Monitoring LyftData
Keeping LyftData healthy in production means watching the control plane, the workers, and the jobs they execute. Use the practices below to get fast feedback when something drifts from normal.
Built-in observability (UI)
- Dashboard: high-level health, recent job activity, and key charts.
- Jobs: per-job deploy state, run history, and issues.
- Workers: which workers are online, what they’re running, and their current limits.
- Metrics Explorer: query stored metrics by job/worker and time range.
- Observe → Logs / Problems / Messages: historical logs, “what’s broken right now”, and a live event stream.
Quick checks from the shell
# Liveness (unauthenticated). Add `-k` if you are using the default self-signed cert.curl -fsS https://<server>:3000/api/liveness
# Health summary (admin auth required)curl -fsS -H "Authorization: Bearer <admin-token>" \ https://<server>:3000/api/health | jq '{status: .status, version: .version}'If you have the CLI configured (see CLI reference), you can also run:
lyftdata doctorlyftdata workers listlyftdata jobs listKey signals to alert on
| Signal | Investigate when | Typical response |
|---|---|---|
| Workers offline | Any production worker is unexpectedly offline | Check connectivity/TLS, restart worker, or replace host |
| Backlog growing | Queue depth or “pending work” trends up over several minutes | Add workers, reduce job cadence, or tune expensive steps |
| Deployments stuck | Jobs sit in staged/deploying states unusually long | Check worker availability, review Issues, redeploy |
| Error spikes | Errors or retries rise suddenly | Investigate downstream systems before scaling |
| Disk pressure | Staging/log storage approaches your limits | Increase disk, tighten retention, or move staging to a larger volume |
| License risk | License nearing expiry or limits being hit | Resolve licensing before it blocks production runs |
Most teams start with the UI dashboards and add alerts as the “normal” baseline becomes clear for their workloads.
Alerting and integrations
- Forward host-level logs to your central platform (Loki/ELK/Splunk/etc.) for long-term retention and correlation.
- For LyftData telemetry (logs/issues/metrics), prefer the UI surfaces and CLI (
lyftdata workers logs <worker-id>,lyftdata workers metrics <worker-id>). - If you run collectors from fixed IPs, consider the server allowlist (
--whitelist/LYFTDATA_API_WHITELIST) which enables read-only access to selected worker telemetry endpoints when the collector sendsAuthorization: WHITELIST(see Security hardening). - Notify on-call channels when pipelines stall, workers flap, or error budgets are exceeded.
Health checks and diagnostics
- HTTP probes:
GET /api/livenessfor basic reachability;GET /api/healthfor a richer status summary. - CLI:
lyftdata doctorandlyftdata server healthfor guided checks. - Synthetic jobs: schedule a tiny canary job that runs every few minutes and alerts if it fails.
Troubleshooting signals
- Jobs stuck in Staged: verify target workers are online and the scheduling queue is clear.
- Rising retries on a connector: inspect connector logs and downstream APIs for throttling or auth errors.
- High backlog with low CPU: scale out workers or increase job concurrency; see the Scaling guide.
What’s next
- Hardening operations continues in the rest of the Operate section.
- For immediate triage, pair this guide with the Troubleshooting reference.