Monitoring LyftData

Keeping LyftData healthy in production means watching the control plane, the workers, and the jobs they execute. Use the practices below to get fast feedback when something drifts from normal.

Built-in observability (UI)

Dashboard: high-level health, recent job activity, and key charts.
Jobs: per-job deploy state, run history, and issues.
Workers: which workers are online, what they’re running, and their current limits.
Metrics Explorer: query stored metrics by job/worker and time range.
Observe → Logs / Problems / Messages: historical logs, “what’s broken right now”, and a live event stream.

Quick checks from the shell

# Liveness (unauthenticated). Add `-k` if you are using the default self-signed cert.
curl -fsS https://<server>:3000/api/liveness

# Health summary (admin auth required)
curl -fsS -H "Authorization: Bearer <admin-token>" \
  https://<server>:3000/api/health | jq '{status: .status, version: .version}'

If you have the CLI configured (see CLI reference), you can also run:

lyftdata doctor
lyftdata workers list
lyftdata jobs list

Key signals to alert on

Signal	Investigate when	Typical response
Workers offline	Any production worker is unexpectedly offline	Check connectivity/TLS, restart worker, or replace host
Backlog growing	Queue depth or “pending work” trends up over several minutes	Add workers, reduce job cadence, or tune expensive steps
Deployments stuck	Jobs sit in staged/deploying states unusually long	Check worker availability, review Issues, redeploy
Error spikes	Errors or retries rise suddenly	Investigate downstream systems before scaling
Disk pressure	Staging/log storage approaches your limits	Increase disk, tighten retention, or move staging to a larger volume
License risk	License nearing expiry or limits being hit	Resolve licensing before it blocks production runs

Most teams start with the UI dashboards and add alerts as the “normal” baseline becomes clear for their workloads.

Alerting and integrations

Forward host-level logs to your central platform (Loki/ELK/Splunk/etc.) for long-term retention and correlation.
For LyftData telemetry (logs/issues/metrics), prefer the UI surfaces and CLI (lyftdata workers logs <worker-id>, lyftdata workers metrics <worker-id>).
If you run collectors from fixed IPs, consider the server allowlist (--whitelist / LYFTDATA_API_WHITELIST) which enables read-only access to selected worker telemetry endpoints when the collector sends Authorization: WHITELIST (see Security hardening).
Notify on-call channels when pipelines stall, workers flap, or error budgets are exceeded.

Health checks and diagnostics

HTTP probes: GET /api/liveness for basic reachability; GET /api/health for a richer status summary.
CLI: lyftdata doctor and lyftdata server health for guided checks.
Synthetic jobs: schedule a tiny canary job that runs every few minutes and alerts if it fails.

Troubleshooting signals

Jobs stuck in Staged: verify target workers are online and the scheduling queue is clear.
Rising retries on a connector: inspect connector logs and downstream APIs for throttling or auth errors.
High backlog with low CPU: scale out workers or increase job concurrency; see the Scaling guide.

What’s next

Hardening operations continues in the rest of the Operate section.
For immediate triage, pair this guide with the Troubleshooting reference.