Backup and Recovery
A resilient LyftData deployment needs regular backups and rehearsed recovery procedures. This runbook captures what to back up, how often to do it, and how to validate restores.
What to back up
| Component | Why it matters | Suggested cadence |
|---|---|---|
| Job definitions (export) | Portable backup of pipeline definitions, independent of server state | Nightly |
Server staging directory (LYFTDATA_STAGING_DIR) | Primary server state (job metadata, deployments state, stored telemetry) | Daily snapshots |
| Server startup config | How the server is started (systemd unit/launchd, container manifests, env files) | Weekly or whenever changed |
Worker startup config + jobs directory (LYFTDATA_JOBS_DIR) | Worker identity/credentials and local state | Weekly (or treat as rebuildable) |
| TLS material | Certificates/keys if using built-in TLS or a reverse proxy | Aligned to rotation schedule |
| Secrets and master keys | Worker API keys, variables/credential-manager keys, license key | Aligned to rotation schedule |
| Logs / audit exports | Forensics and compliance outside LyftData retention windows | Daily export with 30–90 day retention |
Quick export commands
# Export all jobs to a dated directory and compress itEXPORT_ROOT=backups/jobs-$(date +%Y%m%d)lyftdata jobs export --dir "$EXPORT_ROOT"tar -czf "${EXPORT_ROOT}.tar.gz" -C "$(dirname "$EXPORT_ROOT")" "$(basename "$EXPORT_ROOT")"
# Snapshot server state (example for a Linux systemd install)# Adjust for your deployment method (containers/VMs/launchd/etc).STAGING_DIR="${LYFTDATA_STAGING_DIR:-/var/lib/lyftdata-server}"sudo systemctl stop lyftdata-serversudo tar -czf "backups/server-staging-$(date +%Y%m%d).tar.gz" -C "$(dirname "$STAGING_DIR")" "$(basename "$STAGING_DIR")"sudo systemctl start lyftdata-server
# Optional: back up Linux service config if you followed the install guidesudo tar -czf "backups/server-service-config-$(date +%Y%m%d).tar.gz" \ /etc/systemd/system/lyftdata-server.service \ /etc/default/lyftdata-serverStore backups in two places: fast local storage for quick restores and offsite object storage for disasters. Encrypt sensitive archives before upload.
Automate daily configuration backup
#!/usr/bin/env bashset -euo pipefailBACKUP_DIR="/var/backups/lyftdata"DATE=$(date +%Y%m%d-%H%M%S)mkdir -p "$BACKUP_DIR"
EXPORT_DIR="$BACKUP_DIR/jobs-$DATE"lyftdata jobs export --dir "$EXPORT_DIR"tar -czf "$BACKUP_DIR/jobs-$DATE.tar.gz" -C "$BACKUP_DIR" "jobs-$DATE"rm -rf "$EXPORT_DIR"
# Server state + config (systemd example; adjust for your environment)STAGING_DIR="${LYFTDATA_STAGING_DIR:-/var/lib/lyftdata-server}"systemctl stop lyftdata-servertar -czf "$BACKUP_DIR/server-staging-$DATE.tar.gz" -C "$(dirname "$STAGING_DIR")" "$(basename "$STAGING_DIR")"systemctl start lyftdata-servertar -czf "$BACKUP_DIR/server-config-$DATE.tar.gz" /etc/systemd/system/lyftdata-server.service /etc/default/lyftdata-server
find "$BACKUP_DIR" -type f -mtime +30 -deleteValidate backups
Run automated checks to ensure archives are usable:
#!/usr/bin/env bashBACKUP="$1"TEMP_DIR=$(mktemp -d)trap 'rm -rf "$TEMP_DIR"' EXIT
tar -xzf "$BACKUP" -C "$TEMP_DIR"EXPORT_DIR=$(find "$TEMP_DIR" -maxdepth 1 -type d -name 'jobs-*' -print -quit)if [ -z "$EXPORT_DIR" ]; then echo "could not locate exported jobs directory" >&2 exit 1fi
# Dry-run import confirms the job definitions still loadlyftdata jobs import --dry-run --dir "$EXPORT_DIR"Schedule validation weekly; surface failures in monitoring.
Recovery steps
- Restore the server:
- Rebuild host/container and reinstall the same (or compatible) LyftData version.
- Restore your startup config (systemd unit/launchd, env file, container manifests, TLS material).
- Restore the server staging directory (
LYFTDATA_STAGING_DIR) and restart the service.
- Re-register workers:
- Reinstall
lyftdata-workeron each worker host. - Restore each worker’s config and jobs directory (
LYFTDATA_JOBS_DIR) so it retains its identity and credentials. - Confirm workers appear in the Workers UI or via
lyftdata workers list.
- Reinstall
- Redeploy jobs:
- Extract the latest job archive and run
lyftdata jobs import --dir <path> --update - Confirm job state matches expectations in the UI
- Extract the latest job archive and run
- Validate:
- Confirm liveness (
GET /api/liveness) and sign-in. - Run canary jobs or sample pipelines.
- Watch the first hour of Logs & Issues for new errors.
- Confirm liveness (
Disaster recovery tips
- Keep infrastructure-as-code scripts handy to recreate servers and workers in new regions.
- Document RPO/RTO expectations (e.g., 1 hour of data loss max, four-hour recovery window).
- Test restore procedures quarterly to ensure runbooks stay current.
See also: Monitoring LyftData for detection signals and the troubleshooting guide for incident triage.