Skip to content

Backup and Recovery

A resilient LyftData deployment needs regular backups and rehearsed recovery procedures. This runbook captures what to back up, how often to do it, and how to validate restores.

What to back up

ComponentWhy it mattersSuggested cadence
Job definitions (export)Portable backup of pipeline definitions, independent of server stateNightly
Server staging directory (LYFTDATA_STAGING_DIR)Primary server state (job metadata, deployments state, stored telemetry)Daily snapshots
Server startup configHow the server is started (systemd unit/launchd, container manifests, env files)Weekly or whenever changed
Worker startup config + jobs directory (LYFTDATA_JOBS_DIR)Worker identity/credentials and local stateWeekly (or treat as rebuildable)
TLS materialCertificates/keys if using built-in TLS or a reverse proxyAligned to rotation schedule
Secrets and master keysWorker API keys, variables/credential-manager keys, license keyAligned to rotation schedule
Logs / audit exportsForensics and compliance outside LyftData retention windowsDaily export with 30–90 day retention

Quick export commands

Terminal window
# Export all jobs to a dated directory and compress it
EXPORT_ROOT=backups/jobs-$(date +%Y%m%d)
lyftdata jobs export --dir "$EXPORT_ROOT"
tar -czf "${EXPORT_ROOT}.tar.gz" -C "$(dirname "$EXPORT_ROOT")" "$(basename "$EXPORT_ROOT")"
# Snapshot server state (example for a Linux systemd install)
# Adjust for your deployment method (containers/VMs/launchd/etc).
STAGING_DIR="${LYFTDATA_STAGING_DIR:-/var/lib/lyftdata-server}"
sudo systemctl stop lyftdata-server
sudo tar -czf "backups/server-staging-$(date +%Y%m%d).tar.gz" -C "$(dirname "$STAGING_DIR")" "$(basename "$STAGING_DIR")"
sudo systemctl start lyftdata-server
# Optional: back up Linux service config if you followed the install guide
sudo tar -czf "backups/server-service-config-$(date +%Y%m%d).tar.gz" \
/etc/systemd/system/lyftdata-server.service \
/etc/default/lyftdata-server

Store backups in two places: fast local storage for quick restores and offsite object storage for disasters. Encrypt sensitive archives before upload.

Automate daily configuration backup

#!/usr/bin/env bash
set -euo pipefail
BACKUP_DIR="/var/backups/lyftdata"
DATE=$(date +%Y%m%d-%H%M%S)
mkdir -p "$BACKUP_DIR"
EXPORT_DIR="$BACKUP_DIR/jobs-$DATE"
lyftdata jobs export --dir "$EXPORT_DIR"
tar -czf "$BACKUP_DIR/jobs-$DATE.tar.gz" -C "$BACKUP_DIR" "jobs-$DATE"
rm -rf "$EXPORT_DIR"
# Server state + config (systemd example; adjust for your environment)
STAGING_DIR="${LYFTDATA_STAGING_DIR:-/var/lib/lyftdata-server}"
systemctl stop lyftdata-server
tar -czf "$BACKUP_DIR/server-staging-$DATE.tar.gz" -C "$(dirname "$STAGING_DIR")" "$(basename "$STAGING_DIR")"
systemctl start lyftdata-server
tar -czf "$BACKUP_DIR/server-config-$DATE.tar.gz" /etc/systemd/system/lyftdata-server.service /etc/default/lyftdata-server
find "$BACKUP_DIR" -type f -mtime +30 -delete

Validate backups

Run automated checks to ensure archives are usable:

#!/usr/bin/env bash
BACKUP="$1"
TEMP_DIR=$(mktemp -d)
trap 'rm -rf "$TEMP_DIR"' EXIT
tar -xzf "$BACKUP" -C "$TEMP_DIR"
EXPORT_DIR=$(find "$TEMP_DIR" -maxdepth 1 -type d -name 'jobs-*' -print -quit)
if [ -z "$EXPORT_DIR" ]; then
echo "could not locate exported jobs directory" >&2
exit 1
fi
# Dry-run import confirms the job definitions still load
lyftdata jobs import --dry-run --dir "$EXPORT_DIR"

Schedule validation weekly; surface failures in monitoring.

Recovery steps

  1. Restore the server:
    • Rebuild host/container and reinstall the same (or compatible) LyftData version.
    • Restore your startup config (systemd unit/launchd, env file, container manifests, TLS material).
    • Restore the server staging directory (LYFTDATA_STAGING_DIR) and restart the service.
  2. Re-register workers:
    • Reinstall lyftdata-worker on each worker host.
    • Restore each worker’s config and jobs directory (LYFTDATA_JOBS_DIR) so it retains its identity and credentials.
    • Confirm workers appear in the Workers UI or via lyftdata workers list.
  3. Redeploy jobs:
    • Extract the latest job archive and run lyftdata jobs import --dir <path> --update
    • Confirm job state matches expectations in the UI
  4. Validate:
    • Confirm liveness (GET /api/liveness) and sign-in.
    • Run canary jobs or sample pipelines.
    • Watch the first hour of Logs & Issues for new errors.

Disaster recovery tips

  • Keep infrastructure-as-code scripts handy to recreate servers and workers in new regions.
  • Document RPO/RTO expectations (e.g., 1 hour of data loss max, four-hour recovery window).
  • Test restore procedures quarterly to ensure runbooks stay current.

See also: Monitoring LyftData for detection signals and the troubleshooting guide for incident triage.