Skip to content

File Handling

LyftData can write data to two broad classes of destinations:

  • Log files via the file output (newline-delimited events written to the local filesystem)
  • Block stores / object stores via outputs like file-store, s3, gcs, azure-blob, and web-dav-store (objects written per batch; no append)

These can look similar in the editor, but their on-disk (and on-cloud) semantics differ in important ways.

Pick the right output

You needPreferWhy
Append events to a local file you can tailfileWrites one line per event and keeps a single growing file.
Durable, partitioned exports to a “bucket”Block store outputs (s3, gcs, azure-blob, file-store, web-dav-store)Writes whole objects; supports batching + preprocessors; avoids “append to object” pitfalls.
“Object store semantics” on diskfile-storeMirrors cloud object stores while writing to a local directory.

Where data is written

All outputs run on the worker executing the job:

  • If you run on the built-in worker, paths are on the server host/container.
  • If you run on an external worker, paths are on that worker host/container.

Use absolute paths and ensure the target directory is persistent (volume-mounted in containers) and writable.

Log Files (file): append vs overwrite (and flushing)

The Log Files output writes one line per event (NDJSON-style when writing full JSON events).

  • Default behavior is append: file-per-event: false appends to an existing file (and creates it if missing).
  • file-per-event: true writes each event by opening the path for a fresh write (so a constant path will be overwritten repeatedly). Use this only with a unique path (for example, by including ${} expansions in the filename).
  • compress-after: true gzips the previous file when the expanded path changes (and removes the uncompressed file).
  • truncate: true truncates a file the first time it is seen during a run (useful when re-running a job into the same path).

Flushing: as of LyftData 2.0.2, the file output flushes each event write (the flush-at-end field exists in the schema but is not yet honored). For high-throughput exports, prefer a block store output with batching instead of relying on file buffering.

Block store outputs: objects, GUIDs, and “no append”

Block store outputs write objects, not log files:

  • There is no append operation for an existing object. Each put writes a complete object body.

  • By default, the outputs append a GUID to avoid collisions. The default guid-prefix is /, so object-name.name: processed/summary.json becomes keys like processed/summary.json/<uuid>.

  • If you want filename-style keys, keep the GUID enabled but set guid-prefix / guid-suffix. For example:

    • object-name: { name: processed/${partition||unknown}/summary }
    • guid-prefix: -
    • guid-suffix: .json.gz

    …yields keys like processed/2026-03-19/summary-<uuid>.json.gz.

  • Set disable-*-name-guid: true only when you intend deterministic overwrites or your name is already unique.

  • Preprocessors like gzip compress bytes but do not rename objects; include .gz in the key yourself.

For large batched uploads, prefer streaming multipart writes when supported:

runtime-options:
prefer-streaming-outputs: true

This reduces peak memory by streaming batched uploads instead of buffering the entire combined batch in memory.

Naming and partitioning with ${} expansions

Many output fields accept ${} expansions evaluated per event (including file paths and object names). Use || to provide defaults:

  • ${partition||unknown}
  • ${host||unassigned}

It is usually clearer to create a single partition field in your actions (and sanitize it), then reference it from outputs.

When batching is enabled on a block store output, expansions are resolved once per batch (using the last event in the batch). Ensure each batch contains only one logical partition (or use ${stat|batch_number} instead of per-event fields).

Example: build a partition prefix and use it in an object-store key:

actions:
- slugify:
input-field: host
output-field: host_slug
- add:
output-fields:
partition: host/${host_slug}
output:
s3:
bucket-name: analytics-prod-archive
object-name:
name: exports/${partition||unknown}/events
guid-prefix: "-"
guid-suffix: ".ndjson"

If you need a simple per-batch uniqueness knob, outputs can also expand ${stat|batch_number} when batching is enabled (for example: events-${stat|batch_number}.ndjson).