Transforming Data

LyftData jobs transform events between one input and one output. This guide helps you choose the right action chain, apply AI where it adds value, and harden pipelines before production.

Choose the right transformation path

Goal	Recommended action chain	Edition
Parse logs into fields	`extract` -> `key-value` -> `convert`	Both
Normalize JSON payloads	`json` -> `flatten` -> `rename` -> `remove`	Both
Join reference data	`enrich` -> `time`	Both
Prepare documents for AI workflows	`pdf-text` or `docx-to-text` -> `chunk` -> `tokenize`	Both
Generate structured model output	`chunk` -> `infer` -> `assert`	`infer` is Enterprise
Embed and cluster records	`infer` (`embedding`) -> `cluster`	`infer` is Enterprise

Sequence actions for predictable behavior

Use this order unless you have a specific reason not to:

Parse (json, csv, xml, extract).
Shape fields (rename, copy, remove, flatten, convert).
Enrich (add, enrich, time, optional script).
Validate and gate (filter, assert, abort).
Route signals (message) and deliver via output.

This keeps validation close to the final payload shape and reduces hard-to-debug side effects.

AI-assisted transformation patterns

Structured extraction with guardrails

actions:
  - chunk:
      input-field: body
      output-field: chunks
  - infer:
      workload:
        llm-completion:
          llm:
            provider: openai-compat
            model: your-model
            input-field: chunks
            response-field: ai_result
            response-format: json
            prompt:
              schema: '{"type":"object"}'
      timeout-ms: 15000
      on-error: dlq:ai_failures
  - assert:
      behaviour: drop-onfailure
      schema:
        schema-string: '{"type":"object"}'

Embedding pipeline for downstream analytics

actions:
  - infer:
      workload:
        embedding:
          embedding:
            provider: openai-compat
            model: your-embedding-model
            input-field: text
            response-field: vector
  - cluster:
      input-field: vector
      output-field: cluster_id

Defaults, filters, and empty events

When you write line-oriented outputs (for example CSV/text) through output.file with input-field, treat missing/null/empty values explicitly:

Missing or null input-field: LyftData emits nothing (no line is written).
Empty payload events (for example events dropped/filtered earlier): LyftData emits nothing.
Empty strings ("") are valid values: a blank line is written.

Recommended pattern:

Use add to set default values before output when a field must always exist.
Use filter to drop rows you do not want (for example empty-string rows).
Use csv-stringify to create correctly escaped CSV rows, then point output.file.input-field to that generated field.

Example:

actions:
  - add:
      kv-pairs:
        csv_row: ""
  - csv-stringify:
      fields: [name, price, category]
      output-field: csv_row
  - filter:
      condition: "csv_row ~= ''"
output:
  file:
    path: /tmp/export.csv
    input-field: csv_row

Production guardrails

Treat script as an escape hatch; prefer declarative actions first.
Add explicit conversion and validation behavior (convert + assert / filter) so bad events fail predictably.
For infer, always set timeout-ms, rate-limit, concurrency, and on-error explicitly.
Use response-format: json and a schema when model output feeds downstream systems.
Keep secrets in variables such as ${dyn|OPENAI_API_KEY}, never inline in job definitions.

Run & Trace checklist

Verify each step changes only the fields you expect.
Check field types after convert and time, not just field names.
Test malformed inputs to validate failure behavior (drop, abort, or DLQ).
For AI jobs, inspect response shape stability across multiple samples and confirm latency under realistic payload sizes.

For full field-level options, use the DSL index and open the linked action pages.