Skip to content

Transforming Data

LyftData jobs transform events between one input and one output. This guide helps you choose the right action chain, apply AI where it adds value, and harden pipelines before production.

Choose the right transformation path

GoalRecommended action chainEdition
Parse logs into fieldsextract -> key-value -> convertBoth
Normalize JSON payloadsjson -> flatten -> rename -> removeBoth
Join reference dataenrich -> timeBoth
Prepare documents for AI workflowspdf-text or docx-to-text -> chunk -> tokenizeBoth
Generate structured model outputchunk -> infer -> assertinfer is Enterprise
Embed and cluster recordsinfer (embedding) -> clusterinfer is Enterprise

Sequence actions for predictable behavior

Use this order unless you have a specific reason not to:

  1. Parse (json, csv, xml, extract).
  2. Shape fields (rename, copy, remove, flatten, convert).
  3. Enrich (add, enrich, time, optional script).
  4. Validate and gate (filter, assert, abort).
  5. Route signals (message) and deliver via output.

This keeps validation close to the final payload shape and reduces hard-to-debug side effects.

AI-assisted transformation patterns

Structured extraction with guardrails

actions:
- chunk:
input-field: body
output-field: chunks
- infer:
workload:
llm-completion:
llm:
provider: openai-compat
model: your-model
input-field: chunks
response-field: ai_result
response-format: json
prompt:
schema: '{"type":"object"}'
timeout-ms: 15000
on-error: dlq:ai_failures
- assert:
behaviour: drop-onfailure
schema:
schema-string: '{"type":"object"}'

Embedding pipeline for downstream analytics

actions:
- infer:
workload:
embedding:
embedding:
provider: openai-compat
model: your-embedding-model
input-field: text
response-field: vector
- cluster:
input-field: vector
output-field: cluster_id

Defaults, filters, and empty events

When you write line-oriented outputs (for example CSV/text) through output.file with input-field, treat missing/null/empty values explicitly:

  • Missing or null input-field: LyftData emits nothing (no line is written).
  • Empty payload events (for example events dropped/filtered earlier): LyftData emits nothing.
  • Empty strings ("") are valid values: a blank line is written.

Recommended pattern:

  1. Use add to set default values before output when a field must always exist.
  2. Use filter to drop rows you do not want (for example empty-string rows).
  3. Use csv-stringify to create correctly escaped CSV rows, then point output.file.input-field to that generated field.

Example:

actions:
- add:
kv-pairs:
csv_row: ""
- csv-stringify:
fields: [name, price, category]
output-field: csv_row
- filter:
condition: "csv_row ~= ''"
output:
file:
path: /tmp/export.csv
input-field: csv_row

Production guardrails

  • Treat script as an escape hatch; prefer declarative actions first.
  • Add explicit conversion and validation behavior (convert + assert / filter) so bad events fail predictably.
  • For infer, always set timeout-ms, rate-limit, concurrency, and on-error explicitly.
  • Use response-format: json and a schema when model output feeds downstream systems.
  • Keep secrets in variables such as ${dyn|OPENAI_API_KEY}, never inline in job definitions.

Run & Trace checklist

  • Verify each step changes only the fields you expect.
  • Check field types after convert and time, not just field names.
  • Test malformed inputs to validate failure behavior (drop, abort, or DLQ).
  • For AI jobs, inspect response shape stability across multiple samples and confirm latency under realistic payload sizes.

For full field-level options, use the DSL index and open the linked action pages.