Skip to content

Transform Google Analytics 4 Exports

This how-to converts GA4 export files into a predictable event schema you can route to storage, APIs, or downstream jobs.

What this pipeline should produce

  • One event per GA4 events[] item.
  • A canonical @timestamp field.
  • Consistent, flattened field names for downstream consumers.
  • Optional AI-derived labels (Enterprise) for segmentation.

1. Configure the input

  1. Create a job (for example ga4-normalized).
  2. Select your object-store input (s3, gcs, azure-blob, or file-store).
  3. Set the object prefix (for example exports/ga4/daily/) and enable fingerprinting.
  4. Under response handling, split records from the events array.

Tip: keep replay simple by using date-based prefixes and rotating job versions rather than editing production definitions in place.

2. Baseline transformation stack

actions:
- json:
input-field: data
- expand-events:
array-field: events
- flatten:
input-field: events
separator: "."
- rename:
fields:
"events.event_params.key": param_key
"events.event_params.value.string_value": param_value
- filter:
how:
expression: "events.name == 'purchase'"
- convert:
fields:
events.event_timestamp: num
units:
- field: events.event_timestamp
from: microseconds
to: milliseconds
- time:
input-field: events.event_timestamp
input-formats:
- epoch_msecs
output-field: '@timestamp'
output-format: default_iso
- add:
output-fields:
dataset: "{{dataset}}"
environment: "{{environment}}"
source_object: "${msg|message_content.object_name||unknown}"

Why this order

  1. Parse first (json, expand-events).
  2. Shape the schema (flatten, rename, filter, convert).
  3. Add operational metadata last (time, add), including timestamp normalization.

This ordering keeps traces easier to read and reduces accidental field drift.

events.event_timestamp in GA4 exports is typically in microseconds, so this example normalizes it to milliseconds before using time with epoch_msecs. If your upstream payload already uses milliseconds, remove the units conversion.

3. Optional AI enrichment (Enterprise)

Use infer when you want automated classification (for example purchase intent, campaign grouping, or anomaly flags) without a separate service.

- infer:
workload:
llm-completion:
llm:
provider: openai-compat
model: your-model
input-field: events.name
response-field: ai_labels
response-format: json
prompt:
schema: '{"type":"object"}'
timeout-ms: 10000
on-error: skip

Recommended defaults for production: set rate-limit, concurrency, and cache, and validate ai_labels shape in Run & Trace.

4. Add quality gates

Use filters for recoverable issues and assertions for hard contract failures.

- filter:
how:
expression: "exists(events.user_pseudo_id)"
- assert:
schema:
schema-string: '{"type":"object","required":["events.event_timestamp","events.user_pseudo_id"]}'
behaviour: abort-on-failure

5. Run & Trace checklist

  • Confirm event count after expand-events matches GA4 array size.
  • Verify @timestamp conversion on real samples.
  • Check renamed fields used by downstream dashboards.
  • For AI steps, verify ai_labels remains valid JSON across multiple runs.

6. Stage, deploy, and monitor

  1. Stage the job after trace validation.
  2. Deploy to non-production first and monitor throughput/error rates.
  3. Promote to production via your standard release path (manual or CI/CD automation).
  4. Add alerts for parse failures, assertion failures, and output delivery errors.