Tokenize

Tokenize (`tokenize`)

Convert free-form text into token sequences for downstream analytics.

Text AI & ML Transform json

actions:
  - tokenize: {}

JSON

{
  "actions": [
    {
      "tokenize": {}
    }
  ]
}

Field	Type	Description
`description`	`string`	Describe this step.
`condition`	`lua-expression` (`string`)	Only run this action if the condition is met. Examples: `2 * count()`
`input-field`	`field` (`string`)	Field containing the text to tokenize. Examples: `data_field`
`output-field`	`field` (`string`)	Field to write tokens to (array output). Examples: `data_field`
`tokenizer`	`Tokenizer`	Tokenizer implementation to use. Allowed values: `whitespace`, `regex`, `byte-pair`, `wordpiece`
`pattern`	`string`	Optional regex or pattern used by certain tokenizer modes.
`lowercase` ✓	`boolean` (`bool`)	Convert text to lowercase prior to tokenization. Default: `false`
`keep-punctuation` ✓	`boolean` (`bool`)	Retain punctuation tokens. Default: `false`
`emit-metadata` ✓	`boolean` (`bool`)	Emit token metadata (offsets/ids) alongside raw values. Default: `false`
`tokenizer-path`	`string`	Optional Hugging Face tokenizer identifier or path.
`metadata-field`	`field` (`string`)	Field to capture metadata when `emit-metadata` is enabled. Examples: `data_field`

Field	Type	Required	Description
`bearer-token`	`string`		Optional bearer token for Hugging Face private repositories (used with `tokenizer-path`). Provide a static string or reference a secret; if omitted, unauthenticated access is used.
`tokenizer-sha256`	`string`		Optional expected SHA-256 of the downloaded tokenizer.json. If set, the runtime verifies the artifact integrity after download and errors on mismatch.