Tokenize
Tokenize (tokenize)
Convert free-form text into token sequences for downstream analytics.
Text AI & ML Transform json
Minimal example
actions: - tokenize: {}JSON
{ "actions": [ { "tokenize": {} } ]}Contents
Fields
| Field | Type | Required | Description |
|---|---|---|---|
description | string | Describe this step. | |
condition | lua-expression (string) | Only run this action if the condition is met. Examples: 2 * count() | |
input-field | field (string) | Field containing the text to tokenize. Examples: data_field | |
output-field | field (string) | Field to write tokens to (array output). Examples: data_field | |
tokenizer | Tokenizer | Tokenizer implementation to use. Allowed values: whitespace, regex, byte-pair, wordpiece | |
pattern | string | Optional regex or pattern used by certain tokenizer modes. | |
lowercase ✓ | boolean (bool) | Convert text to lowercase prior to tokenization. Default: false | |
keep-punctuation ✓ | boolean (bool) | Retain punctuation tokens. Default: false | |
emit-metadata ✓ | boolean (bool) | Emit token metadata (offsets/ids) alongside raw values. Default: false | |
tokenizer-path | string | Optional Hugging Face tokenizer identifier or path. | |
metadata-field | field (string) | Field to capture metadata when emit-metadata is enabled.Examples: data_field |
Authentication
Authentication
| Field | Type | Required | Description |
|---|---|---|---|
bearer-token | string | Optional bearer token for Hugging Face private repositories (used with tokenizer-path). Provide a static string or reference a secret; if omitted, unauthenticated access is used. | |
tokenizer-sha256 | string | Optional expected SHA-256 of the downloaded tokenizer.json. If set, the runtime verifies the artifact integrity after download and errors on mismatch. |
Schema
Tokenizer Options
| Value | Name | Description |
|---|---|---|
whitespace | whitespace | Whitespace |
regex | regex | Regex |
byte-pair | byte-pair | Byte Pair |
wordpiece | wordpiece | Wordpiece |