Skip to content

Reading From S3

Concepts

The basic idea shared by all the object store inputs is that buckets contain objects. Although it’s tempting to think about ‘Folder/Files’ this gets misleading - an object name might contain slashes, but there are no directories implied.

At the minimum, you need the Access Key and the Secret Key. Depending on the account, might also need a Session Token and Role ARN if you are assuming a role.

The example I’m using here is the MinIO S3 Compatible Server which is easy to obtain and run.

The available files:

$ mc ls minio/my-first-bucket
[2024-04-29 11:15:46 SAST] 39B STANDARD my-object
[2024-04-29 11:45:03 SAST] 44B STANDARD my-object-name
[2024-11-18 16:50:32 SAST] 173KiB STANDARD out.json
[2024-09-06 14:24:26 SAST] 195MiB STANDARD scratch.parquet

Listing

The first mode is List Objects: the input lists object metadata. Endpoint and Bucket have to be specified (and of course the key and secret)

It is one of those cases where thinking about ‘files’ will be confusing. There is no wildcard ‘glob’ notation. Instead, the Object Names are prefixes that match the start of objects.

Running this and looking at the Run Output shows two events, representing the two objects found matching.

{ "object_name": "my-object", "object_size": 39, "ts": "2024-12-10T12:40:14.668Z" }
{ "object_name": "my-object-name", "object_size": 44, "ts": "2024-12-10T12:40:14.668Z" }

If Mode is List Objects then there are a number of properties which you can display:

  • Creation Time Field (not always available)
  • Object Name Field (default is “object_name”)
  • Last Modified Field
  • Content Length Field (default is “object_size”)
  • Content Type Field
  • Etag Field
  • Data Field

In this case, Data Field is not available. If list-objects and no fields specified, then the two defaults are assumed.

Just Downloading

Mode is download-objects. Here you know exactly what objects to download, and Object Names must match exactly!

So Object Names is just ‘my-object’, which contains exactly one line of JSON.

Inputs generally behave in this way; they will by default treat each line of input as a separate event.

If you do wish to treat an object as a single event, then set Ignore Linebreaks.

There are some available preprocessors -

  • gzip assume the object is in gzip format, and decompress it
  • parquet the object is in Apache Parquet format.
  • base64 the object may be binary, so we must encode it as Base64 to pass through the system. (The corresponding object store outputs also have this encoding as an option, so binary data can be routed from one store to another)
  • extension - work out the preprocessors from the object extension, e.g. “.parquet.gz”

Generally we can stream data, and decompress on the fly, but Parquet files must be completely downloaded before they can be converted into JSON events.

List and Download

This mode works by listing the available objects, and downloading them.

In this mode and List Objects, you have several ways to further filter the candidate list:

  • Include Regex - a set of regular expressions that must match the object name
  • Exclude Regex - similar, except they must not match the object name

Maximum Age (in seconds) can exclude old objects.

Fingerprinting is the way to avoid reprocessing objects since the fingerprint database is persisted.