Reading From S3
Concepts
The basic idea shared by all the object store inputs is that buckets contain objects. Although it’s tempting to think about ‘Folder/Files’ this gets misleading - an object name might contain slashes, but there are no directories implied.
At the minimum, you need the Access Key
and the Secret Key
. Depending on the account, might also need a Session Token
and Role ARN
if you are assuming a role.
The example I’m using here is the MinIO S3 Compatible Server which is easy to obtain and run.
The available files:
Listing
The first mode is List Objects
: the input lists object metadata. Endpoint
and Bucket
have to be specified (and of course the key and secret)
It is one of those cases where thinking about ‘files’ will be confusing. There is no wildcard ‘glob’ notation. Instead, the Object Names
are prefixes that match the start of objects.
Running this and looking at the Run Output
shows two events, representing the two objects found matching.
If Mode
is List Objects
then there are a number of properties which you can display:
Creation Time Field
(not always available)Object Name Field
(default is “object_name”)Last Modified Field
Content Length Field
(default is “object_size”)Content Type Field
Etag Field
Data Field
In this case, Data Field
is not available. If list-objects
and no fields specified, then the two defaults are assumed.
Just Downloading
Mode
is download-objects
. Here you know exactly what objects to download, and Object Names
must match exactly!
So Object Names
is just ‘my-object’, which contains exactly one line of JSON.
Inputs generally behave in this way; they will by default treat each line of input as a separate event.
If you do wish to treat an object as a single event, then set Ignore Linebreaks
.
There are some available preprocessors -
- gzip assume the object is in gzip format, and decompress it
- parquet the object is in Apache Parquet format.
- base64 the object may be binary, so we must encode it as Base64 to pass through the system. (The corresponding object store outputs also have this encoding as an option, so binary data can be routed from one store to another)
- extension - work out the preprocessors from the object extension, e.g. “.parquet.gz”
Generally we can stream data, and decompress on the fly, but Parquet files must be completely downloaded before they can be converted into JSON events.
List and Download
This mode works by listing the available objects, and downloading them.
In this mode and List Objects
, you have several ways to further filter the candidate list:
Include Regex
- a set of regular expressions that must match the object nameExclude Regex
- similar, except they must not match the object name
Maximum Age
(in seconds) can exclude old objects.
Fingerprinting
is the way to avoid reprocessing objects since the fingerprint database is persisted.