S3 sink connector by Confluent naming and data formats

The Apache Kafka Connect® S3 sink connector enables you to move data from an Aiven for Apache Kafka® cluster to Amazon S3 for long term storage. The following document describes advanced parameters defining the naming and data formats.

Warning

Aiven provides two version of S3 sink connector: one developed by Aiven, another developed by Confluent.

This article is about the Confluent version. Documentation for the Aiven version is available in the dedicated page.

S3 naming format

The Apache Kafka Connect® S3 sink connector by Confluent stores a series of files as objects in the specified S3 bucket. By default, each object is named using the pattern:

topics/<TOPIC_NAME>/partition=<PARTITION_NUMBER>/<TOPIC_NAME>+<PARTITIOIN_NUMBER>+<START_OFFSET>.<FILE_EXTENSION>

The placeholders are the following:

  • TOPIC_NAME: Name of the topic to be pushed to S3

  • PARTITION_NUMBER: Topic partitions number

  • START_OFFSET: File starting offset

  • FILE_EXTENSION: The file extension depends on serialization format defined. The bin extension is generated when serializing messages in binary format.

For example, a topic with 3 partitions generates initially the following files in the destination S3 bucket:

topics/<TOPIC_NAME>/partition=0/<TOPIC_NAME>+0+0000000000.bin
topics/<TOPIC_NAME>/partition=1/<TOPIC_NAME>+1+0000000000.bin
topics/<TOPIC_NAME>/partition=2/<TOPIC_NAME>+2+0000000000.bin

S3 data format

By default, data is stored in binary format, one line per message. The connector creates a file every X messages, where X is defined by the flush.size parameter. Setting the flush.size parameter to 1 generates a file for each message in a topic.

In the above example, having a topic with 3 partitions and 10 messages, setting the flush.size parameter to 1 generates the following files (one per message) in the destination S3 bucket:

topics/<TOPIC_NAME>/partition=0/<TOPIC_NAME>+0+0000000000.bin
topics/<TOPIC_NAME>/partition=0/<TOPIC_NAME>+0+0000000001.bin
topics/<TOPIC_NAME>/partition=0/<TOPIC_NAME>+0+0000000002.bin
topics/<TOPIC_NAME>/partition=0/<TOPIC_NAME>+0+0000000003.bin
topics/<TOPIC_NAME>/partition=1/<TOPIC_NAME>+1+0000000000.bin
topics/<TOPIC_NAME>/partition=1/<TOPIC_NAME>+1+0000000001.bin
topics/<TOPIC_NAME>/partition=1/<TOPIC_NAME>+1+0000000002.bin
topics/<TOPIC_NAME>/partition=2/<TOPIC_NAME>+2+0000000000.bin
topics/<TOPIC_NAME>/partition=2/<TOPIC_NAME>+2+0000000001.bin
topics/<TOPIC_NAME>/partition=2/<TOPIC_NAME>+2+0000000002.bin

You can find additional documentation at the dedicated page.