Data Preparation

Applications publishing data to the Data Lake typically micro-batch data to optimize data publishing and query performance.

As a best practice, applications can include condition-based logic that micro-batches data for a pre-determined amount of time, or if a certain amount of data is buffered. There is no file size limitation imposed within the service, but it is best practice that files are no smaller than 5 MB and no larger than 50 MB. This reduces risks associated with network interruptions and improves overall processing performance of downstream consumers that directly interface with raw data objects.

Data objects must be sent compressed using ZLIB compressed data format with the DEFLATE compression method. Requests containing uncompressed files or files compressed with a different method are rejected.

You can ingest any supported data format. We recommend sending data objects in flat Newline-delimited JSON format.