Using the data object retrieval APIs

Data Lake is a scalable, elastic object store for capturing raw data in its original and native format. Data Lake provides interfaces for these tasks:

  • Retrieving a list of data objects stored in Data Lake
  • Retrieving a single data object by a data object ID
  • Marking an object as corrupt to prevent its extraction
  • Retrieving statistical information about a stored object

Interface and consumption methods are exposed through the Data Lake API Service registered within the Data Fabric Suite in API Gateway. For more information on how to use API Gateway and how to interact with Swagger documentation for the API methods, see the ION section of this documentation.

The Data Lake APIs can return more than 10,000 data objects. However, the pagination is limited to 10,000 results. For incremental extraction, we recommend that you use the high watermark timestamp strategy to efficiently navigate the results. Alternatively, you can break large filter queries into smaller ones with /dataobjects/splitquery.

Incremental extraction with /dataobject/byfilter

We recommend that you follow these best practices when you implement the incremental loading logic:

  • Use dl_document_indexed_date as the primary field for incremental loading.
  • To ensure completeness, apply a 5-second lag from the highest indexed timestamp that was retrieved in the previous call.
  • Sort the results by both dl_document_indexed_date and dl_id to avoid missing or duplicating records. This is because multiple data objects can share the same indexed timestamp.
    Note: We recommend that you do not use dl_id alone for incremental loading because the sequence of dl_id is not reliably ordered.

    This approach provides an advantage to the API's built-in pagination. By tracking the last retrieved dl_document_indexed_date and dl_id (high watermark logic), you can continuously retrieve new data without depending on page tokens. With this method, you can avoid the API's 10,000-object limit, which can be a constraint in high-volume environments. As a result, the /splitquery endpoint is redundant because the incremental retrieval strategy handles large datasets more efficiently and reliably. The /splitquery endpoint will be deprecated in a future release.

    Example:

    First call:

    GET /dataobjects?records=100&sort=dl_document_indexed_date,dl_id&page=<empty>&filter=<some filter>

    Second call:

    GET /dataobjects?records=100&sort=dl_document_indexed_date,dl_id&page=<empty>&filter=<some filter> AND (dl_document_indexed_date > <highest dl_document_indexed_date from previous response> OR (dl_document_indexed_date = <highest dl_document_indexed_date> AND dl_id > <highest dl_id from previous response>))
  • To optimize performance, avoid wildcard searches and include dl_document_name in filters when possible. Data Lake content is compressed by default when stored and streamed, which enhances large data transfer efficiency. Clients can specify content encoding with Accept-Encoding: deflate or Accept-Encoding: identity for compressed or uncompressed data, respectively.
    Note: Not all clients support the identity value.