Using the data object retrieval APIs
Data Lake is a scalable, elastic object store for capturing raw data in its original and native format. Data Lake provides interfaces for these tasks:
- Retrieving a list of data objects stored in Data Lake
- Retrieving a single data object by a data object ID
- Marking an object as corrupt to prevent its extraction
- Retrieving statistical information about a stored object
Interface and consumption methods are exposed through the Data Lake API Service registered within the Data Fabric Suite in API Gateway. For more information on how to use API Gateway and how to interact with Swagger documentation for the API methods, see the ION section of this documentation.
The Data Lake APIs can return more than 10,000 data objects. However, the pagination is limited to 10,000 results. For incremental extraction, we recommend that you use the high watermark timestamp strategy to efficiently navigate the results. Alternatively, you can break large filter queries into smaller ones with /dataobjects/splitquery.
Incremental extraction with /dataobject/byfilter
We recommend that you follow these best practices when you implement the incremental loading logic:
- Use
dl_document_indexed_dateas the primary field for incremental loading. - To ensure completeness, apply a 5-second lag from the highest indexed timestamp that was retrieved in the previous call.
- Sort the results by both
dl_document_indexed_dateanddl_idto avoid missing or duplicating records. This is because multiple data objects can share the same indexed timestamp.Note: We recommend that you do not usedl_idalone for incremental loading because the sequence ofdl_idis not reliably ordered.This approach provides an advantage to the API's built-in pagination. By tracking the last retrieved
dl_document_indexed_dateanddl_id(high watermark logic), you can continuously retrieve new data without depending on page tokens. With this method, you can avoid the API's 10,000-object limit, which can be a constraint in high-volume environments. As a result, the/splitqueryendpoint is redundant because the incremental retrieval strategy handles large datasets more efficiently and reliably. The/splitqueryendpoint will be deprecated in a future release.The first call has this format:
GET /dataobjectsrecords=100&sort=dl_document_indexed_date&sort=dl_id&filter=<some filter>First call example:
GET /dataobjects?records=100&sort=dl_document_indexed_date&sort=dl_id&filter=(dl_document_name eq 'MITMAS')The second call has this format:
GET /dataobjects?records=100&sort=dl_document_indexed_date&sort=dl_id&filter=<some filter> AND (dl_document_indexed_date gt '<highest_dl_document_indexed_date>' OR (dl_document_indexed_date eq '<highest_dl_document_indexed_date>' AND dl_id gt '<highest_dl_id>'))Second call example:
GET /dataobjects?records=100&sort=dl_document_indexed_date&sort=dl_id&filter=dl_document_name eq 'MITMAS' AND (dl_document_indexed_date gt '2025-09-15T14:30:00Z' OR (dl_document_indexed_date eq '2025-09-15T14:30:00Z' AND dl_id gt '123456')) - To optimize performance, avoid wildcard searches and include
dl_document_namein filters when possible. Data Lake content is compressed by default when stored and streamed, which enhances large data transfer efficiency. Clients can specify content encoding withAccept-Encoding: deflateorAccept-Encoding: identityfor compressed or uncompressed data, respectively.Note: Not all clients support theidentityvalue.