Using the data object retrieval APIs
Data Lake is a scalable, elastic object store for capturing raw data in its original and native format. Data Lake provides interfaces for these tasks:
- Retrieving a list of data objects stored in Data Lake
- Retrieving a single data object by a data object ID
- Marking an object as corrupt to prevent its extraction
- Retrieving statistical information about a stored object
Interface and consumption methods are exposed through the Data Lake API Service registered within the Data Fabric Suite in API Gateway. For more information on how to use API Gateway and how to interact with Swagger documentation for the API methods, see the ION section of this documentation.
The Data Lake APIs can return more than 10,000 data objects. However, the pagination is limited to 10,000 results. For incremental extraction, we recommend that you use the high watermark timestamp strategy to efficiently navigate the results. Alternatively, you can break large filter queries into smaller ones with /dataobjects/splitquery
.
Incremental extraction with /dataobject/byfilter
We recommend that you follow these best practices when you implement the incremental loading logic:
- Use
dl_document_indexed_date
as the primary field for incremental loading. - To ensure completeness, apply a 5-second lag from the highest indexed timestamp that was retrieved in the previous call.
- Sort the results by both
dl_document_indexed_date
anddl_id
to avoid missing or duplicating records. This is because multiple data objects can share the same indexed timestamp.Note: We recommend that you do not usedl_id
alone for incremental loading because the sequence ofdl_id
is not reliably ordered.This approach provides an advantage to the API's built-in pagination. By tracking the last retrieved
dl_document_indexed_date
anddl_id
(high watermark logic), you can continuously retrieve new data without depending on page tokens. With this method, you can avoid the API's 10,000-object limit, which can be a constraint in high-volume environments. As a result, the/splitquery
endpoint is redundant because the incremental retrieval strategy handles large datasets more efficiently and reliably. The/splitquery
endpoint will be deprecated in a future release.Example:
First call:
GET /dataobjects?records=100&sort=dl_document_indexed_date,dl_id&page=<empty>&filter=<some filter>
Second call:
GET /dataobjects?records=100&sort=dl_document_indexed_date,dl_id&page=<empty>&filter=<some filter> AND (dl_document_indexed_date > <highest dl_document_indexed_date from previous response> OR (dl_document_indexed_date = <highest dl_document_indexed_date> AND dl_id > <highest dl_id from previous response>))
- To optimize performance, avoid wildcard searches and include
dl_document_name
in filters when possible. Data Lake content is compressed by default when stored and streamed, which enhances large data transfer efficiency. Clients can specify content encoding withAccept-Encoding: deflate
orAccept-Encoding: identity
for compressed or uncompressed data, respectively.Note: Not all clients support theidentity
value.