Data Lake input functions

There are four functions available in the Data Lake Input step.

This list explains the functions:

  • queryAll - Query Object from Data Lake v1 streambyfilter

    This function is defaulted with each new Data Lake Input step used and is considered the best practice for moving large data from the Data Lake payloads using a filter set against Data Lake properties. This function is the only function that adds the dlDocumentDate to the output properties and uses an in-transformation table input step to get the max dlDocumentDate from the on-premises table for incremental processing. The filter against the Data Lake properties and value that we store in dlDocumentDate property is using the Indexed Date dl_document_indexed_date object property.

    Note: This functionality updated to use the Index Date is available only through the 2022-08 release of the ETL Tool. Previous versions of the ETL Tool are using the Stored Date dl_document_date.
  • query - Query Object from Data Lake Compass v1 APIs

    This function is available for Data Lake extractions that require complex joins, filters, and column transformations. This function requires a second transformation to gain the max timestamp path value to include in the incremental processing.

  • queryAllCsv – Query Object from Data Lake v1 payloads

    This function is specific to extracting CSV payloads from Data Lake. This function requires the use of the original QueryString utilizing the DATALAKE_HOURS environment variable set in the kettle.properties for incremental processing.

    Query string example:

    dl_document_date ge '$time.addHours(${DATALAKE_HOURS})'
  • queryAllOld – Query Object from Data Lake v1 payloads

    This function uses streambyid APIs to extract payloads from Data Lake. This function requires the use of the original QueryString utilizing the DATALAKE_HOURS environment variable set in the kettle.properties for incremental processing.

    Query string example:

    dl_document_date ge '$time.addHours(${DATALAKE_HOURS})'
    Note: As of the 2022.08 release, queryAllOld has been deprecated. Please use the queryAll option.