Data Lake Input

To create a DataLakeInput stage:

  1. Click the Design tab.
  2. Select the Input tab.
  3. Drag the input type, DataLakeInput , into your transformation or double-click the transformation to bring it into the transformation flow.
  4. Double-click the box, the Deem DataLake Input dialog box is displayed.
  5. To change the IONAPI_FILE parameter in the kettle.properties file, go to this folder:
    DEEM_HOME/.kettle
    When the parameter is changed, further modifications are not required when creating a new DataLakeInput stage.
    Alternatively, you can browse the Host and select the ION API file, that can be created in this way:
    1. Within ION, create a new Authorized Application named ETL Client for Data Lake .
    2. When downloading the credentials, select the Create Service Account option.
    3. To avoid issues when configuring the DataLakeInput stage specify the Full Name of your user within ION.
  6. Click Test connection and select the preferred API Object.
  7. Click Get cloud-api fields.
  8. Select a Function name (query – Compass or queryAll – Payloads).

    These queries exist:

    • query - Query Object from Data Lake Compass v1 APIs used for AD-HOC queries - for FULL LOADS and suitable for small data sets only. Eventually, performance issues.
    • queryAll - Query Object from Data Lake v1 streambyfilter (all fields, no sql support).

      This function optimizes the API calls with streamByFilter instead of streamByID.

      This is used especially for INCREMENTAL LOADS and huge data sets.

    • queryAllCsv - Query Object from Data Lake v1 payloads (all fields, no sql support). This function creates a CSV file.
    • queryAllOld - Query Object from Data Lake v1 payloads (all fields, no sql support). This function uses streamByID. Upon API Call optimization criteria, the use of this in not recommended.

    The query string depends on:

    • Compass – example select $fields from $object where timestamp>'$maxdate.addDays(${DAYS})’
    • Payload – example

      dl_document_date gt $maxdate.addHours(-2)

      dl_document_date gt $maxdate.addDays(-2)

      dl_document_date range [$maxdate.addDays(-2), $maxdate.addDays(1)]

      dl_document_date range [2019-09-01T10:00:01.410Z, 2019-09-13T11:00:01.700Z]

    The object with the fields and keys for further handling of the data are now selected.