Data Lake incremental load

Due to the nature and architecture of the Data Lake, a full data load is not recommended for several reasons, primarily due to large data volumes. This section describes the incremental load of a set of spaces from the Data Lake.

Load Frequency

Load frequency is assumed to be once daily. It is also assumed that processing will occur outside of business hours, and that it is not a problem if business users cannot access the data during the processing window.

Change Data Capture (CDC)

A definitive method of determining which records have been modified or inserted between loads is required. This is sometimes referred to as the delta. The M3 Analytics solution uses the date and time that the object arrived in the Data Lake, identified by the lastModified field in the format YYYY-MM-DDTHH:mm:SS.mmmZ (ISO8601) in UTC. Queries in Modeler Enterprise contain a WHERE clause that filters the lastModified field to a date between a start date and an end date which are stored in two variables, M3A_ExtractStartDate and M3A_ExtractEndDate respectively. These are single-valued, query variables that are retrieved from the IncrementalLoad hierarchy.

The length of the extraction windows is managed in these scenarios:

  • Initial load

    When preparing for the initial load of data, you should change the M3A_ExtractInitialLoadDate variable to the date when an initial load is run in M3. On initial load, the date specified in the variable serves as the start date. You can use the M3A_ExtractWindowMin to control the end date. When you set this variable to zero (0), the end date will always be the current date and time. In this case, the Birst Connect queries will retrieve all files with lastModified between the M3A_ExtractInitialLoadDate and the current date and time.

    If data is loaded in manageable chunks, the number of minutes you specified in this variable is added to the start date. This method is similar to running several incremental loads until all the data is loaded.

  • Incremental load

    The incremental load is the primary use case. This method only loads records that are based on deltas from the previous loads and uses a change data capture. The start date and end date are calculated by the incremental load start and end scripts, which are discussed in Incremental load start and end scripts.