ETL Services Overview

In some cases, raw data that is available for Birst to analyze cannot be directly mapped into a desired logical dimensional model. Either the structure of the source data is too different from the ultimate star model or various complex manipulations of that data (or complex filtering/augmentation) must be done to ensure it maps correctly. Birst offers ETL Services for this purpose. ETL provides a powerful and flexible mechanism to transform and manipulate data into Birst so that it can be loaded into the dimensional analysis environment.

Note: Birst does not trim any whitespace when importing data. If you would like to remove whitespaces, you will need to trim the whitespace at the source level via an ETL script. If a source contains multiple line breaks, this may also skew data post-publishing.

Birst employs parallel script execution to improve the performance of script execution. When an ETL script is executed, it will be executed multi-threaded so long as it does not have a WRITERECORD command in the COMPLETE block or does not make use of an ORDER BY statement in the SELECT statement. Scripts with an ORDER BY in the SELECT statement can still make use of multi-threaded processing by using a PARTITION BY statement after ORDER BY, as in the following example:

SELECT [Categories.CategoryID], [Categories.CategoryName], [Categories.Description]
FROM [Categories]
ORDER BY [Categories.CategoryID] ASCENDING
PARTITION BY [Categories.CategoryID]

For information on writing the Select Statement for a script, see Birst ETL Services Input Query Reference. For information on writing the script, see Birst ETL Services Script Functions. For step-by-step instructions on creating a script, see Creating a Scripted (ETL) Data Source. For script examples, see Example Script: Rolling 200 and Example Script: Using an Employee Table to Find Average Head Count Over Any Time Span.

Data Flow and Procedural Scripting for Transformation

ETL Services uses two powerful models for data manipulation together to provide significant flexibility in how data may be manipulated. The data flow model allows for data to be progressively run through a series of different data manipulation steps until an output is finally created. This allows you to sequence pieces of transformations together into powerful chains. Procedural transformations allow more of a programming style mechanism to creating individual transformations. Procedural languages take advantage of the fact that they can use global memory and complex looping and conditional logic to do powerful manipulations.

A data flow model allows you to create a transformation process composed of many steps. In each step, data is sourced either from the original data, or from the results of another step. By stringing many steps together, manipulations can be layered to create powerful transformations.

Procedural Scripting by Record

Procedural scripts allow you to write complex and layered transformations. Using scripts with variables, data can be manipulated across records as well as on a record by record basis. ETL Services step through data supplied in an input query (SELECT statement) and then executes a script for each record.

For example, suppose you want to create a new calculation of Revenue based on order line item data in a given source. Further, suppose that data has only two existing data elements: unit price and quantity sold (in addition to keys). You could simply take each record, calculate the revenue by multiplying those two values, and then save the output record to a new data source. With ETL Services this is accomplished by creating a new data source that is based on a script instead of uploaded data. That script consists of a SELECT statement (the input query) and a script block.

Before creating the script, you first need to add the columns to the data source that would be created as a result of the script. For this example, you would add a column for Revenue.

For this example, the SELECT statement is as follows:

SELECT [Order Details.OrderID], [Order Details.ProductID], [Order Details.UnitPrice], [Order Details.Quantity] FROM [Order Details]

This query selects four fields from the data source Order Details. ETL Services executes this query and then for each record that is returned, it runs the associated script:

[OrderID]=[Order Details.OrderID]
[ProductID]=[Order Details.ProductID]
[Revenue]=[Order Details.UnitPrice]*[Order Details.Quantity]
WriteRecord

Debugging Scripts

Often during script execution there are internal variables that are calculated along the way. These variables are not necessarily in the output so it may be difficult to trace what is happening. A convenient mechanism for debugging scripts is to add each of these internal variables to the output so that their intermediate values can be seen. These variables do not need to be targeted (and are therefore ignored) but they help trace how your script is proceeding. In the same way, you can add a tracing column where assertions or other comments can be added during processing in order to verify the functioning of your script.

Print

In addition to the debugging method above, you can also use the Print statement. Add the word Print followed by a value to print the values to an output log after the script has been executed. This is useful if you need to see the value of a variable at a certain stage, or find what path data is following based on your logic.

Print [myVariable] //String values can be printed by encapsulating with single quotes

Comments

You can comment out lines of ETL code using the following two methods:

Single Line: //
Multi Line: /* */

Cloning Existing Sources

When adding a script, ETL Services provides a convenience cloning function. This is very useful when you want to make standard transformations based on a single data source and substitute the transformed values for the originals. When adding a script, if you choose to clone an existing source, a new source is created with the same columns as the original and a script is pre-populated that simply copies the original. You can then un-target the original data source and make any desired modifications to the cloned one in order to use the transformed data source. Common cases include manipulations of date or string columns to conform to standard reporting.

Order of Execution

Scripts are executed at the time that a data source would be processed. Dependencies are analyzed such that if a script depends on another data source, that data source is loaded first. ETL Services will also identify loops and will not execute if a script refers to itself or one of its dependencies refers to itself.