Prepare the Data activity catalog

For optimization, a limited number of prepare data activities are available to clean/transform the data.

The Prepare data activity catalog offers these preprocessing activities:

Select Columns

You can select a subset of the columns in the dataset. You can exclude or include specific columns by clicking on the column name and moving those columns to the Selected panel.

Construct Feature

You can build new intermediate features out of the existing features using mathematical, logical, or casting operations.

Click Add in the Configuration panel to create a new feature. Specify the name and type. Use the Insert Feature and Insert Function options to use the existing features of the input dataset and the selected operations.

As you make the selections, the expression is displayed in the Expression field. When done, click Add to return to the Configuration panel of the activity. You can perform the action multiple times in the same manner to construct multiple features.

Smooth Data

Replace noisy data and outliers with an approximate function. Eliminating random variation in data and allowing important patterns to stand out is important before building the model. Smoothing techniques create an approximate function which would replace noisy data and outliers.

Use the column selector to exclude or include specific columns to smooth features. Select the Moving Average Smoothing Mode option and specify Window Length, Padding Mode and Padding Value.

Scripting

You can execute a customized Python script to perform an activity. Add the Scripting activity in the flow to execute Python scripts for a variety of tasks. Check the currently supported version of Python in the Release Report.

The input and output of the Scripting activity is handled with Python variables. In the input variable, a list of datasets is displayed from which single or multiple datasets can be selected. These can be mapped to one output variable. The scripting query can be written on the selected datasets for which output is stored in the defined output variable. The author of the script is required to assign single or multiple datasets to the output variable.

For the input variable, single or multiple datasets can be selected. The output variable must be named. The scripts are run network isolated. The input can only be provided through the Scripting activity, consisting of single or multiple input datasets and the code of the script.

For an input dataset, a spark dataframe is normally passed to the script. For additional information using pyspark.sql.DataFrame data type, copy this URL to a browser: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.

Additionally, you can use Pandas Dataframe in the executed script. For additional information using Pandas Dataframe, copy this URL to a browser: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html.

When a Pandas dataframe is expected as an input dataset, a special Python comment on a separate line in the script is required to be included. Use this comment: #:request:params-api:pandas.

For the output dataset, a Spark dataset/dataframe is expected back and assigned to the output variable. The script can also return a Pandas Dataframe in the output variable. Both cases are handled transparently by the background code.

The following pre-deployed Python packages are available:

Pandas
NumPy
SciPy
Scikit_Learn==0.23.2
Python-Dateutil
Docutils
Urllib3
Six
Jmespath
Joblib
Category_encoders
Imbalanced_learn
Pyarrow
Fsspec

Optionally, you can install additional libraries, but everything must be packed with the script or supplied as input datasets.

Feature Scaling

You can scale features with varying magnitudes, units, and range into normalized values.

Large differences in data ranges of numeric columns can be scaled avoiding range distortion or loss of data.

New values maintain the general distribution and ratios from the source data, and keep values within a scale applied across all numeric columns used in the model.

Use the column selector to exclude or include specific columns that you want to scale. Then select the scaling mode:

Min-Max Normalization redistributes all features to values between 0 and 1, or -1 and 1.
Mean Normalization redistributes all features to values between -1 and 1 with μ=0.
Standardization redistributes all features with their mean μ = 0 and standard deviation σ =1.

μ and σ are the estimated model parameters of the distribution, computed from the data as maximum likelihood estimates for each column separately.

Handle Missing Data

You can remove, replace, or infer missing values. Select the Handle Missing Data activity and select an option to replace missing data, remove the entire row containing missing data, or impute values using interpolation methods. Use the column selector to exclude or include specific columns for which you would like missing data handled.

Set the Minimum / Maximum Missing Value Ratio of missing data. This is the ratio of minimum and maximum missing values to all values in the column. Select the cleaning mode for the columns that meet the specified ratio condition. You can select one of these options:

Remove entire row
Replace with constant and specify a replacement value
Replace with mean
Replace with mode
Interpolation: Specify interpolation type. You can select one of these options:
- Linear
- Cubic
- Quadratic
- Piecewise Cubic Hermit Polynomial
- Nearest-Neighbor
- Previous-Neighbor
- Next-Neighbor
- Akima

Select the Missing Value Indicator column to generate another column in the output to indicate if a column met the specified minimum and maximum ratio condition.

Edit Metadata:

You can change metadata such as the data types of the features. For example, you can change date and time values to numeric type. Use the column selector to exclude or include specific columns in which you want to change and then change the data type using the Data Type list.

Execute SQL

You can perform an SQL operation to filter out data, join datasets, or aggregate data.

The activity box of Execute SQL includes a list of all datasets on which this particular transformation activity can be applied. There is no limitation in selecting the datasets from the list of datasets. A combined query can be written as per the need of the model selecting multiple datasets.

To aggregate data and use a subset of the dataset, perform SQL aggregation operations across columns.

Aliasing

You can change column names using this activity. Click the Pencil icon in the column list to rename the column.

General settings

Each preprocessing activity can be given a custom description in the settings panel.