Prepare the Data Activity catalog

Data preprocessing plays an important part in the machine learning process. The data that is generated in business applications can be inconsistent and not viable as training data. To increase the model accuracy, you should clean and transform the data by correcting errors and anomalies in raw data and building new features that are relevant to the problem that you are trying to resolve.

The Prepare Data activity catalog offers these preprocessing activities:

General settings

Each preprocessing activity can be given a custom name and description in the settings panel on the right. This name is displayed on the activity box so that you can make distinctions between activities in complex flows.

Select Columns

You can select a subset of the columns in the dataset. You can exclude or include specific columns by clicking on the column name and moving those columns to the Selected panel.

Remove Duplicates

If the dataset contains duplicate records, you can remove repeating points of data from the dataset. Use the column selector to identity duplicate columns to include or exclude.

Handle Missing Data

You can remove, replace, or infer missing values. Select the Handle Missing Data activity and select an option to replace missing data, remove the entire row containing missing data, or impute values using interpolation methods. Use the column selector to exclude or and include specific columns for which you would like missing data handled.

Set the Minimum / Maximum Missing Value Ratio of missing data. This is the ratio of min / max missing values to all values in the column. Select the desired Cleaning Mode for the columns that meet the specified ratio condition. You can select one of these options:

  • Remove entire row
  • Replace with constant and specify a replacement value
  • Replace with mean
  • Replace with mode
  • Interpolation: Specify the interpolation type. You can select one of these options:
    • Linear
    • Cubic
    • Quadratic
    • Piecewise Cubic Hermit Polynomial
    • Nearest-Neighbor
    • Previous-Neighbor
    • Next-Neighbor
    • Akima

Select the Missing Value Indicator Column to generate another column in the output to indicate if a column met the specified min / max ratio condition.

Edit Metadata

You can change metadata such as the values and data types of the features. For example, you can tag numeric or Boolean type features as categorical, change datetime values to numeric type, or change column names. By default, columns are treated as non-categorical features.

Use the column selector to exclude or include specific columns in which you want to change. You can change these options:

  • Click the Pencil icon in the column selector to rename the column.
  • To change the machine learning type, select one of these from the list:
    • Feature
    • Label: Select Label to specify the label you want to predict.
    • Time Series
    • Score
    • Ignored
  • To change the categorical value, select Categorical or Non-Categorical.
  • To change the data type, use the list.

Feature Scaling

You can scale features with varying magnitudes, units, and range into normalized values.

Large differences in data ranges of numeric columns must be scaled for machine learning algorithms that use Euclidian distance when calculating the model, such as Linear Learner, or Principal Component Analysis (PCA). You must bring those features to a common scale, and simultaneously avoid range distortion or loss of data.

New values maintain the general distribution and ratios from the source data, and keep values within a scale applied across all numeric columns used in the model.

Use the column selector to exclude or include specific columns that you want to scale. Then select the scaling mode:

  • Min-Max Normalization: redistribute all features to values between 0 and 1, or -1 and 1.
  • Mean Normalization: redistribute all features to values between -1 and 1 with μ=0.
  • Standardization: redistribute all features with their mean μ = 0 and standard deviation σ =1.
Note: μ and σ are the estimated model parameters of the distribution, computed from the data as maximum likelihood estimates for each column separately.

Smooth Data

Replace noisy data and outliers with an approximate function. Eliminating random variation in data and allowing important patterns to stand out is important before training the model. Smoothing techniques create an approximate function which would replace noisy data and outliers. These methods are applied to reveal underlying trends in time series analysis, forecasting, seasonal, and cyclic components.

Use the column selector to exclude or include specific columns to smooth features. Select the Moving Average Smoothing Mode and specify the Window Length, Padding Mode and Padding Value.

Construct Feature

You can build new intermediate features out of the existing features using mathematical, logical, or casting operations.

Click Add in the configuration panel to create a new feature. Specify the name and type. Use the Insert Feature and Insert Function selections to use the existing features of the input dataset and the desired operations.

As you make the selections, the expression is displayed in the Expression field. When done, click Add to return to the configuration panel of the activity. You can perform the action multiple times in the same manner to construct multiple features.

Index Data

Transform categorical values into numeric values for the selected columns. Categorical features cannot be handled by most machine learning algorithms. These must be converted to numerical to be processed in the model training.

Each category is assigned a number according to its representation value in the data. The number is based on the count in descending order. The category with the highest occurrence is assigned the value of 0.

Use the column selector to exclude or include specific columns that you want indexed.

One-Hot Encoder

Transform categorical features into binary matrix (vectors) to distinguish each categorical label. Most machine learning algorithms cannot handle categorical features. One-Hot Encoder is applied to change categorical data to a numerical format, by spreading the values in a column to multiple flag columns and assign 0 or 1 to them.

Categorical values are vectorized by generating a new column for each unique category, containing (0)s in all cells, except for a single cell with value 1 for the one that corresponds to the category label.

Use the column selector to exclude or include specific columns. Select Keep Initial Column to preserve the initial column in the output.

Note: Up to one hundred thousand unique values can be computed using One-Hot Encoder. When handling features with high cardinality, One-Hot Encoder activity can consume a long running time, up to 5-6 hours. For these cases, consider using Target Encoder.

Target Encoder

Transform categorical features into numeric values through the target, also referred to as a label. Replace the categorical variable with just one new numerical variable. Replace each category of the categorical variable with its corresponding probability of the target if categorical, or average of the target if numerical.

Use the column selector to exclude or include specific columns. Select Keep Initial Column to preserve the initial column in the output.

Execute SQL

You can perform an SQL operation to filter out data, join datasets, or aggregate data.

The activity box of Execute SQL includes four input ports. Input tables can be given custom names which are to be reused when building the query. You can join up to four different datasets into one.

To aggregate data and use a subset of the dataset, perform SQL aggregation operations across columns.

Scripting

You can execute a customized Python script to perform an activity that is not available in the Coleman AI Platform catalog. Add the Scripting activity in the flow to execute Python scripts for a variety of tasks. Check the currently supported version of Python in the Release Report.

The input and output of the Scripting activity is handled with python variables. For example, when value1 is Input Port 1 Variable, then the dataset that is connected in the first, upper port is available as value1 in Python. The second, lower port, is specified in Input Port 2 Variable.

Similar logic applies to the output variable, Output Port Variable. The author of the script is required to assign a dataset to the output variable.

Both input ports are required to be connected to input datasets. If you are working with one input dataset, then connect the same dataset to both input ports of the Scripting activity. The scripts run network isolated. The input can only be provided through the Scripting activity, consisting of two input datasets and the code of the script.

For an input dataset, a spark dataframe is normally passed to the script. For additional information using pyspark.sql.DataFrame data type, copy this URL to a browser: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.

Additionally, you can use Pandas Dataframe in the executed script. For additional information using Pandas Dataframe, copy this url to a browser: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html.

When Pandas dataframe is expected as an input dataset, a special python comment on a separate line in the script is required to be included. Use this comment: #:request:params-api:pandas

For the output dataset, Spark dataset/dataframe is expected back and assigned to the output variable. The script can also return Pandas Dataframe in the output variable, and both cases are handled transparently by the background code.

Optionally, you can install additional libraries, but everything must be packed with the script or supplied as input datasets.

Balance Data

You can balance the dataset to evenly represented classes. To ensure that classes are evenly represented, use a downsampling or oversampling method to match the number of samples in the minority class with a random sample from the majority class, or vice versa.

In the settings panel select Downsampling or Oversampling for the Balance mode and select Random.

Split Data

You can split the dataset into training data and test data. Before training the model, the dataset is split into training and test part. The training part is ingested to train the model, and the test part assess the predictions the trained model returns on unseen data belonging to the same distribution as the training data.

Specify the First Part Ratio (0-1) in the configuration panel. That is the training portion of the dataset. The remaining portion of the dataset is the test data.

Ingest to Data Lake

You can apply this activity to the output port of an activity to ingest the results dataset from a quest into Data Lake. For example, you can use the data after preprocessing or the result dataset of the scoring activity that is available in Data Lake and reuse in another quest to perform additional exploratory analysis within other Infor applications.

The configuration of the Ingest to Data Lake activity requires that you specify the object name that you want updated in the Data Lake from a menu. If it is a new object that does not have a definition in the Data Catalog, then you must first specify the object name and metadata in the Data Catalog.