Apply algorithm activity catalog

After the dataset has been transformed using preprocessing activities, the next phase is the application of a machine learning algorithm. The input of the transformed dataset and the algorithm into the Train Model activity produces the trained model.

Different algorithms accomplish different tasks. Algorithms examine the data and determine a model that is the closest fit to the data that is being reviewed.

The Algorithms activity catalog provides supervised and unsupervised algorithms:

  • Supervised algorithms are for regression, classification, or forecasting problems.
    • XGBoost
    • Linear Learner
    • Random Forest
    • Decision Tree
    • Extra Trees
    • Multilayer Perceptron
    • DeepAR: forecasting
    • Anomaly Detection
    • Time-series Forecasting
  • Unsupervised algorithms are for cluster and associations analysis problems and dimensionality reduction.
    • K-Means
    • Auto Clustering
    • PCA: Principal Component Analysis
  • Custom algorithms are for packaging and deploying custom algorithm code to Infor AI. The code is used for model training.

XG Boost

Use this gradient boosted trees algorithm to provide an accurate prediction of a target variable by combining the estimate of a set of simpler, weaker models. Additionally, it uses a gradient descent algorithm to minimize the loss when adding new models.

XGBoost minimizes a regularized objective function that combines a convex loss function, based on the difference between the predicted and target outputs, and a penalty term for model complexity. This is also referred to as the regression tree function.

Linear Learner

Use the Linear Learner algorithm to explore a large number of models and choose the best model that optimizes either continuous objectives, such as mean square error, cross entropy loss, absolute error, or discrete objectives suited for classification, such as F1 measure, precision and recall, or accuracy.

When compared with solutions providing a solution to only continuous objectives, the implementation provides a significant increase in speed over naive hyper-parameter optimization techniques.

Random Forest

Use the Random Forest algorithm to construct and combine multiple decision trees to provide a more accurate prediction. Unlike the decision tree algorithm, the Random Forest algorithm randomly selects observations and features and builds several decision trees before averaging the results.

Decision Tree

Use the Decision Tree algorithm to continuously split the dataset according to a certain parameter, forming a decision tree. The tree has two main entities: decision nodes and leaves. The leaves are the outcomes and the decision nodes are the points where the data is split.

Extra Trees

The Extra Trees algorithm implements an estimator that fits many randomized decision trees, also called extra trees, on various sub-samples of the data set. This algorithm uses averaging to improve the predictive accuracy and control over-fitting.

Multilayer Perceptron

Use the Multilayer Perceptron (MLP) algorithm to train a set of input-output pairs to learn to model the correlation between them. Training involves adjusting the parameters to minimize error, and finding their correct balance to prevent model overfitting or underfitting.

The MLP algorithm can be thought of as a deep artificial neural network. The perceptron's input layer receives the signal, and the output layer decides or predicts the input. In between the input and the output layer, there are many hidden layers that are the true computational engine that combines the basic attributes into higher-level concepts.

DeepAR

Use the DeepAR algorithm for training models with time-dependent patterns. The algorithm infers scalar (one-dimensional) time series using recurrent neural networks (RNN).

Use case examples include time series groupings for product demands, server loads, and web page requests.

K-Means

Use the K-Means algorithm to find discrete groupings within data, where members of a group are as similar as possible to one another, and as different as possible from members of other groups.

You define the attributes that you want the algorithm to use to determine similarity. It scales to massive datasets and delivers improvements in training time. The algorithm streams mini-batches, small, random subsets of the training data.

Auto Clustering

The Auto Clustering algorithm is a highly configurable and scalable clustering engine that uses the Gower distance metric for handling numerical, categorical, or mixed inputs automatically to determine the optimal number of clusters.

Specify a minimal configuration, designate inputs as optional, and let the engine work out the rest. Or, tell the engine to specifically implement a method for a tailored clustering approach. For large datasets, the engine utilizes an automatic sampling approach to avoid bias.

Developed for Infor's Augmented Intelligence (AI) suite, Auto Clustering is designed to provide a valuable solution through a combination of K-means, K-modes, and K-prototypes automatically based on the features.

Note: A new version of this algorithm is available. (The new version is the default.) You can choose the version from the configuration panel using the Algorithm Version list. We recommend using the latest available version to benefit from the most up-to-date improvements and achieve the best possible performance.

Principal Component Analysis

Use the Principal Component Analysis (PCA) algorithm to reduce the dimensional number of features within a dataset and still retain as much information as possible.

Anomaly Detection

Anomaly Detection plays a pivotal role in data processing by identifying and handling anomalies in time series and regression data. It leverages outlier detection techniques, including statistical methods, clustering algorithms, and machine learning approaches, to mitigate the impact of anomalies and improve forecast accuracy.

Anomaly Detection includes grouping features that let you organize data based on specific attributes. This makes it easier to detect unusual patterns within particular groups or categories. It is especially handy when analyzing trends across different segments.

On top of that, the explainability features help you understand what’s causing these anomalies, making the results clearer and easier to trust.

You can define these hyperparameters:

  • Input Keys
    • Specifies a comma-separated list of features or columns in the input data that must be included. These features are essential for the model to index and retrieve results and must be present in the input data. Ensure that the specified columns are correctly named and formatted in your dataset.

      Example: Date

  • Target Variable
    • Defines the name of the column in the original input data representing the target variable. This is the value you want to detect outliers for. Make sure the target variable is numeric and properly preprocessed to avoid any inconsistencies.

      Example: Price

  • Detection method
    • Indicates the method used for anomaly detection. It specifies the technique or algorithm employed to identify anomalies within the dataset. Choose the method that best suits the characteristics of your data and the type of anomalies you expect to detect.

      The available options are: two_sided_moving_median, one_sided_moving_median, distribution_based

      Example: two_sided_moving_median

  • User Tagged
    • Defines the name of the column that denotes whether the value is not an outlier. It serves as a flag indicating whether the data points have been tagged by the user as non-outliers. A value of 1 means this value is not an outlier. Ensure that this column is binary (0 or 1) and accurately reflects the tagging.
    • Example: user_tagged
  • Group By
    • Specifies a comma-separated list of fields used for grouping within the dataset. It determines the level at which we want to detect anomalies. Grouping helps in identifying anomalies within specific segments or categories of the data.

      Example: LocationId

  • Date Column
    • Specifies the name of the column in the input data that contains date information. It identifies the column representing dates within the dataset in YYYY-MM-DD format. Ensure that the date column is correctly formatted and free of missing values.

      Example: Date

  • Handling Method
    • Specifies the method used for handling anomalies within the dataset. It determines how anomalies are treated. Choose the method that aligns with your data handling strategy and the impact of anomalies on your analysis.

      The available options are: smooth, median, remove.

      Example: smooth

  • Advanced
    • Provides additional parameters for advanced usage of the engine. It is designed to make the interface easier to use so you do not need to wonder about parameters that can be set internally. It may include various settings or configurations tailored for specific use cases or scenarios, such as window size settings. This parameter is only used by two thirds of the detection methods thus it is not required all the time.

      Example: window_size:'14'.

Note: A new version of this algorithm is available. (The new version is the default.) You can choose the version from the configuration panel using the Algorithm Version list. We recommend using the latest available version to benefit from the most up-to-date improvements and achieve the best possible performance.

Time-series Forecasting

Time-series Forecasting incorporates three algorithms: Prophet, Croston, and Croston Teunter-Syntetos-Babai (TSB).

Prophet is a forecasting model based on an additive/multiplicative model, where non-linear trends are fitted with seasonality. Well-suited for time series with strong seasonal patterns and multiple seasons of historical data. Prophet is highly adaptable to missing data, trend shifts, and outliers.

Croston is designed for intermittent demand forecasting, ideal for time series with missing values and sporadic demand patterns.

Croston TSB (Teunter-Syntetos-Babai) is an enhanced Croston method incorporating a smoothing parameter for both the demand interval and size, improving accuracy and responsiveness to fluctuations in demand.

Automated Machine Learning (AutoML ) techniques are integrated within the Time-series Forecasting to further automate and optimize the forecasting process.

You can define these hyperparameters:

  • Uncertainty Samples
    • Affects the uncertainty forecast interval. Setting this to 0,0 reduces the number of output columns. Generates uncertainty intervals, upper and lower bands. Ensure the value is a float.
  • Interval Width
    • The width of the uncertainty intervals. This hyperparameter determines the confidence interval for the forecast. For example, 0.8 stands for 80% confidence interval. Ensure the value is a float.
  • Growth
    • Type of growth or trend in the time series. Can be set to linear, flat, or logistic. The auto setting uses Kendall's tau method to detect the trend type from the data. Choose the option that best fits the trend characteristics of your data.
    • The available options are: auto, flat, linear, logistic
  • Exogenous
    • External factors or additional regressors that can be included to improve forecasting accuracy. These factors should be relevant to the target variable and can help enhance the model's predictions.
    • The available options are: Temperature, price, false
  • Exogenous Selection
    • Select from the list of exogenous features the most relevant regressors that can be included in the forecasting model. This helps in improving the model's accuracy by including significant external factors.
    • The available options are: True or False
  • Frequency
    • Describe time frequency. This hyperparameter defines the granularity of the time series data. Choose the option that matches the frequency of your data collection.
    • The available options are: Daily, Weekly, Monthly
  • Algorithm
    • Choose the name of the algorithm to apply. The auto setting selects the appropriate algorithm based on the characteristics of the time series. Select the algorithm that best suits your data and forecasting needs.
    • The available options are: auto, prophet, croston, croston_tsb
  • Holidays Prior Scale
    • Controls flexibility to fit holiday effects. This hyperparameter adjusts the model's sensitivity to holiday-related variations in the data. Choose a value that balances flexibility and accuracy.
    • The available range is: [0.01, 10]
  • Changepoint Prior Scale
    • Determines the flexibility of the trend. Bigger values mean more flexibility. Too small values lead to underfitting, too large values lead to overfitting. Select a value that provides an optimal balance for your data.
    • The advised range us: [0.001, 0.05]
  • Changepoint Range
    • By default, changepoints are inferred for the first 80% of the time series to avoid overfitting fluctuations at the end. Adjust this hyperparameter to control the portion of the time series used for changepoint detection.
    • The available range is: [0.8, 0.95]
  • Seasonality Mode
    • Tuned by looking at the time series. Multiplicative if seasonality is not a constant additive factor. Choose the mode that best represents the seasonal patterns in your data.
    • The available options are: additive, multiplicative
  • Seasonality Prior Scale
    • Controls the flexibility of the seasonality. Large values allow fitting large fluctuations, small values shrink the magnitude of the seasonality. Select a value that appropriately captures the seasonal variations in your data.
    • The available range is: [0.01, 10]
  • Seasonalities
    • The default setting activates auto mode which automatically generates seasonalities according to the time-level of the data. Additionally, users have the option to specify seasonal components to incorporate into the modeling process. Seasonalities consist of period-Nterms pairs, where "period" refers to the periodicity of the Fourier terms, and "Nterms" represents the number of Fourier terms (cosine and sine). Users can include multiple seasonal components, which should be comma-separated. For example, "12-4,4-2,3-1" translates to [(12-4), (4, 2), (3, 1)].

      Example: [auto, 12-4,4-2,3-1]

  • Croston
    • Parameters for the Croston model. The alpha value=0,7. This hyperparameter is used for intermittent demand forecasting.
    • The available range is: [0.001, 1]
  • Croston TSB
    • Croston TSB model parameters, alpha and beta values, string of floats comma separated. These hyperparameters are used for enhanced intermittent demand forecasting.
    • The available range is: Values between 0 and 1
  • Benchmark
    • Benchmark model parameters and name. For example, last year provides the last year observation as a forecast. For moving average, you need to set two parameters: window and offset. The hyperparameter values must be comma-separated. For example, moving average,6,365 means moving average(window=6, offset=365).
    • The available options are: moving average, last year, average
Note: A new version of this algorithm is available. (The new version is the default.) You can choose the version from the configuration panel using the Algorithm Version list. We recommend using the latest available version to benefit from the most up-to-date improvements and achieve the best possible performance.

Custom Algorithm

Custom algorithms that are packaged and deployed in the Custom Algorithms section are available in the Apply Algorithm activity catalog.