Encoding string and categorical features

Most machine learning algorithms cannot work with string features. These must be converted to numeric.

These encoders are pre-built:

  • Index Data: Each category is assigned a number according to its representation value in the data.
  • One Hot Encoder: Categories are vectorized by generating a new column for each unique category, containing (0)s in all cells, except for a single cell with value 1 for the one that corresponds to the category label. This encoder can consume a long processing time.
  • Target Encoder: Replaces the categorical variable with just one new numerical variable, with its corresponding probability of the target if categorical, or average of the target if numerical.

Consider applying the Indexer for these scenarios:

  • One Hot Encoder when the feature cardinality does not exceed 100 unique values.
  • Target Encoder when the feature cardinality exceeds 100 unique values.

After encoding, apply the Edit Metadata activity to indicate the feature's categorical type.