Data preparation

You must have a good understanding of the dataset. Investigate and understand its structure.

Data must be transformed in a way that it can be understood and consumed by algorithms. The data generated in business applications is generally incomplete. It is lacking attribute values, lacking certain attributes of interest, contains only aggregate data, is noisy (contains errors or outliers), and is inconsistent (contains discrepancies in codes or names).

The data must be formatted, cleaned, and organized before feeding it into the model training.

Answer this question: What are the inconsistencies and defects in the data that need to be resolved?

These are common pre-processing practices:

  • Cleaning: assign or remove missing values, smooth noisy data, identify or remove outliers, resolve inconsistencies
  • Transformation: normalization and aggregation
  • Reduction: reduce the volume but produce the same or similar analytical results
  • Discretization: replace numerical attributes with nominal ones

After each transformation you can save the resulting dataset to use it in another quest.