tags:
  - preprocess
aliases:
  - preprocess data

General idea

Data preprocessing is a crucial step that is done on data before doing Data Analysis or before feeding it to models in Machine Learning or deep learning. It usually involves steps such as normalization or standardization, dealing with outliers, missing data or reducing data dimensionality or cardinality through dimensionality reduction techniques such as Principal Component Analysis, UMAP or T-SNE .

PCA illustration :

UMAP illustration :

T-SNE illustration :

These techniques assume some things about the underlying data :

PCA is sensitive to the variances of the initial variables, so if the variables are in different scales, the PCA might not work as expected. Therefore, standardization is often preferred before applying PCA because it ensures that each variable contributes equally to the principal components, making the PCA results more interpretable and reliable. In practice, if you're dealing with data that has a normal or Gaussian distribution, standardization is a good choice. If your data is not normally distributed, you might need to consider other preprocessing steps or transformations to make it suitable for PCA.
Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data

The data is **uniformly distributed on Riemannian manifold;
The Riemannian metric is locally constant (or can be approximated as such);
The Manifold is locally connected.

T-sne is a non-parametric algorithm, which means that it does not make many assumptions about the data or the way that features are related. If you are working with a dataset that has many features that are not linearly related, you may be better off using t-sne than another algorithm that makes stronger assumptions about the structure of the input data.

Normalization & standardization

Normalization and standardization are two different preprocessing techniques used in data analysis and machine learning to bring data into a common scale. The primary difference between the two lies in the method of scaling and their objectives.

Normalization

Normalization scales the data to a fixed range, usually . It is also known as Min-Max scaling. This method is useful when you want to bring all variables onto the same scale, but it does not take into account the distribution of the data. This means that normalization might not be suitable for data with skewed distributions, as it does not change the shape of the distribution.

Standardization

Standardization, on the other hand, scales the data based on the mean and standard deviation of the data, resulting in a distribution with a mean of 0 and a standard deviation of 1. This method is useful when you want to ensure that all variables contribute equally to the model, regardless of their original scale. It is particularly useful for algorithms that are sensitive to the scale of the input features, such as Support Vector Machines (SVMs), K-Nearest Neighbors (KNN), and Principal Component Analysis (PCA).

$µ$

In other words, Standardization (or Z-score normalization) is the process of rescaling the features so that they’ll have the properties of a Gaussian distribution with μ=0 and σ=1.

Drawbacks :

Normalizing the data is sensitive to outliers, so if there are outliers in the data set it is a bad practice. Standardization creates a new data not bounded (unlike normalization).

In short, Normalization is sensitive to outliers and should be avoid it if the dataset contains extreme values, while standardization is more robust to outliers and can create a more representative distribution.

Non-normally distributed data with linear correlations

When dealing with non-normally distributed data with linear correlations between features, two main dimensionality reduction techniques are suitable:

1. Principal Component Analysis (PCA):

Assumptions: While PCA works best with normally distributed data, it can still be effectively used for dimensionality reduction even if the data doesn't follow a normal distribution, especially if the primary goal is to capture linear relationships between features. This is because PCA focuses on maximizing variance, which often aligns with linear correlations.
Process: PCA identifies the principal components (PCs) that capture the most variance in the data. These PCs represent uncorrelated directions of maximum variance. By selecting a subset of the top PCs, you can achieve dimensionality reduction while preserving the information about linear correlations.
Limitations: PCA is sensitive to scaling. Features with larger scales will have a greater impact on PCs, regardless of their underlying relationships. Therefore, standardization or normalization is crucial before applying PCA to ensure each feature contributes equally.

2. Partial Least Squares (PLS) Regression:

Assumptions: PLS is specifically designed to handle situations where the data might not be normally distributed and there are linear relationships between features and a target variable (regression setting). It focuses on finding latent variables (LVs) that are maximally correlated with the target variable while explaining variance in the features.
Process: PLS identifies LVs that explain both the relationship between features and the target variable. By selecting a subset of the top LVs, you can reduce dimensionality while preserving the information relevant to the target variable and linear relationships between features.
Advantages: PLS addresses the issue of scaling sensitivity and is robust to non-normality, making it a good choice when dealing with non-normal data and linear correlations in a regression context.

Choosing the right technique:

If you don't have a target variable and solely focus on capturing linear relationships between features: PCA is a suitable choice, even with non-normal data, as long as you standardize or normalize your features.
If you have a target variable and want to reduce dimensionality while preserving information relevant to the target variable and linear relationships between features: PLS regression is the preferred option due to its robustness to non-normality and ability to handle the target variable.

Remember to evaluate the effectiveness of your chosen technique based on the specific characteristics of your data and the intended use of the reduced dimensionality representation.