Principal Component Analysis

PCA.png

Datasets for principal components analysis

What kind of dataset should PCA be used on? PCA is an unsupervised learning algorithm which means that it does not require there to be a specific outcome variable you are trying to predict in your dataset. Instead, PCA is used when you have a set of features and you want to reduce the dimensionality of your feature set. This simply means that you want to condense as much of the information in your input features as possible into a smaller set of transformed features. In particular, PCA is intended to be used when you have a set of numeric features you want to condense.

Advantages and disadvantages of PCA

What are the main advantages and disadvantages of PCA? Here are some advantages and disadvantages you should keep in mind when deciding whether to use PCA.

Advantages of principal component analysis

  • Guaranteed to produce uncorrelated features. No matter how highly correlated the input features that go into your PCA model are, the transformed features that come out of the model are guaranteed to be uncorrelated. This is a big advantage as correlated features tend to cause problems for a lot of machine learning algorithms.
  • Relatively fast. Another advantage of PCA is that it is relatively fast compared to other dimensionality reduction techniques. PCA makes use of simple linear algebra computations that are easy for computers to handle. That means it is a good option when you have a large dataset with many observations.
  • Not sensitive to choice of seed. Another advantage of PCA is that it is not sensitive to the choice of seed or any other initialization conditions. PCA is a deterministic algorithm, which means that it will always produce the same result when applied to the same dataset.
  • No hyperparameters. Another advantage of PCA is that there are no hyperparameters that need to be tuned. This means that you do not have to go through the additional step of hyperparameter tuning when applying PCA to your data.
  • Popular and well studied. PCA is one of the most common dimensionality reduction techniques out there, which means that many data scientists are familiar with it. This means that it will be easier for collaborators to contribute to projects that use PCA than it would be for them to contribute to projects that use more obscure algorithms.

Disadvantages of principal component analysis

  • Assumes relationships between features are linear. One of the main disadvantages of PCA is that it makes the assumption that the relationships between the different features in the input data are linear. This means that it may not perform well in situations where the relationships between features are non-linear
  • Does not necessarily preserve local structure of data. PCA does not necessarily preserve the local structure of your data. This means that observations that are close together in the original features space will not necessarily be close together in the transformed features space. This can be a problem if you want to apply something like clustering or visualization techniques to the data.
  • Need to rescale features. Another disadvantage of PCA is that it is sensitive to scale. That means that you may need to rescale your features before you apply PCA to them.
  • Sensitive to outliers. Another disadvantage of PCA is that it is sensitive to outliers. If there are outliers in your dataset, they may have an oversized effect on the model. You will end up with transformed features that are more representative of a few outlying points than the bulk of the data.
  • Cannot handle missing values. Another disadvantage of traditional PCA is that it cannot handle missing data. This means you may have to preprocess your data to handle any missing values. There are some extensions of PCA that can handle missing values, but they may or may not be available in common machine learning libraries.
  • Only suitable for continuous data. Another disadvantage of PCA is that it is only suitable for continuous variables. If you have a mixture of continuous and categorical variables in your dataset, you may want to consider other dimensionality reduction methods.
  • Does not perform well when input features are not correlated. PCA does not perform well in situations where none of the input features are correlated with one another. If there is no information that is shared between features, the algorithm will not be able to compress shared information into fewer features.

When to use principal component analysis

When should you use principal component analysis rather than another dimensionality reduction technique? Here are some examples of situations where you should use principal component analysis.

  • Many correlated features. If you have many correlated features in your dataset you want to apply an algorithm that does not perform well on correlated features to the dataset, this is a great use case for PCA. All you have to do is apply PCA to the set of correlated features and replace the input features with the transformed features produced by the PCA model. The transformed features are guaranteed to be independent of one another no matter how highly correlated the input features were.
  • Quick and easy dimension reduction. PCA is a great model to use if you need to apply a quick and easy dimension reduction technique for something like a prototype or proof-of-concept. The model is deterministic and there are no hyperparameters to tune, so you only have to apply the model to your data one time and you are done.

When not to use principal component analysis

When should you avoid using principal component analysis? Here are some examples of situations where you should avoid using principal component analysis.

  • Features are not linearly related. Principal component analysis performs best when it is applied to a dataset where all of the features are linearly related. If you do not think that the features in your dataset are linearly related, you may be better off using a dimensionality reduction technique that makes fewer assumptions about the data. For example, t-sne is an example of a non-parametric algorithm that makes fewer assumptions about the structure of the data.
  • Visualizing data. If the primary reason you want to reduce the number of dimensions in your data is so that you can visualize your data, you are generally better off using an algorithm like t-sne that preserves local relationships in the data. Algorithms that preserve local relationships try to ensure that observations that are close together in the input feature space are also close together in the transformed feature space, which is what you want if you are trying to visualize data. PCA focuses more on preserving global trends in the data and less on preserving local relationships between specific points.
  • Need interpretable features. Most dimension reduction techniques produce features that do not have a straightforward interpretation. If you need all of the features in your dataset to be directly interpretable, you may be better off using feature selection techniques instead of traditional dimensionality reduction techniques to reduce the size of your dataset.

Explained variance

1. Explained Variance (explained_variance_)

  • Description: This attribute is an array of shape (n_components,) that stores the amount of variance explained by each of the selected principal components (PCs).
  • Access: You can access it using the following code:

Python

pca_model = PCA(...)  # Create your PCA model
explained_variance = pca_model.explained_variance_
  • Interpretation: Each element in the explained_variance array represents the variance captured by a corresponding PC. Higher values indicate that the PC captures more variability in the original data.

2. Explained Variance Ratio (explained_variance_ratio_)

  • Description: This attribute is an array of shape (n_components,) that stores the percentage of variance explained by each of the selected PCs.
  • Access: You can access it using the following code:

Python

pca_model = PCA(...)  # Create your PCA model
explained_variance_ratio = pca_model.explained_variance_ratio_
  • Interpretation: Each element in the explained_variance_ratio array represents the fraction of the total variance captured by a corresponding PC, expressed as a percentage. These values are typically between 0 and 1, with the sum of all ratios usually close to 1 (assuming all components are kept).

3. Connection to Compression Rate

Both explained_variance_ and explained_variance_ratio_ help you assess the compression achieved by PCA. To understand how:

  • Compression in PCA: PCA aims to capture the most significant variations in the data using a smaller number of dimensions (PCs). This essentially compresses the data by discarding directions that contribute less to the overall variance.
  • Using explained_variance_: By summing the values in the explained_variance_ array, you get the total variance captured by the selected PCs. Dividing this sum by the original data's total variance (often estimated using n_samples - 1 degrees of freedom) gives you the proportion of variance retained after applying PCA.
  • Using explained_variance_ratio_: Directly sum the elements in the explained_variance_ratio_ array to get the cumulative percentage of variance explained by the chosen PCs. This value reflects the combined impact of all PCs on the data's variability.