Principal Component Analysis (PCA)
Principal Component Analysis is a way of projecting the data onto a smaller vector space while retaining most of the variance of the data.
Algorithm
- Compute the covariance matrix
of the data - Find the eigenvalues of
and sort them in decreasing order - Compute an orthonormal basis of eigenvectors
- Keep the first
"principal components" of this basis
Pros
- Dimensionality Reduction: PCA reduces the number of features in a dataset while retaining as much variance as possible. This simplifies the dataset, making it easier to visualize.
- Feature Extraction: PCA identifies patterns in data by finding the principal components, which are linear combinations of the original features. These components capture the most significant variation in the data.
- Remove Redundancy: PCA helps in removing redundant information, reducing the impact of noise in the dataset. This can lead to better generalization and improved model performance.
- Speed: Working with a reduced set of dimensions can significantly speed up the training of machine learning algorithms and reduce the computational load. Plus, the algorithm is fast.
Cons
- Loss of Interpretability: The principal components derived from PCA are linear combinations of the original features, and their interpretation might not be straightforward in terms of the original variables.
- Assumption of Linearity: PCA assumes that the relationships between variables are linear. If the underlying relationships are nonlinear, PCA may not capture them effectively.
- Sensitivity to Scale: PCA is sensitive to the scale of the variables. If variables are on different scales, those with larger variances will dominate the principal components. It is important to normalize the data beforehand (at least if the units are not the same)
- Non-Gaussian Data: PCA assumes that data is normally distributed. If the data is highly non-Gaussian, PCA might not be the most suitable technique.