Basic Introduction to PCA?

Richard Mei
3 min readNov 30, 2020

Principal Component Analysis, commonly known as PCA, is commonly used for dimensionality reduction. If we had a data set with maybe four or five features, we could easily plot all the variables against each other in pair plots. We’d be able to observe the pairs and see if there’s any possible correlations, but what if we had hundreds of rows? What if we wanted to compare combinations of variables? It’d would be much harder, thereby giving us the need to do a dimensionality reduction of some sort.

The simplest explanation that I could think of for PCA is to look for the principal components of the data set that explain the most variance.

Principal Components in 2D

A principal component is a linear combination of two non-categorical variables. To find the linear combination, we

  1. First take the two variables and plot them. With this plot of data, we may or may not see any patterns like groupings, but we can sure see the spread.
  2. We want to then center our data onto the origin which keeps the order of all our points the same.
  3. We then draw a line through the origin and measure the sum of the squares distances of all the points to that line.
  4. What we want to do is repeat this process until we find the line that maximizes this value. The reason behind this is this line explains a majority of the variability between these two variables, hence, the most and our first principal line.
  5. Next we want to find an orthogonal line to our first principal line that crosses through the origin.
Image Source

The terminology for describing the direction of the lines are called the eigenvectors and the magnitudes are called the eigenvalues. A video by StatQuest is an amazing and thorough explanation on PCA.( Link )

Takeaways

When doing a full PCA analysis we end up with components that explain a percentage of our variance. Each component have different combinations of all features along with potentially different eigenvectors and eigenvalues. Each component can tell a story about the data, for example from a fuel economy data set:

This is the first principal component. We see that we have high cylinders, displacement, horsepower, and weight. This seems to me like a really good car that can go fast and has enough weight. Given the acceleration, I would further think this component describes a fast car since it has a negative acceleration. A more negative acceleration, despite how it seems, actually means not slowing down.

Conclusion

Overall, that was a basic understanding of PCA. To summarize, PCA is one way to do dimsionality reduction for very large data sets. It produces multiple principal components which are made up of different features of our data. Each principal can describe a story, by potentially categorizing a target.

There’s a lot more math behind PCA so definitely look for some more resources if you’re interested in that. PCA is a great tool for working with data, and it very simple to implement in Python through the sklearn library. If you’re interested, here’s something to point you into the right direction:

from sklearn.decomposition import PCA

--

--