Analyzing large data sets comes with multiple challenges. One of the challenges is to get data in the right structure for the analysis. Without preprocessing the data, your algorithms might have difficult time converging and/or take a long time execute. One of the techniques that we used at TCinc is Principal Component Analysis (PCA).
The official definition of PCA from Wikipediai is “Principal component analysis (PCA) is a statistical procedure that uses orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.”
From an engineering point of view, PCA is defined as a dimensionality reduction algorithm. Two straightforward examples of where the dimension might need to be reduced are data visualization and data compression. For example in data visualization, spreadsheets with a lot of columns are difficult to visualize. By reducing the data to 3 or 2 dimensions, you can chart/graph it. An example of data compression is an image compression. Processing large images is computationally expensive. By reducing a large image and still preserving the critical data in the image that is required for processing, you can speed up the processing and use less resources.
Data always has some redundancies, noise, and echoes. With PCA you are looking for the component of the data that has the largest variation. In terms of math, you are looking to maximize the diagonal entries in a covariance matrix and reduce the off-diagonal entries. Here is the formula to calculate the covariance.
(I’m dividing here by m and not (m-1), both are valid. Dividing by m gives you the population covariance matrix and dividing by (m-1) gives you the sample covariance matrix.)
The above formula says that the mean normalized data is multiplied by its transpose. Since we want our code to run fast we should vectorize all of our operations. Here is the vector representation of the above formula that has been mean normalized ahead of time.
Before we start solving for the principal components, here are assumptions that we are making about the data and the process. Linearity is the main assumption. We are assuming that the data is linearly related. We are also assuming that data has high signal to noise ratio. This would mean that the larger principal components represent signal and the smaller components represent the noise. The last assumption that will help us in developing the PCA algorithm is that the principal components are orthogonal. We also define dimensionality reduction as transformation.
We are trying to find a orthonormal matrix WD , the columns of which are the principal components that map the data “X” to a lower dimension and where D represents the dimension.
We have used two methods to find the principal components, eigenvector decomposition and singular value decomposition (SVD). Both methods are dependent on the libraries available in the programming languages used and performance required. If the data set is larger it can be computationally expensive to calculate the covariance matrix that we need in eigenvector decomposition. In SVD you don’t have to calculate the covariance matrix. Here are the two algorithms. We will leave off the algebraic proof for both algorithms for a different blog.
The PCA eigenvector decomposition algorithm :- Mean normalize the data (feature scale if needed)- Calculate the covariance matrix – Find the eigenvectors in the covariance matrix.- Sort the eigenvectors from largest to smallest- Use the largest ones as the principal components- Multiply the data by the largest components to transform the data
The PCA SVD algorithm :- Mean normalize the data (feature scale if needed)- Use a library and call the SVD function on the data- Depending on data structure, use either the left or right singular vectors- SVD orders the principal components, use the top entries in singular vector- multiple the data by the singular vector
We have both methods implemented in multiple programming languages. I think the easiest one to understand is the R version. Here is the R version of PCA implement using SVD.
library("biOps", lib.loc="/Library/Frameworks/R.framework/Versions/3.0/Resources/library") ## load image img <- readJpeg("hadoop.jpg") size <- dim(img) x <- size y <- size grey <- matrix(0,x,y) k <- 285 #was calculated with sum(s$d[1:285])/sum(s$d) = 0.9919 -> covers 99% of the data ## convert image to grey grey <- imgRGB2Grey(img) ## save grey image writeJpeg("grey.jpg",grey) ## calculate mean columnMeans <- colMeans(grey) ## subtract mean greyNoMean = t(apply(grey,1,columnMeans,FUN="-")) s <- svd(greyNoMean) Ureduce = s$v[,1:k] reduced = greyNoMean %*% Ureduce ## save reduced image writeJpeg("grey-reduced.jpg",imagedata(reduced)) recovred =reduced %*% t(Ureduce) ## add mean recovredWithMean = t(apply(recovred,1,columnMeans,FUN="+")) ## save recovered image writeJpeg("grey-recovred.jpg",imagedata(recovredWithMean))
We have used the biOps package for saving, loading and converting the image from RGB to grey. We chose to run the algorithm on an image to show you image compression and help with visualizing what the algorithm is doing.Here is the original image, 495x500x3:
We converted the image from RGB to grey to simplify the algorithm. In practice you often convert the image to grey since you still retain the data and you simplify the processing. Here is the grey image, 495×500:
Here is the reduced image, 285×500:
As you can see, it does not mean much to the human eye, but the important data is there. We can now apply other machine learning algorithms to the data. We wanted to preserve 99% of the data. We used the output from SVD to calculate the error. The entries in the diagonal matrix returned from SVD are sorted from the largest to smallest. Here is the formula used to calculate the error:
In this example 285 out of 485 principal components retain 99% of the data. We have reduced the data by 200 dimensions and still retain most of the critical data. PCA is a lossy compression. Here is how the image looks after it was recover from the reduced image, 495×500:
As you can see we were able to recover the image. There are artifacts in the image but the important data is there, you can see that the image is similar to the original image.
PCA is a powerful data preprocessing tool and it can be applied to multiple machine learning processing pipelines.