| Principal Component Analysis (PCA) |
|
Mass Frontier offers the classification method called Principal Component Analysis (PCA). The central idea of principal component analysis is to reduce the dimensionality of a data set in which there are a large number of interrelated (i.e. correlated) variables, while retaining as much as possible of the variation present in the data set. In the case of mass spectrometry, the data set consists of the mass spectra of different compounds. The mass spectra are expressed as the intensities of individual m/z ratios (i.e. variables). The aim of PCA is to find a new coordinate system that can be expressed as the linear combination of the original variables (mass-to-charge ratios m/z) so that the major trends in the data are described. Mathematically, PCA relies upon eigenvalue/eigenvector decomposition of the covariance or the correlation matrix of the original variables. PCA decomposes the data matrix X as the multiplication of two matrices P (the matrix of new coordinates of data points) and T’(transpositon of the coefficients matrix of the linear combination of the original variables): X = P ´ T’ Generally, it is found that the data can be adequately described using far fewer coordinates, also called principal components, than original variables. PCA also serves as a data reduction method and a very good visualization tool. When the data points are plotted in the new coordinate system, the relationships and clusters are often more apparent than when the data points are plotted with the original coordinates.
Geometrical interpretation of PCA. The axes of the new coordinate system – principal components p1 and p2 – is created as the linear combinations of the original axes. New coordinates (PC - principal components) are orthogonal (perpendicular) to each other. It is clear, that there is greater variation in the direction of p1 than in either of the original variables, but very little variation in the direction of p2. The first PC describes the direction of the greatest variation in the data set, the second PC describes the direction of the second greatest variation (and so on, for data sets with more than two variables). |

