| Data mining in mass spectrometry using higher dimensions of Self-Organising Maps |
|
for identification of unknowns with NIST library Alexej Nikiforov1; Julia E. Wingate2; Robert Mistrik3 1Ins.Org.Chem. Univ.Vienna, Vienna, Austria Introduction Data Mining (DM) together with Knowledge Discovery in Databases (KDD) emerged around 1984 as new science discipline for the "nontrivial extraction of implicit, previously unknown, and potentially useful information from data" (1). As the size of MS libraries and MS data sets is increasing continuously in terms both order and extent, the application of DM concepts and methodology to MS data in search for improved interpretation tools is tempting. Self-Organizing Maps (SOM) appear to be interesting in this respect as they perform unsupervised clustering and are suitable to map N-dimensional space into 2D. (1) W. Frawley and G. Piatelsky-Shapiro and C. Matheus: AI Magazine; 213-228 (1992) Methods The software used for calculations was MassFrontier 3.0, developers version. All calculations were performed on a desktop PC. NIST 98 library installed under Mass Frontier 3.0 was used. Results When we initially studied the ability of PCA and SOM to handle larger amounts of spectra (in the range of 104 to 105) from NIST library, the limitations of PCA soon became clear. SOM, however, due to the possibility to increase resolution via an increase in dimension, was able to handle such large data sets.
To investigate the concept of DM in MS, the independently monitored control parameters of the classified spectra were structure and elementary composition. Since direct classification of all library spectra was not an option because of the huge number of spectra, several element-specific (CxHx, CxHxOx, CxHxNx, CxHxSx, CxHx(Halogen)x) and some functional subgroup-specific (COOR, CO, etc.) spectra groups (sub-libraries) of spectra were built. Although individual sub-libraries still contained several thousands of spectra each, SOM with dimensions higher than 80x80 allowed the separation of most of these groups, when analyzed in combinations of two or three at a time. In one approach, the unknown was successively compared with two sub-libraries (e.g. CxHx/CxHxOx and CxHx/CxHxNx etc.) in the first step, and then with the sub-library best fitting in the second step. For a general unknown not contained in the MS library, the analysis as part of such combinations opens a way to determine the presence of heteroatoms from low-resolution spectra only. An additional aspect is the determination of relevant similarities or sub-structures for the unknown.
The calculation of higher SOM dimensions with tens of thousands of spectra can sometimes take several days, even with the fastest PC (2GHz AMD or Pentium 4). However, once the element- and functional sub-group specific sub-sets of spectra have been calculated, the unknown spectrum only needs to be compared with the previously calculated SOMs. Such subsequent comparisons are relatively fast and may be performed even on-line with GC/MS data sets. The approach has limited application for EI spectra with very few signals and spectra of soft ionization methods. Several other applications of DM techniques like profiling applications and analysis of data of single GC(LC)-MS data sets are shown.
Contact: This e-mail address is being protected from spam bots, you need JavaScript enabled to view it |





