Mass Spectral Database Initiative Meeting

ASMS 2003 Montreal, Canada:
June 10, 2003

Robert Mistrik: Good morning and thank you for coming to our Mass Spectral Database Initiative meeting.

Not long ago, none of us paid very much attention to MS database software platforms. Most people only used library search for EI spectra and we were satisfied with the tools provided by data vendors like NIST or Wiley. But, the amazing growth of new mass spectral techniques has transformed our view of data management. As we move from gas chromatography, single stage, unit resolution and electron impact records to a much broader spectrum of instrument outcomes, we face a number of new demands. The most important is the need for a database platform that is able to manage complex mass spectral and chromatographic data of a reference or non-reference nature. At Highchem, we were confronted with growing pressure from our customer base to create a database for high resolution and tandem spectra and so we decided, more than year ago, to develop a new database concept. I think many of you have experienced the same pressure, either as an end user, reference data producer or as a hard or software vendor. And so it seemed to us to be a good idea to organize a meeting where we could present our database concept, share our ideas and hopefully coordinate efforts leading to a generally accepted format.

I’d like to start with a few comments about NIST database software. NIST not only produces reference spectra but they have also developed database management software. This database is available in their search software and also as DLL, which many of us have implemented into our software products. The DLL database is very fast, and I have never experienced any run time problems with it and the code is of high quality. The implementation of DLL is not a problem for a skilled programmer. DLL allows library searches and contains a number of useful tools. NIST DLL has been implemented in a number of software products and of particular interest is that DLL has been free for the last year. Despite the popularity of the DLL database, there are some drawbacks. The most significant is its inability to manage high resolution and tandem spectra. The second problem is that DLL is not based on a common database format. The FairCom platform might not be inferior to the giant database products but higher development and integration costs could become an issue when dealing with such databases. In addition, the data transport of user data collections across large company servers is often unexplored territory. Even though NIST offers the database source code to selected individuals to promote further development, in my experience it is better to develop a new system from scratch than to make extensive concept changes to an existing system. A year ago, when I heard that NIST was freezing software development, I realized we had to start creating a new database concept.

The NIST database software represents pioneering work that we can all draw inspiration from. A new database should allow simple and combined searches, should be broadly acceptable and should not be a superformat ranging from UV to NMR as we might sometimes see such attempts. The world is so dynamic that it is not practical to endlessly discuss a format for all possible cases, but the new concept should last for at least a decade. In contrast to NIST, the new database should be flexible enough to allow modification without affecting compatibility. Modern database platforms allow such functionality. One of the most problematic questions is which database system should be chosen. We could discuss the pros and cons of this issue for hours. But I will try and convince you that we were correct in choosing the Microsoft SQL Server Desktop Engine.

This engine is a restricted subunit of a commonly used database platform which makes it easier to find database specialists. The availability of information resources and the exchange of experiences is an important factor because database design is a science in itself. Microsoft SQL Server is a relational database, which allows scripts to be written if there is a need to move or exchange data between different database platforms. It is possible to store procedures directly into this database, which is not possible with smaller libraries such as Microsoft Access. We have taken advantage of this feature by adding a NIST spectra search algorithm for single stage spectra into the database. One of the driving forces behind our decision regarding a platform was price. Microsoft SQL DE is the only database among the three biggest database players that is free under certain conditions. Neither Oracle nor IBM, regarding their DB2, are so generous. Using the SQL system, developers are only required to buy database development tools from Microsoft that cost a couple of hundred dollars, but the database engine may be distributed for free, even if the programming tools for the user interface do not come from Microsoft. For example, a comparable version from Oracle costs 400 dollars. Of course such a restricted database has some limitations, but for the extent of our mass spectral applications these limitations will not be an obstacle. There are some disadvantages with this system, that I have to mention. Similar to most Microsoft software, this database platform is currently limited to the Windows environment. The overhead problem should also be mentioned, but this effects all complex databases. In our ever more insecure world, databases must be installed and configured with a number of security and access levels that make the installation quite difficult, even if we do not need these features. After the proprietary issues have been solved, it may be possible to address this problem by providing a free database installation kit. Or maybe Microsoft will surprise us all and the SQL Desktop Engine will come preinstalled with future Windows versions.

I want to stress that it is not in our interest to develop a database that will be restricted to Mass Frontier. We would like to keep this database open to anyone who is interested. What are we offering? We will publish the database structure with all the tables on our web site. We will provide a free empty library with stored procedures for efficient data retrieval. With this empty database we will also be effectively giving away hundreds of hours of development for free. We will also set up a web page dedicated to this initiative and provide a limited degree of support as far as we can afford it. Our only condition is that individuals or companies using this database should acknowledge our copyright and database logo. We will not, however, provide the user interface. This must be developed individually, which I think is natural. Also a special search algorithm will not be included.

The suggested database format will be vendor, instrument and software independent. Even though we work closely with Thermo Finnigan, the general database concept will not be directed towards their instruments. However, certain instrument specific data may be important in some searches or in calculations of parameter matching values. In these cases, the developers will be able to add these data fields which, of course, will only be accessible using their software, but will not cause incompatibility. We also intend to add some specific links and data fields such as fragmentation patterns or Finnigan LTQ FT high resolution parameters.

We may have different views on a future database concept, but I would be happy if we could at least all agree on the formats for three things: Spectra, chromatograms and structures. The other information that needs to be stored is either text or numerical types and will not affect compatibility.

One of the main reasons why a new database concept is needed are tandem spectra. The logical data structure that best reflects spectra dependencies is a tree. And so we have implemented a spectral tree structure into our database that is hierarchically consistent. This structure is very flexible and allows parallel spectra to be stored for each node. Parallel spectra can represent spectra acquired at various collision energies and isolation widths, or using wide band activation or they can be zoom spectra, or source CID spectra. In each node, average and composite spectra are automatically calculated. The user can specify which spectra type should be searched. The practical implementation of a tree is quite simple, each spectrum contains a vector of precursor ions. If the vector contains zero, the spectrum is a full scan - first stage. For example, if two spectra contain identical vectors then they are parallel and should be placed in the same node.

For a correct library search in LC/MSn spectra that contain uncharacteristic peaks, it is important to have annotation capabilities to assign flags for cluster, adduct, dimer and doubly-charged ions. We may not delete those peaks from spectra just the search algorithm will either ignore those peaks or process them with different logic. In addition, we can choose a precursor ion that is doubly charged without to wondering why product spectrum contains peaks with higher m/z value that precursor ion.

I think there is not much space for discussion over the chemical structure format. The most common formats are the MOL format for small molecules and the PDB format for proteins. I am not too sure about formats for different kinds of structures. The structure can be assigned to every record, spectrum or peak. It is not a problem to introduce this option in the database, but it will be more challenging to implement this feature in a graphical user interface.

In our database concept, we have selected the supplementary information that should accompany a reference spectrum. We have tried not to over annotate the library so as to keep the most important information easily accessible. The complete list of variables with text and numerical types is displayed on our poster and will be posted on our web site. As I mentioned earlier, the database allows either the pre-programming of additional data fields or their dynamic extension on the fly. The supplementary information can be connected to a record, node or individual spectrum.

We didn’t want to create a library concept that would be based on a theoretical perception which was a long way from reality. And so we began collecting ESI tandem spectra to fine-tune the overall concept. As we progressed with our sample library, we became more and more enthusiastic about this work and we took a leap into the data collection business. We specialize in ion trap tandem spectra.

To conclude, then, these are our goals. We have developed a database concept that we want to share with the ms community and to provide the tools needed for its implementation. We are open to any ideas, suggestions or cooperation proposals. However, we would like to finalize the general concept by October to keep our deadlines we all have. Thank you for listening.

 

Contact: This e-mail address is being protected from spam bots, you need JavaScript enabled to view it