Chemometrics approaches for the automatic analysis of metabolomics GC-MS data

Research output: Book/ReportPh.D. thesisResearch

  • Giacomo Baccolo
Metabolomics, aiming at the comprehensive characterization of the metabolites in a sample, is an appealing and widely applied scientific approach applied in several research fields, such as biomarker identification, drug discovery, food science and environmental science. Often, the analyzed samples are characterized by complex matrices, such as biological tissues, foods or soil samples. Metabolomics is closely related to analytical techniques and one of the most applied is Gas Chromatography connected to Mass Spectrometry. Modern analytical platforms can generate hundreds of thousands of spectra, detecting an impressive amount of distinct molecules. Despite, or because, the technical progress achieved on the experimental side, the translation of the signals measured by the instruments into easily expendable information is still a major bottleneck in metabolomics. For each identified compound, it is desired to have the relative concentration across the analyzed samples as well as the associated mass spectrum. The signals from GC-MS instruments are complex and the software available for the analysis of the experimental data have been repeatedly indicated as a major source of uncertainty, strongly limiting both the quantity and the quality of the extracted information. Most of the approaches, or at least the most applied, are based on the univariate analysis of the data, considering each sample separately from the others and requiring laborious manual efforts for the setting of several parameters that affect the result of the analysis. This thesis deals with the presentation of a new approach called AutoDise for the analysis of GC-MS data. The most critical step, that consist in the extraction of the pure contributions from the experimental data, is performed by means of PARAFAC2. The PARAFAC2 modelling approach decomposes the multilinear data, discriminating among the different signals across the samples. The efficacy of PARAFAC2 to extract meaningful chemical signals from GC-MS data has been widely demonstrated. Because of its intrinsic properties, PARAFAC2 has almost no need for data preprocessing, no critical settings are required, whereas other approaches need to set several parameters and a laborious pretreatment of the data, requiring an extremely skilled user and dramatically reducing results reproducibility. However, fitting PARAFAC2 models involves different phases and a skilled tensor analyst is required for the analysis and interpretation of the models. AutoDise is an expert system based on statistical diagnostics and Artificial Intelligence, which is able to take care of all the modeling aspects and to generate a peak table where each compound is univocally identified and fully reproducible results. The performance of the approach has been tested on a complex dataset of virgin olive oils with different quality grades, whose volatile profiles were obtained by solid-phase microextraction - GC/MS. The data have been analyzed both with a commercial software, by experienced users, and automatically with the proposed AutoDise method and the resulting peak tables have been compared. The results showed that AutoDise overcome the manual analysis both in terms of number of identified compounds and in quality of the identification and quantification. Moreover, a dedicated GUI has been developed to make the algorithm more accessible to people not skilled in programming language. A tutorial is included in the thesis, showing the main features and how to use the graphical interface. Another important part of this thesis is devoted to the test and development of new artificial neural networks to be implemented in the AutoDise software for detecting which PARAFAC2 components that are providing chemically useful information. To this end, more than 170,000 profiles have been manually labeled, in order to train, validate and test a convolutional neural network and a bilinear long short term memory network and a k-nearest neighbour model. The results suggest that deep learning networks can effectively be applied for the automatic classification of the chromatographic profiles.
Original languageEnglish
PublisherDepartment of Food Science, Faculty of Science, University of Copenhagen
Number of pages161
Publication statusPublished - 2022

ID: 337582180