TY - JOUR
T1 - A new pipeline for the normalization and pooling of metabolomics data
AU - Viallon, Vivian
AU - His, Mathilde
AU - Rinaldi, Sabina
AU - Breeur, Marie
AU - Gicquiau, Audrey
AU - Hemon, Bertrand
AU - Overvad, Kim
AU - Tjønneland, Anne
AU - Rostgaard-Hansen, Agnetha Linn
AU - Rothwell, Joseph A.
AU - Lecuyer, Lucie
AU - Severi, Gianluca
AU - Kaaks, Rudolf
AU - Johnson, Theron
AU - Schulze, Matthias B.
AU - Palli, Domenico
AU - Agnoli, Claudia
AU - Panico, Salvatore
AU - Tumino, Rosario
AU - Ricceri, Fulvio
AU - Monique Verschuren, W. M.
AU - Engelfriet, Peter
AU - Onland-Moret, Charlotte
AU - Vermeulen, Roel
AU - Nøst, Therese Haugdahl
AU - Urbarova, Ilona
AU - Zamora-Ros, Raul
AU - Rodriguez-Barranco, Miguel
AU - Amiano, Pilar
AU - Huerta, José Maria
AU - Ardanaz, Eva
AU - Melander, Olle
AU - Ottoson, Filip
AU - Vidman, Linda
AU - Rentoft, Matilda
AU - Schmidt, Julie A.
AU - Travis, Ruth C.
AU - Weiderpass, Elisabete
AU - Johansson, Mattias
AU - Dossus, Laure
AU - Jenab, Mazda
AU - Gunter, Marc J.
AU - Bermejo, Justo Lorenzo
AU - Scherer, Dominique
AU - Salek, Reza M.
AU - Keski-Rahkonen, Pekka
AU - Ferrari, Pietro
N1 - Publisher Copyright:
© 2021 by the authors. Licensee MDPI, Basel, Switzerland.
PY - 2021/9/1
Y1 - 2021/9/1
N2 - Pooling metabolomics data across studies is often desirable to increase the statistical power of the analysis. However, this can raise methodological challenges as several preanalytical and analytical factors could introduce differences in measured concentrations and variability between datasets. Specifically, different studies may use variable sample types (e.g., serum versus plasma) collected, treated, and stored according to different protocols, and assayed in different laboratories using different instruments. To address these issues, a new pipeline was developed to normalize and pool metabolomics data through a set of sequential steps: (i) exclusions of the least informative observations and metabolites and removal of outliers; imputation of missing data; (ii) identification of the main sources of variability through principal component partial R-square (PC-PR2) analysis; (iii) application of linear mixed models to remove unwanted variability, including samples’ originating study and batch, and preserve biological variations while accounting for potential differences in the residual variances across studies. This pipeline was applied to targeted metabolomics data acquired using Biocrates AbsoluteIDQ kits in eight case-control studies nested within the European Prospective Investigation into Cancer and Nutrition (EPIC) cohort. Comprehensive examination of metabolomics measurements indicated that the pipeline improved the comparability of data across the studies. Our pipeline can be adapted to normalize other molecular data, including biomarkers as well as proteomics data, and could be used for pooling molecular datasets, for example in international consortia, to limit biases introduced by inter-study variability. This versatility of the pipeline makes our work of potential interest to molecular epidemiologists.
AB - Pooling metabolomics data across studies is often desirable to increase the statistical power of the analysis. However, this can raise methodological challenges as several preanalytical and analytical factors could introduce differences in measured concentrations and variability between datasets. Specifically, different studies may use variable sample types (e.g., serum versus plasma) collected, treated, and stored according to different protocols, and assayed in different laboratories using different instruments. To address these issues, a new pipeline was developed to normalize and pool metabolomics data through a set of sequential steps: (i) exclusions of the least informative observations and metabolites and removal of outliers; imputation of missing data; (ii) identification of the main sources of variability through principal component partial R-square (PC-PR2) analysis; (iii) application of linear mixed models to remove unwanted variability, including samples’ originating study and batch, and preserve biological variations while accounting for potential differences in the residual variances across studies. This pipeline was applied to targeted metabolomics data acquired using Biocrates AbsoluteIDQ kits in eight case-control studies nested within the European Prospective Investigation into Cancer and Nutrition (EPIC) cohort. Comprehensive examination of metabolomics measurements indicated that the pipeline improved the comparability of data across the studies. Our pipeline can be adapted to normalize other molecular data, including biomarkers as well as proteomics data, and could be used for pooling molecular datasets, for example in international consortia, to limit biases introduced by inter-study variability. This versatility of the pipeline makes our work of potential interest to molecular epidemiologists.
KW - Cancer epidemiology
KW - Metabolites
KW - Metabolomics
KW - Normalization
KW - Pooling
KW - Technical variability
UR - http://www.scopus.com/inward/record.url?scp=85115861814&partnerID=8YFLogxK
U2 - 10.3390/metabo11090631
DO - 10.3390/metabo11090631
M3 - Article
AN - SCOPUS:85115861814
SN - 2218-1989
VL - 11
JO - Metabolites
JF - Metabolites
IS - 9
M1 - 631
ER -