TY - JOUR
T1 - Statistical analysis of high-dimensional biomedical data
T2 - a gentle introduction to analytical goals, common approaches and challenges
AU - for topic group “High-dimensional data” (TG9) of the STRATOS initiative
AU - Rahnenführer, Jörg
AU - De Bin, Riccardo
AU - Benner, Axel
AU - Ambrogi, Federico
AU - Lusa, Lara
AU - Boulesteix, Anne Laure
AU - Migliavacca, Eugenia
AU - Binder, Harald
AU - Michiels, Stefan
AU - Sauerbrei, Willi
AU - McShane, Lisa
N1 - Publisher Copyright:
© 2023, This is a U.S. Government work and not under copyright protection in the US; foreign copyright protection may apply.
PY - 2023/12/1
Y1 - 2023/12/1
N2 - Background: In high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions. Methods: Advances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 “High-dimensional data” of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD. Results: The paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided. Conclusions: This review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses.
AB - Background: In high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions. Methods: Advances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 “High-dimensional data” of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD. Results: The paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided. Conclusions: This review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses.
KW - Analytical goals
KW - Clustering
KW - Exploratory data analysis
KW - High-dimensional data
KW - Initial data analysis
KW - Multiple testing
KW - Omics data
KW - Prediction
KW - STRATOS initiative
UR - http://www.scopus.com/inward/record.url?scp=85159394357&partnerID=8YFLogxK
U2 - 10.1186/s12916-023-02858-y
DO - 10.1186/s12916-023-02858-y
M3 - Review article
C2 - 37189125
AN - SCOPUS:85159394357
SN - 1741-7015
VL - 21
JO - BMC Medicine
JF - BMC Medicine
IS - 1
M1 - 182
ER -