TY - JOUR
T1 - Variable selection for generalized canonical correlation analysis
AU - Tenenhaus, Arthur
AU - Philippe, Cathy
AU - Guillemot, Vincent
AU - Le Cao, Kim Anh
AU - Grill, Jacques
AU - Frouin, Vincent
PY - 2014/1/1
Y1 - 2014/1/1
N2 - Regularized generalized canonical correlation analysis (RGCCA) is a generalization of regularized canonical correlation analysis to 3 or more sets of variables. RGCCA is a component-based approach which aims to study the relationships between several sets of variables. The quality and interpretability of the RGCCA components are likely to be affected by the usefulness and relevance of the variables in each block. Therefore, it is an important issue to identify within each block which subsets of significant variables are active in the relationships between blocks. In this paper, RGCCA is extended to address the issue of variable selection. Specifically, sparse generalized canonical correlation analysis (SGCCA) is proposed to combine RGCCA with an $\ell 1$-penalty in a unified framework. Within this framework, blocks are not necessarily fully connected, which makes SGCCA a flexible method for analyzing a wide variety of practical problems. Finally, the versatility and usefulness of SGCCA are illustrated on a simulated dataset and on a 3-block dataset which combine gene expression, comparative genomic hybridization, and a qualitative phenotype measured on a set of 53 children with glioma. SGCCA is available on CRAN as part of the RGCCA package.
AB - Regularized generalized canonical correlation analysis (RGCCA) is a generalization of regularized canonical correlation analysis to 3 or more sets of variables. RGCCA is a component-based approach which aims to study the relationships between several sets of variables. The quality and interpretability of the RGCCA components are likely to be affected by the usefulness and relevance of the variables in each block. Therefore, it is an important issue to identify within each block which subsets of significant variables are active in the relationships between blocks. In this paper, RGCCA is extended to address the issue of variable selection. Specifically, sparse generalized canonical correlation analysis (SGCCA) is proposed to combine RGCCA with an $\ell 1$-penalty in a unified framework. Within this framework, blocks are not necessarily fully connected, which makes SGCCA a flexible method for analyzing a wide variety of practical problems. Finally, the versatility and usefulness of SGCCA are illustrated on a simulated dataset and on a 3-block dataset which combine gene expression, comparative genomic hybridization, and a qualitative phenotype measured on a set of 53 children with glioma. SGCCA is available on CRAN as part of the RGCCA package.
KW - Generalized canonical correlation analysis
KW - Multiblock data analysis
KW - Variable selection
UR - http://www.scopus.com/inward/record.url?scp=84902252378&partnerID=8YFLogxK
U2 - 10.1093/biostatistics/kxu001
DO - 10.1093/biostatistics/kxu001
M3 - Article
C2 - 24550197
AN - SCOPUS:84902252378
SN - 1465-4644
VL - 15
SP - 569
EP - 583
JO - Biostatistics
JF - Biostatistics
IS - 3
ER -