TY - JOUR
T1 - A new hybrid record linkage process to make epidemiological databases interoperable
T2 - application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers
AU - GEMO Study Collaborators
AU - GENEPSO Study Collaborators
AU - Jiao, Yue
AU - Lesueur, Fabienne
AU - Azencott, Chloé Agathe
AU - Laurent, Maïté
AU - Mebirouk, Noura
AU - Laborde, Lilian
AU - Beauvallet, Juana
AU - Dondon, Marie Gabrielle
AU - Eon-Marchais, Séverine
AU - Laugé, Anthony
AU - Boutry-Kryza, Nadia
AU - Calender, Alain
AU - Giraud, Sophie
AU - Léone, Mélanie
AU - Bressac-de-Paillerets, Brigitte
AU - Caron, Olivier
AU - Guillaud-Bataille, Marine
AU - Bignon, Yves Jean
AU - Uhrhammer, Nancy
AU - Bonadona, Valérie
AU - Lasset, Christine
AU - Berthet, Pascaline
AU - Castera, Laurent
AU - Vaur, Dominique
AU - Bourdon, Violaine
AU - Noguchi, Tetsuro
AU - Popovici, Cornel
AU - Remenieras, Audrey
AU - Sobol, Hagay
AU - Coupier, Isabelle
AU - Harmand, Pierre Olivier
AU - Pujol, Pascal
AU - Vilquin, Paul
AU - Dumont, Aurélie
AU - Révillion, Françoise
AU - Muller, Danièle
AU - Barouk-Simonet, Emmanuelle
AU - Bonnet, Françoise
AU - Bubien, Virginie
AU - Longy, Michel
AU - Sévenet, Nicolas
AU - Gladieff, Laurence
AU - Guimbaud, Rosine
AU - Feillel, Viviane
AU - Toulas, Christine
AU - Dreyfus, Hélène
AU - Leroux, Dominique
AU - Peysselon, Magalie
AU - Rebischung, Christine
AU - Caron, Olivier
N1 - Publisher Copyright:
© 2021, The Author(s).
PY - 2021/12/1
Y1 - 2021/12/1
N2 - Background: Linking independent sources of data describing the same individuals enable innovative epidemiological and health studies but require a robust record linkage approach. We describe a hybrid record linkage process to link databases from two independent ongoing French national studies, GEMO (Genetic Modifiers of BRCA1 and BRCA2), which focuses on the identification of genetic factors modifying cancer risk of BRCA1 and BRCA2 mutation carriers, and GENEPSO (prospective cohort of BRCAx mutation carriers), which focuses on environmental and lifestyle risk factors. Methods: To identify as many as possible of the individuals participating in the two studies but not registered by a shared identifier, we combined probabilistic record linkage (PRL) and supervised machine learning (ML). This approach (named “PRL + ML”) combined together the candidate matches identified by both approaches. We built the ML model using the gold standard on a first version of the two databases as a training dataset. This gold standard was obtained from PRL-derived matches verified by an exhaustive manual review. Results The Random Forest (RF) algorithm showed a highest recall (0.985) among six widely used ML algorithms: RF, Bagged trees, AdaBoost, Support Vector Machine, Neural Network. Therefore, RF was selected to build the ML model since our goal was to identify the maximum number of true matches. Our combined linkage PRL + ML showed a higher recall (range 0.988–0.992) than either PRL (range 0.916–0.991) or ML (0.981) alone. It identified 1995 individuals participating in both GEMO (6375 participants) and GENEPSO (4925 participants). Conclusions: Our hybrid linkage process represents an efficient tool for linking GEMO and GENEPSO. It may be generalizable to other epidemiological studies involving other databases and registries.
AB - Background: Linking independent sources of data describing the same individuals enable innovative epidemiological and health studies but require a robust record linkage approach. We describe a hybrid record linkage process to link databases from two independent ongoing French national studies, GEMO (Genetic Modifiers of BRCA1 and BRCA2), which focuses on the identification of genetic factors modifying cancer risk of BRCA1 and BRCA2 mutation carriers, and GENEPSO (prospective cohort of BRCAx mutation carriers), which focuses on environmental and lifestyle risk factors. Methods: To identify as many as possible of the individuals participating in the two studies but not registered by a shared identifier, we combined probabilistic record linkage (PRL) and supervised machine learning (ML). This approach (named “PRL + ML”) combined together the candidate matches identified by both approaches. We built the ML model using the gold standard on a first version of the two databases as a training dataset. This gold standard was obtained from PRL-derived matches verified by an exhaustive manual review. Results The Random Forest (RF) algorithm showed a highest recall (0.985) among six widely used ML algorithms: RF, Bagged trees, AdaBoost, Support Vector Machine, Neural Network. Therefore, RF was selected to build the ML model since our goal was to identify the maximum number of true matches. Our combined linkage PRL + ML showed a higher recall (range 0.988–0.992) than either PRL (range 0.916–0.991) or ML (0.981) alone. It identified 1995 individuals participating in both GEMO (6375 participants) and GENEPSO (4925 participants). Conclusions: Our hybrid linkage process represents an efficient tool for linking GEMO and GENEPSO. It may be generalizable to other epidemiological studies involving other databases and registries.
KW - Hybrid process
KW - Probabilistic linkage
KW - Record linkage
KW - Supervised machine learning
UR - http://www.scopus.com/inward/record.url?scp=85112306435&partnerID=8YFLogxK
U2 - 10.1186/s12874-021-01299-6
DO - 10.1186/s12874-021-01299-6
M3 - Article
C2 - 34325649
AN - SCOPUS:85112306435
SN - 1471-2288
VL - 21
JO - BMC Medical Research Methodology
JF - BMC Medical Research Methodology
IS - 1
M1 - 155
ER -