TY - JOUR
T1 - Natural Language Processing for Patient Selection in Phase I or II Oncology Clinical Trials
AU - Delorme, Julie
AU - Charvet, Valentin
AU - Wartelle, Muriel
AU - Lion, François
AU - Thuillier, Bruno
AU - Mercier, Sandrine
AU - Soria, Jean Charles
AU - Azoulay, Mikael
AU - Besse, Benjamin
AU - Massard, Christophe
AU - Hollebecque, Antoine
AU - Verlingue, Loic
N1 - Publisher Copyright:
© 2021 by American Society of Clinical Oncology.
PY - 2021/1/1
Y1 - 2021/1/1
N2 - PURPOSE Early discontinuation affects more than one third of patients enrolled in early-phase oncology clinical trials. Early discontinuation is deleterious both for the patient and for the study, by inflating its duration and associated costs. We aimed at predicting the successful screening and dose-limiting toxicity period completion (SSD) from automatic analysis of consultation reports. MATERIALS AND METHODS We retrieved the consultation reports of patients included in phase I and/or phase II oncology trials for any tumor type at Gustave Roussy, France. We designed a preprocessing pipeline that transformed free text into numerical vectors and gathered them into semantic clusters. These document-based semantic vectors were then fed into a machine learning model that we trained to output a binary prediction of SSD status. RESULTS Between September 2012 and July 2020, 56,924 consultation reports were used to build the dictionary and 1,858 phase I or II inclusion reports were used to train (72%), validate (14%), and test (14%) a random forest model. Preprocessing could efficiently cluster words with semantic proximity. On the unseen test cohort of 264 consultation reports, the performances of the model reached: F1 score 0.80, recall 0.81, and area under the curve 0.88. Using this model, we could have reduced the screen fail rate (including dose-limiting toxicity period) from 39.8% to 12.8% (relative risk, 0.322; 95% CI, 0.209 to 0.498; P < .0001) within the test cohort. Most important semantic clusters for predictions comprised words related to hematologic malignancies, anatomopathologic features, and laboratory and imaging interpretation. CONCLUSION Machine learning with semantic conservation is a promising tool to assist physicians in selecting patients prone to achieve SSD in early-phase oncology clinical trials.
AB - PURPOSE Early discontinuation affects more than one third of patients enrolled in early-phase oncology clinical trials. Early discontinuation is deleterious both for the patient and for the study, by inflating its duration and associated costs. We aimed at predicting the successful screening and dose-limiting toxicity period completion (SSD) from automatic analysis of consultation reports. MATERIALS AND METHODS We retrieved the consultation reports of patients included in phase I and/or phase II oncology trials for any tumor type at Gustave Roussy, France. We designed a preprocessing pipeline that transformed free text into numerical vectors and gathered them into semantic clusters. These document-based semantic vectors were then fed into a machine learning model that we trained to output a binary prediction of SSD status. RESULTS Between September 2012 and July 2020, 56,924 consultation reports were used to build the dictionary and 1,858 phase I or II inclusion reports were used to train (72%), validate (14%), and test (14%) a random forest model. Preprocessing could efficiently cluster words with semantic proximity. On the unseen test cohort of 264 consultation reports, the performances of the model reached: F1 score 0.80, recall 0.81, and area under the curve 0.88. Using this model, we could have reduced the screen fail rate (including dose-limiting toxicity period) from 39.8% to 12.8% (relative risk, 0.322; 95% CI, 0.209 to 0.498; P < .0001) within the test cohort. Most important semantic clusters for predictions comprised words related to hematologic malignancies, anatomopathologic features, and laboratory and imaging interpretation. CONCLUSION Machine learning with semantic conservation is a promising tool to assist physicians in selecting patients prone to achieve SSD in early-phase oncology clinical trials.
UR - http://www.scopus.com/inward/record.url?scp=85110344091&partnerID=8YFLogxK
U2 - 10.1200/CCI.21.00003
DO - 10.1200/CCI.21.00003
M3 - Article
C2 - 34197179
AN - SCOPUS:85110344091
SN - 2473-4276
VL - 5
SP - 709
EP - 718
JO - JCO Clinical Cancer Informatics
JF - JCO Clinical Cancer Informatics
ER -