TY - JOUR AU - F. R. Chen AU - J. L. Huang AU - D. L. Wilson AU - W. J. Lo-Ciganic A1 - AB - INTRODUCTION: Early detection and intervention are crucial for reducing the impacts of depression and associated healthcare costs. Few studies have used electronic health records (EHR) and machine learning (ML) with a longitudinal design to predict depression onset. We developed and validated ML algorithms using EHR to identify patients at high risk for the onset of diagnosis-based major depressive disorder (MDD) in primary care settings. METHODS: Using a prognostic modeling approach with retrospective cohort study design, we identified patient visits in primary care settings for individuals aged ≥18 years from the Accelerating Data Value Across a National Community Health Center Network Clinical Research Network 2015-2021 data. We measured 267 features at six-month intervals starting six months prior to the first encounter. We developed algorithms using Least Absolute Shrinkage and Selection Operator (LASSO), random forest, and XGBoost with 10-fold cross validation. Using hold-out testing data, we measured prediction performance (e.g., C-statistics), stratified patients into decile risk subgroups, and assessed model biases. RESULTS: Among eligible 1,965,399 individuals (mean age = 43.52 ± 16.04 years; male = 35%; African American = 20%) with 4,985,280 person-periods, the MDD onset rate was 1% during the study period. XGBoost performed similarly to other models and had the fewest predictors, (C-statistic = 0.763, 95% CI = [0.760, 0.767]). XGBoost had a 66.78% sensitivity, 74.19% specificity, and 2.55% positive predictive value at the balanced threshold identified using Youdan Index. The top three risk decile subgroups captured ∼70% of MDD cases, without significant racial or sex biases. CONCLUSIONS: An ML algorithm using EHR data can effectively identify individuals at high risk of depression onset within the subsequent six months, without exacerbating racial or sex biases, providing a valuable tool for targeted early interventions. AD - Georgia State University Andrew Young School of Policy Studies, Atlanta, GA USA.; University of Florida, College of Pharmacy, Gainesville, FL USA.; University of Pittsburgh School of Medicine, Pittsburgh, PA USA.; North Florida/South Georgia Veterans Health System, Geriatric Research Education and Clinical Center, Gainesville, FL USA. AN - 40776007 BT - Stud Health Technol Inform C5 - HIT & Telehealth DA - Aug 7 DO - 10.3233/shti250989 DP - NLM JF - Stud Health Technol Inform LA - eng N2 - INTRODUCTION: Early detection and intervention are crucial for reducing the impacts of depression and associated healthcare costs. Few studies have used electronic health records (EHR) and machine learning (ML) with a longitudinal design to predict depression onset. We developed and validated ML algorithms using EHR to identify patients at high risk for the onset of diagnosis-based major depressive disorder (MDD) in primary care settings. METHODS: Using a prognostic modeling approach with retrospective cohort study design, we identified patient visits in primary care settings for individuals aged ≥18 years from the Accelerating Data Value Across a National Community Health Center Network Clinical Research Network 2015-2021 data. We measured 267 features at six-month intervals starting six months prior to the first encounter. We developed algorithms using Least Absolute Shrinkage and Selection Operator (LASSO), random forest, and XGBoost with 10-fold cross validation. Using hold-out testing data, we measured prediction performance (e.g., C-statistics), stratified patients into decile risk subgroups, and assessed model biases. RESULTS: Among eligible 1,965,399 individuals (mean age = 43.52 ± 16.04 years; male = 35%; African American = 20%) with 4,985,280 person-periods, the MDD onset rate was 1% during the study period. XGBoost performed similarly to other models and had the fewest predictors, (C-statistic = 0.763, 95% CI = [0.760, 0.767]). XGBoost had a 66.78% sensitivity, 74.19% specificity, and 2.55% positive predictive value at the balanced threshold identified using Youdan Index. The top three risk decile subgroups captured ∼70% of MDD cases, without significant racial or sex biases. CONCLUSIONS: An ML algorithm using EHR data can effectively identify individuals at high risk of depression onset within the subsequent six months, without exacerbating racial or sex biases, providing a valuable tool for targeted early interventions. PY - 2025 SN - 0926-9630 SP - 997 EP - 1001+ ST - Development and Validation of Machine-Learning Algorithms to Predict the Onset of Depression Using Electronic Health Record Data: A Prognostic Modeling Study T1 - Development and Validation of Machine-Learning Algorithms to Predict the Onset of Depression Using Electronic Health Record Data: A Prognostic Modeling Study T2 - Stud Health Technol Inform TI - Development and Validation of Machine-Learning Algorithms to Predict the Onset of Depression Using Electronic Health Record Data: A Prognostic Modeling Study U1 - HIT & Telehealth U3 - 10.3233/shti250989 VL - 329 VO - 0926-9630 Y1 - 2025 ER -