Study design and participants
This cross-sectional study aimed to examine the association between dry eye disease (DED) and depressive symptoms in adolescents and to develop predictive models for classifying depression severity using machine learning (ML) techniques. Participants were middle school students aged 12–18 years, recruited using cluster sampling from 20 randomly selected middle schools in Beijing, China. This sampling approach ensured demographic representativeness across age, gender, and geographic region, enhancing the generalizability of the findings.
Prior to data collection, informed consent was obtained from all participants and their legal guardians. Ethical approval for the study was granted by the institutional review board of Ethics Committee of Beijing Tongren Hospital, Capital Medical University, Beijing, China (Ethics Approval Number: TRECKY2021-204) and all procedures adhered to the principles outlined in the Declaration of Helsinki. Data were collected anonymously through a standardized electronic questionnaire administered between March and June 2023.
Data preprocessing
A total of 2,197 questionnaires were distributed (see Table 1). After excluding responses with missing information in key variables or implausible values, 2,076 valid questionnaires remained for analysis (valid response rate: 94.5%). Missing and implausible responses were removed prior to analysis, and no imputation algorithms were applied, ensuring that only complete and valid cases were included. We chose not to impute because doing so would require unverifiable missing-at-random assumptions for several psychosocial and behavioral predictors, and imputation outside the resampling loop can inflate apparent performance via information leakage. To gauge whether complete-case exclusion altered the composition of the analytic sample relative to the original survey frame, we compared overall distributions of sex and location between the initial dataset (n = 2,197) and the retained complete-case dataset (n = 2,076) using percentage-point differences. The complete-case sample showed minimal shifts: female 52.39% → 51.83% (Δ = −0.56 pp); location: loc = 1 53.07% → 52.41% (Δ = −0.66 pp), loc = 2 11.52% → 11.18% (Δ = −0.34 pp), loc = 3 35.86% → 36.37% (Δ = +0.51 pp). Continuous variables were standardized (z-score normalization) within training folds of nested cross-validation to avoid data leakage. To address class imbalance across depression-severity categories, SMOTE was applied only within training folds. Finally, data were split using stratified sampling (80% training, 20% testing), with all preprocessing performed inside the training pipeline of nested cross-validation.
Data collection
The dataset encompassed demographic, behavioral, clinical, and psychological variables:
Demographic variables
-
Age and Gender.
-
Parental Education Level: Assessed as a proxy for socioeconomic status (categorized into primary, secondary, or tertiary education levels).
Behavioral variables
-
Electronic Device Usage: Self-reported average daily hours spent using electronic devices for academic and recreational activities.
-
Physical Exercise: Weekly frequency of moderate-to-vigorous physical activity.
Clinical variables
Dry Eye Symptoms: Measured using the validated Ocular Surface Disease Index (OSDI) [32], a 12-item instrument that assesses DED severity through three subscales: A total score (OSDI_total) was calculated to quantify overall DED severity, with higher scores indicating more severe symptoms.
Sleep quality
-
Sleep Duration (Sleep_quality1): Self-reported average number of sleep hours per night.
-
Sleep Disturbances (Sleep_quality2): Frequency of sleep disturbances, such as difficulty initiating or maintaining sleep, measured on a Likert scale.
Depression Levels: Measured using the Patient Health Questionnaire-9 (PHQ-9) [33], a validated tool for assessing depression severity. PHQ-9 scores (PHQ_total) were classified into four categories:
-
No Depression: PHQ_total ≤ 17.
-
Mild Depression: 18 ≤ PHQ_total ≤ 26.
-
Moderate Depression: 27 ≤ PHQ_total ≤ 35.
-
Severe Depression: PHQ_total > 35.
Feature selection
To ensure robust predictor inclusion, we first assessed multicollinearity by removing variables with pairwise Spearman correlation coefficients exceeding |ρ| >0.80, followed by exclusion of variables with variance inflation factor (VIF) > 5. For residual collinearity among sleep-related variables, we performed principal component analysis (PCA) to derive a composite sleep score. For embedded feature selection, we used LASSO (L1-regularized logistic regression) to shrink non-informative coefficients to zero. To improve interpretability and quantify global and local effects, we computed SHapley Additive exPlanations (SHAP) values for the trained models.
Exploratory model screening
To provide a comprehensive overview of algorithmic performance, we initially screened 15 machine learning models spanning diverse methodological families:
-
Linear models: logistic regression (L1, L2, elastic net),
-
Tree-based models: decision tree, random forest,
-
Boosted ensembles: gradient boosting, AdaBoost, LightGBM, XGBoost, CatBoost,
-
Probabilistic classifier: naïve Bayes,
-
Support vector machines: linear and RBF kernels,
-
Neural network: multilayer perceptron (MLP),
-
Ensemble meta-learners: bagging, stacking,
-
Custom algorithm: extreme learning machine (ELM).
This broad comparison was intended as an exploratory phase to cover linear, non-linear, kernel-based, probabilistic, ensemble, and neural paradigms. Such algorithmic diversity has been widely recommended in biomedical informatics to ensure transparency and identify promising paradigms for further validation [25,26,27,28]. Full exploratory results for all 15 algorithms are provided in the Supplementary Materials.
Representative algorithms for nested cross-validation
For rigorous yet computationally feasible evaluation, we restricted the nested cross-validation analysis to five representative algorithms. These were chosen to capture the principal paradigms of machine learning while ensuring interpretability and efficiency: logistic regression with elastic net regularization (linear baseline), support vector classifier with RBF kernel (kernel-based learner), random forest (bagging-based ensemble), gradient boosting (boosting-based ensemble), and multilayer perceptron (MLP, neural network). This representative selection balanced methodological diversity and computational feasibility, given that all experiments were conducted in a Google Colab environment with limited computing resources. By spanning linear vs. non-linear, bagging vs. boosting, and kernel vs. neural approaches, these five models provided broad methodological coverage while allowing efficient implementation of nested cross-validation (5 outer folds × 5 inner folds). Such representative selection strategies are consistent with prior recommendations for methodological diversity in biomedical machine learning research [26,27,28].
Rationale for model selection
To comprehensively evaluate predictive performance, we initially screened 15 machine learning algorithms spanning diverse methodological families (linear, tree-based, ensemble boosting, probabilistic, kernel-based, and neural networks). This broad comparison approach has been widely recommended in biomedical informatics [25,26,27,28] to ensure transparency and identify promising algorithmic paradigms for further validation.
Linear models (e.g., logistic regression) for high interpretability and baseline benchmarking.
-
Tree-based models (Decision Tree, Random Forest, Gradient Boosting) for capturing nonlinear relationships and feature interactions.
-
Boosted ensemble models (LightGBM, XGBoost, CatBoost) for strong predictive power in structured data.
-
Support Vector Machine for robustness in high-dimensional feature spaces.
-
Multilayer Perceptron for complex nonlinear transformations.
-
Ensemble meta-learning approaches (AdaBoost, Bagging, Stacking) to leverage the strengths of multiple base learners.
-
Extreme Learning Machine (ELM) for rapid training and scalability to large datasets.
Hyperparameter tuning and overfitting prevention
All models were trained and evaluated within a stratified 5 × 5 nested cross-validation framework. Hyperparameters were optimized in the inner loop using RandomizedSearchCV (50 iterations, optimizing macro-F1), with preprocessing steps (scaling, SMOTE, feature selection) performed strictly inside the inner loop to prevent data leakage. Model-specific regularization and constraints were applied to mitigate overfitting. Class imbalance was handled via class weighting in training folds, while outer test folds remained untouched. Performance was summarized from outer folds (macro-AUC, macro-F1, balanced accuracy, Brier score, ECE). Full hyperparameter ranges, optimal configurations, and pre- versus post-tuning results are provided in Supplementary Table S1.
Machine learning models
-
Linear Model: Logistic Regression.
-
Tree-Based Models: Decision Tree, Random Forest, Gradient Boosting.
-
LightGBM, XGBoost, CatBoost.
-
Probabilistic Model: Naive Bayes.
-
Support Vector Machine (SVM): SVM with a linear kernel.
-
Neural Network: Multilayer Perceptron (MLP).
-
Ensemble Methods: AdaBoost, Bagging, and Stacking.
-
Custom Algorithm: Extreme Learning Machine (ELM).
Model evaluation
-
Accuracy: The proportion of correctly classified samples.
-
Precision: The proportion of true positive predictions among all positive predictions.
-
Recall (Sensitivity): The proportion of actual positive cases correctly identified.
-
F1 Score: The harmonic mean of precision and recall, providing a balanced measure for imbalanced datasets.
Receiver operating characteristic (ROC) and AUC analysis
Models capable of producing probability estimates were further evaluated using ROC curves and Area Under the Curve (AUC) metrics to assess their ability to discriminate across depression severity levels. Non-probabilistic models (e.g., ELM and Stacking) were excluded from this analysis.
Analytical tools
All analyses were conducted using Python programming in a Google Colab environment, which provided scalable computational resources. Key libraries included:
Data handling and preprocessing: Pandas, NumPy
-
Machine Learning Algorithms: Scikit-learn, XGBoost, LightGBM, CatBoost.
-
Feature Interpretability: SHAP.
-
Visualization: Matplotlib, Seaborn.
-
Class Imbalance Correction: Imbalanced-learn (SMOTE).
link

