Developing personalized algorithms for sensing mental health symptoms in daily life
This pilot study developed and tested generalized and personalized ML algorithms for detecting individual and family mental health symptoms using widely available mobile devices in 35 families over a 60-day period. Performance of the generalized models showed sensitivity ranging from poor to moderate, specificity from poor to excellent, and F1 scores from poor to good (HO1). Personalized models demonstrated poor to moderate sensitivity, moderate to excellent specificity, and poor to moderate F1 scores, depending on the specific symptom states examined. When aggregating across all target states, personalized models outperformed generalized models on both sensitivity and F1 score. Additional comparisons by specific symptom states revealed higher performance for personalized models in sensitivity for sadness, anxiety, anger, and stress; in specificity for happiness, quality time, closeness, and positive interactions; and in F1 for happiness, anxiety, stress, quality time, and positive interactions (HO2). Model performance also varied based on individual user characteristics. Symptom endorsement (HO3a) and survey adherence (HO3b) were significantly related to performance, whereas symptom variability (HO3c) was not. Finally, we found that the most informative feature sets varied by target state. However, overall, models using activity and sleep features—or the combination of activity, sleep, and environment features—outperformed models using speech or interaction features alone (HO4). Developing algorithms to detect mental health symptoms and family interactional processes using passive sensing data from common commercial mobile devices represents a critical preliminary step toward building usable, scalable JITAI systems. Our findings provide crucial insights into the conditions that enhance model performance and highlight key directions for optimizing future real-world deployment.
The results from our generalized models indicated poor to moderate sensitivity, poor to excellent specificity, and poor to good F1 scores across target states. The methods employed in this study were intentionally designed to approximate real-world use cases by leveraging widely available commercial devices, rather than research-grade equipment, to enhance generalizability and future scalability. However, this real-world approach also introduced considerable heterogeneity into the data, leading to substantial variability in model performance across users. These findings underscore the inherent challenge of using passive sensing data collected from commercial devices to detect in-the-moment emotional states in uncontrolled, everyday environments—the exact conditions under which JITAIs must ultimately operate to achieve meaningful public health impact. Despite generally moderate performance, model specificity exceeded chance levels across conditions, and strong performance was achieved for certain target states and subgroups. These results demonstrate that good model performance is attainable and offer encouraging early evidence supporting the feasibility of real-world passive sensing for mental health. Importantly, the models developed here serve as critical benchmarks for future algorithm development, both by our own research group and by others who may access our dataset via our public data repository. Finally, our findings highlight key future directions for improving model performance, including the need for systems explicitly designed for real-world variability and systematic investigations of the conditions under which models perform successfully. Building robust, scalable models for JITAI systems will require embracing the complexity of real-world data rather than attempting to avoid it, and carefully tailoring algorithm development to optimize performance under these conditions.
In addition to building generalized models across the sample, we developed personalized models for each individual participant and conducted statistical tests to compare model performance. Overall, personalized models demonstrated significantly higher sensitivity and F1 scores compared to generalized models. Further distinctions emerged when examining specific emotional states: personalized models showed enhanced sensitivity for sadness, anxiety, anger, and stress; higher specificity for happiness, quality time, closeness, and positive interactions; and higher F1 scores for happiness, anxiety, stress, quality time, and positive interactions. These findings suggest that individualized model development may be a particularly effective strategy for building high-performing algorithms intended for JITAI systems. Given the substantial heterogeneity in device types, data streams, and symptom profiles across users, it is logical that personalized models designed around an individual’s unique characteristics would outperform generalized models trained across the broader sample. However, translating personalized models into real-world JITAI deployments would require overcoming several practical challenges. Successful implementation would necessitate sufficient data collection from each user to build stable baseline models, as well as the creation of sophisticated software systems capable of adjusting and updating model parameters in real time as new data are collected. Future JITAI systems would likely need an initial calibration period to gather baseline data and personalize model parameters before delivering tailored interventions. While this represents an exciting and promising direction, building systems capable of real-time personalization, deployment, and adaptation will require advances in research and technology before they can be widely deployed in everyday settings.
We further examined whether individual factors and symptom profiles were associated with model performance. We found partial support for Hypothesis 3, with significant effects for symptom endorsement (HO3a) and survey adherence (HO3b), but not symptom variability (HO3c). Higher symptom endorsement was associated with increased sensitivity but decreased specificity, while greater survey adherence was linked to improved sensitivity and F1 scores. On one hand, these results align with expectations: a greater proportion of positive samples was associated with greater sensitivity and lower specificity, and a higher number of training samples was related to stronger F1 scores. However, these basic statistical properties must be interpreted within their clinical context. Different users’ symptom profiles and engagement behaviors inherently shape the data available for model development. For example, an individual experiencing daily sadness but exhibiting low survey response rates would present very different data challenges compared to an individual with fluctuating anxiety symptoms and high survey compliance. Taken together, these findings suggest that the characteristics of users—specifically their symptom profiles and engagement patterns—play an important role in model success. Passive sensing models may need to be tailored not only to individual users, but also to the clinical characteristics of the symptoms being targeted. Future systems might benefit from first mapping individuals’ symptom profiles during a baseline period and then dynamically matching users to algorithms most likely to perform well based on their unique patterns. Such approaches could increase the precision and scalability of real-world JITAI systems.
Our final hypothesis (HO4) explored whether the best performing feature sets differed depending on the mental health symptoms being detected. Overall, the three highest performing feature sets were activity, sleep, and environment. However, feature set performance varied across different target states.
Activity features—which included both physical activity and physiological activation—were especially important for detecting states characterized by bodily or emotional arousal, such as stress, anger, and aggression. Speech emerged among the top feature sets for anger, conflict, and aggression, suggesting that linguistic features, such as the frequency of offensive words, may be informative markers for detecting interpersonal difficulties. Sleep features, which are closely tied to emotional regulation and mental health outcomes, were especially predictive for anxiety, reflecting the known bidirectional relationship between sleep disruption and anxious symptoms42,43. Interestingly, speech was the least informative feature set for sadness, whereas environment features were the most predictive. This may reflect the more inward, less externally expressed nature of sadness, suggesting that other environmental factors—such as time spent at home or specific ambient conditions—may provide better contextual cues for sadness than speech alone.
Our mutual information gain analyses further identified the individual features contributing most strongly to model performance across all target states. The top features included lightly active minutes, number of awakenings, total minutes asleep, neutral sentiment, sleep efficiency, sedentary minutes, total minutes in bed, polarity, fairly active minutes, and proximity to home. When aggregating across emotion states, models combining1 activity and sleep features and ref. 2 activity, sleep, and environment features outperformed models based solely on speech or interaction data. Although speech features did not emerge as top contributors in this study, it is important to note that speech was only sampled during 30-min windows twice per day after participant prompts, limiting its temporal overlap with symptom reporting. Future studies with more continuous speech sampling may better harness its potential as a rich source of emotional information. Overall, these findings highlight that careful feature selection—tailored to the specific symptom or emotional state of interest—is critical for building effective passive sensing models. Foundational work to map which features best predict which mental health symptoms will be crucial for the next generation of personalized, scalable mental health interventions.
This pilot study offers several important strengths as a foundational step toward developing personalized mental health sensing algorithms to inform future JITAIs. Using a pervasive computing system, we collected ~14 million data points across 52 passive and active data streams over a 60-day period in a highly diverse sample. This intensive, naturalistic data collection enabled the detection of a wide range of moods and interpersonal states, supporting a multifaceted and ecologically valid assessment of mental health processes. The resulting dataset is uniquely rich, consisting of continuous, multimodal data streams captured in participants’ daily lives over an extended period. Our methodology approximated real-world implementation by leveraging widely available commercial devices and accommodating heterogeneity in operating systems and sensors, which increases the future scalability and generalizability of the algorithms developed. The study also focused on both individual and family mental health symptoms, providing a logical and impactful use case for eventual deployment of personalized algorithms in JITAI frameworks. Importantly, we moved beyond simply benchmarking model performance by systematically examining factors that influence success. We compared generalized and personalized approaches, explored how user symptom profiles and behavioral engagement impacted model outcomes, and evaluated which data streams contributed most effectively to symptom detection. This multifactorial approach advances the field of AI sensing by moving toward real-world applications and by identifying specific factors that may optimize model performance.
Despite these strengths, several limitations warrant consideration. First, although the data collection was intensive and multimodal, the number of participating families was modest. Given the relatively small sample size, this study should be interpreted as an initial proof-of-concept rather than a definitive evaluation of model performance. To mitigate limitations associated with small samples, we employed non-parametric statistical tests, which are robust to distributional assumptions, and applied Bonferroni corrections to adjust for multiple comparisons. Nevertheless, future studies are needed to validate and extend these findings using larger samples, longer longitudinal data collection windows, and real-time implementation. These efforts will be critical for translating the foundational ML algorithms developed here into scalable, deployable JITAI systems for mental health intervention. Second, although personalized models consistently outperformed generalized models, overall model performance remained moderate. This finding is consistent with other research showing that emotion detection in ambulatory settings is significantly more challenging than in controlled laboratory environments44,45. Additional development work, including model optimization and algorithm refinement, will be necessary to achieve the level of reliability needed for clinical applications.
Third, technical challenges likely impacted data completeness and model performance. Most notably, a server overload occurred during data collection, caused by the high volume and velocity of incoming data, which exceeded processing capacity and resulted in permanent loss of some scheduled survey notifications. Although the system was upgraded mid-study to expand capacity and prevent future overloads, we were unable to distinguish between missing surveys due to technical errors and participant non-response. These challenges reflect the real-world complexities of creating scalable and reliable pervasive computing systems, particularly when managing high-frequency, high-volume data streams. They underscore the need for future research to implement more robust system monitoring, proactive technical support, and stress testing to minimize missingness and improve model reliability in real-world conditions. Fourth, the eligibility excluded families experiencing acute safety concerns (e.g., suicidal ideation, child maltreatment). Though this recruitment strategy was appropriate for the study’s initial scope, it may have introduced selection bias and limits generalizability to higher-risk populations. Future research should broaden eligibility criteria to include a wider range of clinical presentations and consider extending the age range of participants to better capture developmental variations in family mental health dynamics and symptom expression.
Fifth, while personalized models showed strong promise, they may be vulnerable to overfitting due to the limited number of labeled samples per individual. Future directions should explore sub-population modeling strategies46,47, where participants are grouped by common behavioral or clinical characteristics, allowing for greater model stability and scalability. Alternatively, researchers might use hybrid models that begin with sub-population predictions and gradually personalize over time as individual-level data accumulates. Finally, although our sample was demographically diverse, we did not formally examine the relationship between sociodemographic characteristics and model performance. Given growing evidence that ML models may exhibit bias across racial, ethnic, gender, and socioeconomic lines48, future studies should explicitly evaluate fairness across groups and assess whether personalization strategies can mitigate potential disparities. Together, these findings underscore the significant promise of personalized mental health sensing while also highlighting critical challenges to be addressed through future research. Continued work building on this pilot study will be essential for creating scalable, equitable, and effective AI-driven systems to promote mental health in everyday life.
This pilot study provides critical foundational evidence for the development of personalized machine learning models to detect mental health symptoms and interpersonal processes using pervasive mobile sensing in real-world settings. Overall, models achieved moderate performance, with personalized models consistently outperforming generalized models across key metrics. Importantly, model success varied systematically based on individual user factors and symptom profiles, underscoring the need for algorithms that adapt to user-specific characteristics. The study also highlighted the complexity of real-world system development, as reflected in challenges such as missing data, device variability, and server capacity limitations. Crucially, these challenges are not obstacles to be avoided but inherent realities that must be integrated into system design for scalable, reliable deployment. Future efforts must build on these early findings by conducting research under real-world conditions and designing adaptive systems that are robust to the heterogeneity and technical constraints of daily life. While preliminary, this work lays essential groundwork for advancing personalized, JITAIs and for realizing the potential of AI-driven mental health support in everyday contexts.
link
