Personalized prediction of negative affect in individuals with serious mental illness followed using long-term multimodal mobile phenotyping

Participants and procedure
Study participants were 70 adults diagnosed with a primary affective disorder (e.g., bipolar disorder or major depressive disorder) or psychotic disorder (i.e., schizophrenia or schizoaffective disorder). Among participants enrolled between 2015 and 2019, passive smartphone signals and wrist-worn actigraphy were available for subsets of 68 and 31 participants, respectively (see Table 1 for demographic and clinical characteristics of the final sample). Participants were recruited via study advertisement through the divisions for Depression and Anxiety Disorders and Psychotic Disorders at McLean Hospital and using the Rally platform through Mass General Brigham (MGB). The duration of the study per participant was set as one year, with the option of continuing participation depending on the severity and fluctuation of a participant’s symptoms as assessed during monthly clinic visits (M = 465 days, SD = 426 days, range = 3–2044 days; total daily emotion surveys collected = 12,959). Participants installed the Android and iOS compatible smartphone application Beiwe [38, 39] onto their own device to provide semi-continuous, passive collection of screen usage, accelerometry, and GPS data. Using Beiwe, participants were also invited to complete ecological surveys (see Measures: Outcomes). Participants were also given the option to wear a GENEActiv (ActivInsights, Inc.) actigraphy device on their wrist continuously and return or swap out the watch at the time of their monthly clinical visit. This wrist-worn device measures movement which is used to derive metrics for a participant’s sleep and physical activity.
Ethics approval and consent to participate
The study protocol was approved by the Mass General Brigham (MGB) Institutional Review Board (2015P002189) and informed consent was obtained from all subjects. All procedures contributing to this work comply with the ethical standards of the relevant national and institutional committees on human experimentation and with the Helsinki Declaration.
Measures
Outcomes
When initially enrolled, demographic information (age, gender, and race) and patient’s diagnoses were recorded. The daily 31-item survey consisted of questions on emotion, psychotic symptoms, social behavior, physical activity, sleep, and alcohol and coffee consumption. The primary outcomes for our study were the daily negative affect items which included self-report questions on feeling anxious, irritable, upset, and lonely (ordinal, 1 = not at all, 2 = a little, 3 = moderately, 4 = extremely; e.g., “How anxious have you been feeling today?”).
Predictors
Primary predictors include variables extracted from passive smartphone data. Using smartphone accelerometry and screen usage data, we performed weekly estimations of sleep epochs to derive features reflecting (1) phone usage during the sleep epoch, (2) sleep onset, (3) sleep offset, (4) sleep duration, (5) phone usage during the wake period, (6) the difference in phone usage between wake and sleep periods, and (7) missing phone data. All these variables are explained in Supplementary Table S1. GPS coordinates collected by the smartphones were analyzed by our open-source Deep Phenotyping of Location (DPLocate) pipeline [40]. This pipeline which was designed and validated in multiple data sets, uses different temporal and spatial filters to clean the GPS data and detect frequently visited places, called points of Interest (POI), using density-based clustering techniques, and assign those POIs to the related minutes of the day [40]. The daily location maps are then used to estimate features for (1) distance from home location, (2) radius of mobility, (3) percentage of time spent at home location, (4) number of locations visited, and (5) GPS missing time. The definition of these variables is also explained in Supplementary Table S1. Moreover, the daily measures of (1) phone accelerometer and (2) phone use that are used as daily features in the analysis are explained in this table.
Using data from a subset of the cohort (n = 31) with available wrist-worn actigraphy, we derived additional daily features related to sleep and activity using a previously published open source pipeline (DPSleep [41]). Primary predictors derived from wrist-worn actigraphy include (1) accepted days of watch data after quality control, (2) sleep onset, (3) sleep offset, (4) sleep duration, (5) activity during sleep epoch, (6) activity during wake period, (7) the difference in activity between wake and sleep periods, and (8) sleep fragmentation (Supplementary Table S2). Due to high level of missingness, we were only able to analyze 8 subjects with sufficient number of observations of either the wrist-worn actigraphy measures or negative emotions. Thus, the results presented below focus on smartphone features as predictors. For wrist-worn actigraphy results see the supplementary document.
Analytic approach
Missing data imputation
Missingness for the above smartphone predictor variables was imputed using multiple imputation (MICE package in r) [42], while outcome variables were not imputed to avoid overfitting of prediction models. The imputation process was performed jointly for all passive variables, excluding the outcome variables (i.e., negative emotional states). We produced 5 imputations and performed 5 iterations per imputation, which are the default choices in MICE [42]. Since the goal of our analysis is prediction, we used the mean values of the imputed passive variables across the five imputations when training the predictive algorithms [43].
Definition of high negative affect (HNA) states
This study is focused on the prediction of states of heightened negative emotions (specifically, anxious, irritable, upset, and lonely). For brevity, below we refer to these states as high negative affect (HNA) states. Consistent with Ren et al. [15], HNA states were computed based on elevations above the person-specific average emotion score for a given individual. Specifically, for each emotion, if the observed emotion score of a given participant at a particular timepoint exceeds their mean emotion score by at least 1/2 point, we define this as an HNA state of that emotion (see Table 2). This cutoff value is identical to a previous study [15] and was selected by balancing the tradeoff between (1) the selected cutoff value to define an HNA state and (2) the proportion of these states. By selecting a higher cutoff value, we can be more confident that HNA states are in fact instances of “high” negative affect. However, overly high cutoff values may result in proportions of HNA states that are too low to successfully train a classification algorithm. On the other hand, although low cutoff values provide us with more HNA states for statistical modeling, it also increases the chance that at least some of these identified states are questionable (i.e., too low to be considered true states of “high” negative affect). For each of the negative affect items, patients were excluded from the analyses if they had less than 10 total observations and 4 or 10% (whichever implies more total number of HNA events) of negative emotion events. For example, a subject with 100 total observations and 4 HNA events will be removed since the proportion of HNA events is less than 10%. Using this threshold, the number of participants for the analyses predicting anxious, irritable, upset, and lonely states was 35, 36, 40, and 42, respectively.
Personalized predictions of HNA states
Passive smartphone features and wrist-worn actigraphy data were used to predict HNA states from the same day. We used two approaches to model the relation between these passive data features and HNA states. First, we used a generalized linear mixed effects regression (GLMER) with logit link and subject-specific random intercept to model the heterogeneity between participants. The GLMER model combines fixed effects and the best linear unbiased predictions of the random effects [44] to predict the person-specific outcomes (HNA states). The GLMER is a simpler and commonly used approach that we implemented for comparison against our machine learning ensemble approach described below.
Next, we used a recently developed ensemble machine learning approach [15, 45] that builds a unique predictive model for each individual while borrowing information from other individuals’ models in an effort to improve predictive performance. Specifically, this approach develops an ensemble of idiosyncratic prediction models \({f}_{i}^{l}\left(x\right),{i}=1,\ldots ,K,{l}=1,\ldots ,L\), where K is the number of individuals and L represents the number of different learning algorithms (e.g., logistic regression). \({f}_{i}^{l}\left(x\right)\) is trained on data from participant i with algorithm l. The “personalized ensemble model” (PEM) \({f}_{i}\left({x;}{w}^{i}\right)\) for participant i is a linear combination of all idiosyncratic models (IM):
$${f}_{i}\left({x;}{w}^{i}\right)=\mathop{\sum }\limits_{{i}^{{\prime} }=1}^{K}\mathop{\sum }\limits_{l=1}^{L}{{w}_{{i}^{{\prime} },l}^{i}f}_{{i}^{{\prime} }}^{l}\left(x\right)$$
and the combination weights \({w}^{i}=({w}_{{i}^{{\prime} },l}^{i};{i}^{{\prime} }=1,\ldots K,l=1,\ldots ,L)\) with the constraints that \({\sum }_{{i}^{{\prime} },l}{w}_{{i}^{{\prime} },l}^{i}=1\) and \({w}_{{i}^{{\prime} },l}^{i}\ge 0\) for all \(i,{i}^{{\prime} }\in \{1,\ldots ,K\}\) and \(l\in \{1,\ldots ,L\}\) are selected to minimize a cross-validated loss function:
$${\widehat{{\rm{w}}}}^{i}={argmi}{n}_{w}\mathop{\sum }\limits_{j=1}^{{n}_{i}}{\mathscr{L}}\left({y}_{i,j};\sum _{{i}^{{\prime} }\ne i,l}{w}_{{i}^{{\prime} },l}^{i}{f}_{{i}^{{\prime} }}^{l}\left({x}_{i,j}\right)+{w}_{i,l}^{i}{f}_{i,-j}^{l}\left({x}_{i,j}\right)\,\right),$$
where \({n}_{i}\) is the number of observations for participant i and \({f}_{i,-j}^{l}(x)\) is the IM trained on all data from participant i except for the j-th observation (or a fold containing the j-th observation) with algorithm l. \({\mathscr{L}}\) is a loss function (log-loss for binary outcomes given that we are predicting whether or not an HNA state is present). In summary, this approach develops a personalized model for each individual via a data-driven search for the optimal weighting of IMs for that individual (i.e., “borrowing” information from the prediction models for other individuals in an effort to improve predictive performance). See Ren et al. [15] for additional details.
For the PEM statistical approach, we conducted 10-fold cross-validation (CV) to estimate the combination weights \({\widehat{{\rm{w}}}}^{i}\) and examined three different learning algorithms (see supplement for time-series CV, which yielded a similar pattern of findings): GLM with elastic net penalty (ENet), support vector machine (SVM), and random forest (RF). We used these algorithms individually \((L=1)\) in three separate ensemble (PEM-ENet, PEM-SVM and PEM-RF) models and in combination \((L=3)\) (i.e., 4 separate ensemble models were tested in total). We refer to the PEMs with \(L=3\) as a personalized double ensemble model (PDEM). Note that the cross-validation procedure partitions time points within a subject into different folds. See supplement for additional details. R code for all analyses is available online ( We assume that the pattern of missingness in our data is Missing at Random (MAR; See Supplement for details).
Feature engineering
The original PEM approach in Ren et al. [15] used a principal component analysis (PCA) to first reduce the raw smartphone features into several PCs, which yielded better prediction performance compared to models using the raw features as predictors. In this study, we used a modified version of the PEM approach, combining the generalizability of PCA-based models and the specificity of the raw feature-based models, in an effort to achieve an improved personalized prediction performance. This was achieved by including two idiosyncratic models (IMs) for each learning algorithm and participant, where one used the raw features as the predictors and the other used the PCs of the raw features. Operationally, with L learning algorithms, 2L IMs were generated, where \({f}_{i}^{l}\) and \({f}_{i}^{L+l}\) are the raw-feature- and PCA-based models, respectively, for participant i using learning algorithm l.
Clustering based on feature importance
We visualized the feature importance measures of each smartphone variable based on the best performing PEMs by linearly combining the feature importance of the IMs with the ensemble weights w. We normalized these measures such that the sum of their absolute values is one per subject. The signs of the importance are determined by the conditional relationship implied by the PEMs between a feature and an outcome. We then performed a K-medoid clustering on the feature importance matrix to identify subgroups of patients with similar feature importance signature. Euclidean distance was used to compute the pairwise dissimilarity between participants. The number of clusters was selected to maximize the average silhouette width, with an upper limit of five imposed to enhance the robustness and reliability of the clustering results, given the small between-subject sample size in our dataset. The pam function in R package cluster was used for this analysis.
link