Feature Studies to Inform the Classification of Depressive Symptoms from Twitter Data for Population Health Danielle Mowery Craig Bryan Mike Conway UniversityofUtah UniversityofUtah UniversityofUtah BiomedicalInformatics Psychology BiomedicalInformatics 421WakaraWay 380S1530EBEHS502 421WakaraWay SaltLakeCity,UT,USA SaltLakeCity,UT,USA SaltLakeCity,UT,USA [email protected] [email protected] [email protected] 7 ABSTRACT from characterizing linguistic phenomena associated with 1 TheutilityofTwitterdataasamediumtosupportpopulation- depression [7] and its subtypes e.g., postpartum depression 0 level mental health monitoring is not well understood. In [10],toidentifyingspecificdepressivesymptoms[3,12]e.g., 2 depressed mood. However, more research is needed to bet- anefforttobetterunderstandthepredictivepowerofsuper- ter understand the predictive power of supervised machine n visedmachinelearningclassifiersandtheinfluenceoffeature learning classifiers and the influence of feature groups and a sets for efficiently classifying depression-related tweets on a featuresetsforefficientlyclassifyingdepression-relatedtweets J large-scale,weconductedtwofeaturestudyexperiments. In tosupportmentalhealthmonitoringatthepopulation-level 8 the first experiment, we assessed the contribution of fea- [6]. 2 turegroupssuchaslexicalinformation(e.g.,unigrams)and This paper builds upon related works toward classify- emotions (e.g., strongly negative) using a feature ablation ing Twitter tweets representing symptoms of major depres- ] study. In the second experiment, we determined the per- R sive disorder by assessing the contribution of lexical fea- centile of top ranked features that produced the optimal tures (e.g., unigrams) and emotion (e.g., strongly negative) I classification performance by applying a three-step feature . to classification performance, and by applying methods to s elimination approach. In the first experiment, we observed c that lexical features are critical for identifying depressive eliminate low-value features. [ symptoms, specifically for depressed mood (-35 points) and 1 for disturbed sleep (-43 points). In the second experiment, 2. METHODS v we observed that the optimal F1-score performance of top Specifically, we conducted a feature ablation study to as- 9 rankedfeaturesinpercentilesvariablyrangedacrossclasses sesstheinformativenessofeachfeaturegroupandafeature 2 e.g., fatigue or loss of energy (5th percentile, 288 features) elimination study to determine the optimal feature sets for 2 todepressed mood (55thpercentile,3,168features)suggest- classifying Twitter tweets. We leveraged an existing, anno- 8 ing there is no consistent count of features for predicting tatedTwitterdatasetthatwasconstructedbasedonahier- 0 depressive-related tweets. We conclude that simple lexical archicalmodelofdepression-relatedsymptoms[13,14]. The . features and reduced feature sets can produce comparable 1 dataset contains 9,473 annotations for 9,300 tweets. Each results to larger feature sets. 0 tweet is annotated as no evidence of depression (e.g.,“Cit- 7 izens fear an economic depression”) or evidence of depres- 1 CCSConcepts sion (e.g.,“depressed over disappointment”). If a tweet is : annotated evidence of depression, then it is further anno- v •Computing methodologies→Feature selection; Su- tated with one or more depressive symptoms, for example, i pervisedlearningbyclassification; Supportvectormachines; X depressedmood (e.g.,“feelingdowninthedumps”),disturbed Natural language processing; r sleep (e.g.,“anotherrestlessnight”),orfatigueorlossofen- a ergy (e.g.,“the fatigue is unbearable”) [12]. For each class, Keywords every annotation (9,473 tweets) is binarized as the positive class e.g., depressed mood=1 or negative class e.g., not de- depression; natural language processing; social media pressed mood=0. 1. INTRODUCTION 2.1 Features In recent years, there has been a movement to leverage socialmedialdatatodetect,estimate,andtrackthechange Furthermore,thisdatasetwasencodedwith7featuregroups in prevalence of disease. For example, eating disorders in with associated feature values binarized (i.e., present=1 or Spanish language Twitter tweets [17] and influenza surveil- absent=0) to represent potentially informative features for lance [5]. More recently, social media has been leveraged to classifying depression-related classes. We describe the fea- monitor social risks such as prescription drug and smoking ture groups by type, subtype, and provide one or more ex- behaviors [15, 11, 4] as well as a variety of mental health amples of words representing the feature subtype from a disordersincludingsuicidalideation[10],attentiondeficient tweet: hyperactivitydisorder[8]andmajordepressivedisorder[9]. Inthecaseofmajordepressivedisorder,recenteffortsrange No#evidence#of#clinical#depression#0#F1:#82# Evidence(of(clinical(depression(0(F1:(52( sans$LIWC$$ sans$LIWC$$ sans$personality$$$ sans$personality$$$ sans$sen3ment$$ sans$sen3ment$$ sans$demographics$$ sans$demographics$$ sans$emo3on$$ sans$emo3on$$ sans$syntac3c$$ sans$syntac3c$$ sans$lexical$ sans$lexical$ !50$ !40$ !30$ !20$ !10$ 0$ 10$ 20$ 30$ 40$ 50$ !50$ !40$ !30$ !20$ !10$ 0$ 10$ 20$ 30$ 40$ 50$ Depressive(symptoms(-(F1:(49( Depressed'mood'*'F1:'35' sans$LIWC$$ sans$LIWC$$ sans$personality$$$ sans$personality$$$ sans$sen3ment$$ sans$sen3ment$$ sans$demographics$$ sans$demographics$$ sans$emo3on$$ sans$emo3on$$ sans$syntac3c$$ sans$syntac3c$$ sans$lexical$ sans$lexical$ !50$ !40$ !30$ !20$ !10$ 0$ 10$ 20$ 30$ 40$ 50$ !50$ !40$ !30$ !20$ !10$ 0$ 10$ 20$ 30$ 40$ 50$ Disturbed*sleep*-*F1:*43* Fa#gue'or'loss'of'energy'/'F1:'70' sans$LIWC$$ sans$LIWC$$ sans$personality$$$ sans$personality$$$ sans$sen3ment$$ sans$sen3ment$$ sans$demographics$$ sans$demographics$$ sans$emo3on$$ sans$emo3on$$ sans$syntac3c$$ sans$syntac3c$$ sans$lexical$ sans$lexical$ !50$ !40$ !30$ !20$ !10$ 0$ 10$ 20$ 30$ 40$ 50$ !50$ !40$ !30$ !20$ !10$ 0$ 10$ 20$ 30$ 40$ 50$ Figure1: Featureablationstudy: foreachclass, weplotted thechangeofaverageF1-scoresfromthebaseline reported in the titles by ablating each feature set. Black = point gains in F1; Purple = point losses in F1. • lexical features, unigrams, e.g.,“depressed”; • LIWCFeatures1,indicatorsofanindividual’sthoughts, feelings, personality, and motivations, e.g., “feeling” • syntactic features, parts of speech, e.g.,“cried”en- suggestions perception, feeling, insight, and cognitive coded as V for verb; mechanisms experienced by the Twitter user. • emotionfeatures,emoticons,e.g.,:(encodedasSAD; A more detailed description of leveraged features and their values, including LIWC categories, can be found in [12]. • demographic features, age and gender e.g., “this Basedonourpriorinitialexperimentsusingthesefeature semester” encoded as an indicator of 19-22 years of groups [12], we learned that support vector machines per- ageand“mygirlfriend”encodedasanindicatorofmale formwiththehighestF1-scorecomparedtoothersupervised gender, respectively; approaches. For this study, we aim to build upon this work byconductingtwoexperiments: 1)toassessthecontribution • sentiment features, polarity and subjectivity terms of each feature group and 2) to determine the optimal per- withstrengths,e.g.,“terrible”encodedasstronglyneg- centile of top ranked features for classifying Twitter tweets ative and strongly subjective; in the depression schema hierarchy. • personality traits, neuroticism e.g.,“pissed off”im- plies neuroticism; 1Linguistic Inquiry and Word Count [16] 2.2 FeatureContribution sans syntactic: 16,935, sans emotion: 16,954, sans de- Feature ablation studies are conducted to assess the in- mographics: 16,946, sans sentiment: 16,950, sans per- formativeness of a feature group by quantifying the change sonality: 16,946, and sans LIWC: 16,832. In Figure 1, in predictive power when comparing the performance of a compared to the baseline performance, significant drops in classifier trained with the all feature groups versus the per- F1-scoresresultedfromsanslexicalfordepressedmood(-35 formancewithoutaparticularfeaturegroup. Weconducted points), disturbed sleep (-43 points), and depressive symp- a feature ablation study by holding out (sans) each feature toms (-45 points). Less extensive drops also occurred for groupandtrainingandtestingthesupportvectormodelus- evidenceofdepression (-14points)andfatigueorlossofen- ingalinearkerneland5-fold,stratifiedcross-validation. We ergy (-3points). Incontrast,a3pointgaininF1-scorewas report the average F1-score from our baseline approach (all observed for no evidence of depression. We also observed feature groups) and report the point difference (+ or -) in notable drops in F1-scores for disturbed sleep by ablating F1-scoreperformanceobservedbyablatingeachfeatureset. demographics (-7 points), emotion (-5 points), and sen- timent (-5 points) features. These F1-score drops were ac- 2.3 FeatureElimination companied by drops in both recall and precision. We found equal or higher F1-scores by removing non-lexical feature Feature elimination strategies are often taken 1) to re- groupsfornoevidenceofdepression(0-1points),evidenceof move irrelevant or noisy features, 2) to improve classifier depression(0-1points),anddepressivesymptoms(2points). performance, and 3) to reduce training and run times. We conducted an experiment to determine whether we could 3.2 FeatureElimination maintainorimproveclassifierperformancesbyapplyingthe Theinitialmatricesofalmost17,000featureswerereduced following three-tiered feature elimination approach: by eliminating features that only occurred once in the full • Reduction We reduced the dataset encoded for each dataset, resulting in 5,761 features. We applied Chi-Square classbyeliminatingfeaturesthatoccurlessthantwice feature selection and plotted the top-ranked subset of fea- in the full dataset. turesforeachpercentile(at5percentintervalscumulatively added)andevaluatedtheirpredictivecontributionusingthe • SelectionWeiterativelyappliedChi-Squarefeaturese- support vector machine with linear kernel and stratified, 5- lection on the reduced dataset, selecting the top per- fold cross validation. centile of highest ranked features in increments of 5 In Figure 2, we observed optimal F1-score performance percenttotrainandtestthesupportvectormodelus- using the following top feature counts: no evidence of de- ingalinearkerneland5-fold,stratifiedcross-validation. pression: F1: 87 (15th percentile, 864 features), evidence of depression: F1: 59 (30th percentile, 1,728 features), de- • Rank We cumulatively plotted the average F1-score pressive symptoms: F1: 55 (15th percentile, 864 features), performances of each incrementally added percentile depressed mood: F1: 39 (55th percentile, 3,168 features), of top ranked features. We report the percentile and disturbed sleep: F1: 46 (10th percentile, 576 features), and countoffeaturesresultinginthefirstoccurrenceofthe fatigue or loss of energy: F1: 72 (5th percentile, 288 fea- highest average F1-score for each class. tures) (Figure 1). We note F1-score improvements for de- pressed mood fromF1: 13atthe1stpercentiletoF1: 33at All experiments were programmed using scikit-learn 0.182. the 20th percentile. 3. RESULTS 4. DISCUSSION From our annotated dataset of Twitter tweets (n=9,300 Weconductedtwofeaturestudyexperiments: 1)afeature tweets), we conducted two feature studies to better under- ablation study to assess the contribution of feature groups standthepredictivepowerofseveralfeaturegroupsforclas- and 2) a feature elimination study to determine the opti- sifying whether or not a tweet contains no evidence of de- malpercentileoftoprankedfeaturesforclassifyingTwitter pression(n=6,829tweets)orevidenceofdepression(n=2,644 tweets in the depression schema hierarchy. tweets). If there was evidence of depression, we determined 4.1 FeatureContribution whether the tweet contained one or more depressive symp- toms (n=1,656 tweets) and further classified the symptom Unsurprisingly,lexicalfeatures(unigrams)werethelargest subtypeofdepressedmood (n=1,010tweets),disturbedsleep contributor to feature counts in the dataset. We observed (n=98 tweets), or fatigue or loss of energy (n=427 tweets) that lexical features are also critical for identifying depres- using support vector machines. From our prior work [12] sive symptoms, specifically for depressed mood and for dis- and in Figure 1, we report the performance for prediction turbed sleep. For the classes higher in the hierarchy - no models built by training a support vector machine using 5- evidence of depression, evidence of depression, and depres- fold, stratified cross-validation with all feature groups as a sivesymptoms -theclassifierproducedconsistentF1-scores, baseline for each class. We observed high performance for evenslightlyabovethebaselinefordepressivesymptomsand no evidence of depression and fatigue or loss of energy and minor fluctuations of change in recall and precision when moderate performance for all remaining classes. removing other feature groups suggesting that the contri- bution of non-lexical features to classification performance 3.1 FeatureContribution waslimited. However,notablechangesinF1-scorewereob- By ablating each feature group from the full dataset, we served for the classes lower in the hierarchy including dis- observedthefollowingcountoffeatures-sanslexical: 185, turbed sleep and fatigue or loss of energy. For instance, changes in F1-scores driven by both recall and precision 2http://scikit-learn.org/stable/ were observed for disturbed sleep by ablating demograph- 100 90 80 70 60 e 50 r o c s 40 1-‐ F 30 20 10 0 1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 Feature Percen/le No evidence of depression Evidence of depression Depressive symptoms Disturbed sleep Depressed mood FaBgue or loss of energy Figure2: Featureeliminationstudy: foreachclass,weplottedthechangeofaverageF1-scoresfortopfeatures of percentiles by adding top-ranked features at 5% increments to the prediction model. ics,emotion,andsentimentfeatures,suggestingthatage to estimate the prevalence of depression (and depression- or gender (“mid-semester exams have me restless”), polar- relatedsymptomsandpsycho-socialstressors)overmillions ity and subjective terms (“lack of sleep is killing me”), and of UnitedStates-geocodedtweets. Identifyingthemost dis- emoticons (“wide awake :(”) could be important for both criminating feature sets and natural language processing identifyingandcorrectlyclassifyingasubsetofthesetweets. classifiersforeachdepressionsymptomisvitalforthisgoal. 4.2 FeatureElimination WeobservedpeakF1-scoreperformancesatlowpercentiles 6. CONCLUSIONS forfatigue or loss of energy (5thpercentile),disturbed sleep Insummary,weconductedtwofeaturestudyexperiments (10thpercentile)aswellasdepressive symptoms andno ev- toassessthecontributionoffeaturegroupsandtodetermine idenceofdepression (both15thpercentile)suggestingfewer the optimal percentile of top ranked features for classifying features are needed to reach optimal performance. In con- Twitter tweets in the depression schema hierarchy. From trast,peakF1-scoreperformancesoccurredatmoderateper- these experiments, we conclude that simple lexical features centiles for evidence of depression (30th percentile) and de- andreducedfeaturesets canproduce comparable resultsto pressedmood (55thpercentile)suggestingthatmorefeatures the much larger feature dataset. areneededtoreachoptimalperformance. However,oneno- tabledifferencebetweenthesetwoclassesisthedramaticF1- scoreimprovementsfordepressedmoodi.e.,20pointincrease 7. ACKNOWLEDGMENTS from the 1st percentile to the 20th percentile compared to Research reported in this publication was supported by themoregradualF1-scoreimprovementsforevidence ofde- theNationalLibraryofMedicineofthe[UnitedStates]Na- pression i.e., 11 point increase from the 1st percentile to tionalInstitutesofHealthunderawardnumbersK99LM011393 the 20th percentile. This finding suggests that for identify- and R00LM011393. This study was granted an exemption ing depressed mood a variety of features are needed before from review by the University of Utah Institutional Review incremental gains are observed. Board(IRB00076188). Notethatinordertoprotecttweeter anonymity, we have not reproduced tweets verbatim. Ex- 5. FUTUREWORK ample tweets shown were generated by the researchers as Our next step is to address the classification of rarer de- exemplars only. Finally, we would like to thank the anony- pressive symptoms suggestive of major depressive disorder mous reviewers of this paper for their valuable comments. fromourdatasetandhierarchyincludinginappropriateguilt, difficulty concentrating, psychomotor agitation or retarda- tion, weight loss or gain, and anhedonia [1, 2]. We are de- veloping a population-level monitoring framework designed 8. REFERENCES Emotions in Social Media, pages 182–191, Osaka, Japan, December 2016. [1] American Psychiatric Association. Diagnostic and [13] D. L. Mowery, C. Bryan, and M. Conway. Toward Statistical Manual of Mental Disorders, 4th Edition, developing an annotation scheme for depressive Text Revision (DSM-IV-TR). American Psychiatric disorder symptoms: A preliminary study using Association, Washington, DC, 2000. Twitter data. In Proceeding of 2nd Workshop on [2] American Psychiatric Association. Diagnostic and Computational Linguistics and Clinical Psychology - Statistical Manual of Mental Disorders, Fifth Edition From Linguistic Signal to Clinical Reality, pages (DSM-5). American Psychiatric Association, 89–98. Association for Computational Linguistics, Washington, DC, 2013. 2015. [3] P. A. Cavazos-Rehg, M. J. Krauss, S. Sowles, [14] D. L. Mowery, H. A. Smith, T. Cheney, G. Stoddard, S. Connolly, C. Rosa, M. Bharadwaj, and L. J. Bierut. C. Glen, C. Bryan, and M. Conway. Understanding A content analysis of depression-related tweets. depressive symptoms and psychosocial stressors on Computers in Human Behavior., 54:351–357, 2016. Twitter: A corpus-based study. J Med Internet Res, [4] A. Chen, S. Zhu, and M. Conway. What online [in press]. communities can tell us about electronic cigarettes [15] M. Mysl´ın, S.-H. Zhu, W. W. Chapman, and and hookah use: A study using text mining and M. Conway. Using Twitter to examine smoking visualization techniques. J Med Internet Res, behavior and perceptions of emerging tobacco 17(9):e220, 2015 Sep 29. products. J Med Internet Res, 15(8):e174, 2013. [5] N. Collier, N. T. Son, and N. M. Nguyen. Omg u got [16] J. Pennebaker, M. Francis, and R. Booth. Linguistic flu? analysis of shared health messages for Inquiry and Word Count [computer software]. bio-surveillance. J Biomed Semantics, 2 Suppl 5:S9, Mahwah, NJ: Erlbaum Publishers, 2001. Oct 2011. [17] V.Prieto,S.Matos,M.A´lvarez,F.Cacheda,andJ.L. [6] M. Conway and D. O’Conner. Social media, big data, Oliveira. Twitter: a good place to detect health and mental health: Current advances and ethical conditions. PLoS One, 9(1):e86191, 2014. implications. Current Opinion in Psychology., 9:77–82, 2016. [7] G. Coppersmith, M. Dredze, and C. Harman. Quantifying mental health signals in Twitter. In Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pages 51–60, Baltimore, Maryland, USA, June 27th 2014 2014. Association for Computational Linguistics. [8] G. Coppersmith, M. Dredze, C. Harman, and K. Hollingshead. From ADHD to SAD: Analyzing the language of mental health on Twitter through self-reported diagnoses. In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pages 1–10, Denver, CO, USA, June 5th 2015 2015. [9] M. De Choudhury, S. Counts, E. J. Horvitz, and A. Hoff. Characterizing and predicting postpartum depression from shared Facebook data. In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing - CSCW ’14, pages 626–638, New York, New York, USA, 2014. ACM Press. [10] M. De Choudhury, E. Kiciman, M. Dredze, G. Coppersmith, and M. Kumar. Discovering shifts to suicidal ideation from mental health content in social media.Inthe 2016 CHI Conference on Human Factors in Computing Systems, pages 2098–2110, San Jose, CA, USA, 2016. ACM Press. [11] C. Hanson, B. Cannon, S. Burton, and C. Giraud-Carrier. An exploration of social circles and prescription drug abuse through Twitter. J Med Internet Res, 15(9):e189, 2013. [12] D. Mowery, A. Park, C. Bryan, and M. Conway. Towards automatically classifying depressive symptoms from Twitter data for population health. In Proceedings of the Workshop on Computational Modeling of People’s Opinions, Personality, and