Studies in Computational Intelligence 1018 Fabio Crestani David E. Losada Javier Parapar Editors Early Detection of Mental Health Disorders by Social Media Monitoring TTThhheee FFFiiirrrsssttt FFFiiivvveee YYYeeeaaarrrsss ooofff ttthhheee eeeRRRiiissskkk PPPrrrooojjjeeecccttt Studies in Computational Intelligence Volume 1018 Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland The series “Studies in Computational Intelligence” (SCI) publishes new develop- ments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, self- organizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publica- tion timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science. · · Fabio Crestani David E. Losada Javier Parapar Editors Early Detection of Mental Health Disorders by Social Media Monitoring The First Five Years of the eRisk Project Editors Fabio Crestani David E. Losada Faculty of Informatics Centro Singular de Investigación en Università della Svizzera Italiana (USI) Tecnoloxías Intelixentes (CiTIUS) Lugano, Switzerland Universidade de Santiago de Compostela Santiago de Compostela, Spain Javier Parapar Centro de Investigación en Tecnoloxías da Información e as Comunicacións (CITIC) Universidade da Coruña A Coruña, Spain ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-3-031-04430-4 ISBN 978-3-031-04431-1 (eBook) https://doi.org/10.1007/978-3-031-04431-1 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Foreword Early in 2013, my wife Rebecca and I were out on a “date night” having food and wine at a restaurant named—pragmatically, but uninspiringly—Food And Wine. We found ourselves discussing some work being done recently by my tribe, computational linguists, that might potentially be interesting to her tribe, clinical psychologists. By the time dessert arrived we had hatched the idea of a workshop that would bring the two tribes together. Not too long afterward I pitched the idea to Meg Mitchell, who had been involved in a number of interesting studies on language analysis in connection with mental health conditions like autism and cognitive impairment with Brian Roark, Emily Prud’hommeaux, and others.1 The three of us led the organization of the first Workshop on Computational Linguistics and Clinical Psychology, at the 2014 ACL conference. CLPsych, as it is known, has taken place every year since then, except for what I sincerely hope will be just a single break during a global pandemic.2 We organized CLPsych because we could see that there were an increasing number of people interested in bringing the tools of natural language processing and machine learning to the problems of mental health and well-being, but they were 1 See for example: (cid:129) Emily T. Prud’hommeaux, Margaret Mitchell, and Brian Roark. 2011. Using patterns of narra- tive recall for improved detection of mild cognitive impairment. International Conference on Technology and Aging (ICTA) 2011, Toronto, Ontario. (cid:129) B. Roark, M. Mitchell, J.-P. Hosom, K. Hollingshead and J. Kaye, “Spoken Language Derived Measures for Detecting Mild Cognitive Impairment,” in IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2081-2090, Sept. 2011. (cid:129) Van Santen, J. P., Prud’Hommeaux, E. T., Black, L. M., & Mitchell, M. (2010). Computational prosodic markers for autism. Autism, 14(3), 215-236. (cid:129) Roark, B., Mitchell, M., & Hollingshead, K. (2007, June). Syntactic complexity measures for detecting mild cognitive impairment. In Proceedings of the ACL 2007 Workshop on Biomedical Natural Language Processing (BioNLP), Prague, pp. 1-8, June 2007. 2 Proceedings can be found at CLPsych.org. v vi Foreword scattered, presenting their work in a wide range of different journals, conferences, and workshops. There was critical mass, and it was time to start building a community. The book you’re holding (or, these days, viewing, or perhaps listening to), emerges from a similar recognition—a recognition of real-world needs, of a range of capable people with the right skills to tackle those needs, and of an opportunity to bring that body of intelligence and enthusiasm and energy together. The organizers of eRisk identified a common thread running through a range of concerns for human well-being: the value of identifying risks and threats sooner rather than later. This has given rise to the concept of early risk detection systems as a coherent, interdisciplinary research area. The application areas for early risk detection are numerous. These include risks directly associated with mental health, such as major depression and suicidality, but they also go further, to include risks from sexual predators, for example, or cyberbullying, or political radicalization, or addiction. What makes all this possible is the ever-increasing availability of online evidence about people’s thoughts and behaviors—what Lazer and colleagues, introducing the field of computational social science, referred to as “digital traces that can be compiled into comprehensive pictures”.3 Similarly, Coppersmith and his colleagues talk about the “clinical whitespace” in the context of mental health; this describes the long periods of time that intervene in between attention from clinical providers, periods that used to be entirely opaque to providers of care, but which now hold the potential for yielding valuable or even crucial signal about people’s mental state and, as such, new opportunities for early intervention.4 This notion of time is really central, and one of eRisk’s most distinguishing and innovative features. It’s very easy for us, as technologists, to define “tasks” and their associated evaluation metrics in whatever way we find intuitively plausible or easy to execute. But tasks and metrics, if they are to be relevant, are not stand- alone things. Rather, progress needs to be made using faithful abstractions of the real-world problems the technology is eventually intended to help address, and that means thinking about real-world use cases. One obvious example of this is in speech recognition, where the tasks worked on by the community have progressed, over the decades, from isolated words, to speech read by broadcasters, to conversations, to meetings, all with an increased awareness of the need for tolerance in the face of noise and variation among speakers. Another is information retrieval, where standard thinking about evaluation in terms of precision and recall has now been updated with more relevant metrics in the context of how people actually search for information on a day-to-day basis. What sense does it make to evaluate systems as if we care about every single relevant document they return, for example, in a world where 3 Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabasi, A. L., Brewer, D., ... & Van Alstyne, M. (2009). Social science. Computational social science. Science (New York, NY), 323(5915), 721– 723. 4 Coppersmith, G., Leary, R., Crutchley, P., and Fine, A. (2018). Natural language processing of social media as screening for suicide risk. Biomedical Informatics Insights, 10, 117822261879286. Foreword vii most searches are done on the web and almost nobody looks past the first ten hits? Measuring precision-at-10, the number of relevant hits among the first ten returned, is a better abstraction for assessing the extent to which a system will satisfy that real-world need. Correspondingly, in settings where proactive intervention is the ultimate real- world goal, data don’t arrive all at once. Much of time, datapoints arrive in a stream over time. The question, therefore, is not just how to do good classification, it’s about when a system should raise a flag saying that an individual or a situation merits attention. The eRisk tasks are, therefore, designed from the ground up with chronological considerations in mind, from an incremental data release strategy to evaluation metrics that consider not only the accuracy but also the speed with which choices are made. As another key thing to consider, better technology is not a goal in and of itself. Even if a technology genuinely performs well in formal evaluations, that doesn’t necessarily mean that it can be deployed in a way that actually represents progress. Most technologists are familiar with the distinction between intrinsic and extrinsic evaluations, the latter category meaning that what is measured is the performance of the entire system in which the to-be-evaluated technology is embedded. For example, you can evaluate a natural language parser intrinsically, by looking at the parse trees it produces, or you can consider it extrinsically in the context of a machine translation system, looking at how different parsing approaches improve, or don’t improve, the overall quality of the translations being produced.5 When it comes to risk, systems like the ones described in this volume are being evaluated intrinsically in the eRisk tasks, but it’s important to remember that ulti- mately what matters most is how they will perform extrinsically—embedded not just in some other automatic system, like machine translation, but in an entire ecosystem that involves not just technologies but also human beings providing care or assistance.6 This gives rise to some challenges for the typical ways technologists gather data for technological development and evaluation; for example, Ernala and colleagues have pointed out some of the downsides of using online self-reports and other proxies as a stand-in for clinically valid diagnoses.7 It’s notable and important, therefore, that eRisk’s criteria go further and require explicit mention of a diagnosis of a condition like depression, not just a statement about having depression; this is a credible way of balancing the goal of collecting data at reasonable scale against 5 Resnik, P., and Lin, J. (2013). Evaluation of NLP systems. In Clark, A., Fox, C., & Lappin, S. (Eds.) (2010). The Handbook of Computational Linguistics and Natural Language Processing. Wiley-Blackwell. 6 Resnik, P., Foreman, A., Kuchuk, M., Musacchio Schafer, K., and Pinkham, B. (2021). Naturally occurring language as a source of evidence in suicide prevention. Suicide and Life Threatening Behavior, 51(1), 88–96. 7 Ernala, S. K., Birnbaum, M. L., Candan, K. A., Rizvi, A. F., Sterling, W. A., Kane, J. M., and De Choudhury, M. (2019, April). Methodological gaps in predicting mental health states from social media: Triangulating diagnostic signals. In Proceedings of the 2019 ACM CHI conference on human factors in computing systems (p. 134). viii Foreword the goal of defining ground-truth categories that will actually matter to clinicians or other relevant subject matter experts. The chapters in this volume do a great job providing a comprehensive under- standing of these and other considerations. The technology-oriented reader will find clear rationales for the choices being made, detailed discussions of the best work so far in eRisk tasks, and thorough contemplation of findings. Even more important, though, this volume is a milestone setting the stage for the next generation of eRisk. In looking at the future of eRisk, the discussion of expanding the range of problems is exciting, particularly the idea of selecting new challenges to focus on based on broad social relevance—for example, anxiety disorders are trending upward as a result of COVID-19 pandemic, and digital evidence may be one of the most important ways to help clinical professionals handle the mental health tsunami following in its wake.8 Similarly, the ability to improve early detection of dementia may have a dramatic impact on well-being for large swaths of the world’s aging population. Equally exciting is eRisk’s more recent and continuing push in the direction of prediction not just for a single class (at-risk or not), but for symptom-level responses of the kind reported in clinical questionnaires. Particularly for mental health providers, the yes-or-no question of a diagnosis matters, but it’s far from the end of the story: finer-grained information is needed in order to really under- stand what’s going on with patients, and in fact there are efforts underway in the mental health community to improve the organization, description, and measure- ment of psychopathology via data-driven, empirically based organization of signs and symptoms.9 The move toward obtaining ground-truth data from validated ques- tionnaires, and defining the goals of prediction tasks accordingly, is a move toward bridging more effectively between the things technologies can provide and the things clinicians need. In the end, whether we are talking about mental health risks in particular, or other risks more generally, that kind of bridging—creating momentum and community among technologists, combined with relevance for the non-technologists—is the way real progress is going to be made. The work in this book is an important contribution to that enterprise. October 2021 Philip Resnik Department of Linguistics and Institute for Advanced Computer Studies University of Maryland College Park, Maryland, USA 8 Inkster, B. and Digital Mental Health Data Insights Group (DMHDIG) (2021). Early warning signs of a mental health tsunami: A coordinated response to gather initial data insights from multiple digital services providers. Frontiers in Digital Health, 2, 64. 9 Kotov, R., Krueger, R. F., W atson, D., Achenbach, T. M., Althoff, R. R ., Bagby, R. M., ... & Zimmerman, M. (2017). The Hierarchical Taxonomy of Psychopathology (HiTOP): A dimensional alternative to traditional nosologies. Journal of Abnormal Psychology, 126(4), 454-477. Preface This book presents the best of the first five years of eRisk. eRisk10 , which is a short- ening of Early Risk Prediction on the Internet, is concerned with the exploration of techniques for the early detection of certain health problems which manifest in the way people write and communicate on the internet, in particular in user generated content. eRisk is also concerned with the evaluation methodology for such tech- niques, in particular with effectiveness metrics where early detection is an important issue. The eRisk project was proposed as part of the Conference and Lab of the Evaluation Forum (CLEF) in 2016 and started as a lab in 2017. Thus, 2020 was the fifth year of eRisk. Proud of what we achieved after five years and excited for what the future will bring to this endeavor, we decided to compile this book to present the best of what was achieved in these first five years of this interesting and challenging task. We hope this book will be of interest not just to the many past and future participants of the lab but also to other researchers and practitioners involved in similarly challenging endeavors. Lugano, Switzerland Fabio Crestani Santiago de Compostela, Spain David E. Losada A Coruña, Spain Javier Parapar October 2021 10 See: https://erisk.irlab.org/. ix