DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2018 Deep Neural Networks for Inverse De-Identification of Medical Case Narratives in Reports of Suspected Adverse Drug Reactions EVA-LISA MELDAU KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Deep Neural Networks for Inverse De-Identification of Medical Case Narratives in Reports of Suspected Adverse Drug Reactions EVA-LISA MELDAU Master in Computer Science Date: February 24, 2018 Supervisor at KTH: Joel Brynielsson Supervisor at the Uppsala Monitoring Centre: Niklas Norén Examiner: Olle Bälter Swedish title: Djupa neuronnät för omvänd avidentifiering av medicinska fallbeskrivningar i biverkningsrapporter School of Electrical Engineering and Computer Science iii Abstract Medical research requires detailed and accurate information on indi- vidual patients. This is especially so in the context of pharmacovig- ilance which amongst others seeks to identify previously unknown adverse drug reactions. Here, the clinical stories are often the start- ing point for assessing whether there is a causal relationship between the drug and the suspected adverse reaction. Reliable automatic de- identification of medical case narratives could allow to share this pa- tient data without compromising the patient’s privacy. Current re- searchonde-identificationfocusedonsolvingthetaskoflabellingthe tokens in a narrative with the class of sensitive information they be- longto. In this Master’s thesis project, we explore an inverse approach to thetaskofde-identification. Thismeansthatde-identificationofmed- ical case narratives is instead understood as identifying tokens which do not need to be removed from the text in order to ensure patient confidentiality. Ourresultsshowthatthisapproachcanleadtoamore reliable method in terms of higher recall. We achieve a recall of sensi- tive information of 99.1% while the precision is kept above 51% for the 2014-i2b2 benchmark data set. The model was also fine-tuned on case narratives from reports of suspected adverse drug reactions, wherearecallofsensitiveinformationofmorethan99%wasachieved. Although the precision was only at a level of 55%, which is lower than in comparable systems, an expert could still identify informa- tionwhichwouldbeusefulforcausalityassessmentinpharmacovigi- lanceinmostofthecasenarrativeswhichwerede-identifiedwithour method. Inmorethan50%ofthecasenarrativesnoinformationuseful forcausalityassessmentwasmissingatall. iv Sammanfattning Tillgångtilldetaljeradekliniskadataärenförutsättningförattbedriva medicinsk forskning och i förlängningen hjälpa patienter. Säker avi- dentifiering av medicinska fallbeskrivningar kan göra det möjligt att delasådaninformationutanattäventyrapatientersskyddavpersonli- ga data. Tidigare forskning inom området har sökt angripa problemet genom att märka ord i en text med vilken typ av känslig information de förmedlar. I detta examensarbete utforskar vi möjligheten att an- gripa problemet på omvänt vis genom att identifiera de ord som inte behöver avlägsnas för att säkerställa skydd av känslig patientinfor- mation. Våra resultat visar att detta kan avidentifiera en större andel av den känsliga informationen: 99,1% av all känslig information avi- dentifieras med vår metod, samtidigt som 51% av alla uteslutna ord verkligen förmedlar känslig information, vilket undersökts för 2014- i2b2 jämförelse datamängden. Algoritmen anpassades även till fallbe- skrivningarfrånbiverkningsrapporter,ochidettafallavidentifierades 99,1%avallkänsliginformationmedan55%avallauteslutnaordför- medlar känslig information. Även om denna senare andel är lägre än förjämförbarasystemsåkundeenexperthittainformationsomäran- vändbar för kausalitetsvärdering i flertalet av de avidentifierade rap- porterna; i mer än hälften av de avidentifierade fallbeskrivningarna saknadesingeninformationmedvärdeförkausalitetsvärdering. Contents 1 Introduction 1 1.1 PurposeandProblemStatement . . . . . . . . . . . . . . 2 2 Background 4 2.1 Pharmacovigilance . . . . . . . . . . . . . . . . . . . . . . 4 2.1.1 CausalityAssessment . . . . . . . . . . . . . . . . 5 2.1.2 WorldHealthOrganization(WHO)International DrugMonitoringProgramme . . . . . . . . . . . . 8 2.1.3 VigiBase . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 ProtectedHealthInformation . . . . . . . . . . . . . . . . 11 2.2.1 Health Insurance Portability and Accountability Act . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 EuropeanUnion . . . . . . . . . . . . . . . . . . . 11 2.2.3 ComparisonBetweenCountries . . . . . . . . . . 13 2.3 RelatedWork: De-IdentificationSystems . . . . . . . . . . 15 2.3.1 SystemsUsingHand-EngineeredFeatures . . . . 16 2.3.2 FeatureLearningNeuralNetworkSystems . . . . 18 2.3.3 InverseApproachSystems . . . . . . . . . . . . . 20 3 Theory 22 3.1 ArtificialNeuralNetworks . . . . . . . . . . . . . . . . . . 23 3.2 DeepLearning . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 FeatureLearning . . . . . . . . . . . . . . . . . . . 24 3.2.2 Pre-TrainingandFine-Tuning . . . . . . . . . . . . 25 3.3 DeepFeedForwardNeuralNetworks . . . . . . . . . . . 26 3.3.1 Training . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4 RecurrentNeuralNetworks . . . . . . . . . . . . . . . . . 30 3.4.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4.2 Training . . . . . . . . . . . . . . . . . . . . . . . . 35 v vi CONTENTS 3.4.3 BidirectionalRecurrentNeuralNetwork . . . . . 38 3.4.4 LongShort-TermMemory . . . . . . . . . . . . . . 39 3.5 WordVectors . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.6 Linear-ChainConditionalRandomField . . . . . . . . . . 43 3.7 EvaluationMeasures . . . . . . . . . . . . . . . . . . . . . 43 4 Methodology 45 4.1 DataSets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.1 2014-i2b2DataSet . . . . . . . . . . . . . . . . . . 46 4.1.2 VigiBaseDataSet . . . . . . . . . . . . . . . . . . . 48 4.2 Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.1 WHODrug . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.2 MedicalDictionaryforRegulatoryActivities . . . 49 4.3 De-IdentificationMethods . . . . . . . . . . . . . . . . . . 49 4.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . 50 4.3.2 AnnotatingandPre-Processing . . . . . . . . . . . 52 4.3.3 Rule-BasedApproachUsingDictionaryLook-ups 52 4.3.4 DeepLearningApproachUsingLongShort-Term Memory . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3.5 CombinationStrategy . . . . . . . . . . . . . . . . 62 4.3.6 ModelSelection . . . . . . . . . . . . . . . . . . . . 63 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4.1 Recall and Precision for Protected Health Infor- mation . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4.2 RetainmentofValuableInformation . . . . . . . . 67 5 Results 68 5.1 2014-i2b2DataSet . . . . . . . . . . . . . . . . . . . . . . . 68 5.1.1 EvaluationoftheHybridDe-Identifier . . . . . . 68 5.1.2 EvaluationoftheDeepDe-Identifier . . . . . . . . 70 5.1.3 Comparisons . . . . . . . . . . . . . . . . . . . . . 70 5.1.4 ResultsPerCategory . . . . . . . . . . . . . . . . . 73 5.1.5 ExampleOutputs . . . . . . . . . . . . . . . . . . . 74 5.2 VigiBaseDataSet . . . . . . . . . . . . . . . . . . . . . . . 79 5.2.1 GeneralResults . . . . . . . . . . . . . . . . . . . . 79 5.2.2 ResultsPerCategory . . . . . . . . . . . . . . . . . 80 5.2.3 ExamplesofLeakedProtectedHealthInformation 83 5.2.4 ValuableInformationforCausalityAssessment . 84 6 Discussion 85 CONTENTS vii 7 Conclusion 95 Bibliography 96
Description: