Detecting Adverse Drug Reactions in Electronic Health Records by using the Food and Drug Administration’s Adverse Event Reporting System A thesis submitted to the Division of Research and Advanced Studies of the University of Cincinnati in partial fulfillment of the requirements for the degree of Master of Science for the Major of Computer Science in the Department of Electrical Engineering and Computing Systems of the College of Engineering and Applied Science 2016 By Huaxiu Tang Committee Members: Jaroslaw Meller, Ph.D., Chair Raj Bhatnagar, Ph.D. Yizhao Ni, Ph.D. i Abstract The objective of this study was to detect adverse drug reactions (ADRs) in Electronic Health Records (EHR) by using the Food and Drug Administration’s (FDA) Adverse Event Reporting System (FAERS). The null hypothesis stated that “leveraging drug-reaction pairs from the FAERS will not improve the performance of EHR-based ADR detection”. Both structured data and unstructured clinical notes (narrative text) were used in this study. Natural Language Processing (NLP) was used to identify medical conditions in clinical notes. First, I identified the 41 most frequently used medications by calculating the frequencies of medications in the EHR data and FDA ADR reports. Focusing on the 41 medications, I used the FDA drug-reaction pairs as candidate ADRs and searched for them in the clinical notes to identify Drug-Reaction Pair Sentences (DRPSs). Then, NLP techniques were performed on the DRPSs to identify ADR-related DRPSs. Finally, a clinical informatics expert with medical degree and I performed manual review to evaluate the algorithm outputs. A total of 2475 ADR-related DRPSs were identified, which contained 1492 unique DRPSs. Five hundred and twenty eight sentences were randomly sampled from the 1492 DRPSs for manual review. The positive predictive value on the reviewed cases was 42.4% (224 out of 528). The performance of ADR identification leveraging FDA ADE reports (PPV = 42.4%) was significantly better than not using FDA ADE reports (PPV = 8.7%), the p-value was 3.67 × 10-36. The null hypothesis was rejected. The promising results have potential for significant impact in automated or semi-automated ADR detection, particularly as growing adoption of EHRs greatly expands the size of, and increases the complexity of, the EHR data. Key works: adverse drug reaction; clinical notes; electronic health records; natural language processing ii iii Acknowledgement This thesis was completed with the help of many people. Firstly, I would like to thank my advisors, Dr. Imre Solti and Dr. Jaroslaw Meller, for their great help and insightful guidance. Dr. Solti suggested the overall scope of this thesis and guided me until the thesis was finished. I appreciate the resources, mentoring and opportunity given to me by my advisers, Dr. Solti and Dr. Meller. I would like to especially thank Dr. Yizhao Ni for the mentoring and guidance during my three years’ of study and the great effort he made for my thesis project. He also provided tremendous technical help which made this study possible. I would like to give my special thanks to Dr. Eric Kirkendall and Lisa Garrity, who gave me helpful clinical suggestions and comments when I hit roadblocks during the study. I would give my special thanks to Todd G. Lingren and Dr. Qi Li, who shared their experience and resources related to this study with me. iv I also want to thank Dr. Raj Bhatnagar who were kind enough to agree to serve on my thesis committee and provided many valuable suggestions. I want to thank Ms Julie Muenchen, Ms Vicki Baker, Ms Jennifer Diallo and Ms Melissa Hogan for their administrative help. I would also thank my classmates Lirong Tan and Xiaoting Zhu, who provided help while I was not at Cincinnati. Lastly and most importantly, I want to thank my dear husband and daughter. They were always there for me and I cannot say my thanks enough for their support and love. v Table of Contents Abstract ........................................................................................................................................................ ii Acknowledgement ...................................................................................................................................... iv Table of Contents ....................................................................................................................................... vi List of Tables ............................................................................................................................................. vii List of Figures ........................................................................................................................................... viii Abbreviations ............................................................................................................................................. ix 1 Introduction ......................................................................................................................................... 1 2 Related work ........................................................................................................................................ 3 2.1 Definitions related to ADR and scope of the study ........................................................................ 3 2.2 Adverse drug reaction detection ...................................................................................................... 4 2.3 Natural language processing on clinical notes ................................................................................ 9 2.3.1 Medical condition identification ............................................................................................... 9 2.3.2 Contextual property detection ................................................................................................ 12 3 Data .................................................................................................................................................... 13 3.1 Electronic Health Records (EHR) ................................................................................................. 13 3.2 FDA FAERS Reports ...................................................................................................................... 14 4 Methods .............................................................................................................................................. 15 4.1 Most frequently used medications ................................................................................................. 17 4.1.1 Data preprocessing on FDA FAERS reports ......................................................................... 18 4.1.2 Medication name mapping ...................................................................................................... 20 4.1.3 Most frequently used medications .......................................................................................... 27 4.2 Preparation for DDSS/reaction-CUI matching ............................................................................ 29 4.2.1 Preprocessing the FDA drug-reaction pairs .......................................................................... 30 4.2.2 FDA reaction-CUI mapping .................................................................................................... 31 4.2.3 Preprocessing the clinical notes .............................................................................................. 32 4.2.4 NLP of the clinical notes .......................................................................................................... 34 4.3 Identifying ADR-related DRPSs .................................................................................................... 43 4.3.1 ADR time window .................................................................................................................... 44 4.3.2 Identifying ADR-related DRPSs ............................................................................................. 44 4.3.3 Thresholding reactions for a medication ............................................................................... 54 4.4 Identifying baseline ADR-related DRPSs ..................................................................................... 57 5 Results ................................................................................................................................................ 60 5.1 Evaluation metrics .......................................................................................................................... 60 vi 5.2 Performance of ADR detection with FAERS reports .................................................................. 60 5.3 Performance of ADR detection using the baseline ....................................................................... 69 6 Discussion........................................................................................................................................... 71 7 Conclusion ......................................................................................................................................... 75 8 References .......................................................................................................................................... 75 9 Appendix ............................................................................................................................................ 80 List of Tables Table 1: EHR data .................................................................................................................................... 13 Table 2: FDA ADE Reports ..................................................................................................................... 14 Table 3: Frequencies and cumulative percentages for the 41 most frequently used medications ..... 28 Table 4: Criteria to skip reactions ........................................................................................................... 30 Table 5: Number of reactions reported by FDA for the 41 most frequently used medications ......... 31 Table 6: Types of notes used in this study .............................................................................................. 33 Table 7: Complicated encounters to be filtered out ............................................................................... 33 Table 8: Excluded sections ....................................................................................................................... 36 Table 9: Key words and captures of plan/ hypothetical /instructions/therapy ................................... 38 Table 10: Patterns indicating relations between DDSSs and medications and their indication ........ 39 Table 11: ADR-related DRPSs identified for each medication ............................................................. 45 Table 12: ADR-related DRPSs identified for each note type ................................................................ 47 Table 13: ADR-related DRPSs identified for each medication and each note type ............................ 48 Table 14: Number of ADR-related DRPSs at each cut-off percentage ................................................ 54 Table 15: Baseline ADR-related DRPSs identified for each medication .............................................. 58 Table 16: Results for each medication .................................................................................................... 61 Table 17: Results for each note type ........................................................................................................ 62 Table 18: Results for each cut-off percentage ........................................................................................ 62 Table 19: Results for each medication in each note type ....................................................................... 63 Table 20: Number of medication-note type pairs and the corresponding percentages at each PPV level ............................................................................................................................................................. 67 Table 21: Identified ADRs and their frequencies .................................................................................. 68 Table 22: Baseline review results for each medication .......................................................................... 69 Appendix Table 1: Frequencies and cumulative percentages for the top 50 most frequently used medications in EHR data .......................................................................................................................... 80 Appendix Table 2: Excluded reactions ................................................................................................... 82 Appendix Table 3: Excluded FDA drug-reaction pairs ........................................................................ 84 Appendix Table 4: Included FDA drug-reaction pairs ......................................................................... 85 vii List of Figures Figure 1: cTAKES components and their dependencies (Masanz, 2013) ............................................. 11 Figure 2: Overview of the study............................................................................................................... 17 Figure 3: Example of one single one-drug-ADE and two multi-drug-ADE cases ............................... 18 Figure 4: Age code and report source code (a. Age code; b. Report source code) .............................. 19 Figure 5: Data preprocessing on the FDA reports ................................................................................. 20 Figure 6: Brand/generic name searching in the Lexicomp Online Database ...................................... 20 Figure 7: Preprocessing of medication names ........................................................................................ 21 Figure 8: Medication name tokenization and the corresponding searching of the candidates .......... 21 Figure 9: Example of search queries ....................................................................................................... 22 Figure 10: Examples of returned pages .................................................................................................. 24 Figure 11: Parsing rules for identifying generic name from the returned HTML page .................... 24 Figure 12: Parsing rules for identifying the FDA generic names from the returned HTML pages .. 25 Figure 13: User interfaces for the manual review tool .......................................................................... 27 Figure 14: Medication names (left column) mapping to generic names (right column) ..................... 27 Figure 15: The UMLS Terminology Browser ........................................................................................ 32 Figure 16: Data preprocessing for clinical notes .................................................................................... 34 Figure 17: Workflow of the NLP pipeline .............................................................................................. 35 Figure 18: Example sentences and the cTAKES output of DDSS terms and CUIs ............................ 38 Figure 19: Pipeline for identifying ADR-related DRPSs ....................................................................... 45 Figure 20: Plots for ADR-related DRPSs at each cut-off percentage ................................................... 56 Figure 21: Pipeline for identifying baseline ADR-related DRPSs ........................................................ 57 viii Abbreviations ADE ------ Adverse Drug Event ADR ------ Adverse Drug Reaction CCHMC ------ Cincinnati Children’s Hospital Medical Center cTAKES ------ clinical Text Analysis and Knowledge Extraction System CUI ------ Concept Unique Identifier DDSSs ------ Diseases/Disorders and Signs/Symptoms DRPS ------ Drug-Reaction Pair Sentence EHR ------ Electronic Health Record FAERS ------ Food and Drug Administration’s Adverse Event Reporting System FDA ------ Food and Drug Administration NLP ------ Natural Language Processing UIMA ------ Unstructured Information Management Architecture ix
Description: