ebook img

Yueyang Alice Li PDF

92 Pages·2011·0.31 MB·English
by  
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Yueyang Alice Li

Medical Data Mining: Improving Information Accessibility using Online Patient Drug Reviews by Yueyang Alice Li S.B., Massachusetts Institute of Technology (2010) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2011 (cid:13)c Massachusetts Institute of Technology 2011. All rights reserved. Author .............................................................. Department of Electrical Engineering and Computer Science January 4, 2011 Certified by.......................................................... Dr. Stephanie Seneff Senior Research Scientist Thesis Supervisor Accepted by......................................................... Dr. Christopher J. Terman Chairman, Masters of Engineering Thesis Committee 2 Medical Data Mining: Improving Information Accessibility using Online Patient Drug Reviews by Yueyang Alice Li Submitted to the Department of Electrical Engineering and Computer Science on January 4, 2011, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science Abstract We address the problem of information accessibility for patients concerned about pharmaceutical drug side effects and experiences. We create a new corpus of online patient-provided drug reviews and present our initial experiments on that corpus. We detect biases in term distributions that show a statistically significant associa- tion between a class of cholesterol-lowering drugs called statins, and a wide range of alarming disorders, including depression, memory loss, and heart failure. We also develop an initial language model for speech recognition in the medical domain, with transcribed data on sample patient comments collected with Amazon Mechanical Turk. Our findings show that patient-reported drug experiences have great potential to empower consumers to make more informed decisions about medical drugs, and our methods will be used to increase information accessibility for consumers. Thesis Supervisor: Dr. Stephanie Seneff Title: Senior Research Scientist 3 4 Acknowledgments I would like to express my sincere gratitude to Stephanie Seneff for acting as my advisor. Her invaluable expertise and generous guidance were instrumental to the completion of this thesis, and her eternal enthusiasm kept me motivated throughout the year. It has been a pleasure being part of the Spoken Language Systems group. Special thanks goes to JingJing Liu for her knowledgeable insight and collaboration in the classificationexperiments, toJimGlassforhiskindencouragement, andtoVictorZue for his advice on grad school and life beyond. I would especially like to thank Scott Cyphers who was always willing to answer my endless questions about the Galaxy system. Many thanks to everyone in the group for making it such an enjoyable and welcome place to work. I would also like to acknowledge Tommi Jaakkola for his patient and illuminating instruction on machine learning, and Regina Barzilay for first introducing me to NLP. This work would not have been possible without Victor Costan, who gave me massive help whenever I ran into difficulties with Ruby on Rails. I also deeply appreciate my friends and colleagues at CSAIL, for most enjoyable discussions and treasured memories. Finally, I am indebted to my wonderful family for their unconditional love and support. 5 6 Bibligraphic Note Portions of this thesis are based on the paper entitled “Automatic Drug Side Effect Discovery from Online Patient-Submitted Reviews - Focus on Statin Drugs” with Stephanie Seneff and JingJing Liu, which was submitted to the Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. 7 8 Contents 1 Introduction 17 1.1 Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.3 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2 Related Work 23 2.1 Term Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.1 Medical Knowledge Resources . . . . . . . . . . . . . . . . . . 24 2.1.2 Statistical Approaches . . . . . . . . . . . . . . . . . . . . . . 25 2.2 Medical Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.1 Dialogue Systems . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.2 Health Surveillance . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3 Data 31 3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1.1 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1.2 Data Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Example Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3 Spelling Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4 Automatic Discovery of Side Effects: Focus on Cholesterol-Lowering Drugs 39 9 4.1 Side Effects of Cholesterol-lowering Drugs: Brief Literature Review . 40 4.1.1 Statin Drugs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1.2 Non-Statin Cholesterol-Lowering Drugs . . . . . . . . . . . . . 42 4.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3.1 Log Likelihood Statistic . . . . . . . . . . . . . . . . . . . . . 44 4.3.2 Pointwise Mutual Information . . . . . . . . . . . . . . . . . . 45 4.3.3 Set Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4.1 Cholesterol-lowering vs Blood-pressure-lowering Drugs . . . . 46 4.4.2 Statins vs Non-statins . . . . . . . . . . . . . . . . . . . . . . 47 4.4.3 Gender Differences . . . . . . . . . . . . . . . . . . . . . . . . 50 4.4.4 Lipophilic vs Hydrophilic Statins . . . . . . . . . . . . . . . . 51 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.5.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5 Speech Recognition Experiments 55 5.1 Collection of Spoken Questions Data . . . . . . . . . . . . . . . . . . 55 5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2.1 Trigram Language Model . . . . . . . . . . . . . . . . . . . . . 57 5.2.2 Data Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6 Additional Preliminary Experiments 63 6.1 Multi-word Term Identification . . . . . . . . . . . . . . . . . . . . . 63 6.1.1 Term Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.1.2 Part of Speech Filter . . . . . . . . . . . . . . . . . . . . . . . 65 6.1.3 Association Measures . . . . . . . . . . . . . . . . . . . . . . . 66 6.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 10

Description:
instruction on machine learning, and Regina Barzilay for first introducing me to NLP. help whenever I ran into difficulties with Ruby on Rails. Possible side effects are rash, pancreatic inflammation, nausea, headache, dizziness,.
See more

The list of books you might like