ebook img

Comparison of Missing Data Imputation Methods for Improving Detection of Obstructive Sleep Apnea PDF

218 Pages·2017·1.41 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Comparison of Missing Data Imputation Methods for Improving Detection of Obstructive Sleep Apnea

Comparison of Missing Data Imputation Methods for Improving Detection of Obstructive Sleep Apnea Marit Iren Rognli Tokle Thesis submitted for the degree of Master in Informatics: Programming and Networks 60 credits Department of Informatics Faculty of mathematics and natural sciences UNIVERSITY OF OSLO Autumn 2017 Comparison of Missing Data Imputation Methods for Improving Detection of Obstructive Sleep Apnea Marit Iren Rognli Tokle (cid:13)c 2017 Marit Iren Rognli Tokle Comparison of Missing Data Imputation Methods for Improving Detection of Obstructive Sleep Apnea http://www.duo.uio.no/ Printed: Reprosentralen, University of Oslo Abstract Sleep apnea is a common sleep disorder where the breathing is paused or reduced during sleep, which forces awakening due to less oxygen in the blood. We employ the four data mining methods K-Nearest Neighbor, Support Vector Machine, Arti(cid:28)cial Neural Network, and Decision Tree, to analyze datasets containing the four non-invasive sensor signals chest respiration, abdominal respiration, nasal respiration, and oxygen saturation. Good results for sleep apnea analysis using these signals as input data for the data mining methods already exist. We examine how using the European Data Format Plus (EDF+) a(cid:27)ects the data mining results, because it is a standardised data format used for storing sleep data, and is used by the sensors and Resmed tool NOX which we use for data acquisition. We also examine how pre-processing input data with imputation methods to handle missing data a(cid:27)ects the data mining results, as we want to support usage of sensors of all qualities, in which we have to assume missing data will occur. Therearetwotasksinthisthesis. First, wecheckhowwellthedataminingalgorithms works with our signals in the EDF+ data format. We conclude that EDF+ is as good as the most common data format used in PhysioNet, as we could store and read data withoutanyproblems, oritmightbeevenbettersincetheinformationisstoredinasingle (cid:28)le instead of several (cid:28)les. By converting data to EDF+, we con(cid:28)rmed that signals and annotations may be stored in the same (cid:28)le. We con(cid:28)rm that our data mining algorithms and all signal combinations, except the sole use of respiration from the chest, may be used for o(cid:27)-line classi(cid:28)cation of sleep apnea. In thesecond task, weexamine howimputation methods workand howpre-processing oursignaldatawithimputationmethodsa(cid:27)ectourdataminingmethods. Forthemissing data challenge, we discovered that the only imputation method that may be used for all percentages of missing values, 5%, 10%, 20%, 30% and 50%, and the four data mining methods, is Self-Organising Maps. This is the overall best method, and the only method that should be used for datasets containing 30% or more missing data, because the others do not maintain the data structure of the dataset. Imputing with the mean and median of each class of normal or disrupted breathing should not be applied as an imputation method, and we assume that separating between classes when imputing is bad practice. Multiple Linear Regression and K-Nearest Neighbor are better at maintaining the data structure than mean and median imputation, but both have a deviation of about 8% compared to the results of the complete dataset. Self-Organising Maps has at most a deviation of 1.25% from the classi(cid:28)cation of the complete dataset. Mean and median imputation may be used if the imputation time is important, as they are better than handling the missing values by replacing them with zeros and the fastest methods, using only a few milliseconds when imputing. 1 Acknowledgements It has been an intensive and educational period of time. Now it is time to give a special note to and re(cid:29)ect on the people who helped me and supported me throughout this period. First, Iwouldliketothankmysupervisor, ProfessorDr. VeraGoebel, forherguidance throughout the work on this thesis. I am exceptionally happy to have such a dedicated supervisor helping me through any unclear situations and always giving good answer to my questions. I would also like to thank my partner, Christian, and parents, Kate and Atle, for their support, encouragement and patience. At last, I would like to thank my family and friends for their encouragement. 3 Contents 1 Introduction 1 1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Sleep apnea 6 2.1 Obstructive Sleep Apnea . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Central Sleep Apnea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Mixed or Complex Sleep Apnea . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Physiological Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.5 Sleep Apnea Diagnosis Tools . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Data mining 10 3.1 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Data Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3 Data mining tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.4 Classi(cid:28)cation methods used in this thesis . . . . . . . . . . . . . . . . . . 14 3.4.1 Arti(cid:28)cial Neural Network . . . . . . . . . . . . . . . . . . . . . . . 14 3.4.2 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . 17 3.4.3 K-Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4.4 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5.2 Holdout Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.5.3 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4 European Data Format 26 4.1 EDF Speci(cid:28)cations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1.1 Header Record . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1.2 Data Record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1.3 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1.4 Labels for Sleep Apnea Detection . . . . . . . . . . . . . . . . . . 35 4.2 WFDB Software Package . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.1 mit2edf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5 5 Missing Data 39 5.1 Types of Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.2 Handling Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.3 Imputation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.3.1 Mean, median and mode . . . . . . . . . . . . . . . . . . . . . . . 43 5.3.2 Hot-Deck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.3.3 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.3.4 Multiple Imputation . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3.5 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.3.6 K-Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.3.7 Self-Organising Maps . . . . . . . . . . . . . . . . . . . . . . . . . 57 6 Requirement analysis 61 6.1 Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.2 PhysioNet EDF Databases . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2.1 CAP Sleep Database . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2.2 Sleep-EDF Database . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.2.3 Non-Invasive Fetal Electrocardiogram Database . . . . . . . . . . 64 6.2.4 SHHS Polysomnography Database . . . . . . . . . . . . . . . . . . 64 6.2.5 St. Vincent’s University Hospital Database . . . . . . . . . . . . . 65 6.2.6 Apnea-ECG Database . . . . . . . . . . . . . . . . . . . . . . . . 65 6.2.7 MIT-BIH Polysomnography Database . . . . . . . . . . . . . . . . 66 6.2.8 Analysis of Database Suitability . . . . . . . . . . . . . . . . . . . 66 6.3 Conversion of Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.4 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.5 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 7 Design and Implementation 75 7.1 System Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.2 Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.2.1 Apnea-ECG Database . . . . . . . . . . . . . . . . . . . . . . . . 79 7.2.2 MIT-BIH Polysomnography Database . . . . . . . . . . . . . . . . 84 7.2.3 St.Vincent’s Hospital Database . . . . . . . . . . . . . . . . . . . 88 7.3 Class Imbalance and Data Distribution . . . . . . . . . . . . . . . . . . . 90 7.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.5 Conversion of Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7.5.1 Header Record Design . . . . . . . . . . . . . . . . . . . . . . . . 93 7.5.2 Data Record Design . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.5.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7.6 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.6.1 Generating Missing Datasets . . . . . . . . . . . . . . . . . . . . . 101 7.6.2 Mean and Median Imputation . . . . . . . . . . . . . . . . . . . . 102 7.6.3 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . 105 7.6.4 K-Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.6.5 Self-Organising Maps . . . . . . . . . . . . . . . . . . . . . . . . . 109 6

Description:
required if separation of obstructive and central sleep apnea is desirable. In previous work, the signal from the abdomen scored highest of the abdomen and chest, with an accuracy of 92.9%. The performance of the data mining methods varied little in previous work. The K-. Nearest Neighbor performed
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.