ebook img

A New Intelligence Based Approach for Computer-Aided Diagnosis of Dengue Fever PDF

0.75 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview A New Intelligence Based Approach for Computer-Aided Diagnosis of Dengue Fever

(cid:13)c 20XXIEEE.PERSONALUSEOFTHISMATERIALISPERMITTED.PUBLISHEDINIEEE.DOI:10.1109/TITB.2011.2171978 1 A New Intelligence Based Approach for Computer-Aided Diagnosis of Dengue Fever Vadrevu Sree Hari Rao, Senior Member, IEEE, and Mallenahalli Naresh Kumar Abstract—Identification of the influential clinical symptoms as enzyme-linked immunosorbent assays (ELISA) and real- and laboratory features that help in the diagnosis of dengue time polymerase-chain reaction (RT-PCR) which are based feverinearlyphaseoftheillnesswouldaidindesigningeffective on nucleic and acid hybridization [3]. A recent study [4] on publichealthmanagementandvirologicalsurveillancestrategies. the behavior of C-type lectin domain family 5, member A Keepingthisasourmainobjectivewedevelopinthispaper,anew 5 computational intelligence based methodology that predicts the (CLEC5A) gene may result in a strategy for reducing tissue 1 diagnosis in real time, minimizing the number of false positives damage which would help improve the odds of survival of 0 and false negatives. Our methodology consists of three major the patients suffering from DHF and dengue shock syndrome 2 components (i) a novel missing value imputation procedure that (DSS). A multivariate model was developed in [5] for pre- n canbeappliedonanydatasetconsistingofcategorical(nominal) dicting hemoglobin (Hb) using predictors such as reactance a and/or numeric (real or integer) (ii) a wrapper based features J selection method with genetic search for extracting a subset of obtained from a single frequency bioelectrical impedance 1 most influential symptoms that can diagnose the illness and (iii) analysis, sex, nausea/vomiting sensation and weight. These 3 an alternating decision tree method that employs boosting for strategies can be employed only after 2 − 12 days from generating highly accurate decision rules. The predictive models the onset of the illness and require state-of-the-art laboratory ] developedusingourmethodologyarefoundtobemoreaccurate facilities. L than the state-of-the-art methodologies used in the diagnosis of M the dengue fever. The World Health Organization (WHO) has arrived at a classification scheme for identifying the infected individuals Index Terms—dengue fever, classification, clinical diagnosis, . t prediction, imputation, features selection, genetic search, alter- based on clinical symptoms and laboratory features. The de- a nating decision trees velopment of predictive models for diagnosis of dengue fever t s based on these schemes is affected by missing or incomplete [ data records in the clinical databases [6] which may arise due 1 I. INTRODUCTION to any or all of the following reasons (i) value being lost v DENGUE fever (DF) is a mosquito-borne infectious dis- (erased or deleted) (ii) not recorded (iii) incorrect measure- 2 ease caused by the viruses of the genus Togaviridae, ments(iv)equipmenterrorsand(v)anexpertnotattachingany 6 0 subgenusFlavirus.Thetransmissionofthisdiseaseisthrough importance to a particular clinical procedure. Usually data is 0 the bites of vectors (aedes aegypti, aedes albopictus) carrying notcollectedfromanorganizedresearchpointofview[7].The 0 thevirusesbelongingtoFlavigenus[1].Fromitsfirstappear- presenceoflargenumberofclinicalsymptomsandlaboratory 2. anceinthePhilippinesin1953,thediseasehasbeenidentified features requires one to search large sub spaces for optimal 0 as one of the most important arthropod-borne viral disease feature subsets. These issues unless addressed appropriately 5 in humans [2]. Dengue virus infection has been reported in wouldhinderthedevelopmentofaccurateandcomputationally 1 morethan100countries,with2.5billionpeoplelivinginareas effective diagnostic system. : v where dengue is endemic. The annual occurrence is estimated In view of the above challenges, we present the following i to be around 100 million cases of DF and 250,000 cases of X novel features of our work: dengue hemorrhagic fever (DHF). r a The diagnosis of dengue fever presents great challenges as • to identify the missing values (MV) in the data set the symptoms overlap with other febrile illnesses. Accurate and impute them by using a newly developed novel diagnosisispossibleonlyafterconductingdefinitivetestssuch imputation procedure; • to identify a set of clinical symptoms that would enable ManuscriptreceivedMay13,2011:revisedAugust24,2011andSeptember early detection of suspected dengue in children and 30,2011;acceptedOctober7,2011. adults, which reduces the risk of transmission of the Vadrevu Sree Hari Rao is with the Department of Mathematics, Jawahar- lal Nehru Technological University, Hyderabad, Andhra Pradesh, 500 085, dengue fever in the community; India. Also, he is an advisor for International Centre for Interdisciplinary • to identify the laboratory features and clinical symptoms Research and Innovation, VNRVJIET Campus, Hyderabad, India. e-mail: that would enable better diagnosis and understanding of [email protected] Mallenahalli Naresh Kumar is with the Software and Database Systems the disease in suspected dengue individuals. This renders Group,NationalRemoteSensingCenter(ISRO),Hyderabad,AndhraPradesh, optimal utilization of the laboratory resources required 500625,India.e-mail:nareshkumar [email protected] for confirmed diagnosis; (cid:13)c 20xxIEEE.Personaluseofthismaterialispermitted.Permissionfrom IEEE must be obtained for all other uses, in any current or future media, • tobuildapredictivemodelthathasacapabilityofrender- includingreprinting/republishingthismaterialforadvertisingorpromotional ing effective diagnosis in realtime. Further we compare purposes,creatingnewcollectiveworks,forresaleorredistributiontoservers its performance with other state-of-the-art methods used orlists,orreuseofanycopyrightedcomponentofthisworkinotherworks. DOI:10.1109/TITB.2011.2171978 in the diagnosis of dengue fever. (cid:13)c 20xxIEEE.Personaluseofthismaterialispermitted.PermissionfromIEEEmustbeobtainedforallotheruses,inanycurrentorfuturemedia,including reprinting/republishingthismaterialforadvertisingorpromotionalpurposes,creatingnewcollectiveworks,forresaleorredistributiontoserversorlists,orreuseofany copyrightedcomponentofthisworkinotherworks.DOI:10.1109/TITB.2011.2171978 The present paper is organized as follows: A survey of the III. ANEWMETHODOLOGYFORCOMPUTER-AIDED state-of-the-art techniquesfor the diagnosisof dengue feveris DIAGNOSISOFDENGUEFEVER presented in Section II, while in Section III we describe our Motivated by the above issues we propose a new method- novel methodology for computer-aided clinical diagnosis of ology comprising of a novel non parametric missing value dengue. The performance evaluation of the methodologies is imputation method that can be applied on data sets consisting described in Section IV. The description of the data sets and of attributes that are of the type categorical (nominal) and/or theexperimentalresultsarepresentedinSectionV.Wepresent numeric (integer or real). The methodology proposed in [15] a comparison of our new imputation methodology with other ignores missing values while generating the decision tree, imputation methods in Section VI. In Section VII we discuss whichrenderslowerpredictionaccuracies.Wehaveembedded thecomputationalcomplexityofournewmethod.Comparison the new imputation strategy (Section III-B) before generating of our new methodology with other state-of-the-art methods the alternating decision tree which results in the improved forms the subject of Section VIII. Conclusions and discussion performance of the classifier on data sets having missing are deferred to Section IX. values. Also, we develop an effective wrapper based features selection algorithm in order to identify the most influential features subset. The present methodology comprises in uti- lizing the new imputation embedded alternating decision tree II. SURVEYOFTHESTATE-OF-THE-ARTTECHNIQUESFOR and the wrapper based features subset selection algorithm. DIAGNOSISOFDENGUEFEVER This methodology can predict the diagnosis of dengue in real time.Infactthemachineknowledgeacquiredbyutilizingthis Logistic regression method was employed to identify clini- novelmethodologywillbeusefultodiagnoseotherindividuals cal symptoms and laboratory features in 381 individuals, out based on clinical symptoms and laboratory features where of which 148 were confirmed dengue [8]. The data records the clinical decision is unavailable. We designate this novel with missing values (MV) are ignored and are deleted from methodology as NM throughout this work. thedataset.In[9],thestudywasconductedonclinicalrecords comprisingof341childrenand597adultsoutofwhich38and A. Data representation 107 respectively were laboratory-confirmed positive dengue A clinical data set can be represented as a set S cases. In this study the data fields that are incomplete or having row vectors (R ,R ,...,R ) and column vec- inaccurate for all suspected dengue cases were replaced with 1 2 m tors (C ,C ,...,C ). Each record can be represented as the known values corresponding to the information in the 1 2 n an ordered n-tuple of clinical and laboratory attributes medical charts. A C4.5 decision tree which has an in built (A ,A ,...,A ,A ) for each i = 1,2,...,m where mechanismofhandlingMVwasemployedin[10]todevelopa i1 i2 i(n−1) in the last attribute (A ) for each i, represents the physician’s diagnostic algorithm to differentiate dengue from non-dengue in illnessonadatasetcomprisingof1200patientsofwhich173 diagnosistowhichtherecord(Ai1,Ai2,...,Ai(n−1))belongs had DF, 171 had DHF and 20 had DSS. A support vector and without loss of generality we assume that there are no missing elements in this set. Each attribute of an element in machine (SVM) based methodology was employed in [11] to analyze the expression pattern of 12 genes of 28 dengue S that is Aij for i = 1,2,...,m and j = 1,2,...,n−1 can patients of which 13 were DHF and 15 were DF cases. A either be a categorical (nominal) or numeric (real or integer) type. Clearly all the sets considered are finite sets. setofseveninfluentialgeneswereidentifiedthroughselective removal of expression data of these twelve genes. B. A new non-parametric imputation strategy In the above studies the MV were either removed [8], or filled with approximate values based on medical charts [9]. Thefirststepinanyimputationalgorithmistocomputethe These approaches would lead to biased estimates and may proximity measure in the feature space between the clinical eitherreduceorexaggeratethestatisticalpower.Methodssuch recordstoidentifythenearestneighborsfromwherethevalues as logistic regression, maximum likelihood and expectation can be imputed. The most popular metric for quantifying the maximization have been employed for imputation of MV, but similarity between any two records is the Euclidean distance. they can be applied only on data sets that are either nominal Eventhoughthismetricissimplertocompute,itissensitiveto or numeric. There are other imputation methods such as k- thescalesofthefeaturesinvolved.Furtheritdoesnotaccount nearest neighbor imputation (KNNI) [12]; k-means clustering for correlation between the features. Also, the categorical imputation (KMI) [13]; weighted k-nearest neighbor imputa- variables can only be quantified by counting measures which tion (WKNNI) [14] and fuzzy k-means clustering imputation callsforthedevelopmentofeffectivestrategiesforcomputing (FKMI) [13] that have been applied on other data sets but not the similarity [16]. Considering these factors we first propose ondenguefeverdatasets.However,theauthorsin[8],[9],[11] a new indexing measure ICl(Ri,Rk) between two typical haveemployedmethodssuchasoddsratio(OR)andselective elements Ri, Rk for i,k = 1,2,...,m, l = 1,2,...,n−1 inclusion or exclusion of attributes for obtaining features sub belonging to the column Cl of S which can be applied on any sets of data sets of dengue fever. But these methods do not type of data, be it categorical (nominal) and/or numeric (real yeild effective diagnosis as all interactions or correlations or integer). We consider the following cases: between the features and the diagnosis are not considered in Case I: A =A in kn these studies. Let A denote the collection of all members of S that (cid:13)c 20xxIEEE.Personaluseofthismaterialispermitted.PermissionfromIEEEmustbeobtainedforallotheruses,inanycurrentorfuturemedia,including reprinting/republishingthismaterialforadvertisingorpromotionalpurposes,creatingnewcollectiveworks,forresaleorredistributiontoserversorlists,orreuseofany copyrightedcomponentofthisworkinotherworks.DOI:10.1109/TITB.2011.2171978 belong to the same decision class to which R and R indicate the cardinalities of the respective subsets. We i k belong and does not have MV. Based on the type of the define the indexing measure between the two records attribute to which the column C belongs, the following R and R as l i k situations arise: (cid:26) max{βr, δs}, for i(cid:54)=k; ICl(Ri,Rk)= 0, δs βr otherwise. (i) Elements of the column C of S are of categorical l (nominal) type: where β represents the cardinality of the subset P r βr We now express A as a disjoint union of non- all of whose elements have first co-ordinates A in il empty subsets of A, say Bγp1,Bγp2,...,Bγps ob- set P and δs represents the cardinality of that subset tained in such a manner that every element of A Q , all of whose elements have first co-ordinates A δs kl belongs to one of these subsets and no element in set Q. of A is a member of more than one subset of A. (cid:83) (cid:83) (cid:83) That is A = Bγp1 Bγp2 ,..., Bγps, in which (ii) Elements of the column Cl of S are of numeric type: γp1,γp2,...,γps denotethecardinalitiesoftherespec- If the type of the attribute is integer we follow the tive subsets Bγp1,Bγp2,...,Bγps formed out of the procedure discussed in Case II item (i). For fractional set A, with the property that each member of the same numbers we define the index I (R ,R ) between the Cl i k subset has the same first co-ordinate and members of two records R andR as i k notwodifferentsubsetshavethesamefirstco-ordinate. We define an index I (R ,R )=(cid:26) max{AΛil,AΛkl}, for i(cid:54)=k; (cid:40) min{γpi,γqk}, for i(cid:54)=k; Cl i k 0, otherwise. ICl(Ri,Rk)= 0, γqk γpi otherwise. IntheabovedefinitionΛ=min{P#,Q#}whereP#, and Q# denote the average of the first column entries where γpi represents the cardinality of the subset ofalltheelementsofthesets PandQexcludingthose Bγpi, all of whose elements have first co-ordinates with MV in the lth column. A and γ represents the cardinality of that subset il qk The proximity or distance scores between the clinical B , all of whose elements have first co-ordinates γqk records in the data set S can be represented as D = A . kl {{0,d ,...,d };{d ,0,...,d };...;{d ,d ,...,0}} 12 1m 21 2m m1 m2 (cid:113) where d = (cid:80)n−1I2 (R ,R ). For each of the missing (ii) Elements of the column Cl of S are of numeric type: ik l=1 Cl i k value instances in a record R our imputation procedure Numeric types can be classified further as integers i (awttrhioblueteniusmofbeinrste)goerrtryepaelt(hferancwtioenfaolllonwumthbeerpsr)o.cIefduthree first computes the score z(dij) = (cid:113)m1−1(d(cid:80)ijmi−=1d()dij−d) where j = 1,2...,m and d denotes the mean distance. We then discussedinCaseIitem(i).Forfractionalnumberswe pick up only those records (nearest neighbors) which satisfy construct the index I (R ,R ), based on the ratio of Cl i k the condition z(d ) ≤ 0 where {d ,d ,...,d } denote the values of the elements A ,A of lth column to ij i1 i2 im il kl the distances of the current record R to all other records in the mean of the set of elements belonging to A that do i the data set S. If the type of attribute is categorical or integer, not have MV and is given by then the data value that has the highest frequency (mode) (cid:26) min{Ail,Akl}, for i(cid:54)=k; of occurrence in the corresponding columns of the nearest I (R ,R )= A# A# Cl i k 0, otherwise. recordsisimputed.Forthedatavaluesoftyperealweimpute the mean of data values in the corresponding columns of the In the above definition A# denotes the average of the nearest records. lth column entries of all the elements of the set A Illustrativeexample:Thefollowingexampleillustratesthe excluding those with MV in the lth column. spirit of the new imputation algorithm. Consider a data set Case II: Ain (cid:54)=Akn represented by the matrix S consisting of rows R1=(?, 12.0, Clearly Ri and Rk belong to two different decision positive), R2=( yes, 10.5, positive), R3=( no, 14.0, positive) classes. Consider the subsets Pi and Qk consisting of and R4=(no, 13.0, negative). The missing value instance (’?’) members of S that share the same decision with Ri in this data set is present in record R1 and column C1. These and Rk respectively and does not have MV. Clearly rowscorrespondtothedatarecordsoffourindividuals.Clearly (cid:84) Pi Qk =∅. Basedon the typeof theattribute to which the Case I item (i) of the imputation algorithm applies to the column Cl belongs, the following situations arise: this data set for determining the missing value. The matrix of the indexing measure I has the following rows: (0,0.86) and (i) Elements of the column C of S are of nominal or (0,0.99) in which γ =0, γ =1 and A# =12.17. The rela- l p q categorical type: tivedistancesbetweenR andtheotherrecordsarecomputed 1 Following the procedure discussed in Case I item (i) as {0.93,0,0} and the corresponding z-scores are obtained as we write P and Q as a disjoint union of non-empty {−0.57,−0.57,1.154}.Sincez≤0forthedistancesbetween subsets of P ,P ,...,P and Q ,Q ,...,Q R and R and also R and R , we conclude that the records β1 β2 βr δ1 δ2 δs 1 2 1 3 respectively in which β ,β ,...,β and δ ,δ ,...,δ R and R are nearer to R and hence the highest frequency 1 2 r 1 2 s 2 3 1 (cid:13)c 20xxIEEE.Personaluseofthismaterialispermitted.PermissionfromIEEEmustbeobtainedforallotheruses,inanycurrentorfuturemedia,including reprinting/republishingthismaterialforadvertisingorpromotionalpurposes,creatingnewcollectiveworks,forresaleorredistributiontoserversorlists,orreuseofany copyrightedcomponentofthisworkinotherworks.DOI:10.1109/TITB.2011.2171978 (mode) of the data value in column C is ’yes’. Accordingly Algorithm 1 The NM Methodology 1 this value is a suitable candidate for imputation. Input:(a) DatasetsforthepurposeofdecisionmakingS(m,n)wheremandnare numberofrecordsandattributesrespectivelyandthemembersofSmayhaveMV inanyoftheattributesexceptinthedecisionattribute,whichisthelastattribute intherecord. C. Identification of influential features (b) ThetypeofattributeCofthecolumnsinthedataset. Output:(a) ClassificationaccuracyforagivendatasetS. In situations presented by real world processes, influential (b) PerformancemetricsAUC,SE,SP. features are often unknown a priori, hence features that are Algorithm redundant or those that are weakly participating in decision (1) IdentifyandcollectallrecordsinadatasetS (2) ImputetheMVinthedatasetSusingtheprocedurediscussedinSectionIII-B. making must be identified and appropriately handled. The (3) Extracttheinfluentialfeaturesusingawrapperbasedapproachwithgeneticsearch features selection procedures can be categorized as random foridentifyingfeaturessubsetsandalternatingdecisiontreeforitsevaluationas discussedinSectionIII-C. or sequential. The sequential methods such as forward selec- (4) Split the dataset in to training and testing sets using a stratified k fold cross tion, backward elimination and bidirectional selection employ validation procedure. Denote each training and testing data set by Tk and Rk respectively. greedy methods and hence may not often be successful in (5) Foreachkcomputethefollowing findingtheoptimalfeaturessubsets.Incontrasttothisstochas- (i) BuildtheADTusingtherecordsobtainedfromTk. (ii) Computethepredictedprobabilities(scores)forbothpositiveandnegative tic optimization methods such as genetic algorithms (GAs) diagnosisofdenguefromtheADTbuiltinStep(5)-(i)usingthetestdataset performglobalsearchandarecapableofeffectivelyexploring Rk.DesignatethesetconsistingofallthesescoresbyP. large search spaces [17]. In our approach we adopt a wrapper (iii) IdentifyandcollecttheactualdiagnosisfromthetestdatasetRk intoset denotedbyL. subset based feature evaluation model [18] where the method (6) RepeattheSteps(5)-(i)toStep(5)-(iii)foreachfold. ofclassificationitselfisusedtomeasuretheimportanceofthe (7) ObtaintheperformancemetricsAUC,SEandSPutilizingthesetsLandP. (8) RETURNAUC,SE,SP. features sub set identified by the GA. (9) END. D. Predictive modeling using decision trees probabilities of all the k folds. For each threshold value the An alternating decision tree (ADT) consists of decision measuresSEandSParecomputed.Thefalsepositiverateand nodes(splitternode)andpredictionnodeswhichcaneitherbe truepositiveratevaluesoftheROCistakenas(1-SP)andSE aninteriornodeoraleafnode.Thetreegeneratesaprediction respectively. The AUC is computed by applying a trapezoidal node at the root and then alternates between decision nodes rule on the data points of the ROC curve. The optimal cut and further prediction nodes. Decision nodes specify a pred- off or operating point is the threshold that is closest point to icate condition and prediction nodes contain a single number (0,1) on the ROC curve which gives the equal error rate. The denotingthepredictivevalue.Aninstancecanbeclassifiedby optimal values of AUC, SE, SP are computed for this cut off following all paths for which all decision nodes are true and point. summing the relevant prediction nodes that are traversed. A positivesumimpliesmembershipofoneclassandthenegative V. EXPERIMENTSANDRESULTS sum indicates the membership of the opposite class. In our methodology we have employed a stratified ten-fold cross validation ( k = 10) procedure. We applied a standard IV. PERFORMANCEEVALUATIONMETHODS implementationofSVMwithradialbasisfunctionkernel[11] The standard definitions of the performance measures such using LibSVM package [19]. The GA algorithm for features as the specificity (SP), sensitivity (SE), receiver operator selection has been performed using the parameter values: characteristics (ROC) and area under ROC (AUC) based on cross over probability=1.0 and mutation probability=0.001. number of true positives, true negatives, false positives and The standard implementation of C4.5, LOR algorithms in false negatives are utilized in our experimental analysis. We Weka(cid:13)c [20] are considered for evaluating the performance employed a stratified k-fold cross validation for estimating of our algorithm. We have implemented the NM algorithm the test error on classification algorithms. We have randomly and the performance evaluation methods in Matlab(cid:13)c. A non- divided the given data set into k disjoint subsets. Each subset parametric statistical test proposed by Wilcoxon [21] is used is roughly of equal size and has the same class proportions to compare the performances of the algorithms. We compared as in the original data set. The classification model has been the NM with the state-of-the-art methodologies employed built by setting aside one of the subsets as test data set and in diagnosis of dengue fever using different performance train the classifier using the other nine subsets. The trained measures discussed in Section IV. model is then employed in classifying the test data set. The experiment is repeated by setting aside each of the k subsets A. Data sets astestdatasetsoneatatime.TocomputeROCfork foldswe firsttrainaclassifierusingthetrainingdatasetofak foldand We have obtained four surveillance data sets from case- then obtain the scores in terms of the predicted probabilities patients admitted into hospitals located in central and western forpositivesandnegativesfromthetrainedclassifierusingthe States of India. Standard procedures were adopted in collect- testdatasetcorrespondingtothesamefoldasthetrainingdata. ing the clinical and demographic attributes of the patients. Once all the probabilities and corresponding actual decisions The probable cases of the dengue fever are arrived through are collected, the ROC is obtained by first computing the definitivelaboratorytestssuchasELISA.Thepatientsrecords thresholds using the quartiles of the cumulative predictive include clinical symptoms: fever, fever duration, headache, (cid:13)c 20xxIEEE.Personaluseofthismaterialispermitted.PermissionfromIEEEmustbeobtainedforallotheruses,inanycurrentorfuturemedia,including reprinting/republishingthismaterialforadvertisingorpromotionalpurposes,creatingnewcollectiveworks,forresaleorredistributiontoserversorlists,orreuseofany copyrightedcomponentofthisworkinotherworks.DOI:10.1109/TITB.2011.2171978 TABLE I Performance comparison of the NM with other TABLE III Influential features subsets identified by NM methodologies (C4.5, SVM and LOR) on the data sets used Data #Orignal #influential Accuracy features in the present study set features features (%) identified Dataset Method Accuracy SE SP AUC DS1 16 5 100.00 retro-orbital pain , arthralgia, (%) feverduration,platelet,fever NM 100.00 100.00 100.00 1.00 DS2 9 6 86.53 vomiting or nausea, myalgia, C4.5 96.44 95.90 97.27 1.00 rash,bleedingsites,abdominal LOR 91.02 89.49 93.36 0.96 pain,arthralgia DS1 SVM 96.75 97.18 96.09 0.97 DS3 16 2 100.00 Hb,fever NM 86.53 88.97 82.81 0.93 DS4 9 2 95.48 retro-orbitalpain,arthralgia C4.5 82.35 87.18 75.00 0.84 LOR 72.91 74.36 70.70 0.78 DS2 SVM 78.17 89.49 60.94 0.75 NM 100.00 100.00 100.00 1.00 TABLEIIWilcoxonmatched-pairsranksumtestforcompar- C4.5 94.97 95.41 93.55 0.99 ing the performance of NM with other methodologies used in LOR 92.71 92.79 92.47 0.96 DS3 SVM 98.99 98.69 100.00 0.99 diagnosis of dengue fever NM 95.48 98.03 87.10 0.95 C4.5 90.20 91.48 86.02 0.91 Dataset Method Ranksum(+,-) p-value DS4 LOR 88.44 89.84 83.87 0.90 C4.5 55.0,0.0 0.002 SVM 92.71 98.03 75.27 0.87 LOR 55.0,0.0 0.002 DS1 SVM 45.0,0.0 0.004 C4.5 55.0,0.0 0.002 LOR 55.0,0.0 0.002 DS2 SVM 55.0,0.0 0.002 C4.5 36.0,0.0 0.008 LOR 36.0,0.0 0.008 retro-orbital pain (eye pain), myalgia (body pain), arthralgia DS3 SVM 10.0,0.0 0.125 (jointpain),nauseaorvomiting,bleedinggums,rash,bleeding C4.5 38.5,6.5 0.074 LOR 37.0,8.0 0.098 sites, restlessness and abdominal pain and laboratory features: DS4 SVM 27.0,9.0 0.25 haemoglobin (Hb), white blood cell count (WBC), packed cell volume (PCV) and platelets. The last attribute in data The above comparisons and statistical tests clearly demon- set is the decision attribute. The clinical records are then re- strate the significance of our methodology in identifying the groupedintofourdatasets.Thefirstdataset(DS1)comprises suspected dengue both in children and adults. The imputation of 646 adults (age≥ 16 years) with clinical symptoms and strategy employed in our methodology has improved the laboratory features out of which 256 were dengue positive classificationaccuracieswhencomparedwithC4.5whichuses and 390 are dengue negative. The second data set (DS2) is a modified information gain measure to generate the decision a part of DS1 consisting of only clinical symptoms (ignoring tree in presence of MV. The mean imputation strategies the laboratory features) and has the same number of records adopted in SVM and LOR could not render classification as in DS1. The third data set (DS3) consists of 398 children accuracies higher than NM. (age between 5−15 years) [9] with clinical symptoms and The features subsets identified by the NM is shown in laboratory features, out of which 93 were dengue positive and TableIII.Theapplicationoffeaturesselectionmethodreduced 305 were dengue negative. The fourth data set (DS4) is a part the number of attributes by 75% in DS1 and 87.5% in DS3 of DS3 with only clinical symptoms and has same number of data sets. Our methodology identified some of the clinical records as DS3. symptoms and laboratory features in adults (vomiting and abdominal pain) different from those in children which are in concurrence with earlier studies [22], [23]. The clinical attribute rash was identified as an important feature in adults but not in children. This may be explained by the relative frequencyofthesecondaryinfectionsinadults[24].Arthralgia was found to be influencing the final diagnosis of dengue B. Results both in children and adults. The ROC curves comparing the performance of NM with other methodologies are shown in Figs. 1a-1d. The operating point or cut off point (p<0.001) TheperformanceoftheNMiscomparedwithothermethod- is shown as a pentagon on each of the ROC curves. The ROC ologies (C4.5, SVM and LOR) on the data sets used in the curves clearly demonstrate the superior performance of NM present study and the classification accuracies are presented over other methods used in the diagnosis of dengue. in Table I. A hundred percent accuracy is reported by NM both in data sets DS1 and DS3. The Wilcoxon matched-pairs VI. PERFORMANCECOMPARISONOFNEWIMPUTATION rank sum test results comparing the accuracies of NM with ALGORITHMWITHBENCHMARKINGDATASETS other methodologies are shown in Table II. For example, the Since no specific studies on imputation of missing values positiveranksumof55.0andnegativeranksumof0.0witha in dengue data sets we have utilized some bench marking p-value<0.01forC4.5usingdatasetDS1(firstrowTableII) data sets obtained from Keel and University of California indicates the superior performance of the new methodology Irvin (UCI) machine learning data repositories [25], [26] to over C4.5 and also in respect of other methods as well. test the performance of the new imputation algorithm. The (cid:13)c 20xxIEEE.Personaluseofthismaterialispermitted.PermissionfromIEEEmustbeobtainedforallotheruses,inanycurrentorfuturemedia,including reprinting/republishingthismaterialforadvertisingorpromotionalpurposes,creatingnewcollectiveworks,forresaleorredistributiontoserversorlists,orreuseofany copyrightedcomponentofthisworkinotherworks.DOI:10.1109/TITB.2011.2171978 (a)DS1 (b)DS2 (c)DS3 (d)DS4 Fig.1. ROCcurves Wilcoxon statistics in Table IV is computed based on the accuracies obtained by the new imputation algorithm with the accuracies of those obtained by other imputation algorithms using a C4.5 decision tree. The results in Table IV clearly demonstrate the fact that our algorithm is superior to other imputation algorithms as the positive rank sums are higher than the negative rank sums (p<0.05) in all the cases. TABLE IV Wilcoxon sign rank statistics for matched pairs comparingthenewimputationalgorithmwithotherimputation methods using C4.5 decision tree Method RankSums Test Critical p-value (+,-) Statistics Value Fig.2. ComputationalcomplexityoftheNM FKMI 78.5,12.5 12.5 18 0.021 KMI 85.0,6.0 6 18 0.003 KNNI 76.0,15.0 15 18 0.032 presence of the linear trend between the time taken and the WKNNI 83.0,8.0 8 18 0.006 varyingdatabasesizesensuresthenumericalscalabilityofthe performance of NM in terms of asymptotic linearity. VII. COMPUTATIONALCOMPLEXITY VIII. COMPARISONOFRELATEDMETHODOLOGIESON The computational complexity is a measure of the perfor- DENGUESTUDIES mance of the algorithm. For each data set having n attributes and m records, we select only those subset of records m ≤ In this section we compare the results (Table V) obtained 1 m, in which missing values are present. The distances are in [8]–[10] with the results of our new methodology on the computed for all attributes n excluding the decision attribute. current data set of 1044 individuals including children and So, the time complexity for computing the distance would be adults. As compared to [9] where children with rash were O(m ∗(n−1)).Thetimecomplexityforselectingthenearest having SE of 41.2% and SP of 95.5% our methodology when 1 records is of order O(m ). For computing the frequency of appliedonthedatasetDS2resultedinanaccuracyof86.53%, 1 occurrences for nominal attributes and average for numeric SE of 88.97% and SP of 82.81% which is considered to be attributes the time taken would be of the order O(m ). a good classification model as both SE and SP are higher 1 Therefore, for a given data set with k-fold cross validation than 80%. In [10] both clinical and laboratory features were having n attributes and m records, the time complexity of utilized to develop decision rules using C4.5 decision tree our new imputation algorithm would be k ∗ (O(m ∗ (n − and they have reported a SE of 87.8% and SP of 75.7%. In 1 1) ∗ m) + 2 ∗ O(m )) which is asymptotically linear. Our comparisonto[10]ourmethodologywhenappliedonDS1and 1 experiments were conducted on a personal computer having a DS3hadresultedinSEof100%andSPof100%.Fromthese Intel(R) core (TM) 2 Duo, CPU @2.93 GHZ processor with comparisonsweconcludethatthenewmethodologypresented 4 GB RAM. For each data set the computational time for inthisstudyifappliedonthedatasetsusedin[8]–[10]would imputation and features selection is measured in terms of the yield more accurate results. numberofCPUclockcycleselapsedinseconds.Basedonthe results, we obtain a scatter plot (red line in Fig. 2) between IX. CONCLUSIONSANDDISCUSSION the varying database sizes and the time taken by NM. Also, A new methodology (NM) with built in features for im- we employed a linear regression on our results and obtained putation of missing values and identification of influential the relation between the time taken (T) and the data size (D) attributes is discussed. The NM has out performed the state- as T = 0.96D+5.54, α = 0.05, p < 0.05, r2 = 0.98. The of-the-art methodologies in diagnosis of dengue fever on all (cid:13)c 20xxIEEE.Personaluseofthismaterialispermitted.PermissionfromIEEEmustbeobtainedforallotheruses,inanycurrentorfuturemedia,including reprinting/republishingthismaterialforadvertisingorpromotionalpurposes,creatingnewcollectiveworks,forresaleorredistributiontoserversorlists,orreuseofany copyrightedcomponentofthisworkinotherworks.DOI:10.1109/TITB.2011.2171978 TABLEVEvaluationofNMwithotherrelatedmethodologies [3] S.DePaulaandB.Fonseca,“Dengue:areviewofthelaboratorytestsa on dengue studies clinicianmustknowtoachieveacorrectdiagnosis,”BrazJInfectDis., vol.8(6),pp.390–398,2004. State-of-the- #Patients Records Methods Accuracy SE SP [4] S.-T.Chen,Y.-L.Lin,M.-T.Huang,M.-F.Wu,S.-C.Cheng,H.-Y.Lei, art (DF) with (%) (%) (%) MV C.-K.Lee,T.-W.Chiou,C.-H.Wong,andS.-L.Hsieh,“Clec5aiscritical for dengue-virus-induced lethal disease,” Nature, vol. 453(7195), pp. Chadwick 381 deleted LOR, 84.5 84 85 672–676,2008. et al., [8] (148) OR (clinical) [5] F. Ibrahim, N. Ismail, M. Taib, and A. W. Wan, “Modeling of Chadwick 381 deleted -do- 76.5 74 79 hemoglobin in dengue fever and dengue hemorrhagic fever using bio- et al., [8] (148) electricalimpedance.”PhysiolMeas,vol.25(3),pp.607–15,2004. (laboratory) [6] M. N. Colleen, A. G. William, L. K. Merril, C. D. Naylor, and Ramos et 938 manual -do- 68.95 41.2 95.5 S. L.Duncan, “Dealingwith missing datain observationalhealth care al., [9] (38) update outcome analyses,” Journal of Clinical Epidemiology, vol. 53(4), pp. (clinical, 377–383,2000. children) [7] K. J. Cios and W. Mooree, “Uniqueness of medical data mining,” Tanner et 1200 deleted C4.5 81.75 87.8 75.7 ArtificialIntelligenceinMedicine,vol.26,pp.1–24,2002. al., [10] (173) [8] D.Chadwick,B.Arch,A.Wilder-Smith,andN.Paton,“Distinguishing (laboratory) Gomes et 20(15) - SVM 85 - - dengue fever from other infections on the basis of simple clinical and al., [11] (gene laboratory features: application of logistic regression analysis,” J Clin database) Virolology,vol.35(2),pp.147–153,2006. NM (DS1) 1044 imputed ADT, 100 100 100 [9] M. M. Ramos, K. M. Tomashek, D. F. Arguello, C. Luxemburger, (adults,clinical (256) (new GA L. Quiones, J. Lang, and J. L. Muoz-Jordan, “Early clinical features &laboratory) algo- of dengue infection in puerto rico,” Transactions of the Royal Society rithm) ofTropicalMedicineandHygiene,vol.103(9),pp.878–884,2009. NM (DS2) 1044 -do- -do- 86.53 88.97 82.81 [10] L.Tanner,M.Schreiber,J.Low,A.Ong,andT.Tolfvenstam,“Decision (adults, (256) treealgorithmspredictthediagnosisandoutcomeofdenguefeverinthe clinical) NM (DS3) 1044 -do- -do- 100 100 100 earlyphaseofillness,”PLoSNeglTropDis,vol.2(3),pp.1–9,2008. (children, (93) [11] A.L.V.Gomes,L.J.K.Wee,A.M.Khan,L.H.V.G.Gil,E.T.A. clinical & Marques, Jr, C. E. Calzavara-Silva, and T. W. Tan, “Classification of laboratory) denguefeverpatientsbasedongeneexpressiondatausingsupportvector NM (DS4) 1044 -do- -do- 95.48 98.03 87.10 machines,”PLoSONE,vol.5(6),pp.1–7,2010. (children, (305) [12] G.BatistaandM.Monard,“Ananalysisoffourmissingdatatreatment clinical) methodsforsupervisedlearning,”AppliedArtificialIntelligence,vol.17, no.5,pp.519–533,2003. [13] J. Deogun, W. Spaulding, B. Shuart, and D. Li, “Towards missing data imputation: A study of fuzzy k-means clustering method,” in the four data sets considered in our experiments. The NM 4th International Conference of Rough Sets and Current Trends in has generated a decision tree with an accuracy of 100.0% in Computing(RSCTC’04), ser. Lecture Notes on Computer Science, vol. childrenandadultsusingbothclinicalandlaboratoryfeatures. 3066. LectureNotesInComputerScience,2004,pp.573–579. [14] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tib- Based on the performance measures we conclude that the use shirani,D.Botstein,andR.Altman,“Missingvalueestimationmethods of the new imputation strategy and features selection methods fordnamicroarrays,”Bioinformatics,vol.17,pp.520–525,2001. withwrapperbasedsubsetevaluationusinggeneticsearchhas [15] Y. Freund and L. Mason, “The alternating decision tree learning al- gorithm,” in Proceeding of the Sixteenth International Conference on improved the accuracies of the predictions. Though the new MachineLearningBled,Slovenia. ACM,1999. methodologydiscussedinthispapermaybetakenasauniver- [16] U. Tadashi, M. Yoshihide, K. Daichi, S. Masami, and K. Kenji, “Fast sal tool for the effective diagnosis of this disease, it remains multidimensional nearest neighbor search algorithm based on ellipsoid distance,”InternationalJournalofAdvancedIntelligence,vol.1(1),pp. to be seen whether or not this methodology is geographically 89–107,2009. independent. However, we are willing to share our predictive [17] D.E.Goldberg,Geneticalgorithmsinsearch,optimizationandmachine methodologies and strategies with the researchers working on learning. Addison-Wesley,1989. [18] K. Ron and H. J. George, “Wrappers for feature subset selection,” dengue fever all over the globe. We hold the view that more ArtificialIntelligence,vol.97,pp.273–324,1997. intensive and introspective studies of this kind will pave way [19] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector for better clinical management and virological surveillance of machines,” ACM Transactions on Intelligent Systems and Technology, vol.2(3),pp.1–27,2001. dengue fever. [20] I.WittenandE.Frank,DataMining:Practicalmachinelearningtools andtechniques. MorganKaufmann,SanFrancisco.,2005. [21] F.Wilcoxon,“Individualcomparisonsbyrankingmethods,”Biometrics ACKNOWLEDGMENTS Bulletin,vol.1(6),pp.80–83,1945. [22] J.G.-R.EnidandG.R.-P.Jos,“Dengueseverityintheelderlyinpuerto WethanktheAssociateEditorandtheanonymousreviewers rico,”PanAmJPublicHealth,vol.13(6),pp.362–368,2003. for their constructive suggestions on our paper. This research [23] W. Ole, H. Suchat, B. Chureeratana, C. Kesinee, S. Yoawalark, and is supported by the Foundation for Scientific Research and P. Sasithon, “Risk factors and clinical features associated with severe dengue infection in adults and children during the 2001 epidemic in Technological Innovation (FSRTI)- A Constituent Division chonburi, thailand,” Tropical Medicine and International Health, vol. of Sri Vadrevu Seshagiri Rao Memorial Charitable Trust, 9(9),pp.1022–1029,2004. Hyderabad - 500 035, India. [24] C. Cobra, J. G. Rigau-Prez, G. Kuno, and V. Vomdam, “Symptoms of dengue fever in relation to host immunologic response and virus serotype, puerto rico, 19901991,” American Journal of Epidemiology, REFERENCES vol.142(11),pp.1204–1211,1995. [25] A. Alcal-Fdez, A. Fernandez, Luengo, J. Derrac, S. G. J., L. Snchez, [1] D.Gubler,“Dengueanddenguehemorrhagicfever,”ClinicalMicrobi- and F. Herrera, “Keel data-mining software tool: Data set repository, ologyReviews,vol.11,pp.480–96,1998. integrationofalgorithmsandexperimentalanalysisframework,”Journal [2] T.P.Monath,“Dengue:Therisktodevelopedanddevelopingcountries,” ofMultiple-ValuedLogicandSoftComputing,2010. ProceedingsoftheNationalAcademyofSciencesoftheUnitedStates [26] A. Frank and A. Asuncion, “UCI machine learning repository,” 2010. ofAmerica,vol.91(7),pp.2395–2400,1994. [Online].Available:http://archive.ics.uci.edu/ml

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.