JournalofMachineLearningResearch8(2007)891-933 Submitted1/06;Revised12/06;Published5/07 Anytime Learning of Decision Trees SaherEsmeir [email protected] ShaulMarkovitch [email protected] DepartmentofComputerScience Technion—IsraelInstituteofTechnology Haifa32000,Israel Editor:ClaudeSammut Abstract Themajorityofexistingalgorithmsforlearningdecisiontreesaregreedy—atreeisinducedtop- down,makinglocallyoptimaldecisionsateachnode. Inmostcases,however,theconstructedtree isnotgloballyoptimal.Eventhefewnon-greedylearnerscannotlearngoodtreeswhentheconcept isdifficult. Furthermore,theyrequireafixedamountoftimeandarenotabletogenerateabetter tree if additional time is available. We introduce a framework for anytime induction of decision trees that overcomes these problems by trading computation speed for better tree quality. Our proposedfamilyofalgorithmsemploysanovel strategyforevaluatingcandidatesplits. Abiased sampling of the space of consistent trees rooted at an attribute is used to estimate the size of the minimaltree underthat attribute, and an attribute withthe smallestexpectedtree is selected. We presenttwotypesofanytimeinductionalgorithms:acontractalgorithmthatdeterminesthesample size on the basis of a pre-given allocation of time, and an interruptible algorithm that starts with a greedy tree and continuously improves subtrees by additional sampling. Experimental results indicatethat,forseveralhardconcepts,ourproposedapproachexhibitsgoodanytimebehaviorand yieldssignificantlybetterdecisiontreeswhenmoretimeisavailable. Keywords:anytimealgorithms,decisiontreeinduction,lookahead,hardconcepts,resource-bounded reasoning 1. Introduction Assume that a medical center has decided to use medical records of previous patients in order to buildanautomaticdiagnosticsystemforaparticulardisease. ThecenterappliestheC4.5algorithm onthousandsofrecords,andafterfewsecondsreceivesadecisiontree. Duringthecomingmonths, or even years, the same induced decision tree will be used to predict whether patients have or do nothavethedisease. Obviously,themedicalcenteriswillingtowaitmuchlongertoobtainabetter tree—eithermoreaccurateormorecomprehensible. Consider also a planning agent that has to learn a decision tree from a given set of examples, whilethetimeatwhichthemodelwillbeneededbytheagentisnotknowninadvance. Inthiscase, the agent would like the learning procedure to learn the best tree it can until it is interrupted and queriedforasolution. In both of the above scenarios, the learning algorithm is expected to exploit additional time allocation to produce a better tree. In the first case, the additional time is allocated in advance. In the second, it is not. Similar resource-bounded reasoning situations may occur in many real-life c2007SaherEsmeirandShaulMarkovitch. (cid:13) ESMEIRANDMARKOVITCH a a a a label 1 2 3 4 (cid:0)(cid:2)(cid:1) 1 0 0 1 + 0 1 0 0 + (cid:0)(cid:4)(cid:3) (cid:0)(cid:4)(cid:3) 0 0 0 0 - 1 1 0 0 - - + (cid:0)(cid:6)(cid:5) (cid:0)(cid:6)(cid:5) 0 1 1 1 + 0 0 1 1 - - + - + 1 0 1 1 + (a)Asetoftraininginstances. (b)ID3’sperformance. Figure1: Learningthe2-XORconcepta a ,wherea anda areirrelevant 1 2 3 4 (cid:8) applications such as game playing, planning, stock trading and e-mail filtering. In this work, we introduceaframeworkforexploitingextratime,preallocatedornot,inordertolearnbettermodels. Despite the recent progress in advanced induction algorithms such as SVM (Vapnik, 1995), decision trees are still considered attractive for many real-life applications, mostly due to their interpretability (Hastie et al., 2001, chap. 9). Craven (1996) lists several reasons why the under- standabilityofamodelbyhumansisanimportantcriterionforevaluatingit. Thesereasonsinclude, amongothers,thepossibilityforhumanvalidationofthemodelandgenerationofhuman-readable explanationsfortheclassifierpredictions. Whenclassificationcostisimportant,decisiontreesmay beattractiveinthattheyaskonlyforthevaluesofthefeaturesalongasinglepathfromtherootto aleaf. Intermsofaccuracy,decisiontreeshavebeenshowntobecompetitivewithotherclassifiers forseverallearningtasks. The majority of existing methods for decision tree induction build a tree top-down and use local measures in an attempt to produce small trees, which, by Occam’s Razor (Blumer et al., 1987), should have better predictive power. The well-known C4.5 algorithm (Quinlan, 1993) uses the gain ratio as a heuristic for predicting which attribute will yield a smaller tree. Several other alternative local greedy measures have been developed, among which are ID3’s information gain, Gini index (Breiman et al., 1984), and chi-square (Mingers, 1989). Mingers (1989) reports an empiricalcomparisonofseveralmeasures,andconcludesthatthepredictiveaccuracyoftheinduced trees is not sensitive to the choice of split measure and even random splits do not significantly decrease accuracy. Buntine and Niblett (1992) present additional results on further domains and conclude that while random splitting leads to inferior trees, the information gain and Gini index measuresarestatisticallyindistinguishable. The top-down methodology has the advantage of evaluating a potential attribute for a split in the context of the attributes associated with the nodes above it. The local greedy measures, how- ever,considereachoftheremainingattributesindependently,ignoringpotentialinteractionbetween different attributes (Mingers, 1989; Kononenko et al., 1997; Kim and Loh, 2001). We refer to the family of learning tasks where the utility of a set of attributes cannot be recognized by examining only subsets of it as tasks with a strong interdependency. When learning a problem with a strong interdependency,greedymeasurescanleadtoachoiceofnon-optimalsplits. Toillustratetheabove, letusconsiderthe2-XORproblema a withtwoadditionalirrelevantattributes,a anda . As- 1 2 3 4 (cid:8) 892 ANYTIMELEARNINGOFDECISIONTREES sume that the set of examples is as listed in Figure 1(a). We observe that gain-1 of the irrelevant attributea isthehighest: 4 0:13=gain (a )>gain (a )=gain (a )=gain (a )=0:02; 1 4 1 1 1 2 1 3 and hence ID3 would choose attribute a4 first. Figure 1(b) gives the decision tree as produced by ID3. Any positive instance with value 0 for a would be misclassifiedby this decision tree. In the 4 general case of parity concepts, the information gain measure is unable to distinguish between the relevantandirrelevantattributesbecauseneitherhasapositivegain. Consequently,thelearnerwill grow an overcomplicated tree with splits on irrelevant variables that come either in addition to or insteadofthedesiredsplits. The problem of finding the smallest consistent tree1 is known to be NP-complete (Hyafil and Rivest, 1976; Murphy and McCraw, 1991). In many applications that deal with hard problems, we are ready to allocate many more resources than required by simple greedy algorithms, but still cannot afford algorithms of exponential complexity. One commonly proposed approach for hard problems is anytime algorithms (Boddy and Dean, 1994), which can trade computation speed for quality. Quinlan(1993,chap.11)recognizedtheneedforthistypeofanytimealgorithmfordecision treelearning: “Whatiswantedisaresourceconstrainedalgorithmthatwilldothebestitcanwithin a specified computational budget and can pick up threads and continue if this budget is increased. Thiswouldmakeachallengingthesistopic!” There are two main classes of anytime algorithms, namely contract and interruptible (Russell andZilberstein,1996). Acontractalgorithmisonethatgetsitsresourceallocationasaparameter. If interrupted at any point before the allocation is completed, it might not yield any useful results. Aninterruptiblealgorithmisonewhoseresourceallocationisnotgiveninadvanceandthusmustbe preparedtobeinterruptedatanymoment. Whiletheassumptionofpreallocatedresourcesholdsfor manyinductiontasks,inmanyotherreal-lifeapplicationsitisnotpossibletoallocatetheresources inadvance. Therefore,inourwork,weareinterestedbothincontractandinterruptibledecisiontree learners. In this research, we suggest exploiting additional time resources by performing lookahead. Lookaheadsearchisawell-knowntechniqueforimprovinggreedyalgorithms(Sarkaretal.,1994). Whenappliedtodecisiontreeinduction,lookaheadattemptstopredicttheprofitabilityofasplitat a node by estimating its effect on deeper descendants of the node. One of the main disadvantages of the greedy top-down strategy is that the effect of a wrong split decision is propagated down to allthenodesbelowit(Hastieetal.,2001,chap.9). Lookaheadsearchattemptstopredictandavoid such non-contributive splits during the process of induction, before the final decision at each node ismade. Lookaheadtechniqueshavebeenappliedtodecisiontreeinductionbyseveralresearchers. The reported results vary from lookahead produces better trees (Norton, 1989; Ragavan and Rendell, 1993; Dong and Kothari, 2001) to lookahead does not help and can hurt (Murthy and Salzberg, 1995). Oneproblemwiththeseworksistheiruseofauniform,fixedlow-depthlookahead,therefore disqualifying the proposed algorithms from serving as anytime algorithms. Another problem is the data sets on which the lookahead methods were evaluated. For simple learning tasks, such as inductionofconjunctiveconcepts,greedymethodsperformquitewellandnolookaheadisneeded. However,formoredifficultconceptssuchasXOR,thegreedyapproachislikelytofail. Fewother 1.Aconsistentdecisiontreeisatreethatcorrectlyclassifiesalltrainingexamples. 893 ESMEIRANDMARKOVITCH ProcedureTDIDT(E;A) IfE =0/ ReturnLeaf(nil) If csuchthat e EClass(e)=c 9 8 2 ReturnLeaf(c) a CHOOSE-ATTRIBUTE(A;E) V domain(a) Foreachv V i 2 E e E a(e)=v i i f 2 j g S TDIDT(E;A a ) i i (cid:0)f g ReturnNode(a; v;S i=1::: V ) i i fh ij j jg Figure2: Procedurefortop-downinductionofdecisiontrees. E standsforthesetofexamplesand Astandsforthesetofattributes. non-greedydecisiontreelearnershavebeenrecentlyintroduced(Bennett,1994;Utgoffetal.,1997; PapagelisandKalles,2001;PageandRay,2003). Theseworks,however,arenotcapabletohandle high-dimensional difficult concepts and are not designed to offer anytime behavior.2 The main challenge we face in this work is to make use of extra resources to induce better trees for hard concepts. Notethatincontrasttoincrementalinduction(Utgoff,1989),werestrictourselvesinthis papertobatchsetupwhereallthetraininginstancesareavailablebeforehand. 2. ContractAnytimeInductionofDecisionTrees TDIDT(top-downinductionofdecisiontrees)methodsstartfromtheentiresetoftrainingexamples, partition it into subsets by testing the value of an attribute, and then recursively call the induction algorithm for each subset. Figure 2 formalizes the basic algorithm for TDIDT. We focus first on consistenttreesforwhichthestoppingcriterionforthetop-downrecursioniswhenalltheexamples have the same class label. Later, we consider pruning, which allows simpler trees at the cost of possibleinconsistencywiththetrainingdata(Breimanetal.,1984;Quinlan,1993). In this work we propose investing more time resources for making better split decisions. We firstdiscusstreesizeasadesiredpropertyofthetreetobelearnedandthenwedescribeananytime algorithmthatusessamplingmethodstoobtainsmallertrees. 2.1 InductiveBiasinDecisionTreeInduction The hypothesis space of TDIDT is huge and a major question is what strategy should be followed to direct the search. In other words, we need to decide what our preference bias (Mitchell, 1997, chap.3)willbe. ThispreferencebiaswillbeexpressedintheCHOOSE-ATTRIBUTEprocedurethat determineswhichtreeisexplorednext. Ultimately,wewouldliketofollowapolicythatmaximizestheaccuracyofthetreeonunseen examples. However, since these examples are not available, a heuristic should be used. Motivated byOccam’sRazor,awidelyadoptedapproachistoprefersmallertrees. Theutilityofthisprinciple 2.InSection5wediscussrelatedworksindetails. 894 ANYTIMELEARNINGOFDECISIONTREES to machine learning algorithms has been the subject of a heated debate. Several studies attempted tojustifyOccam’srazorwiththeoreticalandempiricalarguments(Blumeretal.,1987;Quinlanand Rivest,1989;FayyadandIrani,1990). Butanumberofrecentworkshavequestionedtheutilityof Occam’srazor,andprovidedtheoreticalandexperimentalevidenceagainstit. QuinlanandCameron-Jones(1995)providedempiricalevidencethatoversearchingmightresult inlessaccuraterules. ExperimentalresultswithseveralUCIdatasetsindicatethatthecomplexity of the produced theories does not correlate well with their accuracy, a finding that is inconsistent withOccam’sRazor. Schaffer(1994)provedthatnolearningbiascanoutperformanotherbiasover thespaceofallpossiblelearningtasks. ThislooksliketheoreticalevidenceagainstOccam’srazor. Raoetal.(1995),however,arguedagainsttheapplicabilityofthisresulttoreal-worldproblemsby questioningthe validity of its basic assumptionabout the uniform distribution of possible learning tasks. Webb (1996) presented C4.5X, an extension to C4.5 that uses similarity considerations to furtherspecializeconsistentleaves. WebbreportedanempiricalevaluationwhichshowsthatC4.5X hasaslightadvantageinafewdomainsandarguedthattheseresultsdiscreditOccam’sthesis. Murphy and Pazzani (1994) reported a set of experiments in which all the possible consistent decision trees were produced and showed that, for several tasks, the smallest consistent decision treehadhigherpredictiveerrorthanslightlylargertrees. However,whentheauthorscomparedthe likelihoodofbettergeneralizationforsmallervs. morecomplextrees,theyconcludedthatsimpler hypotheses should be preferred when no further knowledge of the target concept is available. The smallnumberoftrainingexamplesrelativetothesizeofthetreethatperfectlydescribestheconcept might explain why, in these cases, the smallest tree did not generalize best. Another reason could be that only small spaces of decision trees were explored. To verify these explanations, we use a similarexperimentalsetupwherethedatasetshavelargertrainingsetsandattributevectorsofhigher dimensionality. Becausethenumberofallpossibleconsistenttreesishuge,weuseaRandomTree Generator (RTG)tosamplethespaceoftreesobtainablebyTDIDTalgorithms. RTGbuildsatree top-downandchoosesthesplittingattributeatrandom. We report the results for three data sets: XOR-5, Tic-tac-toe, and Zoo (See Appendix A for detailed descriptions of these data sets). For each data set, the examples were partitioned into a training set (90%) and a testing set (10%), and RTG was used to generate a sample of 10 million consistentdecisiontrees. Thebehaviorinthethreedatasetsissimilar: theaccuracymonotonically decreases with the increase in the size of the trees (number of leaves), confirming the utility of Occam’sRazor. Furtherexperimentswithavarietyofdatasetsindicatethattheinversecorrelation betweensizeandaccuracyisstatisticallysignificant(EsmeirandMarkovitch,2007). Itisimportanttonotethatsmallertreeshaveseveraladvantagesasidefromtheirprobablegreater accuracy, such as greater statistical evidence at the leaves, better comprehensibility, lower storage costs,andfasterclassification(intermsoftotalattributeevaluations). Motivated by the above discussion, our goal is to find the smallest consistent tree. In TDIDT, the learner has to choose a split at each node, given a set of examples E that reach the node and a setofunusedattributesA. Foreachattributea,letT (a;A;E)bethesmallesttreerootedatathat min is consistent with E and uses attributes from A a for the internal splits. Given two candidate (cid:0)f g attributes a and a , we obviously prefer the attribute whose associated T (a) is the smaller. For 1 2 min e asetofattributesA, wedefine A tobethesetofattributeswhoseassociatedT (a)is thesmaller. min e e Formally, A=argmin T (a). We say that a splitting attribute a is optimal if a A. Observe a A min 2 2 thatifthelearnermakesanoptimaldecisionateachnode,thenthefinaltreeisnecessarilyglobally optimal. 895 ESMEIRANDMARKOVITCH ProcedureENTROPY-K(E;A;a;k) Ifk=0 ReturnI(P (c );:::;P (c )) E 1 E n V domain(a) Foreachv V i 2 E e E a(e)=v i i f 2 j g Foreacha A 0 2 A A a 0 0 (cid:0)f g hi(a0) ENTROPY-K(Ei;A0;a0;k 1)) (cid:0) Return(cid:229) jiV=j1jjEEijjmina02A(hi(a0)) ProcedureGAIN-K(E;A;a;k) ReturnI(P (c );:::;P (c )) E 1 E n (cid:0) ENTROPY-K(E;A;a;k) Figure3: Proceduresforcomputingentropy andgainkforattributea k 2.2 Fixed-depthLookahead OnepossibleapproachforimprovinggreedyTDIDTalgorithmsistolookaheadinordertoexamine the effect of a split deeper in the tree. ID3 uses entropy to test the effect of using an attribute one level below the current node. This can be extended to allow measuring the entropy at any depth k below the current node. This approach was the basis for the IDX algorithm (Norton, 1989). The recursivedefinitionminimizesthek 1entropyforeachchildandcomputestheirweightedaverage. (cid:0) Figure3describesanalgorithmforcomputingentropy anditsassociatedgain . Notethatthegain k k computedbyID3isequivalenttogain fork=1. Werefertothislookahead-basedvariationofID3 k asID3-k. Ateachtreenode,ID3-kchoosestheattributethatmaximizesgain .3 k ID3-kcanbeviewedasacontractanytimealgorithmparameterizedbyk. However, despiteits abilitytoexploitadditionalresourceswhenavailable,theanytimebehaviorofID3-kisproblematic. TheruntimeofID3-kgrowsexponentiallyask increases.4 Asaresult,thegapbetweenthepoints of time at which the resulting tree can improve grows wider, limiting the algorithm’s flexibility. Furthermore, it is quite possible that looking ahead to depth k will not be sufficient to find an op- timal split. Entropy measures the weighted average of the entropy in depth k but does not give a k directestimationoftheresultingtreesize. Thus,whenk< A ,thisheuristicismoreinformedthan j j entropy but can still produce misleading results. In most cases we do not know in advance what 1 valueofk wouldbesufficientforcorrectlylearningthetargetconcept. Invokingexhaustivelooka- head,thatis,lookaheadtodepthk= A ,willobviouslyleadtooptimalsplits,butitscomputational j j costs are prohibitively high. In the following subsection, we propose an alternative approach for evaluatingattributesthatovercomestheabove-mentioneddrawbacksofID3-k. 3.Iftwoattributesyieldthesamedecreaseinentropy,weprefertheonewhoseassociatedlookaheadtreeisshallower. 4.Forexample,inthebinarycase,ID3-kexplores(cid:213) ki=(cid:0)01(n(cid:0)i)2i combinationsofattributes. 896 ANYTIMELEARNINGOFDECISIONTREES ProcedureSID3-CHOOSE-ATTRIBUTE(E;A) Foreacha A 2 p(a) gain (E;a) 1 If asuchthatentropy (E;a)=0 1 9 a Chooseattributeatrandomfrom (cid:3) a A entropy (E;a)=0 1 f 2 j g Else a ChooseattributeatrandomfromA; (cid:3) foreachattributea,theprobability ofselectingitisproportionalto p(a) Returna (cid:3) Figure4: AttributeselectioninSID3 2.3 EstimatingTreeSizebySampling Motivatedbytheadvantagesofsmallerdecisiontrees,weintroduceanovelalgorithmthat,givenan attributea, evaluatesitbyestimatingthesizeoftheminimaltreeunderit. Theestimationisbased on Monte-Carlo sampling of the space of consistent trees rooted at a. We estimate the minimum by the size of the smallest tree in the sample. The number of trees in the sample depends on the available resources, where the quality of the estimation is expected to improve with the increased samplesize. One way to sample the space of trees is to repeatedly produce random consistent trees using the RTG procedure. Since the space of consistent decision trees is large, such a sample might be a poor representative and the resulting estimation inaccurate. We propose an alternative sampling approach that produces trees of smaller expected size. Such a sample is likely to lead to a better estimationoftheminimum. Our approach is based on a new tree generation algorithm that we designed, called Stochastic ID3(SID3). Usingthisalgorithmrepeatedlyallowsthespaceof“good”treestobesampledsemi- randomly. InSID3,ratherthanchooseanattributethatmaximizestheinformationgain,wechoose the splitting attribute randomly. The likelihood that an attribute will be chosen is proportional to itsinformationgain.5 However,ifthereareattributesthatdecreasetheentropytozero,thenoneof themispickedrandomly. TheattributeselectionprocedureofSID3islistedinFigure4. ToshowthatSID3isabettersamplerthanRTG,werepeatedoursamplingexperiment(Section 2.1)usingSID3asthesamplegenerator. Figure5comparesthefrequencycurvesofRTGandSID3. RelativetoRTG,thegraphforSID3isshiftedtotheleft, indicatingthattheSID3treesareclearly smaller. Next, we compared the average minimum found for samples of different sizes. Figure 6 showstheresults. Forthethreedatasets,theminimalsizefoundbySID3isstrictlysmallerthanthe value found by RTG. Given the same budget of time, RTG produced, on average, samples that are twiceaslargeasthatofSID3. However,evenwhentheresultsarenormalized(dashedline),SID3 isstillsuperior. Having decided about the sampler, we are ready to describe our proposed contract algorithm, Lookahead-by-Stochastic-ID3(LSID3). InLSID3,eachcandidatesplitisevaluatedbytheestimated 5.Wemakesurethatattributeswithgainofzerowillhaveapositiveprobabilityofbeingselected. 897 ESMEIRANDMARKOVITCH 0.3 0.2 0.35 RTG RTG RTG 0.25 SID3 00..1168 SID3 0.3 SID3 0.2 0.14 0.25 Frequency 0 0.1.15 Frequency 000 ..0.001.6812 Frequency 0 0.01..152 0.05 0.04 0.05 0.02 0 0 0 40 60 80 100 120 140 160 180 150 200 250 300 350 400 450 500 550 600 15 20 25 30 35 40 45 50 55 Size Size Size Figure5: FrequencycurvesfortheXOR-5(left),Tic-tac-toe,andZoo(right)datasets 140 500 24 SID3 SID3 SID3 130 N-SRITDG3 450 N-SRITDG3 22 N-SRITDG3 20 120 mum 110 mum 400 mum 18 Mini Mini 350 Mini 16 100 14 90 300 12 80 250 10 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Size Size Size Figure6: TheminimumasestimatedbySID3andRTGasafunctionofthesamplesize. Thedata setsareXOR5(left),Tic-tac-toeandZoo(right). Thedashedlinedescribestheresultsfor SID3normalizedbytime. size of the subtree under it. To estimate the size under an attribute a, LSID3 partitions the set of examplesaccordingtothevaluesacantakeandrepeatedlyinvokesSID3tosamplethespaceoftrees consistentwitheachsubset. Summinguptheminimaltreesizeforeachsubsetgivesanestimation oftheminimaltotaltreesizeundera. LSID3isacontractalgorithmparameterizedbyr,thesamplesize. LSID3withr=0isdefined to choose the splitting attribute using the standard ID3 selection method. Figure 7 illustrates the choiceofsplittingattributesasmadebyLSID3. Inthegivenexample,SID3iscalledtwiceforeach subset and the evaluation of the examined attribute a is the sum of the two minima: min(4;3)+ min(2;6)=5. ThemethodforchoosingasplittingattributeisformalizedinFigure8. To analyze the time complexity of LSID3, let m be the total number of examples and n be the total number of attributes. For a given node y, we denote by n the number of candidate attributes y at y, and by m the number of examples that reach y. ID3, at each node y, calculates gain for n y y attributesusingm examples,thatis,thecomplexityofchoosinganattributeisO(n m ). Atleveli y y y (cid:1) ofthetree,thetotalnumberofexamplesisboundedbymandthenumberofattributestoconsideris n i. Thus,ittakesO(m (n i))tofindthesplitsforallnodesinleveli. Intheworstcasethetree (cid:0) (cid:1) (cid:0) will be of depth n and hence the total runtime complexity of ID3 will be O(m n2) (Utgoff, 1989). (cid:1) Shavliketal.(1991)reportedforID3anempiricallybasedaverage-casecomplexityofO(m n). (cid:1) ItiseasytoseethatthecomplexityofSID3issimilartothatofID3. LSID3(r)invokesSID3r timesforeachcandidatesplit. RecallingtheaboveanalysisforthetimecomplexityofID3,wecan 898 ANYTIMELEARNINGOFDECISIONTREES (cid:0) (cid:14)(cid:15)(cid:16)(cid:17) (cid:1)(cid:2)(cid:4)(cid:3)(cid:6)(cid:5)(cid:8)(cid:7)(cid:9)(cid:1)(cid:10)(cid:11)(cid:7) (cid:18)(cid:14)(cid:20)(cid:19)(cid:21) (cid:1)(cid:2)(cid:4)(cid:3)(cid:6)(cid:5)(cid:8)(cid:7)(cid:9)(cid:1)(cid:10)(cid:13)(cid:12) (cid:1)(cid:2)(cid:4)(cid:3)(cid:6)(cid:5)(cid:8)(cid:7)(cid:9)(cid:1)(cid:10)(cid:11)(cid:22) Figure7: Attribute evaluation using LSID3. The estimated subtree size for a is min(4;3)+ min(2;6)=5. ProcedureLSID3-CHOOSE-ATTRIBUTE(E;A;r) Ifr=0 ReturnID3-CHOOSE-ATTRIBUTE(E,A) Foreacha A 2 Foreachv domain(a) i 2 E e E a(e)=v i i f 2 j g min ¥ i Repeatrtimes min min(min; SID3(E;A a ) ) i i i j (cid:0)f g j totala (cid:229) jid=o1main(a)jmini Returnaforwhichtotal isminimal a Figure8: AttributeselectioninLSID3 writethecomplexityofLSID3(r)as n 1 n (cid:229)(cid:0) r (n i) O(m (n i)2)= (cid:229) O(r m i3)=O(r m n4): (cid:1) (cid:0) (cid:1) (cid:1) (cid:0) (cid:1) (cid:1) (cid:1) (cid:1) i=0 i=1 Intheaveragecase,wereplacetheruntimeofSID3byO(m (n i)),andhencewehave (cid:1) (cid:0) n 1 n (cid:229)(cid:0) r (n i) O(m (n i))= (cid:229) O(r m i2)=O(r m n3): (1) (cid:1) (cid:0) (cid:1) (cid:1) (cid:0) (cid:1) (cid:1) (cid:1) (cid:1) i=0 i=1 According to the above analysis, the run-time of LSID3 grows at most linearly with r (under the assumption that increasing r does not result in larger trees). We expect that increasing r will improve the classifier’s quality. To understand why, let us examine the expected behavior of the LSID3algorithmonthe2-XORproblemusedinFigure1(b). LSID3withr=0,whichisequivalent toID3,preferstosplitontheirrelevantattributea . LSID3withr 1evaluateseachattributeaby 4 (cid:21) 899 ESMEIRANDMARKOVITCH callingSID3toestimatethesizeofthetreesrootedata. Theattributewiththesmallestestimation will be selected. The minimal size for trees rooted at a is 6 and for trees rooted at a is 7. For 4 3 a and a , SID3 would necessarily produce trees of size 4.6 Thus, LSID3, even with r =1, will 1 2 succeed in learning the right tree. For more complicated problems such as 3-XOR, the space of SID3treesundertherelevantattributesincludestreesotherthanthesmallest. Inthatcase,thelarger thesampleis,thehigherthelikelihoodisthatthesmallesttreewillbedrawn. 2.4 EvaluatingContinuousAttributesbySamplingtheCandidateCuts Whenattributeshaveacontinuousdomain,thedecisiontreelearnerneedstodiscretizetheirvalues inordertodefinesplittingtests. Onecommonapproachistopartitiontherangeofvaluesanattribute can take into bins, prior to the process of induction. Dougherty et al. (1995) review and compare several different strategies for pre-discretization. Using such a strategy, our lookahead algorithm canoperateunchanged. Pre-discretization,however,maybeharmfulinmanydomainsbecausethe correctpartitioningmaydifferfromonecontext(node)toanother. An alternative approach is to determine dynamically at each node a threshold that splits the examples into two subsets. The number of different possible thresholds is at most E 1, and j j(cid:0) thus the number of candidate tests to consider for each continuous attribute increases to O( E ). j j Such an increase in complexity may be insignificant for greedy algorithms, where evaluating each splitrequiresacheapcomputation(likeinformationgaininID3). InLSID3,however,thedynamic methodmaypresentasignificantproblembecauseevaluatingtheusefulnessofeachcandidatesplit isverycostly. Thedesiredmethodwouldreducethenumberofsplitstoevaluatewhileavoidingthedisadvan- tagesofpre-discretization. Wedevisedamethodforcontrollingtheresourcesdevotedtoevaluating a continuous attribute by Monte Carlo sampling of the space of splits. Initially, we evaluate each possible splitting test by the information gain it yields. This can be done in linear time O( E ). j j Next, we choose a sample of the candidate splitting tests where the probability that a test will be chosenisproportionaltoitsinformationgain. Eachcandidateinthesampleisevaluatedbyasingle invocation of SID3. However, since we sample with repetitions, candidates with high information gainmayhavemultipleinstancesinthesample,resultinginseveralSID3invocations. Theresourcesallocatedforevaluatingacontinuousattributeusingtheabovemethodaredeter- mined by the sample size. If our goal is to devote a similar amount of resources to all attributes, thenwecanuser asthesizeofthesample. Suchanapproach,however,doesnottakeintoaccount the size of the population to be sampled. We use a simple alternative approach of taking samples with size p E where p is a predetermined parameter, set by the user according to the available (cid:1)j j resources.7 Notethat pcanbegreaterthanonebecausewesamplewithrepetition. WenamethisvariantoftheLSID3algorithmLSID3-MC,andformalizeitinFigure9. LSID3- MCcanserveasananytimealgorithmthatisparameterizedby pandr. 2.5 Multiwayvs. BinarySplits TheTDIDTprocedure,asdescribedinFigure2,partitionsthecurrentsetofexamplesintosubsets according to the different values the splitting attribute can take. LSID3, which is a TDIDT algo- rithm,inheritsthisproperty. Whilemultiwaysplitscanyieldmorecomprehensibletrees(Kimand 6.Neithera nora canbeselectedatthe2ndlevelsincetheremainingrelevantattributereducestheentropytozero. 3 4 7.Similarlytotheparameterr,amappingfromtheavailabletimetopisneeded(seeSection2.7). 900
Description: