ebook img

Anytime Induction of Low-cost, Low-error Classifiers PDF

31 Pages·2008·0.38 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Anytime Induction of Low-cost, Low-error Classifiers

JournalofArtificialIntelligenceResearch33(2008)1-31 Submitted05/08;published09/08 Anytime Induction of Low-cost, Low-error Classifiers: a Sampling-based Approach Saher Esmeir [email protected] Shaul Markovitch [email protected] Computer Science Department Technion–Israel Institute of Technology Haifa 32000, Israel Abstract Machinelearningtechniquesaregainingprevalenceintheproductionofawiderangeof classifiersforcomplexreal-worldapplicationswithnonuniformtestingandmisclassification costs. The increasing complexity of these applications poses a real challenge to resource management during learning and classification. In this work we introduce ACT (anytime cost-sensitivetreelearner),anovelframeworkforoperatinginsuchcomplexenvironments. ACT is an anytime algorithm that allows learning time to be increased in return for lower classification costs. It builds a tree top-down and exploits additional time resources to obtain better estimations for the utility of the different candidate splits. Using sampling techniques,ACTapproximatesthecostofthesubtreeundereachcandidatesplitandfavors the one with a minimal cost. As a stochastic algorithm, ACT is expected to be able to escape local minima, into which greedy methods may be trapped. Experiments with a variety of datasets were conducted to compare ACT to the state-of-the-art cost-sensitive treelearners. TheresultsshowthatforthemajorityofdomainsACTproducessignificantly less costly trees. ACT also exhibits good anytime behavior with diminishing returns. 1. Introduction Traditionally, machine learning algorithms have focused on the induction of models with low expected error. In many real-word applications, however, several additional constraints should be considered. Assume, for example, that a medical center has decided to use ma- chinelearningtechniquestobuildadiagnostictoolforheartdisease. Thecomprehensibility ofdecisiontreemodels(Hastie, Tibshirani,&Friedman, 2001, chap.9)makesthemthepre- ferred choice on which to base this tool. Figure 1 shows three possible trees. The first tree (upper-left) makes decisions using only the results of cardiac catheterization (heart cath). This tree is expected to be highly accurate. Nevertheless, the high costs and risks associ- ated with the heart cath procedure make this decision tree impractical. The second tree (lower-left) dispenses with the need for cardiac catheterization and reaches a decision based on a single, simple, inexpensive test: whether or not the patient complains of chest pain. Such a tree would be highly accurate: most people do not experience chest pain and are indeed healthy. The tree, however, does not distinguish between the costs of different types of errors. While a false positive prediction might result in extra treatments, a false negative prediction might put a person’s life at risk. Therefore, a third tree (right) is preferred, one that attempts to minimize test costs and misclassification costs simultaneously. °c2008AIAccessFoundation. Allrightsreserved. Esmeir & Markovitch heart cath chest pain no yes no yes blood blood normal alerting pressure pressure no yes no yes normal cardiac normal heart cath stress no yes no yes normal heart cath normal alerting chest pain no yes no yes normal alerting normal alerting Figure 1: Three possible decision trees for diagnosis of heart diseases. The upper-left tree bases its decision solely on heart cath and is therefore accurate but prohibitively expensive. The lower-left tree dispenses with the need for heart cath and reaches a decision using a single, simple, and inexpensive test: whether or not the patient complains of chest pain. Such a tree would be highly accurate but does not distinguish between the costs of the different error types. The third (right-hand) tree is preferable: it attempts to minimize test costs and misclassification costs simultaneously. a9 cost(a1-10) = $$ a1 a7 a6 a10 a10 a9 a9 a4 a4 0 1 1 0 cost(a1-8) = $$ 0 1 1 0 0 1 1 0 cost(a9,10) = $$$$$$ Figure 2: Left: an example of a difficulty greedy learners might face. Right: an example of the importance of context-based feature evaluation. Finding a tree with the lowest expected total cost is at least NP-complete.1 As in the costinsensitivecase,agreedyheuristiccanbeusedtobiasthesearchtowardslow-costtrees. Decision Trees with Minimal Cost (DTMC), a greedy method that attempts to minimize 1. Findingthesmallestconsistenttree,whichisaneasierproblem,isNP-complete(Hyafil&Rivest,1976). 2 Anytime Induction of Low-cost, Low-error Classifiers both types of costs simultaneously, has been recently introduced (Ling, Yang, Wang, & Zhang, 2004; Sheng, Ling, Ni, & Zhang, 2006). A tree is built top-down, and a greedy split criterion that takes into account both testing and misclassification costs is used. The basic idea is to estimate the immediate reduction in total cost after each split, and to prefer the split with the maximal reduction. If no split reduces the cost on the training data, the induction process is stopped. Although efficient, the DTMC approach can be trapped into a local minimum and produce trees that are not globally optimal. For example, consider the concept and costs described in Figure 2 (left). There are 10 attributes, of which only a and a are relevant. 9 10 Thecostofa anda , however, issignificantlyhigherthantheothers. Suchhighcostsmay 9 10 hide the usefulness of a and a , and mislead the learner into repeatedly splitting on a , 9 10 1−8 which would result in a large, expensive tree. The problem would be intensified if a and 9 a were interdependent, with a low immediate information gain (e.g., a ⊕a ). In that 10 9 10 case, even if the costs were uniform, a local measure might fail to recognize the relevance of a and a . 9 10 DTMC is appealing when learning resources are very limited. However, it requires a fixed runtime and cannot exploit additional resources to escape local minima. In many real-life applications, we are willing to wait longer if a better tree can be induced (Esmeir & Markovitch, 2006). For example, the importance of the model in saving patients’ lives may convince the medical center to allocate 1 month to learn it. Algorithms that can exploit additional time to produce better solutions are called anytime algorithms (Boddy & Dean, 1994). The ICET algorithm (Turney, 1995) was a pioneer in non-greedy search for a tree that minimizes test and misclassification costs. ICET uses genetic search to produce a new set of costs that reflects both the original costs and the contribution of each attribute in reducing misclassification costs. Then it builds a tree using the EG2 algorithm (Nunez, 1991) but with the evolved costs instead of the original ones. EG2 is a greedy cost-sensitive algorithm that builds a tree top-down and evaluates candidate splits by considering both the information gain they yield and their measurement costs. It does not, however, take into account the misclassification cost of the problem. ICET was shown to significantly outperform greedy tree learners, producing trees of lower total cost. ICET can use additional time resources to produce more generations and hencewidenitssearchinthespaceofcosts. Becausethegeneticoperationsarerandomized, ICET is more likely to escape local minima – into which EG2 with the original costs might be trapped. Nevertheless, two shortcomings limit ICET’s ability to benefit from extra time. First, after the search phase, it uses the greedy EG2 algorithm to build the final tree. But because EG2 prefers attributes with high information gain (and low test cost), the usefulnessofhighlyrelevantattributesmaybeunderestimatedbythegreedymeasureinthe case of hard-to-learn concepts where attribute interdependency is hidden. This will result in more expensive trees. Second, even if ICET overcomes the above problem by randomly reweighting the attributes, it searches the space of parameters globally, regardless of the context in the tree. This imposes a problem if an attribute is important in one subtree but uselessinanother. Tobetterunderstandtheseshortcomings,considertheconceptdescribed by the tree in Figure 2 (right). There are 10 attributes with similar costs. The value of a 1 determines whether the target concept is a ⊕a or a ⊕a . The interdependencies result in 7 9 4 6 3 Esmeir & Markovitch a low gain for all attributes. Because ICET assigns costs globally, the attributes will have similar costs as well. Therefore, ICET will not be able to recognize which one is relevant in which context. If the irrelevant attributes are cheaper, the problem is intensified and the model might end up relying on irrelevant attributes. Recently, we have introduced the cost-insensitive LSID3 algorithm, which can induce more accurate trees when allocated more time (Esmeir & Markovitch, 2007a). The algo- rithmevaluatesacandidatesplitbyestimatingthesizeofthesmallestconsistenttreeunder it. The estimation is based on sampling the space of consistent trees, where the size of the sample is determined in advance according to the allocated time. LSID3 is not designed, however, to minimize test and misclassification costs. In this work we build on LSID3 and propose ACT, an anytime cost-sensitive tree learner that can exploit additional time to produce lower-cost trees. Applying the sampling mechanism in the cost-sensitive setup, however, is not trivial and imposes three major challenges: (1) how to produce the sample, (2) how to evaluate the sampled trees, and (3) how to prune the induced trees. In Section 3 we show how these obstacles may be overcome. In Section 4 we report an extensive set of experiments that compares ACT to several decision tree learners using a variety of datasets with costs assigned by human experts or automatically. The results show that ACT is significantly better for the majority of problems. In addition, ACT is shown to exhibit good anytime behavior with diminishing returns. 2. Cost-Sensitive Classification Offline concept learning consists of two stages: the learning stage, where a set of labeled examples is used to induce a classifier; and the classification stage, where the induced classifier is used to classify unlabeled instances. These two stages involve different types of costs (Turney, 2000). Our primary goal in this work is to trade learning speed for a reduction in test and misclassification costs. To make the problem well defined, we need to specify: (1) how misclassification costs are represented, (2) how test costs are calculated, and (3) how we should combine both types of cost. Toanswerthesequestions,weadoptthemodeldescribedbyTurney(1995). Inaproblem with |C| different classes, a misclassification cost matrix M is a |C|×|C| matrix whose M i,j entry defines the penalty of assigning the class c to an instance that actually belongs to the i class c . Typically, entries on the main diagonal of a classification cost matrix (no error) j are all zero. When classifying an example e using a tree T, we propagate e down the tree along a single path from the root of T to one of its leaves. Let Θ(T,e) be the set of tests along this path. We denote by cost(θ) the cost of administering the test θ. The testing cost of e in T is therefore tcost(T,e) = cost(θ). Note that we use sets notation because tests that θ∈Θ appear several times are charged for only once. In addition, the model described by Turney P (1995) handles two special test types, namely grouped and delayed tests. Grouped Tests. Some tests share a common cost, for which we would like to charge only once. Typically, the test also has an extra (possibly different) cost. For example, consider a tree path with tests like cholesterol level and glucose level. For both values to be measured, a blood test is needed. Taking blood samples to measure the cholesterol level clearly lowers 4 Anytime Induction of Low-cost, Low-error Classifiers the cost of measuring the glucose level. Formally, each test possibly belongs to a group.2 If it’s the first test from the group to be administered, we charge for the full cost. If another test from the same group has already been administered earlier in the decision path, we charge only for the marginal cost. Delayed Tests. Sometimes the outcome of a test cannot be obtained immediately, e.g., lab test results. Such tests, called delayed tests, force us to wait until the outcome is available. Alternatively, Turney (1995) suggests taking into account all possible outcomes: when a delayed test is encountered, all the tests in the subtree under it are administered and charged for. Once the result of the delayed test is available, the prediction is at hand. One problem with this setup is that it follows all paths in the subtree, regardless of the outcome of non-delayed costs. Moreover, it is not possible to distinguish between the delays different tests impose: for example, one result might be ready after several minutes while another only after a few days. In this work we do not handle delayed tests, but we do explain how ACT can be modified to take them into account. After the test and misclassification costs have been measured, an important question remains: How should we combine them? Following Turney (1995), we assume that both costtypesaregiveninthesamescale. Amoregeneralmodelwouldrequireautilityfunction that combines both types. Qin, Zhang, and Zhang (2004) presented a method to handle the two kinds of cost scales by setting a maximal budget for one kind of cost and minimizing the other one. Alternatively, patient preferences can be elicited and summarized as a utility function (Lenert & Soetikno, 1997). Notethatthealgorithmweintroduceinthispapercanbeadaptedtoanycostmodel. An important property of our cost-sensitive setup is that maximizing generalization accuracy, which is the goal of most existing learners, can be viewed as a special case: when accuracy is the only objective, test costs are ignored and misclassification cost is uniform. 3. The ACT Algorithm ACT, our proposed anytime framework for induction of cost-sensitive decision trees, builds on the recently introduced LSID3 algorithm. LSID3 adopts the general top-down scheme for induction of decision trees (TDIDT): it starts from the entire set of training examples, partitions it into subsets by testing the value of an attribute, and then recursively builds subtrees. Unlikegreedyinducers, LSID3investsmoretimeresourcesformakingbettersplit decisions. For every candidate split, LSID3 attempts to estimate the size of the resulting subtree were the split to take place. Following Occam’s razor (Blumer, Ehrenfeucht, Haus- sler, & Warmuth, 1987; Esmeir & Markovitch, 2007b), it favors the one with the smallest expected size. The estimation is based on a biased sample of the space of trees rooted at the evaluated attribute. The sample is obtained using a stochastic version of ID3 (Quinlan, 1986), which we call SID3. In SID3, rather than choosing an attribute that maximizes the information gain ∆I (as in ID3), we choose the splitting attribute semi-randomly. The likelihood that anattributewillbechosenisproportionaltoitsinformationgain. Duetoitsrandomization, 2. In this model each test may belong to a single group. However, it is easy to extend our work to allow tests that belong to several groups. 5 Esmeir & Markovitch Procedure LSID3-Choose-Attribute(E,A,r) If r = 0 Return ID3-Choose-Attribute(E, A) Foreach a ∈ A Foreach v ∈ domain(a) i E ← {e ∈ E | a(e) = v } i i min ← ∞ i Repeat r times T ← SID3(E ,A−{a}) i min ← min(min ,Size(T)) i i |domain(a)| total ← min a i=1 i Return a for which total is minimal P a Figure 3: Attribute selection in LSID3 repeatedinvocationsofSID3resultindifferenttrees. Foreachcandidateattributea, LSID3 invokesSID3r timestoformasampleofr treesrootedata,andusesthesizeofthesmallest tree in the sample to evaluate a. Obviously, when r is larger, the resulting size estimations areexpectedto bemoreaccurate, improvingthe finaltree. Consider, for example, a3-XOR conceptwithseveraladditionalirrelevantattributes. ForLSID3topreferoneoftherelevant attributes at the root, one of the trees in the samples of the relevant attributes must be the smallest. The probability for this event increases with the increase in sample size. LSID3 is a contract anytime algorithm parameterized by r, the sample size. Additional time resources can be utilized by forming larger samples. Figure 3 lists the procedure for attribute selection as applied by LSID3. Let m = |E| be the number of examples and n = |A| be the number of attributes. The runtime complexity of LSID3 is O(rmn3). LSID3 was shown to exhibit good anytime behavior with diminishing returns. When applied to hard concepts, it produced significantly better trees than ID3 and C4.5. ACT takes the same sampling approach as LSID3. The three major components of LSID3 that need to be replaced in order to adapt it for cost-sensitive problems are: (1) sampling the space of trees, (2) evaluating a tree, and (3) pruning a tree. 3.1 Obtaining the Sample LISD3 uses SID3 to bias the samples towards small trees. In ACT, however, we would like tobiasoursampletowardslow-costtrees. Forthispurpose,wedesignedastochasticversion of the EG2 algorithm, which attempts to build low cost trees greedily. In EG2, a tree is built top-down, and the test that maximizes ICF is chosen for splitting a node, where, 2∆I(θ)−1 ICF(θ) = . (cost(θ)+1)w ∆I is the information gain (as in ID3). The parameter w ∈ [0,1] controls the bias towards lower cost attributes. When w = 0, test costs are ignored and ICF relies solely 6 Anytime Induction of Low-cost, Low-error Classifiers Procedure SEG2-Choose-Attribute(E,A) Foreach a ∈ A ∆I(a) ← Information-Gain(E,a) c(a) ← Cost(a) p(a) ← 2∆I(a)−w1 (c(a)+1) a∗ ← Choose attribute at random from A; for each attribute a, the probability of selecting it is proportional to p(a) Return a∗ Figure 4: Attribute selection in SEG2 on the information gain. Larger values of w strengthen the effect of test costs on ICF. We discuss setting the value of w in Section 3.5. In stochastic EG2 (SEG2), we choose splitting attributes semi-randomly, proportionally totheirICF.BecauseSEG2isstochastic, weexpecttobeabletoescapelocalminimaforat least some of the trees in the sample. Figure 4 formalizes the attribute selection component in SEG2. To obtain a sample of size r, ACT uses EG2 once and SEG2 r−1 times. EG2 and SEG2 are given direct access to context-based costs, i.e., if an attribute has already been tested, its cost is zero and if another attribute that belongs to the same group has been tested, a group discount is applied. 3.2 Evaluating a Subtree LSID3 is a cost-insensitive learning algorithm. As such, its main goal is to maximize the expected accuracy of the learned tree. Occam’s razor states that given two consistent hypotheses, the smaller one is likely to be more accurate. Following Occam’s razor, LSID3 uses the tree size as a preference bias and favors splits that are expected to reduce its final size. In a cost-sensitive setup, however, our goal is to minimize the expected total cost of classification. Therefore, rather than choosing an attribute that minimizes the size, we would like to choose one that minimizes the total cost. Given a decision tree, we need to come up with a procedure that estimates the expected cost of using the tree to classify a future case. This cost has two components: the test cost and the misclassification cost. 3.2.1 Estimating Test Costs Assuming that the distribution of future cases would be similar to that of the learning examples, we can estimate the test costs using the training data. Given a tree, we calculate the average test cost of the training examples and use it to estimate the test cost of new cases. For a (sub)tree T built from E, a set of m training examples, we denote the average cost of traversing T for an example from E by 1 tcost(T,E) = tcost(T,e). m e∈E X 7 Esmeir & Markovitch The estimated test cost for an unseen example e∗ is therefore tcost(T,e∗) = tcost(T,E). Observe that costs are calculated in the relevant context. If an attribute a has already been tested in upper nodes, we will not charge for testing it agaidn. Similarly, if an attribute fromagroupghasalreadybeentested,wewillapplyagroupdiscounttotheotherattributes from g. If a delayed attribute is encountered, we sum the cost of the entire subtree. 3.2.2 Estimating Misclassification Costs How to go about estimating the cost of misclassification is not obvious. The tree size can no longer be used as a heuristic for predictive errors. Occam’s razor allows the comparison of two consistent trees but provides no means for estimating accuracy. Moreover, tree size is measured in a different currency than accuracy and hence cannot be easily incorporated in the cost function. Rather than using the tree size, we propose a different estimator: the expected error (Quinlan, 1993). For a leaf with m training examples, of which s are misclassified, the expected error is defined as the upper limit on the probabilityfor error, i.e., EE(m,s,cf) = Ubin(m,s), where cf is the confidence level and Ubin is the upper limit of the confidence cf interval for binomial distribution. The expected error of a tree is the sum of the expected errors in its leaves. Originally,theexpectederrorwasusedbyC4.5’serror-based pruning topredictwhether a subtree performs better than a leaf. Although lacking a theoretical basis, it was shown experimentally to be a good heuristic. In ACT we use the expected error to approximate the misclassification cost. Assume a problem with |C| classes and a misclassification cost matrix M. Let c be the class label in a leaf l. Let m be the total number of examples in l l and mi be the number of examples in l that belong to class i. When the penalties for l predictive errors are uniform (M = mc), the estimated misclassification cost in l is i,j mcost(l) = EE(m ,m −mc,cf)·mc. l l l In a problem with nonuniform misclassification costs, mc should be replaced by the cost d of the actual errors the leaf is expected to make. These errors are obviously unknown to the learner. One solution is to estimate each error type separately using confidence intervals for multinomial distribution and multiply it by the associated cost: mcost(l) = Umul(m ,mi,|C|)·mc. cf l l i6=c X d Such approach, however, would result in an overly pessimistic approximation, mainly whentherearemanyclasses. Alternatively,wecomputetheexpectederrorasintheuniform case and propose replacing mc by the weighted average of the penalty for classifying an instanceascwhileitbelongstoanotherclass. Theweightsarederivedfromtheproportions mil using a generalization of Laplace’s law of succession (Good, 1965, chap. 4): ml−mcl mi+1 mcost(l) = EE(m ,m −mc,cf)· l ·M . l l l i6=cÃml −mcl +|C|−1 c,i! X d Note that in a problem with C classes, the average is over C − 1 possible penalties because M = 0. Hence, in a problem with two classes c ,c if a leaf is marked as c , mc c,c 1 2 1 8 Anytime Induction of Low-cost, Low-error Classifiers Procedure ACT-Choose-Attribute(E,A,r) If r = 0 Return EG2-Choose-Attribute(E, A) Foreach a ∈ A Foreach v ∈ domain(a) i E ← {e ∈ E | a(e) = v } i i T ← EG2(a,E ,A−{a}) i min ← Total-Cost(T,E ) i i Repeat r−1 times T ← SEG2(a,E ,A−{a}) i min ← min(min ,Total-Cost(T,E )) i i i |domain(a)| total ← Cost(a)+ min a i=1 i Return a for which total is minimal Pa Figure 5: Attribute selection in ACT would be replaced by M . When classifying a new instance, the expected misclassification 1,2 cost of a tree T built from m examples is the sum of the expected misclassification costs in the leaves divided by m: 1 mcost(T) = mcost(l), m l∈L X d d where L is the set of leaves in T. Hence, the expected total cost of T when classifying a single instance is: total(T,E) = tcost(T,E)+mcost(T). An alternative approach that we intend to explore in future work is to estimate the cost d d d of the sampled trees using the cost for a set-aside validation set. This approach is attractive mainly when the training set is large and one can afford setting aside a significant part of it. 3.3 Choosing a Split Having decided about the sampler and the tree utility function, we are ready to formalize the tree growing phase in ACT. A tree is built top-down. The procedure for selecting a splitting test at each node is listed in Figure 5 and illustrated in Figure 6. We give a detailed example of how ACT chooses splits and explain how the split selection procedure is modified for numeric attributes. 3.3.1 Choosing a Split: Illustrative Examples ACT’sevaluationiscost-senstivebothinthatitconsiderstestanderrorcostssimultaneously andinthatitcantakeintoaccountdifferenterrorpenalties. Toillustratethisletusconsider a two-class problem with mc = 100$ (uniform) and 6 attributes, a ,...,a , whose costs are 1 6 10$. The training data contains 400 examples, out of which 200 are positive and 200 are negative. 9 Esmeir & Markovitch a c ost( S cos=t(4E.7G2) =4.9EG2) cost(SEG2) cost(EG2) =5.1 =8.9 Figure 6: Attribute evaluation in ACT. Assume that the cost of a in the current context is 0.4. The estimated cost of a subtree rooted at a is therefore 0.4+min(4.7,5.1)+ min(8.9,4.9) = 10. Test costs T1 T2 a1 a2 a1 10 a2 10 a3 10 a3 a5 a4 a6 a4 10 a5 10 a6 10 5 + - 95 + + 5 + - 95 + + 0 + - 100 + + 0 + - 100+ + 95 - 5 - 95 - 5 - 50 - 50 - 50 - 50 - MC costs EE = 7.3 EE = 7.3 EE = 7.3 EE = 7.3 EE = 1.4 EE = 54.1 EE = 1.4 EE = 54.1 FP 100 FN 100 mcost (T1) = 7.3*4*100$ / 400 = 7.3$ mcost (T2) = (1.4*2 + 54.1*2) * 100$ / 400 = 27.7$ tcost (T1) = 20$ tcost (T2) = 20$ r = 1 total(T1) = 27.3$ total (T2) = 47.7$ Figure 7: Evaluation of tree samples in ACT. The leftmost column defines the costs: 6 attributes with identical cost and uniform error penalties. T1 was sampled for a 1 and T2 for a . EE stands for the expected error. Because the total cost of T is 2 1 lower, ACT would prefer to split on a . 1 Assume that we have to choose between a and a , and that r = 1. Let the trees in 1 2 Figure 7, denoted T1 and T2, be those sampled for a and a respectively. The expected 1 2 error costs of T and T are:3 1 2 1 4·7.3 mcost(T ) = (4·EE(100,5,0.25))·100$ = ·100$ = 7.3$ 1 400 400 1 mdcost(T ) = (2·EE(50,0,0.25)·100$+2·EE(150,50,0.25)·100$) 2 400 2·1.4+2·54.1 d = ·100$ = 27.7$ 400 When both test and error costs are involved, ACT considers their sum. Since the test cost of both trees is identical (20$), ACT would prefer to split on a . If, however, the cost 1 3. In this example we set cf to 0.25, as in C4.5. In Section 3.5 we discuss how to tune cf. 10

Description:
The ICET algorithm (Turney, 1995) was a pioneer in non-greedy search for a tree that Because ICET assigns costs globally, the attributes will have Typically, entries on the main diagonal of a classification cost matrix (no error).
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.