ebook img

Alleviating Naive Bayes Attribute Independence Assumption PDF

42 Pages·2013·0.42 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Alleviating Naive Bayes Attribute Independence Assumption

JournalofMachineLearningResearch14(2013)1947-1988 Submitted6/12;Revised1/13;Published7/13 Alleviating Naive Bayes Attribute Independence Assumption by Attribute Weighting NayyarA.Zaidi [email protected] FacultyofInformationTechnology MonashUniversity VIC3800,Australia Jesu´sCerquides [email protected] IIIA-CSIC,ArtificialIntelligenceResearchInstitute SpanishNationalResearchCouncil CampusUAB 08193Bellaterra,Spain MarkJ.Carman [email protected] GeoffreyI.Webb [email protected] FacultyofInformationTechnology MonashUniversity VIC3800,Australia Editor:RussGreiner Abstract DespitethesimplicityoftheNaiveBayesclassifier,ithascontinuedtoperformwellagainstmore sophisticated newcomers and has remained, therefore, of great interest to the machine learning community. OfnumerousapproachestorefiningthenaiveBayesclassifier,attributeweightinghas receivedlessattentionthanitwarrants.Mostapproaches,perhapsinfluencedbyattributeweighting inothermachinelearningalgorithms, useweightingtoplacemoreemphasisonhighlypredictive attributesthanthosethatarelesspredictive. Inthispaper,wearguethatfornaiveBayesattribute weightingshouldinsteadbeusedtoalleviatetheconditionalindependenceassumption. Basedon thispremise,weproposeaweightednaiveBayesalgorithm,calledWANBIA,thatselectsweights tominimizeeitherthenegativeconditionalloglikelihoodorthemeansquarederrorobjectivefunc- tions. WeperformextensiveevaluationsandfindthatWANBIAisacompetitivealternativetostate oftheartclassifierslikeRandomForest,LogisticRegressionandA1DE. Keywords: classification,naiveBayes,attributeindependenceassumption,weightednaiveBayes classification 1. Introduction NaiveBayes(alsoknownassimpleBayesandIdiot’sBayes)isanextremelysimpleandremarkably effectiveapproachtoclassificationlearning(Lewis,1998;HandandYu,2001). Itinferstheproba- bility of a class label given data using a simplifying assumption that the attributes are independent giventhelabel(Kononenko,1990;Langleyetal.,1992). Thisassumptionismotivatedbytheneed toestimatehigh-dimensionalmulti-variateprobabilitiesfromthetrainingdata. Ifthereissufficient data present for every possible combination of attribute values, direct estimation of each relevant multi-variate probability will be reliable. In practice, however, this is not the case and most com- c2013NayyarA.Zaidi,Jesu´sCerquides,MarkJ.CarmanandGeoffreyI.Webb. (cid:13) ZAIDI,CERQUIDES,CARMANANDWEBB binations are either not represented in the training data or not present in sufficient numbers. Naive Bayes circumvents this predicament by its conditional independence assumption. Surprisingly, it has been shown that the prediction accuracy of naive Bayes compares very well with other more complex classifiers such as decision trees, instance-based learning and rule learning, especially when the data quantity is small (Hand and Yu, 2001; Cestnik et al., 1987; Domingos and Pazzani, 1996;Langleyetal.,1992). In practice, naive Bayes’ attribute independence assumption is often violated, and as a result its probability estimates are often suboptimal. A large literature addresses approaches to reducing theinaccuraciesthatresultfromtheconditionalindependenceassumption. Suchapproachescanbe placedintotwocategories. Thefirstcategorycomprisessemi-naiveBayesmethods. Thesemethods are aimed at enhancing naive Bayes’ accuracy by relaxing the assumption of conditional indepen- dencebetweenattributesgiventheclasslabel(LangleyandSage,1994;FriedmanandGoldszmidt, 1996;Zhengetal.,1999;CerquidesandDeMa´ntaras,2005a;Webbetal.,2005,2011;Zhengetal., 2012). Thesecondcategorycomprisesattributeweightingmethodsandhasreceivedrelativelylittle attention(HildenandBjerregaard,1976;Ferreiraetal.,2001;Hall,2007). Thereissomeevidence that attribute weighting appears to have primarily been viewed as a means of increasing the influ- enceofhighlypredictiveattributesanddiscountingattributesthathavelittlepredictivevalue. This isnotsomuchevidentfromtheexplicitmotivationstatedinthepriorwork,butratherfromtheman- ner in which weights have been assigned. For example, weighting by mutual information between an attribute and the class is directly using a measure of how predictive is each individual attribute (Zhang and Sheng, 2004). In contrast, we argue that the primary value of attribute weighting is its capacitytoreducetheimpactonpredictionaccuracyofviolationsoftheassumptionofconditional attributeindependence. Contributionsofthispaperaretwo-fold: ThispaperreviewsthestateoftheartinweightednaiveBayesianclassification. Weprovidea • compactsurveyofexistingtechniquesandcomparethemusingthebias-variancedecomposi- tionmethodofKohaviandWolpert(1996). WealsouseFriedmantestandNemenyistatistics toanalyzeerror,bias,varianceandrootmeansquareerror. WepresentnovelalgorithmsforlearningattributeweightsfornaiveBayes. Itshouldbenoted • that the motivation of our work differs from most previous attribute weighting methods. We viewweightingasawaytoreducetheeffectsoftheviolationsoftheattributeindependence assumption on which naive Bayes is based. Also, our work differs from semi-naive Bayes methods,asweweighttheattributesratherthanmodifyingthestructureofnaiveBayes. WeproposeaweightednaiveBayesalgorithm,WeightingattributestoAlleviateNaiveBayes’Inde- pendenceAssumption(WANBIA),thatintroducesweightsinnaiveBayesandlearnstheseweights in a discriminative fashion that is minimizing either the negative conditional log likelihood or the mean squared error objective functions. Naive Bayes probabilities are set to be their maximum a posteriori(MAP)estimates. Thepaperisorganizedasfollows: weprovideaformaldescriptionoftheweightednaiveBayes model in Section 2. Section 3 provides a survey of related approaches. Our novel techniques for learning naive Bayes weights are described in Section 4 where we also discuss their connection with naive Bayes and Logistic Regression in terms of parameter optimization. Section 5 presents experimental evaluation of our proposed methods and their comparison with related approaches. Section6presentsconclusionsanddirectionsforfutureresearch. 1948 ALLEVIATINGNBATTRIBUTEINDEPENDENCEASSUMPTIONBYATTRIBUTEWEIGHTING Notation Description P(e) theunconditionedprobabilityofevente P(e g) conditionalprobabilityofeventegiveng | Pˆ( ) anestimateofP( ) • • a thenumberofattributes n thenumberofdatapointsinD x= x ,...,x anobject(a-dimensionalvector)andx D 1 a h i ∈ y Y theclasslabelforobjectx ∈ Y thenumberofclasses | | D = x(1),...,x(n) dataconsistingofnobjects { } L = y(1),...,y(n) labelsofdatapointsinD { } X discretesetofvaluesforattributei i X thecardinalityofattributei i | | v= 1∑ X theaveragecardinalityoftheattributes a i| i| Table1: Listofsymbolsused 2. WeightedNaiveBayes We wish to estimate from a training sample D consisting of n objects, the probability P(y x) that | anexamplex D belongstoaclasswithlabely Y. Allthesymbolsusedinthisworkarelisted ∈ ∈ inTable1. Fromthedefinitionofconditionalprobabilitywehave P(y x)=P(y,x)/P(x). (1) | Y AsP(x)=∑i|=1|P(yi,x),wecanalwaysestimateP(y|x)inEquation1fromtheestimatesofP(y,x) foreachclassas: P(y,x) P(y,x)/P(x)= . (2) Y ∑i|=1|P(yi,x) Inconsequence,intheremainderofthispaperweconsideronlytheproblemofestimatingP(y,x). NaiveBayesestimatesP(y,x)byassumingtheattributesareindependentgiventheclass,result- inginthefollowingformula: a Pˆ(y,x)=Pˆ(y)∏Pˆ(x y). (3) i | i=1 Weighted naive Bayes extends the above by adding a weight to each attribute. In the most general case,thisweightdependsontheattributevalue: a Pˆ(y,x)=Pˆ(y)∏Pˆ(xi y)wi,xi. (4) | i=1 Doing this results in ∑a X weight parameters (and is in some cases equivalent to a “binarized i i | | logistic regression model” see Section 4 for a discussion). A second possibility is to give a single weightperattribute: a Pˆ(y,x)=Pˆ(y)∏Pˆ(x y)wi. (5) i | i=1 1949 ZAIDI,CERQUIDES,CARMANANDWEBB Onefinalpossibilityistosetallweightstoasinglevalue: w a Pˆ(y,x)=Pˆ(y) ∏Pˆ(x y) . (6) i | i=1 ! Equation 5 is a special case of Equation 4, where w =w, and Equation 6 is a special case i,j ij i ∀ of Equation 5 where w =w. Unless explicitly stated, in this paper we intend the intermediate i i ∀ formwhenwerefertoattributeweighting, aswebelieveitprovidesaneffectivetrade-offbetween computationalcomplexityandinductivepower. AppropriateweightscanreducetheerrorthatresultsfromviolationsofnaiveBayes’conditional attribute independence assumption. Trivially, if data include a set of a attributes that are identical to one another, the error due to the violation of the conditional independence assumption can be removed by assigning weights that sum to 1.0 to the set of attributes in the set. For example, the weight for one of the attributes, x could be set to 1.0, and that of the remaining attributes that are i identical to x set to 0.0. This is equivalent to deleting the remaining attributes. Note that, any i assignment of weights such that their sum is 1.0 for the a attributes will have the same effect, for example,wecouldsettheweightsofallaattributesto1/a. Attribute weighting is strictly more powerful than attribute selection, as it is possible to obtain identical results to attribute selection by setting the weights of selected attributes to 1.0 and of discarded attributes to 0.0, and assignment of other weights can create classifiers that cannot be expressedusingattributeselection. 2.1 DealingwithDependentAttributesbyWeighting: ASimpleExample ThisexampleshowstherelativeperformanceofnaiveBayesandweightednaiveBayesaswevary the conditional dependence between attributes. In particular it demonstrates how optimal assign- ment of weights will never result in higher error than attribute selection or standard naive Bayes, and that for certain violations of the attribute independence assumption it can result in lower error thaneither. We will constrain ourselves to a binary class problem with two binary attributes. We quantify theconditionaldependencebetweentheattributesusingtheConditionalMutualInformation(CMI): P(x ,x y) I(X ,X Y) = ∑∑∑P(x ,x ,y)log 1 2| . 1 2 1 2 | P(x y)P(x y) y x2 x1 1| 2| Theresultsofvaryingtheconditionaldependencebetweentheattributesontheperformanceofthe differentclassifiersintermsoftheirRootMeanSquaredError(RMSE)isshowninFigure1. Togeneratethesecurves,wevariedtheprobabilitiesP(y x ,x )andP(x ,x )andplottedaverage 1 2 1 2 | results across distinct values of the Conditional Mutual Information. For each of the 4 possible attributevaluecombinations(x ,x ) (0,0),(0,1),(1,0),(1,1) ,weselectedvaluesfortheclass 1 2 ∈{ } probability given the attribute value combination from the set: P(y x ,x ) 0.25,0.75 . Note 1 2 | ∈ { } that P( y x ,x )=1 P(y x ,x ), so this process resulted in 24 possible assignments to the vector 1 2 1 2 ¬ | − | P(y , ). |• • We then set the values for the attribute value probabilities P(x ,x ) by fixing the marginal dis- 1 2 tributions to a half P(x )=P(x )=1/2, and varying the correlation between the attributes using 1 2 1950 ALLEVIATINGNBATTRIBUTEINDEPENDENCEASSUMPTIONBYATTRIBUTEWEIGHTING 0.05 Prior Naive Bayes 0.045 Selective Naive Bayes Weighted Naive Bayes 0.04 0.035 0.03 Error0.025 0.02 0.015 0.01 0.005 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Conditional Dependence Figure1: VariationofErrorofnaiveBayes,selectivenaiveBayes,weightednaiveBayesandclassi- fierbasedonlyonpriorprobabilitiesoftheclassasafunctionofconditionaldependence (conditionalmutualinformation)betweenthetwoattributes. Pearson’scorrelationcoefficient,denotedρ,asfollows:1 (1+ρ) P(X =0,X =0) = P(X =1,X =1)= , 1 2 1 2 4 (1 ρ) P(X =0,X =1) = P(X =1,X =0)= − , 1 2 1 2 4 where 1 ρ 1. − ≤ ≤ Notethatwhenρ= 1theattributesareperfectlyanti-correlated(x = x ),whenρ=0theattributes 1 2 − ¬ are independent (since the joint distribution P(x ,x ) is uniform) and when ρ=1 the attributes are 1 2 perfectlycorrelated. For the graph, we increased values of ρ in increments of 0.00004, resulting in 50000 distribu- tions(vectors)forP( , )foreachvectorP(y , ). Nearoptimalweights(w ,w )fortheweighted 1 2 • • |• • naive Bayes classifier were found using grid search over the range 0.0,0.1,0.2,...,0.9,1.0 {{ }× 0.0,0.1,0.2,...,0.9,1.0 . Results in Figure 1 are plotted by taking average across conditional { }} mutualinformationvalues,withawindowsizeof0.1. 1.NotethatfromthedefinitionofPearson’scorrelationcoefficientwehave: ρ= E[(X1−E[X1])(X2−E[X2])] =4P(X1=1,X2=1) 1, E[(X1 E[X1])2]E[(X2 E[X2])2] − − − sinceE[X1]=E[X2]=P(1)p=1/2andE[X1X2]=P(X1=1,X2=1). 1951 ZAIDI,CERQUIDES,CARMANANDWEBB WecomparetheexpectedRMSEofnaiveBayes(w =1,w =1), weightednaiveBayes, naive 1 2 Bayes based on feature 1 only (selective Bayes with w =1,w =0), naive Bayes based on feature 1 2 2 only (selective Bayes with w =0,w =1), and naive Bayes using only the prior (equivalent to 1 2 weighted naive Bayes with both weights set to 0.0). It can be seen that when conditional mutual information (CMI) is small, naive Bayes performs better than selective naive Bayes and the prior classifier. Indeed, when CMI is 0.0, naive Bayes is optimal. As CMI is increased, naive Bayes performance deteriorates compared to selective naive Bayes. Weighted naive Bayes, on the other hand, hasthebestperformanceinallcircumstances. Duetothesymmetryoftheproblem, thetwo selectiveBayesclassifiersgiveexactlythesameresults. Notethatinthisexperimentwehaveusedtheoptimalweightstocalculatetheresults. Wehave shown that weighted naive Bayes is capable of expressing more accurate classifiers than selective naive Bayes. In theremaining sections we will examine and evaluatetechniques for learning from datatheweightsthosemodelsrequire. 3. SurveyofAttributeWeightingandSelectingMethodsforNaiveBayes Attribute weighting is well-understood in the context of nearest-neighbor learning methods and is usedforreducingbiasinhigh-dimensionalproblemsduetothepresenceofredundantorirrelevant features(Friedman,1994;Guyonetal.,2004). Itisalsousedformitigatingtheeffectsofthecurse- of-dimensionalitywhichresultsinexponentialincreaseintherequiredtrainingdataasthenumber offeaturesareincreased(Bellman,1957). AttributeweightingfornaiveBayesiscomparativelyless explored. Before discussing these techniques, however, it is useful to briefly examine the closely related area of feature selection for naive Bayes. As already pointed out, weighting can achieve feature selectionbysettingsweightstoeither0.0or1.0,andsocanbeviewedasageneralizationoffeature selection. Langley and Sage (1994) proposed the Selective Bayes (SB) classifier, using feature selection to accommodate redundant attributes in the prediction process and to augment naive Bayes with the ability to exclude attributes that introduce dependencies. The technique is based on searching through the entire space of all attribute subsets. For that, they use a forward sequential search with a greedy approach to traversethe search space. That is, the algorithm initializes the subsetof attributestoanemptyset,andtheaccuracyoftheresultingclassifier,whichsimplypredictsthemost frequentclass,issavedforsubsequentcomparison. Oneachiteration,themethodconsidersadding each unused attribute to the subset on a trial basis and measures the performance of the resulting classifier on the training data. The attribute that most improves the accuracy is permanently added tothesubset. Thealgorithmterminateswhenadditionofanyattributeresultsinreducedaccuracy, atwhichpointitreturnsthelistofcurrentattributesalongwiththeirranks. Therankoftheattribute isbasedontheorderinwhichtheyareaddedtothesubset. Similar to Langley and Sage (1994), Correlation-based Feature Selection (CFS) used a corre- lation measure as a metric to determine the relevance of the attribute subset (Hall, 2000). It uses a best-first search to traverse through feature subset space. Like SB, it starts with an empty set andgeneratesallpossiblesinglefeatureexpansions. Thesubsetwithhighestevaluationisselected and expanded in the same manner by adding single features. If expanding a subset results in no improvement, the search drops back to the next best unexpanded subset and continues from there. 1952 ALLEVIATINGNBATTRIBUTEINDEPENDENCEASSUMPTIONBYATTRIBUTEWEIGHTING Thebestsubsetfoundisreturnedwhenthesearchterminates. CFSusesastoppingcriterionoffive consecutivefullyexpandednon-improvingsubsets. Therehasbeenagrowingtrendintheuseofdecisiontreestoimprovetheperformanceofother learningalgorithmsandnaiveBayesclassifiersarenoexception. Forexample,onecanbuildanaive BayesclassifierbyusingonlythoseattributesappearinginaC4.5decisiontree. Thisisequivalentto givingzeroweightstoattributesnotappearinginthedecisiontree. TheSelectiveBayesianClassifier (SBC)ofRatanamahatanaandGunopulos(2003)alsoemploysdecisiontreesforattributeselection fornaiveBayes. Onlythoseattributesappearinginthetopthreelevelsofadecisiontreeareselected forinclusioninnaiveBayes. Sincedecisiontreesareinherentlyunstable,fivedecisiontrees(C4.5) are generated on samples generated by bootstrapping 10% from the training data. Naive Bayes is trained on an attribute set which comprises the union of attributes appearing in all five decision trees. One of the earliest works on weighted naive Bayes is by Hilden and Bjerregaard (1976), who used weighting of the form of Equation 6. This strategy uses a single weight and therefore is not strictly performing attribute weighting. Their approach is motivated as a means of alleviating the effectsofviolationsoftheattributeindependenceassumption. Settingwtounityisappropriatewhen the conditional independence assumption is satisfied. However, on their data set (acute abdominal painstudyinCopenhagenbyBjerregaardetal.1976),improvedclassificationwasobtainedwhenw wassmall,withanoptimumvalueaslowas0.3. Theauthorspointoutthatifsymptomvariablesof aclinicalfieldtrialarenotindependent,butpair-wisecorrelatedwithindependencebetweenpairs, then w=0.5 will be the correct choice since using w=1 would make all probabilities the square of what they ought be. Looking at the optimal value of w=0.3 for their data set, they suggested that out of ten symptoms, only three are providing independent information. The value of w was obtainedbymaximizingthelog-likelihoodovertheentiretestingsample. Zhang and Sheng (2004) used the gain ratio of an attribute with the class labels as its weight. Their formula is shown in Equation 7. The gain ratio is a well-studied attribute weighting tech- nique and is generally used for splitting nodes in decision trees (Duda et al., 2006). The weight of each attribute is set to the gain ratio of the attribute relative to the average gain ratio across all attributes. Notethat, asaresultofthedefinitionatleastone(possiblymany)oftheattributeshave weights greater than 1, which means that they are not only attempting to lessen the effects of the independenceassumption—otherwisetheywouldrestricttheweightstobenomorethanone. GR(i) w = . (7) i 1∑a GR(i) a i=1 The gain ratio of an attribute is then simply the Mutual Information between that attribute and the classlabeldividedbytheentropyofthatattribute: GR(i)= I(Xi,Y) = ∑y∑x1P(x1,y)logPP(x(1x)1P,y()y). H(Xi) ∑x1P(x1)logP(1x1) Severalotherwrapper-basedmethodsarealsoproposedinZhangandSheng(2004). Forexample, they use a simple hill climbing search to optimize weight w using Area Under Curve (AUC) as an evaluationmetric. AnotherMarkov-Chain-Monte-Carlo(MCMC)methodisalsoproposed. AnattributeweightingschemebasedondifferentialevolutionalgorithmsfornaiveBayesclas- sificationhavebeenproposedinWuandCai(2011). First,apopulationofattributeweightvectors 1953 ZAIDI,CERQUIDES,CARMANANDWEBB is randomly generated, weights in the population are constrained to be between 0 and 1. Second, typicalgeneticalgorithmicstepsofmutationandcross-overareperformedoverthethepopulation. They defined a fitness function which is used to determine if mutation can replace the current in- dividual (weight vector) with a new one. Their algorithm employs a greedy search strategy, where mutatedindividualsareselectedasoffspringonlyifthefitnessisbetterthanthatoftargetindividual. Otherwise,thetargetismaintainedinthenextiteration. AschemeusedinHall(2007)issimilarinspirittoSBCwheretheweightassignedtoeachat- tributeisinverselyproportionaltotheminimumdepthatwhichtheywerefirsttestedinanunpruned decisiontree. Weightsarestabilizedbyaveragingacross10decisiontreeslearnedondatasamples generated by bootstrapping 50% from the training data. Attributes not appearing in the decision treesareassignedaweightofzero. Forexample,onecanassignweighttoanattributeias: 1 T 1 w = ∑ . (8) i T √d t ti where d is the minimum depth at which the attribute i appears in decision tree t, and T is the ti total number of decision trees generated. To understand whether the improvement in naive Bayes accuracy was due to attribute weighting or selection, a variant of the above approach was also proposed where all non-zero weights are set to one. This is equivalent to SBC except using a bootstrapsizeof50%with10iterations. BothSBandCFSarefeatureselectionmethods. Sinceselectinganoptimalnumberoffeatures is not trivial, Hall (2007) proposed to use SB and CFS for feature weighting in naive Bayes. For example,theweightofanattributeicanbedefinedas: 1 w = . (9) i √ri wherer istherankofthefeaturebasedonSBandCFSfeatureselection. i The feature weighting method proposed in Ferreira et al. (2001) is the only one to use Equa- tion 4, weighting each attribute value rather than each attribute. They used entropy-based dis- cretization for numeric attributes and assigned a weight to each partition (value) of the attribute thatisproportionaltoitspredictivecapabilityoftheclass. Differentweightfunctionsareproposed to assign weights to the values. These functions measure the difference between the distribution overclassesfortheparticularattribute-valuepairanda“baselineclassdistribution”. Thechoiceof weight function reduces to a choice of baseline distribution and the choice of measure quantifying thedifferencebetweenthedistributions. Theyusedtwosimplebaselinedistributionschemes. The firstassumesequiprobableclasses,thatis,uniformclasspriors. Inthatcasetheweightofforvalue joftheattributeicanbewrittenas: 1/α 1 w ∝ ∑ P(y X =j) α . ij | | i − Y | y | | ! where P(y X = j) denotes the probability that the class is y given that the i-th attribute of a data i | point has value j. Alternatively, the baseline class distribution can be set to the class probabilities acrossallvaluesoftheattribute(i.e.,theclasspriors). Theweighingfunctionwilltaketheform: 1/α w ∝ ∑ P(y X =j) P(y X =miss) α . ij i i | | − | 6 | y ! 1954 ALLEVIATINGNBATTRIBUTEINDEPENDENCEASSUMPTIONBYATTRIBUTEWEIGHTING where P(y X =miss) is the class prior probability across all data points for which the attribute i i | 6 is not missing. Equation 10 and 10 assume an L distance metric where α=2 corresponds to the α L norm. Similarly, theyhavealsoproposedtousedistancebasedonKullback-Leiblerdivergence 2 betweenthetwodistributionstosetweights. Many researchers have investigated techniques for extending the basic naive Bayes indepen- dence model with a small number of additional dependencies between attributes in order to im- prove classification performance (Zheng and Webb, 2000). Popular examples of such semi-naive BayesmethodsincludeTree-AugmentedNaiveBayes(TAN)(Friedmanetal.,1997)andensemble methods such as Averaged n-Dependence Estimators (AnDE) (Webb et al., 2011). While detailed discussionofthesemethodsisbeyondthescopeofthiswork,wewilldescribebothTANandAnDE inSection5.10forthepurposesofempiricalcomparison. Semi-naive Bayes methods usually limit the structure of the dependency network to simple structuressuchastrees,butmoregeneralgraphstructurescanalsobelearnt. Considerableresearch hasbeendoneintheareaoflearninggeneralBayesianNetworks(Greineretal.,2004;Grossmanand Domingos, 2004; Roos et al., 2005), with techniques differing on whether the network structure is chosentooptimizeagenerativeordiscriminativeobjectivefunction,andwhetherthesameobjective isalsousedforoptimizingtheparametersofthemodel. Indeedoptimizingnetworkstructureusing adiscriminativeobjectivefunctioncanquicklybecomecomputationallychallengingandthusrecent work in this area has looked at efficient heuristics for discriminative structure learning (Pernkopf and Bilmes, 2010) and at developing decomposable discriminative objective functions (Carvalho etal.,2011). In this paper we are interested in improving performance of the NB classifier by reducing the effectofattributeindependenceviolationsthroughattributeweighting. Wedonotattempttoidentify theparticulardependenciesbetweenattributesthatcausetheviolationsandthusarenotattempting to address the much harder problem of inducing the dependency network structure. While it is conceivablethatsemi-naiveBayesmethodsandmoregeneralBayesianNetworkclassifierlearning couldalsobenefitfromattributeweighting,weleaveitsinvestigationtofuturework. AsummaryofdifferentmethodscomparedinthisresearchisgiveninTable2. 4. WeightingtoAlleviatetheNaiveBayesIndependenceAssumption Inthissection,wewilldiscussourproposedmethodstoincorporateweightsinnaiveBayes. 4.1 WANBIA Many previous approaches to attribute weighting for naive Bayes have found weights using some form of mechanism that increases the weights of attributes that are highly predictive of the class and decreases the weights of attributes that are less predictive of the class. We argue that this is not appropriate. Naive Bayes delivers Bayes optimal classification if the attribute independence assumptionholds. Weightingshouldonlybeappliedtoremedyviolationsoftheattributeindepen- dence assumption. For example, consider the case where there are three attributes, x , x and x , 1 2 3 such that x and x are conditionally independent of one another given the class and x is an exact 1 2 3 copy of x (and hence violates the independence assumption). Irrespective of any measure of how 1 well these three attributes each predict the class, Bayes optimal classification will be obtained by setting the weights of x and x to sum to 1.0 and setting the weight of x to 1.0. In contrast, a 1 3 2 method that uses a measure such as mutual information with the class to weight the attribute will 1955 ZAIDI,CERQUIDES,CARMANANDWEBB Name Description NaiveBayes. NB NaiveBayesClassifier. WeightedNaiveBayes(usingTypicalFeatureWeightingMethods). GRW UsegainratioasattributeweightsinnaiveBayes,showninEquation7(ZhangandSheng,2004). SBC AssignweighttoattributeiasgiveninEquation8whereL=5withabootstrapsizeof10%. Alsodi=0if di>3(RatanamahatanaandGunopulos,2003). MH AssignweighttoattributeiasgiveninEquation8whereL=10withabootstrapsizeof50%(Hall,2007). SB UseSelectiveBayesmethodtodeterminetherankofindividualfeaturesandassignweightsaccordingtoEqua- tion9(LangleyandSage,1994). CFS Usecorrelation-basedfeatureselectionmethodtodeterminetherankofindividualfeaturesandassignweights accordingtoEquation9(LangleyandSage,1994;Hall,2007). SelectiveNaiveBayes(usingTypicalFeatureSelectionMethods). SBC-FS SimilartoSBCexceptwi=1ifwi>0. MH-FS SimilartoMHexceptwi=1ifwi>0(Hall,2007). WeightedNaiveBayes(Ferreiraetal.,2001). FNB-d1 WeightscomputedperattributevalueusingEquation10withα=2. FNB-d2 WeightscomputedperattributevalueusingEquation10withα=2. Semi-naiveBayesClassifiers. AnDE Averagen-DependentEstimator(Webbetal.,2011). TAN TreeAugmentedNaiveBayes(Friedmanetal.,1997). StateoftheArtClassificationTechniques. RF RandomForests(Breiman,2001). LR LogisticRegression(Roosetal.,2005). WeightedNaiveBayes(ProposedMethods,willbediscussedinSection4). CLL WANBIA NaiveBayesweightsobtainedbymaximizingConditionalLog-Likelihood. MSE WANBIA NaiveBayesweightsobtainedbyminimizingMean-Square-Error. Table2: Summaryoftechniquescomparedinthisresearch. reducetheaccuracyoftheclassifierrelativetousinguniformweightsinanysituationwherex and 1 x receivehigherweightsthanx . 3 2 Ratherthanselectingweightsbasedonmeasuresofpredictiveness,wesuggestitismoreprof- itable to pursue approaches such as those of Zhang and Sheng (2004) and Wu and Cai (2011) that optimizetheweightstoimprovethepredictionperformanceoftheweightedclassifierasawhole. FollowingfromEquations1,2and5,letusre-definetheweightednaiveBayesmodelas: π ∏ θwi Pˆ(y|x;π,Θ,w)= ∑ πy ∏i Xθi=wixi|y , (10) y′ y′ i Xi=xi|y′ withconstraints: ∑ π =1 and ∑ θ =1, y y ∀y,i j Xi=xi|y where π ,θ arenaiveBayesparameters. • { y Xi=xi|y} π [0,1] Y isaclassprobabilityvector. | | • ∈ ThematrixΘconsistofclassandattribute-dependentprobabilityvectorsθi,y [0,1]|Xi|. • ∈ wisavectorofclass-independentweights,w foreachattributei. i • 1956

Description:
Journal of Machine Learning Research 14 (2013) 1947-1988 Submitted 6/12; Revised 1/13; Published 7/13 Alleviating Naive Bayes Attribute Independence Assumption by
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.