Combining active learning suggestions Alasdair Tran1,2, Cheng Soon Ong1,3 and Christian Wolf4,5 1ResearchSchoolofComputerScience,AustralianNationalUniversity,Canberra,ACT,Australia 2DatatoDecisionsCooperativeResearchCentre,Adelaide,SA,Australia 3MachineLearningResearchGroup,Data61,CSIRO,Canberra,ACT,Australia 4ResearchSchoolofAstronomyandAstrophysics,AustralianNationalUniversity,Canberra, ACT,Australia 5ARCCentreofExcellenceforAll-skyAstrophysics(CAASTRO),Sydney,NSW,Australia ABSTRACT We study the problem of combining active learning suggestions to identify informative training examples by empirically comparing methods on benchmark datasets. Many active learning heuristics for classification problems have been proposed to help us pick which instance to annotate next. But what is the optimal heuristic for a particular source of data? Motivated by the success of methods that combine predictors, we combine active learners with bandit algorithms and rank aggregation methods. We demonstrate that a combination of active learners outperforms passive learning in large benchmark datasets and removes the need to pick a particular active learner a priori. We discuss challenges to finding good rewards for bandit approaches and show that rank aggregation performs well. Subjects DataMiningandMachineLearning Keywords Activelearning,Bandit,Rankaggregation,Benchmark,Multiclassclassification INTRODUCTION Recent advances in sensors and scientific instruments have led to an increasing use of machinelearningtechniquestomanagethedatadeluge.Supervisedlearninghasbecomea widely used paradigm in many big data applications. This relies on building a training set of labeled examples, which is time-consuming as it requires manual annotation from human experts. Submitted5January2018 The most common approach to producing a training set is passive learning, where Accepted14June2018 Published23July2018 we randomly select an instance from a large pool of unlabeled data to annotate, and we continue doing this until the training set reaches a certain size or until the classifier Correspondingauthor AlasdairTran, makessufficientlygoodpredictions.Dependingonhowtheunderlyingdataisdistributed, [email protected] this process can be quite inefficient. Alternatively we can exploit the current set of labeled Academiceditor datatoidentifymoreinformativeunlabeledexamplestoannotate.Forinstancewecanpick SebastianVentura examplesnearthedecisionboundaryoftheclassifier,wheretheclassprobabilityestimates AdditionalInformationand are uncertain (i.e., we are still unsure which class the example belongs to). Declarationscanbefoundon page31 Many active learning heuristics have been developed to reducethe labeling bottleneck without sacrificing the classifier performance. These heuristics actively choose the most DOI10.7717/peerj-cs.157 informative examples to be labeled based on the predicted class probabilities. “Overview Copyright 2018Tranetal. of Active Learning” describes two families of algorithms in detail: uncertainty sampling Distributedunder and version space reduction. CreativeCommonsCC-BY4.0 In this paper, we present a survey of how we can combine suggestions from various active learning heuristics. In supervised learning, combining predictors is a well-studied HowtocitethisarticleTranetal.(2018),Combiningactivelearningsuggestions.PeerJComput.Sci.4:e157;DOI10.7717/peerj-cs.157 problem. Many techniques such as AdaBoost (Freund & Schapire, 1996) (which averages predictions from a set of models) and decision trees (Breiman et al., 1984) (which select one model for making predictions in each region of the input space) have been shown to perform better than just using a single model. Inspired by this success, we propose to combine active learning suggestions with bandit and rank aggregation methods in “Combining Suggestions.” Theuseofbanditalgorithmstocombineactivelearnershasbeenstudiedbefore(Baram, El-Yaniv & Luz, 2004; Hsu & Lin, 2015). Borda count, a simple rank aggregation method, hasbeenusedinthecontextofmulti-tasklearningforlinguisticannotations(Reichartetal., 2008),wherewehaveoneactivelearnerselectingexamplestoimprovetheperformanceof multiple related tasks (e.g., part-of-speech tagging and name entity recognition). Borda counthasalsobeenusedinmulti-labellearning(Reyes,Morell&Ventura,2018)tocombine uncertaintyinformationfrommultiplelabels.Asfarasweknow,otheraggregationmethods have not been explored and our work is the first time that social choice theory is used to rank and aggregate suggestions from multiple active learners. This paper makes the following two main contributions: 1. We empirically compare four bandit and three rank aggregation algorithms in the context of combining active learning heuristics. We apply these algorithms to 11 benchmarkdatasetsfromtheUCIMachineLearningRepository(Lichman,2013)anda large dataset from the Sloan Digital Sky Survey (SDSS) (Alam et al., 2015). The experimental setup and discussion are described in “Experimental Protocol, Results, and Discussion.” 2. Weproposetwometricsforevaluation:themeanposteriorbalancedaccuracy(MPBA) andthestrengthofanalgorithm.TheMPBAextendsthemetricproposedinBrodersen etal.(2010)fromthebinarytothemulti-classsetting.Thisisanaccuracymeasurethat takesclassimbalanceintoaccount.Thestrengthmeasureisavariationonthedeficiency measureusedinBaram,El-Yaniv&Luz(2004)whichevaluatestheperformanceofan active learner or combiner, relative to passive learning. The main difference between our measure and that of Baram, El-Yaniv & Luz (2004) is that ours assigns a higher number for better active learning methods and ensures that it is upper-bounded by 1 for easier comparison across datasets. OVERVIEW OF ACTIVE LEARNING Inthispaperweconsiderthebinaryandmulticlassclassificationsettingswherewewould liketolearnaclassifierh,whichisafunctionthat mapssomefeaturespaceX (cid:2) Rd toa probability distribution over a finite label space Y: h : X ! pðYÞ (1) In other words, we require that the classifier produces class probability estimates for each unlabeled example. For instance, in logistic regression with only two classes, i.e.,Y ¼ f0;1g,wecanmodeltheprobabilitythatanobjectwithfeaturevectorxbelongs to the positive class with Tranetal.(2018),PeerJComput.Sci.,DOI10.7717/peerj-cs.157 2/34 Algorithm1 Thepool-basedactivelearningalgorithm. Input:unlabeledsetU,labeledtrainingsetℒ ,classifierh(x),andactivelearnerrðU;hÞ. T repeat Selectthemostinformativecandidatex fromU usingtheactivelearningrulerðU;hÞ. (cid:6) Asktheexperttolabelx .Callthelabely . (cid:6) (cid:6) Addthenewlylabeledexampletothetrainingset:L L [fðx ;y Þg. T T (cid:6) (cid:6) Removethenewlylabeledexamplefromtheunlabeledset:U Unfx g. (cid:6) Retraintheclassifierh(x)usingℒ . T untilwehaveenoughtrainingexamples. 1 hðx;θÞ ¼ Pðy ¼ 1jx;θÞ ¼ (2) 1þe(cid:3)θTx and the optimal weight vector θ is learned in training. We can further consider kernel logisticregression,wherethefeaturespaceX isthefeaturespacecorrespondingtoagiven kernel, allowing for non-linear decision functions. In active learning, we use the class probability estimates from a trained classifier to estimate a score of informativeness for each unlabeled example. In pool-based active learning, where we select an object from a pool of unlabeled examples at each time step, we require that some objects have already been labeled. In practice this normally means thatwelabelasmallrandomsampleatthebeginning.Thesebecomethelabeledtraining set L (cid:4) X (cid:5)Y and the rest form the unlabeled set U (cid:4) X. T NowconsidertheproblemofchoosingthenextexampleinUforquerying.Labeling can be a very expensive task, because it requires using expensive equipment or human experts to manually examine each object. Thus, we want to be smart in choosing the nextexample.Thismotivatesustocomeupwitharules(x;h)thatgiveseachunlabeled example a score based only on their feature vector x and the current classifier h. Recall that the classifier produces p(Y), a probability estimate for each class. We use these probability estimates from the classifier over the unlabeled examples to calculate the scores: s :pðYÞ ! R (3) The value of s(x; h) indicates the informativeness of example x, where bigger is better. Wewouldthenlabeltheexamplewiththelargestvalueofs(x;h).Thiswillbeouractive learning rule r: rðU;hÞ ¼ argmaxsðx;hÞ (4) x2U Algorithm 1 outlines the standard pool-based active learning setting. Comingupwithanoptimalruleisitselfadifficultproblem,buttherehavebeenmany attempts to derive good heuristics. Five common ones, which we shall use in our experiments, are described in “Uncertainty Sampling” and “Version Space Reduction.” Tranetal.(2018),PeerJComput.Sci.,DOI10.7717/peerj-cs.157 3/34 There are also heuristics that involve minimizing the variance or maximizing the classifier certainty of the model (Schein & Ungar, 2007), but they are computationally expensive. For example, in the variance minimization heuristic, the score of a candidate example is the expected reduction in the model variance if that example were in the training set. To compute this reduction, we first need to give the example each of the possiblelabels,addittothetrainingset,andupdatetheclassifier.Thisisexpensivetorun since in each iteration, the classifier needs to be retrained k (cid:5)U times, where k is the number of classes and Uis the size of the unlabeled pool. There are techniques to speed this up such as using online training or assigning a score to only a small subset of the unlabeled pool. Preliminary experiments showed that these heuristics do not perform as well as the simpler ones (Tran, 2015), so we do not consider them in this paper. A more comprehensive treatment of these active learning heuristics can be found in (Settles, 2012). Uncertainty sampling Lewis&Gale(1994)introduceduncertaintysampling,whereweselecttheinstancewhose classmembershiptheclassifierisleastcertainabout.Thesetendtobepointsthatarenear thedecisionboundaryoftheclassifier.Perhapsthesimplestwaytoquantifyuncertaintyis the least confidence heuristic (Culotta & McCallum, 2005), where we pick the candidate whose most likely label the classifier is most uncertain about: (cid:1) (cid:3) r ðU;hÞ ¼ argmax (cid:3)max pðyjx;hÞ (5) LC x2U y2Y where p(y|x; h) is the probability that the object with feature vector x belongs to class y under classifier h. For consistency, we have flipped the sign of the score function so that the candidate with the highest score is picked. A second option is to calculate the entropy (Shannon, 1948), which measures the amount of information needed to encode a distribution. Intuitively, the closer the class probabilities of an object are to a uniform distribution, the higher its entropy will be. This gives us the heuristic of picking the candidate with the highest entropy of the distribution over the classes: ( ) X r ðU;hÞ ¼ argmax (cid:3) pðyjx;hÞlog½pðyjx;hÞ(cid:7) (6) HE x2U y2Y Asathirdoptionwecanpickthecandidatewiththesmallestmargin,whichisdefinedasthe differencebetweenthetwohighestclassprobabilities(Scheffer,Decomain&Wrobel,2001): (cid:1) (cid:4) (cid:5)(cid:3) r ðU;hÞ¼ argmax (cid:3) max pðyjx;hÞ(cid:3) max pðzjx;hÞ (7) SM x2U y2Y z2Ynfy(cid:6)g wherey(cid:6) ¼ argmax pðzjx;hÞandweagainflipthesignofthescorefunction.Sincethe y2Y sumofallprobabilitiesmustbe1,thesmallerthemarginis,theharderitistodifferentiate between the two most likely labels. Tranetal.(2018),PeerJComput.Sci.,DOI10.7717/peerj-cs.157 4/34 An extension to the above three heuristics is to weight the score with the information density so that we give more importance to instances in regions of high density: ! 1 XE s ðU;hÞ ¼ simðx;xðkÞÞ sðx;hÞ (8) ID U k¼1 wherehistheclassifier,s(x;h)istheoriginalscorefunctionoftheinstancewithfeature vector x,Uisthesizeoftheunlabeledpool,andsim(x,x(k))isthesimilaritybetween x and another instance x(k) using the Gaussian kernel with parameter g: simðx;xðkÞÞ ¼ expðgjjx(cid:3)xðkÞjj2Þ (9) TheinformationdensityweightingwasproposedbySettles&Craven(2008)todiscourage the active learner from picking outliers. Although the class membership of outliers might be uncertain, knowing their labels would probably not affect the classifier performance on the data as a whole. Version space reduction Instead of focusing on the uncertaintyof individual predictions, we could instead try toconstrainthesizeoftheversionspace,thusallowingthesearchfortheoptimalclassifierto be more precise. The version space is defined as the set of all possible classifiers that are consistent with the current training set. To quantify the size of this space, we can train a committee of B classifiers, B = {h , h , ..., h }, and measure the disagreement among the 1 2 B members of the committee about an object’s class membership. Ideally, each member should be as different from the others as possible but still be in the version space (Melville & Mooney, 2004). In order to have this diversity, we give each member only a subsetofthetrainingexamples.Sincetheremightnotbeenoughtrainingdata,weneedto use bootstrapping and select samples with replacement. Hence this method is often called Query by Bagging (QBB). Onewaytomeasurethelevelofdisagreementistocalculatethemarginusingtheclass probabilities estimated by the committee (Melville & Mooney, 2004): (cid:1) (cid:4) (cid:5)(cid:3) r ðU;hÞ ¼ argmax (cid:3) max pðyjx;BÞ(cid:3) max pðzjx;BÞ (10) QBBM x2U y2Y z2Ynfy(cid:6)g where y(cid:6) ¼ argmaxpðzjx;BÞ (11) y2Y X 1 pðzjx;BÞ ¼ pðyjx;h Þ (12) b B b2B This looks similar to one of the uncertainty sampling heuristics, except now we use p(y|x; B) instead of p(y|x; h). That is, we first average out the class probabilities predicted by the members before minimizing the margin. McCallum & Nigam (1998) offered an alternative disagreement measure which involves picking Tranetal.(2018),PeerJComput.Sci.,DOI10.7717/peerj-cs.157 5/34 Table1 Summaryofactivelearningheuristicsusedinourexperiments. Abbreviation Heuristic Objectivefunction (cid:6) (cid:7) CONFIDENCE Leastconfidence argmax (cid:3)maxy2Y pðyjx;hÞ x2U (cid:6) P (cid:7) ENTROPY Highestentropy argmax(cid:6)(cid:3)(cid:8) y2Y pðyjx;hÞlog½pðyjx;hÞ(cid:7) (cid:9)(cid:7) x2U MARGIN Smallestmargin argmax(cid:6)(cid:3)(cid:8)maxy2Y pðyjx;hÞ(cid:3)maxz2Ynfy(cid:6)g pðzjx;hÞ(cid:9)(cid:7) x2U QQBBBB--MKLARGIN SLmaraglelsetstQQBBBBKmLargin aarrggx2mmUaaxx(cid:6)(cid:3)1BPmBba¼x1y2DYKLpððpybjxk;pBBÞÞ(cid:7)(cid:3)maxz2Ynfy(cid:6)g pðzjx;BÞ x2U Note: Notations:p(y|x;h)istheprobabilityofthatanobjectwithfeaturevectorxhaslabelyunderclassifierh,BisthesetofB classifiers{h,h, :::,h },Yisthesetofpossiblelabels,y(cid:6)isthemostcertainlabel,Uisthesetofunlabeledinstances, 1 2 B D (p||q)istheKullback–Leiblerdivergenceofpfromq,andp istheclassdistributionaveragedacrossclassifiersinB. KL B Forconsistency,withheuristicsthatuseminimization,weflipthesignofthescoresothatwecanalwaystaketheargmax togetthebestcandidate. the candidate with the largest mean Kullback–Leibler (KL) divergence from the average: ( ) 1 XB r ðU;hÞ ¼ argmax D ðp k p Þ (13) QBBKL KL b B B x2U b¼1 where D (p ‖p ) is the KL divergence from p (the probability distribution that is KL b B B averagedacrossthecommitteeB),top (thedistributionpredictedbyamemberb∈B): b X pðyjx;h Þ D ðp k p Þ ¼ pðyjx;h Þ ln b (14) KL b B b pðyjx;BÞ y2Y For convenience, we summarize the five heuristics discussed above in Table 1. COMBINING SUGGESTIONS Outofthefiveheuristicsdiscussed,whichoneshouldweuseinpracticewhenwewould like to apply active learning to a particular problem? There have been some attempts in the literature to do a theoretical analysis of their performance. Proofs are however scarce, and when there is one available, they normally only work under restrictive assumptions. For example, Freund et al. (1997) showed that the query by committee algorithm(aslightvariantofourtwoQBBheuristics)guaranteesanexponentialdecrease inthepredictionerror withthetrainingsize,butonlywhenthereisnonoise.Ingeneral, whether any of these heuristics is guaranteed to beat passive learning is still an open question. Eventhoughwedonotknowwhichoneisthebest,wecanstillcombinesuggestionsfrom all of the heuristics.Thiscan be thought of as the problem of predictionwith expert advice, where each expert is an active learning heuristic. In this paper we explore two different approaches:wecaneitherconsidertheadviceofonlyoneexpertateachtimestep(withbandit algorithms),or we can aggregate the advice of all the experts (with social choice theory). Combining suggestions with bandit theory First let us turn our attention to the multi-armed bandit problem in probability theory (Berry&Fristedt,1985).Thecolorfulnameoriginatesfromthesituationwhereagambler Tranetal.(2018),PeerJComput.Sci.,DOI10.7717/peerj-cs.157 6/34 R Selectheuristicwithb chosenheuristicr rewardw LS Trainwithclassifierh p(Y) Assignscoreswiths Selectcandidate withhighestscore LT U Addtotrainingpool (x∗,y∗) Labelcandidate x∗ Figure1 Activelearningpipelinewithbanditalgorithms.Weneedtocollectrewardswfromthetest setℒ inordertodecidewhichheuristictochooseateachtimestep.Thisroutineisindicatedbythered S arrows.Notations:ℛisthesetofheuristics{r , :::,r },ℒ isthetrainingset,ℒ isthetestset,Uisthe 1 R T S unlabeledset,andp(Y)isthepredictedclassprobabilitiesontheunlabeleddataU. Full-sizeDOI:10.7717/peerj-cs.157/fig-1 standsinfrontofaslotmachinewithRlevers.Whenpulled,eachlevergivesoutareward according to some unknown distribution. The goal of the game is to come up with a strategy that can maximize the gambler’s lifetime rewards. In the context of active learning, each lever is a heuristic with a different ability to identify the candidate whose labeling information is most valuable. Themainprobleminmulti-armedbanditsisthetrade-offbetweenexploringrandom heuristics and exploiting the best heuristic so far. There are many situations inwhichwe find our previously held beliefs to be completely wrong. By always exploiting, we could missoutonthebestheuristic.Ontheotherhand,ifweexploretoomuch,itcouldtakeus a long time to reach the desired accuracy. Banditalgorithmsdonotneedtoknowtheinternalworkingsoftheheuristics,butonly therewardreceivedfromusinganyofthem.Ateachtimestep,wereceivearewardfroma heuristic,andbasedonthehistoryofalltherewards,thebanditalgorithmcandecideon which heuristic to pick next. Formally, we need to learn the function b : ðJ (cid:5)½0;1(cid:7)Þn ! J (15) R R wherebisthebanditalgorithm,therewardisnormalizedbetween0and1,Jℛistheindex set over the set of heuristics ℛ, and n is the time horizon. What would be an appropriate reward w in this setting? We propose using the incrementalincreaseintheperformanceofthetestsetafterthecandidateisaddedtothe training set. This, of course, means that we need to keep a separate labeled test set around,justforthepurposeofcomputingtherewards.Wecould,asiscommonpractice in machine learning, use cross validation or bootstrap on ℒ to estimate the T generalization performance. However for simplicity of presentation we use a separate test set ℒ . S Figure 1 and Algorithm 2 outline how bandits can be used in pool-based active learning. The only difference between the bandit algorithms lies in the SELECT function that selects which heuristic to use, and the UPDATE function that updates the algorithm’s selection parameters when receiving a new reward. Tranetal.(2018),PeerJComput.Sci.,DOI10.7717/peerj-cs.157 7/34 Algorithm2 Pool-basedactivelearningwithbandittheory.Notethatinadditiontothesetofactive learningheuristicsℛandthetestsetℒ ,somebanditalgorithmsalsoneedtoknown,themaximum S sizeofthetrainingset,inadvance. Input:unlabeledsetU,labeledtrainingsetℒ ,labeledtestsetℒ ,classifierh,desiredtrainingsizen,setof T S activelearningheuristicsℛ,andbanditalgorithmbwithtwofunctionsSELECTandUPDATE. whilejℒ j<ndo T Selectaheuristicr(cid:6)∈ℛaccordingtoSELECT. Selectthemostinformativecandidatex fromU usingthechosenheuristicr (U;h). (cid:6) (cid:6) Asktheexperttolabelx .Callthelabely. (cid:6) (cid:6) Addthenewlylabeledexampletothetrainingset:L L [fðx ;y Þg. T T (cid:6) (cid:6) Removethenewlylabeledexamplefromtheunlabeledset:U Unfx g. (cid:6) Retraintheclassifierh(x)usingℒ . T Runtheupdatedclassifieronthetestsetℒ tocomputetheincreaseintheperformancew. S UpdatetheparametersofbwithUPDATE(w). end Therehavebeensomeattemptstocombineactivelearningsuggestionsintheliterature. Baram, El-Yaniv & Luz (2004) used the EXP4 multi-armed bandit algorithm to automate the selection process. They proposed a reward called the classification entropy maximization, which can be shown to grow at a similar rate to the true accuracy in binary classification with support vector machines (SVMs). We will not compare our results directly with those in Baram, El-Yaniv & Luz (2004) since we would like to evaluate algorithms that can work with both binary and multi-class classification. Our experiments also use logistic regression which produces probability estimates directly, rather than SVMs which can only produce unnormalized scores. Hsu & Lin (2015) studiedanimprovedversionofEXP4,calledEXP4.P,andusedimportanceweightingto estimate the true classifier performance using only the training set. In this paper, we empirically compare the following four bandit algorithms: Thompson sampling, OC- UCB, kl-UCB, and EXP3++. Thompson sampling The oldest bandit algorithm is Thompson sampling (Thompson, 1933) which solves the exploration-exploitation trade-off from a Bayesian perspective. LetW betherewardofheuristicr ∈ℛ.Observethatevenwiththebestheuristic,we i i still might not score perfectly due to having a poor classifier trained on finite data. Conversely,abadheuristicmightbeabletopickaninformativecandidateduetopure luck. Thus there is always a certain level of randomness in the reward received. Let us treat the reward W as a normally distributed random variable with mean ν and i i variance t2: i ðW j(cid:2)Þ (cid:8) Nð(cid:2);t2Þ (16) i i i i If we knew both ν and t2 for all heuristics, the problem would become trivially i i easy since we just need to always use the heuristic that has the highest mean Tranetal.(2018),PeerJComput.Sci.,DOI10.7717/peerj-cs.157 8/34 Algorithm 3 Thompson sampling with normally distributed rewards. Notations: ℛ is the set of R heuristics, m is the mean parameter of the average reward, s2 is the variance parameter of the average reward, t2 is the known variance parameter of the reward, and w is the actual reward received. functionSELECT() fori∈{1,2,...,R}do ν′drawasamplefromNðm;s2Þ i i i end Selecttheheuristicwiththehighestsampledvalue:r argmax (cid:2)0 (cid:6) i i functionUPDATE(w) m t2þws2 s2t2 m (cid:6) (cid:6) (cid:6) s2 (cid:6) (cid:6) (cid:6) s2þt2 (cid:6) s2þt2 (cid:6) (cid:6) (cid:6) (cid:6) reward.Inpractice,wedonotknowthetruemeanoftherewardν,soletusaddasecond i layer of randomness and assume that the mean itself follows a normal distribution: (cid:2) (cid:8) Nðm;s2Þ (17) i i i To make the problem tractable, let us assume that the variance t2 in the first layer is i aknownconstant.Thegoalnowistofindagoodalgorithmthatcanestimatem ands2. i i We start with a prior on m and s2 for each heuristic r. The choice of prior does i i i not usually matter in the long run. Since initially we do not have any information abouttheperformanceofeachheuristic,theappropriatepriorvalueform is0,i.e.,there i isnoevidence(yet)thatanyoftheheuristicsoffersanimprovementtotheperformance. In each round, we draw a random sample ν′ from the normal distribution Nðm;s2Þ i i i for each i and select heuristic r that has the highest sampled value of the mean reward: (cid:6) r ¼ argmaxv0 (18) (cid:6) i i We then usethis heuristic to selectthe objectthat is deemedto be themostinformative, add it to the training set, and retrain the classifier. Next we use the updated classifier topredictthelabelsofobjectsinthetestset.Letwbetherewardobserved.Wenowhavea new piece of information that we can use to update our prior belief about the mean m (cid:6) and the variance s2 of the mean reward. Using Bayes’ theorem, we can show that the (cid:6) posterior distribution of the mean reward remains normal, ð(cid:2) j W ¼ wÞ (cid:8) Nðm0; s02Þ (19) (cid:6) (cid:6) (cid:6) (cid:6) with the following new mean and variance: m t2 þws2 s2t2 m0 ¼ (cid:6) (cid:6) (cid:6) s02 ¼ (cid:6) (cid:6) (20) (cid:6) s2 þt2 (cid:6) s2 þt2 (cid:6) (cid:6) (cid:6) (cid:6) Algorithm3summarizestheSELECTandUPDATEfunctionsusedinThompsonsampling. Upper confidence bounds NextweconsidertheUpperConfidenceBound(UCB)algorithmswhichusetheprinciple of “optimism in the face of uncertainty.” In choosing which heuristic to use, we first estimate the upper bound of the reward (that is, we make an optimistic guess) and pick Tranetal.(2018),PeerJComput.Sci.,DOI10.7717/peerj-cs.157 9/34 Algorithm4 OptimallyConfidentUCB.Notations:nisthetimehorizon(maximumnumberoftime steps),tisthecurrenttimestep,T(t)countshowmanytimesheuristicihasbeenselectedbefore i stept,wistherewardreceived,andw istheaverageoftherewardsfromr sofar. i i fuctionSELECT() rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi(cid:11)ffiffiffiffiffiffiffiffiffi(cid:12)ffiffiffi 3 2n r argmaxw þ ln (cid:6) i TðtÞ t i i functionUPDATE(w) t tþ1 T ðtÞ T ðt(cid:3)1Þþ1 (cid:6) (cid:6) Algorithm 5 kl-UCB with normally distributed rewards. Notations: s2 is the variance of the rewards, t is the current time step, T(t) counts how many times heuristic i has been selected i beforestept,wistherewardreceived,andw istheaverageoftherewardsfromr sofar. i i functionSELECT() rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi lnðTðtÞÞ r argmaxw þ 2s2 i (cid:6) i t i functionUPDATE(w) t tþ1 T ðtÞ T ðt(cid:3)1Þþ1 (cid:6) (cid:6) the one with the highest bound. If our guess turns out to be wrong, the upper bound of thechosenheuristicwilldecrease,makingitlesslikelytogetselectedinthenextiteration. There are many different algorithms in the UCB family, e.g., UCB1-TUNED & UCB2 (Auer, Cesa-Bianchi & Fischer, 2002a), V-UCB (Audibert, Munos & Szepesva´ri, 2009), OC-UCB(Lattimore,2015),andkl-UCB(Cappe´etal.,2013).Theydifferonlyintheway the upper bound is calculated. In this paper, we onlyconsider the last two. In Optimally Confident UCB (OC-UCB), Lattimore (2015) suggests that we pick the heuristic that maximizes the following upper bound: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi! (cid:4) (cid:5) a cn r ¼ argmax w þ ln (21) (cid:6) i TðtÞ t i i where w is the average of the rewards from r that we have observed so far, t is the time i i step, T(t) is the number times we have selected heuristic r before step t, and n is the i i maximum numberof steps that we are going to take. There are two tunable parameters, a and c, which the author suggests setting to 3 and 2, respectively. Inkl-UCB,Capp´eetal.(2013)suggestthatwecaninsteadconsidertheKL-divergence between the distribution of the current estimated reward and that of the upper bound. Inthecaseofnormallydistributedrewardswithknownvariances2,thechosenheuristic would be rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi! lnðTðtÞÞ r ¼ argmax w þ 2s2 i (22) (cid:6) i t i Algorithms 4 and 5 summarize these two UCB approaches. Note that the size of the reward w is not used in UPDATE (w) of UCB, except to select the best arm. Tranetal.(2018),PeerJComput.Sci.,DOI10.7717/peerj-cs.157 10/34
Description: