Table Of Content

A Contextual Bandit Approach for Stream-Based Active Learning Linqi Song Jie Xu Electrical Engineering Department Electrical and Computer Engineering Department University of California, Los Angeles, USA University of Miami, USA Email: [email protected] Email: [email protected] Abstract—Contextual bandit algorithms – a class of multi- (cid:20)(cid:2)(cid:11)(cid:21)(cid:12)(cid:22)(cid:4)(cid:9)(cid:2)(cid:21)(cid:9)(cid:23)(cid:4) 7 armed banditalgorithmsthat exploit thecontextual information (cid:16)(cid:6)(cid:17)(cid:2)(cid:12)(cid:6)(cid:2) (cid:2)(cid:6)(cid:24)(cid:17)(cid:2)(cid:22) 1 – have been shown to be effective in solving sequential decision (cid:13)(cid:11)(cid:12)(cid:9)(cid:6)(cid:14)(cid:9) (cid:1)(cid:2)(cid:3)(cid:4) 0 making problems under uncertainty. A common assumption (cid:5)(cid:6)(cid:7)(cid:6)(cid:8)(cid:9)(cid:10)(cid:11)(cid:12) 2 adopted in the literature is that the realized (ground truth) n reward by taking the selected action is observed by the learner (cid:1)(cid:8)(cid:9)(cid:10)(cid:15)(cid:6)(cid:4) (cid:1)(cid:2)(cid:3) (cid:1)(cid:12)(cid:12)(cid:11)(cid:9)(cid:17)(cid:9)(cid:11)(cid:2) a at no cost, which, however, is not realistic in many practical (cid:16)(cid:6)(cid:17)(cid:2)(cid:12)(cid:10)(cid:12)(cid:18)(cid:19) (cid:25)(cid:2)(cid:10)(cid:11)(cid:2)(cid:4)(cid:26)(cid:12)(cid:27)(cid:11) J scenarios. When observing the ground truth reward is costly, a key challenge for the learner is how to judiciously acquire 4 the ground truth by assessing the benefits and costs in order Fig.1. Sequentialdecisionmakingwithactivelearning.Thelearnerdecides 2 armselectionandwhethertoacquirethegroundtruthreward(activelearning). to balance learning efficiency and learning cost. From the information theoreticperspective, aperhapseven moreinteresting thereby allowing the learner to fully and freely utilize this ] G questionishowmuchefficiencymightbelostduetothiscost.In information. While this assumption holds true in some ap- this paper, we design a novel contextual bandit-based learning L plication scenarios, it hardly captures the reality in many algorithm and endow it with the active learning capability. The s. keyfeatureofouralgorithmisthatinadditiontosendingaquery others in which observing the ground truth reward requires c to an annotator for the ground truth, prior information about substantialmanpower,time,energyand/orotherresources.For [ thegroundtruthlearnedbythelearnerissent together, thereby instance, to calculate the reward in stream mining systems, reducing the query cost. We prove that by carefully choosing human experts are needed to manually annotate the ground 1 the algorithm parameters, the learning regret of the proposed v truth labels of the mining tasks. Therefore, in addition to algorithm achieves the same order as that of conventional 5 carefully deciding which arm to pull, the learner also has to contextualbanditalgorithmsincost-freescenarios,implyingthat, 2 surprisingly, cost due to acquiring the ground truth does not activelyandjudiciouslyacquirethegroundtruthrewardsfrom 7 increase the learning regret in the long-run. Our analysis shows an annotator by assessing the benefits and costs of obtaining 6 that prior information about the ground truth plays a critical them in these application scenarios [6][7]. Figure 1 illustrates 0 role in improving the system performance in scenarios where . the considered active learning scenario. 1 active learning is necessary. 0 Inthispaper,wedesignalearningalgorithm,calledContex- 7 I. INTRODUCTION tualBanditswithActiveLearning(CB-AL),thataccomplishes 1 the aforementioned task. We prove that CB-AL is order- Contextualbandits[1][2][3]isa powerfulmachinelearning : v optimal in terms of the learning regret, which matches that frameworkformodelingandsolvingalargeclassofsequential i of conventionalcontextual bandits in cost-free scenarios. The X decision making problems under uncertainty, ranging from key to achieving the optimal regret order by our proposed content recommendation, online advertising, stream mining, r a to decision supportfor clinicaldiagnosis[4] and personalized algorithm is that the query about the ground truth reward is sent to the annotator together with some prior information education [5]. In a typical setting, a task arrives to the about this reward. Although the learner does not directly system with certain contextual information (e.g. incoming observe the reward realization by selecting an arm, it is user’s age, gender, search and purchase history etc. in online learning the distribution of the reward as it learns the optimal content recommendation), then the system pulls an arm from arm to pull, and this statistical information can be utilized a possibly very large arm space (e.g. recommend a piece of to reduce the cost of acquiring the ground truth reward by online content from a large content pool). A reward is later anannotator.Thisisin starkcontrastwith conventionalactive realized dependingon the contextvalue and the selected arm. learningliteraturewherethecostofacquiringthegroundtruth Theobjectiveofalearner(oralearningalgorithm)istomake is constant [8][9][10]. arm selection decisions based on the history of context-arm- reward realizations to minimize the learning regret (i.e. the Our algorithm is able to effectively deal with large context gapof achievablerewardcomparedwith certainbenchmarks). and arm spaces. To this end, our algorithm divides time into A common assumption made in the literature is that the epochs and the context/arm spaces are adaptively partitioned reward of each task is observed by the learner at no cost, across epochs. The partitions become finer and finer as the epoch grows. Within each epoch, our algorithm first explores convex increasing function of the confidence interval b a t t − various arm clusters (defined for each arm subspace) to learn and the significance level δ of the following form: t their reward estimates for each context cluster (defined for c =c[(b a )β1 +ηδβ2] (1) each context subspace) and removes suboptimal arm clusters t t− t t duringthecourseoflearning.Whentheremainingarmclusters where c>0, β 1, β 1, η >0 are constant parameters. 1 2 are learned to be optimal or near-optimal, the algorithm ≥ ≥ Therefore, a larger confidence interval b a and a smaller t t enters an exploitationphase in which the arm cluster removal − confidencelevel1 δ resultinahigherquerycost.Wechoose t operation stops and acquiring the ground truth rewards is no − this form of query cost because it captures the reality to a longer needed for the remaining time slots in the current large extent and also is amenable to our subsequent analysis. epoch, thereby maximizing the reward and minimizing the Let rˆt denote the observed reward in time slot t, which is query cost concurrently. To optimize the overall long term rˆt =r(x ,k ) if q =1, and rˆt = if q =0. t t t t performance and minimize the long-term learning regret, our ∅ Giventhecontextarrivalprocess,theselectedarmsequence algorithm carefully designs control functions that determine and the observed reward sequence, the history by time slot t what arm clusters are optimal, near-optimal and suboptimal. is defined as Theremainderofthispaperisorganizedasfollows.Section II formulates the problem and defines the learning regret. ht−1 = (x1,k1,rˆ1),...,(xt 1,kt 1,rˆt 1)) , t>1 (2) { − − − } ∀ Section III describes our algorithm whose regret performance and h0 = for t = 1. The set of all possible histories is is analyzed in Section IV. Section V provides illustrative ∅ denoted by . An algorithm π is a mapping π : numerical results followed by conclusions in Section VI. H H×X → 0,1 ,whichselectsanarmanddecideswhetherornotto K×{ } II. PROBLEM FORMULATION query given the history and the current context. For the ease of exposition, we separately write πt = π (ht 1,x ) and K K − t A. System Model πt = π (ht 1,x ) for the arm selection component and the q q − t We consider a discrete time system where time is divided query decision component of the algorithm, respectively. into slots t = 1,2,.... The arm space is a bounded space K B. Learning Regret with covering dimension d . The context space is a bounded K space withcoveringdimensiond .Foranycontextx , We use the total expected payoff(i.e. the reward minus the X X ∈X the reward of selecting arm k is r(x,k) [0,1], query cost) to describe the performance of an algorithm π. ∈ K ∈ which is sampled according to some underlyingbut unknown The total expected payoff up to time slot T is thus distribution f(x,k). The expected value of r(x,k) is denoted T as µ(x,k), which is unknowntoo. We assume thatthe reward Uπ(T)=E [r(xt,kt) ctqt] (3) value space is [0,1] for the ease of exposition but this − t=1 assumptioncanberelaxedtoaccountforanyboundedinterval. X wheretheexpectationistakenoverthecontextarrivalprocess Intheconventionalcontextualbanditssetting,thefollowing and the reward distributions. We compare an algorithm with eventsoccurinsequenceineachtimeslott:(1)Acontextx t ∈ the static-best oracle policy π which knows the reward arrives; (2) An arm k is selected; (3) The (ground- ∗ t X ∈ K distributionsapriori.Therefore,ineachtimeslott,theoracle truth)rewardr(x ,k )is generatedaccordingtof(x ,k )and t t t t policyselectsthearmk =argmax µ(x ,k )thatmaximizes is observed by the learner at no cost as feedback, which t∗ k t t the expected reward. Clearly, since the oracle knows the provides information for future arm selections. rewarddistributions,thereisnoneedforittoquerytheground In our considered setting, r(x ,k ) is not observed for t t truth to learn about them. Therefore, q =0, t. The learning free. Instead, there is a cost associated with requesting the t∗ ∀ regret of an algorithm π is defined as groundtruthreward.Therefore,thereisaneedtoactivelyand judiciously decide when to request the ground truth reward R (T)=U (T) U (T) (4) π π π to balance the learning efficiency and the cost minimization. ∗ − Thus, in addition to deciding which arm k to choose, the As a widely-adopted assumption in contextual bandits lit- t learner also has to decide whether or not to query the ground erature [1][3], the reward function is assumed to satisfy a truth reward at a cost, denoted by q 0,1 , where q = 1 Lipschitz condition with respect to both the context and the t t ∈ { } stands for requesting and q =0 stands for not requesting. arm. This assumption is formalized as follows. t We consider that the query cost is not fixed, but a function Assumption 1. Foranytwo contextsx,x′ andtwo arms of the prior information about the ground truth, which is ∈X k,k , the expected rewards satisfy ′ updated as learning goes on. The intuition is that if the prior ∈K information is more informative, then the query cost should µ(x,k) µ(x′,k) LX x x′ (5) | − |≤ k − k besmaller.Inparticular,wedefinethepriorinformationabout µ(x,k) µ(x,k′) LK k k′ (6) the reward r(x ,k ) as a tuple (a ,b ,δ ), which represents | − |≤ k − k t t t t t thattheexpectedrewardµ(x ,k )isintheregion[a ,b ] with where L ,L are the Lipschitz constants for the context t t t t X K probabilityat least 1 δ . The query cost is then defined as a space and the arm space, respectively. t − III. THEALGORITHM (cid:1)(cid:3)(cid:5)(cid:11)(cid:12)(cid:13)(cid:2) (cid:1)(cid:3)(cid:5)(cid:11)(cid:12)(cid:13)(cid:2)(cid:1)(cid:2) Inthissection,we describetheproposedcontextualbandits (cid:14)(cid:13)(cid:14)(cid:13)(cid:14) (cid:14)(cid:13)(cid:14)(cid:13)(cid:14) (cid:1) algorithm with active learning (CB-AL). A. Useful Notions (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:5)(cid:10) (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:9)(cid:8)(cid:7)(cid:8)(cid:9)(cid:5)(cid:10) First, we introduce some useful notions for the algorithm. Context/ArmSpacePartition.Timeslotsaregroupedinto (cid:21)(cid:5)(cid:10)(cid:8)(cid:16)(cid:2)(cid:8)(cid:13) (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:7) (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:8) (cid:20)(cid:3)(cid:7)(cid:11)(cid:16)(cid:13) epochs. The i-th epoch lasts for Ti = 2i time slots. At the (cid:3)(cid:7)(cid:6)(cid:8)(cid:9)(cid:8)(cid:9)(cid:5)(cid:10) (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:6)(cid:10)(cid:4)(cid:11)(cid:12)(cid:8)(cid:13)(cid:6)(cid:4)(cid:14)(cid:15) (cid:16)(cid:10)(cid:14)(cid:8)(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:6)(cid:10)(cid:4)(cid:11)(cid:12)(cid:8)(cid:13)(cid:6)(cid:4)(cid:14)(cid:15) beginningofeach epoch,the contextspace and the arm space (cid:15)(cid:16)(cid:7)(cid:6)(cid:10)(cid:9)(cid:10)(cid:17)(cid:13)(cid:3)(cid:7)(cid:8)(cid:8)(cid:16)(cid:6)(cid:10)(cid:13)(cid:18)(cid:5)(cid:6)(cid:13)(cid:7)(cid:13)(cid:11)(cid:5)(cid:10)(cid:8)(cid:16)(cid:2)(cid:8)(cid:13)(cid:11)(cid:4)(cid:19)(cid:20)(cid:8)(cid:16)(cid:6)(cid:13) are partitioned into small subspaces. A context(arm) space is called a context(arm)cluster.The context/armspace partition Fig.2. Contextual Bandits LearningwithActiveLearning is kept unchanged throughout the entire epoch. Formally, phasefortheremainingtimeslotsofthecurrentepochforthe the partition of the context space for epoch i is denoted context cluster . In particular, D (i,s (i)) has the form by PX(i) = {X1,X2,...,XMi} consisting of Mi subspaces. Xm 2 m Similarly,thepartitionofthearmspaceforepochiisdenoted D (i,s (i))=2ǫ(i) [2D(s (i))+2L ρ +2L ρ ] 2 m m X X,i K K,i by (i) = , ,..., consisting of N subspaces. − PK {K1 K2 KNi} i The radius of a context cluster is half of the maximum Xm Sample Mean Reward. The sample mean reward of an distance between any two context points in the cluster, i.e. arm cluster with respect to a context cluster by n m K X ρ =0.5 sup x x′ (7) round sm(i) in epoch i is denoted by r¯m,n(sm(i)). The Xm x,x′∈Xmk − k sample mean reward of the empirical best arm cluster is The radius of an arm cluster is defined similarly. The con- r¯m∗ (sm(i))=maxKn∈Am(i)r¯m,n(sm(i)). text/arm space partitioning is performed such that the con- ρteXx,ti/aramndcl∀unste=rs s{a1t,is.f..y,N∀mi},=ρK{n1,,i..=.,MTii}−,αρX,m,ρiK=,i,Tiw−hαer,e B.NTohwe,Awlgeordiethscmribe the proposed CB-AL algorithm, whose α (0,1). ∈ pseudo-code is provided in Algorithm 1. Figure 2 provides Active Arm Cluster. At the beginning of each epoch, all an illustration of the algorithm. The algorithm operates in arm clustersaccordingto the armspace partitioningare set to epochs.Atthebeginningofeachepoch,thecontext/armspace beactive.Wedenotethesetofactivearmclusterswithrespect partitions are determined. As aforementioned,the radii of the to context cluster in epoch i by (i). The active arm Xm Am context and arm spaces become smaller as the epoch grows cluster set will be updated as time goes by according to the and therefore, the partitions of the spaces become finer and learning outcome. Some arm clusters will be learned to be finer. All arm clusters for any context cluster are set to be suboptimal and hence will be de-activated (i.e. be removed active at the beginning of each epoch. In each time slot t, a from ) and will not be selected by the algorithm in the remainXinmg time slots of the current epoch. context xt arrives and the algorithm finds the context cluster (i) that it belongs to. Depending on which phase Round. A round sm(i) is defined for each context cluster tXhme a∈lgoPrmithm is in (with respect to the context cluster ), Xm ineachepochi,whichconsistsanumberof|Am(i)| time different operations are carried out as follows: Xm slots. Thus, in each round s (i), each active arm cluster in m Exploration. The goal of the algorithm in the exploration (i) is selected once. Therefore, even in the same epoch, m A phase is to explore various active arm clusters to learn their the length of a round s (i) may change due to the updating m performance for . Arm clusters that are learned to be of the active arm cluster set m(i). Xm A suboptimal will be de-activated over time, thereby improving Control Functions. There are two important control func- thelearningefficiencyandsystemperformance.Ineachround tions in our algorithm. The first control function, denoted by s (i),thealgorithmselectsanactivearmcluster (i) Di(i,sm(i)), is used to de-active arm clusters, with respect m Kn ∈Am that has not been selected in the current round for . to each context cluster m, that are learned to be suboptimal Xm X If all active arm clusters have been selected in the current depending on the epoch index i and the round index s (i). m round, then the current round s (i) ends and a new round In particular, D (i,s (i)) has the form m i m begins. Then the algorithm arbitrarily selects any arm k in t D1(i,sm(i))=ǫ(i)+[2D(sm(i))+2LXρX,i+2LKρK,i] the selected arm cluster n. In the explorationphase, a query K isalwayssent,namelyq =1,togetherwithpriorinformation where ǫ(i) = LTi−α is a small positive value for epoch i, (a ,b ,δ ). The prior inftormation is computed as follows: for t t t and D(sm(i)) = ln(2Ti1+γ)/2sm(i). Here L = L(c) > round sm(i)>1, the prior information is 4LX +4LK and γq (0,1) are constants. ∈ The second control function, denoted by D (i,s (i)), is a =r¯ 2L ρ 2L ρ 2D(s (i) 1) (8) 2 m t m,n X X,i K K,i m − − − − used to determine when to stop querying the ground truth b =r¯ +2L ρ +2L ρ +2D(s (i) 1) (9) t m,n X X,i K K,i m reward.Whenthestoppingconditionissatisfied,thealgorithm − stops de-activatingarm clusters and entersa pure exploitation δt =Ti−(1+γ) (10) Algorithm 1 Contextual Bandits with Active Learning 1: for epoch i=0,1,2,... do 2: Initialization: Create context and arm space partitioning PX(i) and PK(i). Set Am(i) = PK(i),∀m. Set Stopm = 0, m. Set s (i)=1, m. Set r¯ =0, m,n. m m,n ∀ ∀ ∀ 3: for time slot t=2i to 2i+1 1 do − 4: Observe the context xt and find m such that xt m. X ∈X 5: switch Stop do m 6: case 0 ⊲ Exploration 7: Select n m(i) that has not been selected in round sm(i) and select any kt n. K ∈A ∈K 8: Choose qt =1 and send the prior information (at,bt,δt) to the annotator 9: (A query cost ct is incurred, and the reward rˆt =r(xt,kt) is observed.) 10: Update r¯m,n(sm(i)). 11: if round sm(i) has finished then 12: For any Kn m(i) such that ∆m,n(sm(i)) D1(i,sm(i)), remove Kn from m(i). ∈A ≥ A 13: If for all Kn ∈Am(i), ∆m,n(sm(i))≤D2(i,sm(i)), then set Stopm =1. 14: Update sm(i) sm(i)+1. ← 15: end if 16: case 1 ⊲ Exploitation 17: Select any n m(i) and any kt n. K ∈A ∈K 18: Choose qt =0. 19: (The reward r(xt,kt) is generated, but cannot be observed.) 20: end for 21: end for For s (i)=1, the prior information is a =0,b =1,δ =0. the algorithm no longer requests for the ground truth reward, m t t Once the ground truth reward r(x ,k ) is obtained from the i.e. q =0, for all time slots in the exploitation phase. t t t annotator, the sample mean r¯ is updated as follows m,n IV. REGRET ANALYSIS r¯m,n (r¯m,n (sm(i) 1)+r(xt,kt))/sm(i) (11) To analyze the regret, we first introduce some notions. ← · − At the end of a round s (i), the algorithm de-activates Cluster reward. We define the expected reward of m • suboptimal arm clusters if necessary. Specifically, the algo- selecting an arm cluster n for context cluster m as K X rithm first finds the empirically best arm cluster for Xm µ(m,n) = maxx∈Xm,k∈Knµ(x,k). The reward of the and calculates the sample mean reward difference between optimal arm cluster with respect to the context cluster the empirically best arm cluster and any other active arm Xm isthusµ∗m =maxnµ(m,n).Furthermore,wedefine cluster,denotedby∆m,n(sm(i)),r¯m∗ (sm(i))−r¯m,n(sm(i)), the reward difference as ∆m,n =µ∗m−µ(m,n). (s (i)). Then it compares ∆ (s (i)) with ǫ-optimal arm cluster. We define the ǫ-optimal arm ∀Kn ∈ Am m m,n m • the current value of the control function D1(i,sm(i)). If the clusters with respect to the context cluster Xm as the sample mean reward difference is greater than or equal to arm clusters Kn that satisfy µ(m,n) ≥ µ∗m −ǫ. Simi- this value, then the corresponding arm cluster is suboptimal larly, the ǫ-suboptimal arm clusters are those that satisfy with high probability and hence is de-activated. Moreover, µ(m,n)<µ∗m−ǫ. if the remaining active arm clusters have sufficiently similar Normal event and abnormal event. A normal event • sample mean reward estimates, then the algorithm stops the m,n(sm(i))isaneventsuchthattherewardofselecting N de-activationprocessintheremainingtimeslotsofthecurrent arm cluster n for context cluster m in round sm(i) K X epoch and enters the exploitation phase. Specifically, if the satisfies r¯m,n(sm(i)) E[rm,n(sm(i))] D(sm(i)). | − | ≤ reward difference for any active cluster is smaller or equal A abnormal event Nmc,n(sm(i)) is an event such that to D2(i,sm(i)), then the de-activation process stops. We |r¯m,n(sm(i)) − E[rm,n(sm(i))]| > D(sm(i)). Further, denote by Smi the number of rounds taken when the stopping we denote Ni,m,n as the event that no abnormal event condition is satisfied. Nmc,n(sm(i)) occurs with respect to arm cluster Kn and Exploitation. The goal of the algorithm in the exploitation context cluster m for the entire epoch i. X phaseistoexploitthebestarmstomaximizethereward.Since Toanalyzetheregret,wefirstprovidethefollowinglemmas. in the exploitation phase, the remaining active arm clusters athree tchoerreospptoimndalinagrmconctleuxsttercluosrtenreawr-iothptihmigahl aprrmobacbluilsitteyr, tfhoer Locecmurmsaw1it.hApnroabbanboilrimtyaaltevmeonsttfδo(ri)ar=mTcil−uγst.er Kn in epoch i algorithmsimplyarbitrarilyselectsanactivearmcluster Proof. According to the definition of abnormal event and the n K ∈ (i) and then arbitrarily selects an arm from . Notably, Chernoff-Hoeffding bound, the probability that an abnormal m n A K eventforanarmclusteroccursinrounds (i)canbebounded LetusdenotebyT thenumberoftimeslotsinepochi,denote m i by by T the number of context arrivals in context cluster i,m m X Pr{[Nm,n(sm(i))]C}≤2e−2[D(sm(i))]2sm(i) ≤ Ti11+γ. (12) ifαnor=eparomchci1l,uasntedradKnednnoγitne=bcyonTtdeiA,xm+t,1ncluth.setenruXmmbeirnoefpqoucehryi.reWqueessetst Hence, the probability that an abnormal event for an arm FordAt+hedXfi+r2st term RadAi+ndX(1+62), when an abnormal event i cluster Kn in epoch i occurs with at most happens, the regret is at most Ti. According to Lemma 1 Si abnormaleventshappenswithprobabilityatmostδ(i)forarm m Pr [ ]C Pr [ (s (i))]C cluster in epochi. Therefore,the regretRa in (16) can be {Ni,m,n }≤ {Nm,n m } Kn i m m sm(i)=1 expressed as: P Si P P m 1 1 . Ni ≤ m sm(i)=1Ti1+γ ≤ Tiγ Ria ≤ δ(i)Ti ≤Niδ(i)Ti. (17) P P (13) n=1 X For the second term Rn in (16), the regretof 2ǫ(i)-optimal i Lemma 2. (a) With probability at least 1 Niδ(i), an ǫ(i)- arm cluster selection at each time slot is at most 2ǫ(i), and optimal arm clusters are not de-activated −for context cluster the regret of inaccuracy of clusters at each time slot is at m in epoch i. (b) With probability at least 1 Niδ(i), the most 2LXρX,i +2LKρK,i. Therefore, the regret Rin can be Xactive set (i) in the exploitation phase con−tains at most expressed as: m A 2ǫ(i)-optimal arm clusters for context cluster m in epoch i. 2i+1 1 X Rn − (2ǫ(i)+2L ρ +2L ρ ) Proof. If the normal event occurs, for any deactivated arm i ≤ X X,i K K,i (18) t=2i clusters n, we have: =2(ǫ(i)P+LXρX,i+LKρK,i)Ti. K r¯m∗ (sm(i))−r¯m,n(sm(i)) ForthethirdtermRis in(16),whenthenormaleventoccurs, =(µ∗m−µ(m,n))+(r¯m∗ (sm(i))−µ∗m) (14) accordingto Lemma 2, 2ǫ(i)-suboptimalarm cluster can only +(µ(m,n)−r¯m,n(sm(i))) be selected in the exploration phases. Hence, the regret Ris ∆m,n+2D(sm(i))+2LXρX,i+2LKρK,i, can be expressed as: ≤ where the inequality follows from that r¯m∗ (sm(i)) −µ∗m ≤ Rs E 2i+1−1 D(sm(i)) and µ(m,n) r¯m,n(sm(i)) D(sm(i)) + i ≤ 2L ρ +2L ρ . Com−bining with the ≤deactivating rule, ∆m,n>2ǫ(i) t=2i X X,i K K,i ∆ PI x P ,πt ,πt =1, . we have ∆ >ǫ(i). m,n { t ∈Xm K ∈Kn q Ni,m,n} m,n (19) If the normal event occurs, for any reserved active arm Accordingtothe deactivatingrule,fornormalevents,if the cluster , we also have: Kn following is satisfied: r¯ (Si ) r¯ (Si ) =m∗(µ∗mm−−µ(mm,,nn))m+(r¯m∗ (Smi )−µ∗m) (15) ∆m,n−2D(s)−2LXρX,i−2LKρK,i ≥r¯m∗ (s)−r¯m,n(s)≥(D201)(i,s), +(µ(m,n) r¯ (Si )) − m,n m then the arm cluster is deactivated. Hence, the rounds of ∆ 2(D(Si )+L ρ +L ρ ), ≥ m,n− m X X,i K K,i exploring arm cluster n, Ti,m,n with ∆m,n > 2ǫ(i), can K where the inequality follows from that µ r¯ (Si ) be bounded by ∗m − m∗ m ≤ D(Smi )+2LXρX,i +2LKρK,i and r¯m,n(Smi )−µ(m,n) ≤ 8ln(2T1+γ) D(s (i)). Combining with the stopping rule, we have T i . (21) m i,m,n ≤ [∆ (ǫ(i)+4L ρ +4L ρ )]2 ∆m,n 2ǫ(i).Sincethenormaleventoccurswith probability m,n− X X,i K K,i at leas≤t 1 N δ(i), the results follow. Therefore, the regret Rs can be bounded by i i − Now we are ready to prove the regret of CB-AL. Ris ≤E ∆m,nTi,m,n ∆m,n>2ǫ(i) Theorem1. TheregretoftheCB-ALalgorithmcanbeupper- P 8ln(2T1+γ) E ( i bounded by R(T)=O(TddXX++ddKK++12). ≤ ∆m,n>2ǫ(i) ∆m,n−(ǫ(i)+4LXρX,i+4LKρK,i) Pinroeopfo.cThoib,oduenndottehderbeygrRet,i.wTehfiisrstrecgornestidcearnthbeerdegecreotmcpauosseedd + 8[(∆ǫP(8miM),+ni−4NL(iǫXl(niρ()X2+T,4ii1+L+X4γLρ)KX,ρiK+,4i)LlKn(ρ2KT,i1i+)]γ2)) (22) atihnnetdortehfogeureirntatRecrcinmursca:acuythseoedfrceblgyurse2tteǫr(Rsi,)iat-hocepatruiemsgeardletabRrymisaccbalnuuossrteemdrabslyeele2vcǫet(niiot)sn-, ≤≤C2ǫ+1(Mi)8−Mi(Nǫi(N[ii2)liǫ+n((ǫi4()(L2−i)TX+(ǫiρ14(X+iL),Xγ+i+)4ρTL4XLiX,αiK+,ρXρ4KL,i,K+i)4ρLKK,i)ρlKn,(i2)T]2i1+γ) suboptimal arm cluster selection when no abnormal events where C = 16L is a constant. occur, and the query cost Riq. We have For the1 fou(rLth−4tLerXm−4RLqK)i2n (16), we first consider the query i R Ra+Rn+Rs+Rq. (16) cost Rq,1 when the abnormalevent occurs. In this case, since i ≤ i i i i i the maximum query cost per slot is 2c, the query cost can be the context space. Hence, the regret can be bounded by bounded by log T 2 R(T) E Ri Rq,1 N 2cδ(i)T . (23) ≤ i=0 i ≤ i i log2T P (δ(i)N +2cδ(i)N +2ǫ(i)+2L ρ +2L ρ )T i i X X,i K K,i i ≤ nNoerxmt,alweevecnotnssiodcecrurt.heThqiusecrayncboestbRouiq,n2deind tbhye case that only +EiP=lo0g2TO(1)ln(2T1+γ)Tα i i i=0 O(P1)TddXX++ddAA++12 ln(T). 2i+1 1 ≤ (27) Rq,2 E − c I x ,πt ,πt =1 i ≤ t { t∈Xm K ∈Kn q } Therefore, the result of Theorem 1 follows. m,n t=2i P P Smi E c+E [c(4LXρX,i+4LKρK,i+4D(s 1))β1 We further show a lower bound for the CB-AL algorithm. ≤ − m,n m,ns=2 Since the proposed algorithm incurs the query cost when it P P P +cηTi−(1+γ)β2] requests a ground truth, the lower bound of the regret cannot ≤cMiNi+cMiNisS=mi2(8LXρX,i+8LKρK2,i)β1+(8D(s−1))β1 bwehleorweenrothqaunertyhactoosfttihseinccounrvreendti[o1n]a.lcontextualMABsetting P +cηMiNiSmi Ti−(1+γ)β2] Theorem2. TheregretoftheCB-ALalgorithmcanbelower- ≤cMiNi+cMiNi23β1−1(LXρX,i+LKρK,i)β1sSP=mi2 bounded by R(T)=Ω(TddXX++ddKK++12). +cMiNi23β1−1Smi [2lβn1(2/T2i(γs+11))]ββ11//22 +cηMiNiTi−(1+γ)β2+1, Theorem1andTheorem2togethershowthatouralgorithm s=2 − is order-optimal and achieves the same order as conventional P (24) contextual bandits algorithms in cost-free scenarios [1][2][3]. where the third inequality is due to the Jensen’s inequality. If 1 β < 2, the third term on the right hand ≤ 1 V. NUMERICAL RESULTS side of the last inequality in (24) can be bounded by sceNriie2s5β1/2Tt−=11[ltn−(2yTiγ+≤1)1]−β1βT/12(/(12S−miy))1/−(β11/−2, yd)uefotro 0the<diveyrge<nt fiwristWthee2x-pcdeoirnmidmeunecsntito,inlwlauelstccrooamntitvpeexartseextaphneedrpim2e-redfnoimtrsmeunanssiicnoegnoaflsyoanurtmrhpes.triocIpnodosaeutdar 1 [11]. If β1 2, the third term on the right hand CB-AL algorithm with the Contextual Bandit algorithm and P ≥ side of the last inequality in (24) can be bounded by the Contextual Bandit Active Learning without considering cNi25β1/2−1[ln(2Tiγ+1)]β1/2(lnSmi )β1/2, due to the series prior information (CB-AL without prior information). The Tt=−11t−y ≤ lnT for y ≥ 1. We can also have the bound result is shown in Fig. 3. As we can see, the proposed oPf Smi (for all m) due to the fact that when D1(i,Smi ) ≤ CB-AL algorithm performs better, in terms of payoff, than D2(i,Smi ), the stopping rule is satisfied. Hence, Smi can be the conventional bandit algorithm (by 16%) and the CB- bounded by the minimum s, such that D1(i,s) D2(i,s). AL without prior information (by 13%) by the end of the ≤ This shows: experiment duration. In our second experiment, we show the payoffs achieved 8T2αln(2Tγ+1) byourproposedCB-AL algorithmwhenthequerycostvaries Smi ≤ (L i4L i4L )2. (25) (bychangingcostparameterscine.q.(1)).Weshowtheresult X K − − in Fig 4. We can see that as c increases from 0.1 to 1, the achieved payoffdecreases by 15% for T =10000and by 8% Thus, we can bound Rq,2 by for T =20000. i VI. CONCLUSIONS Rq,2 C2MiNiTiα(2−β1)ln(2Tiγ+1), if 1≤β1 <2 In this paper, we developed a contextual bandits learning i ≤( C3MiNi[ln(2Tiγ+1)]β1, if β1 ≥2, algorithm with active learning capability. The active learning (26) cost is reduced by providing prior information about the re- where c26C−2β1/2 = ci(s1 a+coηn)sta+nt, cC2(3Lβ−1+4=5L(XLX−c+4(L1LKK+))2β1η ++ wupadrdateresapliazratittiioonnstootfhtehaencnoonttaetxotr.aTnhdeaarlmgosrpitahcmes,maanidntoaipnesraatneds 2(25−β1β/12)−(L1−)4+LXc−2(34LβL1−+K45L)(2LXX−+4LLAK))2β1 is a consta3nt. bcoetnwtreoelnoefxpthloeraptaiortnitiaonndinegxpplrooictaetsisonanpdhawsehse.nThtorouregqhupesretctihsee According to the definition of covering dimensions [12], ground truth of the reward, the algorithm gracefully balances the maximum number of arm clusters can be bounded by the accuracy of learning and the cost incurred by active Ni ≤ CAρ−Kd,iA in epoch i, and the maximum number of learning. We prove that the regret of the proposed algorithm context clusters can be bounded by Mi ≤CXρ−Xd,iX in epoch achieves the same order as that of conventional contextual i,whereC ,C arecoveringconstantsforthearmspaceand bandits algorithms in cost-free scenarios. A X 1 0.9 0.8 off0.7 y a0.6 p e 0.5 g a0.4 Contextual Bandit r e v0.3 CB−AL without prior information A 0.2 CB−AL 0.1 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Time x 104 Fig.3. Comparisonofperformance fordifferent algorithms. 1 T = 10000 T = 20000 yoff0.9 a p e g a ver0.8 A 0.7 0 0.2 0.4 0.6 0.8 1 Query cost Fig.4. Relationship betweenpayoffandcost. REFERENCES [1] T.Lu,D.Pa´l,andM.Pa´l,“Contextualmulti-armedbandits,”inArtificial Intelligence andStatistics Conference (AISTATS),2010,pp.485–492. [2] J. Langford and T. Zhang, “The epoch-greedy algorithm for multi- armedbanditswithsideinformation,”inAdvancesinNeuralInformation ProcessingSystems,2008,pp.817–824. [3] A. Slivkins, “Contextual bandits with similarity information.” Journal ofMachine LearningResearch,vol.15,no.1,pp.2533–2568, 2014. [4] L. Song, W. Hsu, J. Xu, and M. van der Schaar, “Using contextual learning to improve diagnostic accuracy: Application in breast cancer screening,”IEEEJournalofBiomedicalandHealthInformatics,vol.20, no.3,pp.902–914,2016. [5] J.Xu,T.Xing,andM.van derSchaar, “Personalized course sequence recommendations,” IEEE Transactions on Signal Processing, vol. 64, no.20,pp.5340–5352, 2016. [6] B. Settles, “Active learning literature survey,” University of Wisconsin, Madison, vol.52,no.55-66,p.11,2010. [7] D. A. Cohn, Z. Ghahramani, and M. I. Jordan, “Active learning with statistical models,” Journal of Artificial Intelligence Research, vol. 4, no.1,pp.129–145,1996. [8] M.-F.F.BalcanandV.Feldman,“Statisticalactivelearningalgorithms,” inAdvancesinNeuralInformationProcessingSystems,2013,pp.1295– 1303. [9] A. K. McCallumzy and K. Nigamy, “Employing em and pool-based activelearningfortextclassification,”inProc.InternationalConference onMachine Learning(ICML),1998,pp.359–367. [10] S.Dasgupta,“Analysisofagreedyactivelearningstrategy.”inAdvances inNeuralInformationProcessingSystems,vol.17,2004,pp.337–344. [11] E.Chlebus,“Anapproximateformulaforapartialsumofthedivergent p-series,” Applied Mathematics Letters, vol. 22, no. 5, pp. 732–737, 2009. [12] J.Heinonen, Lecturesonanalysis onmetricspaces. SpringerScience &Business Media,2012.