ebook img

A Universal Probability Assignment for Prediction of Individual Sequences PDF

0.19 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview A Universal Probability Assignment for Prediction of Individual Sequences

1 A Universal Probability Assignment for Prediction of Individual Sequences Yuval Lomnitz, Meir Feder Tel Aviv University, Dept. of EE-Systems Email: [email protected] ,[email protected] Abstract—Is it a good idea to use the frequency of events in a small bias over the empirical distribution seen so far. For 3 the past, as a guide to their frequency in the future (as we all example,Laplace’sestimate forthe probabilitydistribution of 1 do anyway)? In this paper the question is attacked from the of x is 0 perspective of universal prediction of individual sequences. It is t N (x)+1 2 shownthatthereisauniversalsequentialprobabilityassignment, pˆ(x)= t−1 , (1) t suchthatforalargeclasslossfunctions(optimizationgoals),the (t 1)+ n − |X| predictorminimizingtheexpectedlossunderthisprobability,isa a where N (x) denotes the numberof times the state x appears J gooduniversalpredictor.Theproposedprobabilityassignmentis t 7 based on randomly dithering the empirical frequencies of states in xt1. While these estimators get closer with time to the inthepast,anditiseasytoshowthatrandomizationisessential. measured empirical distribution, they do not “trust” it com- 2 This yields a very simple universal prediction scheme which is pletely, and, for example, never assign a probability value 0 similar to Follow-the-Perturbed-Leader (FPL) and works for a ] to states that had not appeared before. Furthermore, in the T large class of loss functions, as well as a partial justification for probabilistic prediction setting the same distributions were using probabilistic assumptions. I . shown to performwell notonly forthe log loss: the predictor s which minimizes the expected loss under these distributions c [ I. INTRODUCTION ˆbt =argmin E l(b,X) operateswell for a wider class of 1 In this paperthe problemof universalsequentialprediction loss funcbtionsX[∼3p,ˆt(·I)II.A.2]. v of an individual unknown sequence is considered [1][2][3], Thisnaturallyle§adsto the followingquestion:is it possible 8 and a prediction approach based on universal probability to forecastan individualsequenceby first generatinga proba- 0 assignmentisproposed.Givenaspaceofstrategies ,aspace bility assignment based on the past, and then minimizing the 4 B ofnaturestates andalossfunctionl(b,x),b ,x ,the expected loss under this assignment (i.e. in a way, acting as .6 purpose is to asXsign the next strategy ˆbt given∈thBe kn∈owXledge if future events truly happen with this probability)? Consider 01 ofthepaststatesx1t−1,suchthattheoverallloss nt=1l(ˆbt,xt) prediction schemes of the following form: 13 wbeosutlfidxebde satsraytmegpytoktincoawllyn ac-lpoosestetroiotrhieafltoerssvioewbPtianignetdhebeynttihree 1) pGaesnteoraftethaepsroeqbuabenilciteyxasts−i1g,nminenatPwta(uy)(wxh)ibcahseddoeosnnthoet : sequence xn, i.e. min n l(b,x ). 1 v 1 b∈B t=1 t depend on the loss function. In the particular case of sequential probability assignment Xi under the log loss functioPn l(b,x) = logb(1x) where B is the 2) Tstoraptergedyicthtabtt munidneimritzheeslothssefeuxnpceticotendl(lbo,sxs)u,ncdheorosPe(tuh)e, r space of probability assignments on the finite alphabet , or t a X i.e.: equivalently in universal sequential compression, it is shown ˆb =argmin E [l(b,X)] (2) [3][4, 13][1, 9] that it is possible to assign probabilities t pˆ(x )§for the§next state in an arbitrary sequence of states b X∼Pt(u)(·) t t xt ,t=1,2,...,n,giventhepaststates,suchthatforany IfthereexistsasingleschemeforgeneratingP(u)()thatdoes pos∈sibXlesequence,theoverallprobabilitypˆ(x)= nt=1pˆt(xt) notdependonthelossfunctionl(b,x),butfortwhic·hˆbt yields would not be too far, in a multiplicative or logarithmic sense, a good (Hannan-consistent [1]) predictor for a certain class Q from the best i.i.d. probability assigned to the sequence a- of loss functions, then we call P(u)() a universal sequential posteriori maxp(·) nt=1p(xt). The result extends to proba- probability assignment with regatrds t·o that class. Notice that bility assigned by Markov machines or finite state machines thistermhasbeenusedinthepastwithrespecttothelog-loss, Q [5]. This problem is related to universal compression because so the definition above can be considereda natural extension. the overall compression length corresponds to log p(1x) . A Itiseasytoshowthat,iftheclassoflossfunctionsincludes remarkable feature of these universal probability as(cid:16)signm(cid:17)ents even simple loss functions such as the 0-1 loss (the number is that, although nothing is assumed about the sequence, to of errors), then no deterministic assignment can be universal, constructauniversalencoderitisenoughtoencodeasif pˆ() and therefore the Laplace or KT assignments are inadequate. t · was the true probability of the next state. However,itisshowninthispaperthattherandomassignment These universal probability assignments, such as the obtained by slightly perturbing the empirical frequencies is Laplace [4, 13.2] or Krichevsky-Trofimov (KT) [6] estima- universalfora largeclass of lossfunctions,includingthe log- § tors, have an intuitively appealing structure which induces loss and any bounded loss. 2 Inadditiontosupplyingasimpleandgeneraluniversalpre- is Hannan consistent. Rewriting the above as ˆb(FL) = t dictionscheme,thisresultalsohasinterpretationscontributing argmin Nt−1(x)l(b,x), it can be interpreted as an im- x∈X t−1 to our understanding of probability. For example, it supplies b plementationof(2)wheretheuniversalprobabilityassignment P justification for treating the statistics of a process in the past equals the empirical frequencies P(u)() = Nt−1(x). In other as a guide to its statistics in the future, without having to t · t−1 words, for this family of loss functions, there is a simple so- assume the processis indeed stationary,or thatit is driven by lutionforP(u)(), namelytheempiricaldistribution.However a “probabilistic” law. In other words, if our natural behavior t · this class of loss functions where FL is universal, is rather is in some way similar to the prediction algorithm described limited. here,thentheclaimsonitsconvergencecanbeusedtojustify For a probability assignment to be “general” enough, one this behavior. would want to cover, at the least, the family of discrete- The next section completes the problem definition and strategy, discrete-state loss functions, presented by Hannan discusses the boundaries of the solution, and relations to [7]. For this family, the loss function can be represented known results. Section III gives the main results, and Sec- by a general matrix specifying the loss for each tion IV discusses the possible implications on understanding |B|× |X| strategy and each state of nature. It is well known [1, 4] probabilistic behavior. The proofs are given in Section V. § and straightforward to see that randomization is required in ordertocoverthisclass:considerthe0-1losscase,i.e.binary II. PROBLEM STATEMENTAND DISCUSSION sequences = = 0,1 with l(b,x)=Ind(b=x), where Building upon the definitions already presented in the the total loXss isBthe n{umb}er of errors. For this lo6ss function, introduction, in this section some complementary definitions nodeterministicpredictoryieldsHannan-consistency,because are presented. We assume throughout this paper that is foreach deterministic predictorthere exists a sequence which X finite (otherwise there is no meaning to measuring empirical fails the predictor completely, by choosing the next outcome frequencies).The set of possible strategies is notrestricted. as the opposite of the predictor’schoice, while the loss of the B The loss function l(b,x) is constant over time. best fixed predictor is at most n/2. Because a deterministic Letusdefine the accumulatedlossof a sequentialpredictor P(u)() inevitably leads to a deterministic predictor (2), this ˆbt(xt1−1) as: imtplie·s a random P(u)() is required, in general. n t · Lˆ = l(ˆb ,x ), (3) Forthebinary0-1lossproblem,Feder,MerhavandGutman n t t [8]useda smallditherwhentheempiricalprobabilityisclose t=1 X to 1,whicheffectivelyavoidsadecisionwhenthefrequencies and the loss of the best fixed strategy as: 2 of0,1arenearlyequal.1Forthisspecificproblem,theoptimal n solution (in the sense of minimax regret) is known exactly L∗ =min l(b,x ). (4) n t and was presented by Cover [9]. While the optimal dither in b t=1 X this problem is different than the straight line used by Feder, The difference Lˆn −L∗n which is defined as the regret, is a Merhav and Gutman, and is not known in general, this is function of the predictor and the sequence. The worst case of no consequence in the current problem, as we are only regret is: considering Hannan consistency. This solution, as well as the =max Lˆ L∗ , (5) small bias from the empirical distribution which is required Rmax xn n− n 1 in the log-lossproblem (1), motivatesthe following choice of (cid:16) (cid:17) and the normalized regret is Rmnax. A forecasting strategy ˆbt Pt(u)(·): add a small dither to Nt−1(x) (the counts of events is said to be Hannan-consistent, if limsup Rmax 0 in the past) and re-normalize. As shown below, this solution almostsurely(theprobabilityisoverthe randno→m∞izatinonin≤the achieves Hannan-consistency for any bounded loss function forecasterifitisrandom).Thismeansthatforlargen,theloss and for the log loss. of the forecaster is essentially at least as small as that of any Theproposedforecasterisreminiscentoftheschemetermed fixed strategy. As mentioned in the introduction, the problem “Follow the Perturbed Leader” (FPL), originally proposed by addressed in this paper is of finding a sequential probability Hannan [7], in which the decision in obtained by adding assignment P(u)() such that the resulting prediction scheme a small dither to the accumulated loss of every reference (2) is Hannant-con·sistent for a large class of loss functions. strategy and then choosing the best one. Indeed, dithering We willfocusmainlyonboundingtheexpected loss(overthe the frequencies is similar, but not equivalent, to dithering the predictor’s randomization), because it also leads to almost- accumulated losses, and our proof technique for the bounded sureboundsbyapplyingthestronglawoflargenumbers.The loss case borrows from Kalai and Vempala’s [10]. Following maximum expected regret is defined as: this similarity we term the scheme proposed here “Follow the Perturbed Frequency” (FPF). Notice, however, that FPL Rmax =mxanxE Lˆn−L∗n , (6) is defined, in general, only when the number of strategies is 1 h i finite,whileFPFisdefined,ingeneral,onlywhenthenumber For some loss functions satisfying smoothness conditions of outcomes (states) is finite, and does not have to assume [1, Thm 3.1][2, Thm 1], the forecasting strategy known as “Follow the Leader” (FL), which chooses at each time the 1It is interesting to note that for the 0-1 loss problem their forecaster best strategy in retrospect ˆb(FL) = argmin t−1l(b,x ), is equivalent to a “Follow the Perturbed Leader” forecaster with a uniform t i=1 i distribution (seebelow)andalsoequivalent totheforecaster proposedhere. b P 3 the numberof strategies is finite. On the otherhand, FPL can Theorem 1. Assuming h =h tα, with α (0,1), the FPF t 1 · ∈ deal with more generalformsof the problem, includingtime- predictor is Hannan-consistentfor any bounded loss function varying loss functions. andforthelog-loss.Thereforeundertheseconditions,P(u)(x) t The problem considered here is a close relative of the defined in (7) is a universal probability assignment for the calibration problem [1, 4.5], i.e. the problem of estimating class. § from an individual sequence, probability forecasts that pass This theorem is based on the two following theorems: certain consistency tests. The problems are related in that, in bothcasesitisshownpossibletogeneratefromempiricaldata Theorem 2. Assume the loss function is bounded l(b,x) | | ≤ collectedfromanindividualsequence,probabilityassignments R. Then: thatappearto operateas well as forecastswhich are based on 1) The expected regret of FPF is upper bounded by knowledgeofthe“true”statisticalmodel.Also,randomization is essential in both cases. However, none of the problems is n 2R h−1+2R h (9) a special case of the other: the probability assignment shown Rmax ≤ t |X| n here is not necessarily calibrated, and a calibrated probability Xt=1 assignment does not necessarily satisfy the requirements of 2) Particularly, for any h = h tα, with α (0,1), the t 1 · ∈ the current problem.2 normalized expected regret 1 tends to zero with nRmax In this paper, in order to simplify matters, only fixed n. strategies are considered. As one of our motivations is to 3) For h = 2t , 1 4R 2|X|. t |X| nRmax ≤ n rationalizethebehavioroflearningprobabilitiesfromthepast, q q it is enough to consider fixed strategies in order to see the Corollary 2.1. The theorem holds under a milder condition, advantageofthisbehavior.Theextensiontodynamicreference thatthelossfunctionisboundedonlyforthesetofoptimizing strategies is unfortunately not immediate as in the setting of strategies, defined as predictionwithexpertadvice[1, 2],wheredynamicstrategies § can be turned into fixed ones by simple enumeration (i.e. = argmin λ(x)l(b,x):λ(x) 0, x:λ(x)>0 replacing the strategy with the index of the strategy), because Bopt ( b∈B ≥ ∃ ) x∈X we explicitly assume a fixed loss function. However in some X (10) cases, the core of the prediction problem lies in competing and where R = supx∈X,b∈Boptl(b,x). Particularly, the the- withfixedstrategies.Forexample,referencestrategiesdefined orem holds for the L2 norm loss, l(b,x) = b x 2 bystates (such asMarkovpredictorsorfinite state machines), for Rd ( < ), and = Rd. Inktha−t cakse can be considered as fixed strategies in each sub-sequence R =Xm⊂axx,x′∈X|Xx| x′∞2 is the Bsquared diameter of the k − k belonging to the same state. set . X Noticethatinthemostgeneralcasewithoutanylimitations III. MAIN RESULTS (such as on magnitude), it is generally impossible to devise Let Nt(x) be number of times a specific x occurred in the a universal scheme for the L2 norm loss that beats the best sequencexuptoandincludingtimet.Theuniversalsequential fixed strategy, i.e. the empirical mean up to a constant, and probability assignment is defined as: it is made possible in the current problem by the assumption that is finite. P(u)(x)=c (N (x)+h u (x)) X t t· t−1 t· t The proof of Theorem 2 is similar in spirit to the proof N (x)+h u (x) (7) t−1 t t of Kalai and Vempala [10] for the FPL forecaster, as the = · t−1+ht· x′∈X ut(x′) perturbation on Nt(x) can be translated to a perturbation on the accumulated loss. where ct = x∈X(Nt−1(x)+ht ·Put(x)) is the normalizer guaranteeing unit sum. ut(x) U[0,1] is a random dither Theorem 3. For the case of the log-loss, where b(x) is a P ∼ which is assumed to be uniformly distributed, i.i.d. over probability distribution over and l(b,x)=log 1 ,3 the different x and t (dependence over t does not affect the X b(x) expected regret of FPF satisfies: (cid:16) (cid:17) expected regret). h is a non-decreasing positive sequence. t 1) Our philosophical considerations (i.e. justifying probabilistic behavior) motivate keeping ht as general as possible rather n h 1 n 1 t than finding a specific optimal sequence h for each problem. |X| − + (11) t Rmax ≤ t t−1 +h The FPF predictor, for any loss function l(b,x) is defined t=1 t=1 ⌊|X|⌋ t X X by: 2) Particularly, for any h = h tα, with α [0,1), the b(tFPF) =argmin E [l(b,X)] (8) normalized expected retgret 11· tends t∈o zero with b∈B X∼Pt(u)(x) n. nRmax 2Consider for example the 0-1 loss problem, and a sequence containing 3) For constant ht the expected regret behaves like anequalnumberofzerosandones.Anyprobability forecaster yieldingonly O(logn) and specifically for the choice h = −1, valuesintherange0.5±ǫisǫ-calibrated,whilethedecisionsbasedonthese log(n). |X| probabilities (whenpluggedinto(2))canbearbitrary(dependingonwhether Rmax ≤|X| theprobabilityissmallerorlargerthan 0.5),andcanyieldarbitrarilybad(or good)aggregate losses. 3Alllog-sinthispaperareinthenatural base. 4 Regarding the last case, notice that this redundancy is definedconcept,andmanyattemptstoexplainorjustifyitsuse similar to the redundancy obtained with Laplace’s estima- have been made. A good introduction to these philosophical tor and approximately twice the redundancy obtained using questions can be found in [11] (for a quick overview see Kritchevsky-Trofimov’s(whichisapproximately |X|−1logn). [13, Chap. ??]). While there is no dispute on the mathemat- 2 However notice that the target of the FPF forecaster was not ical axiomatic theory dealing with probability functions, the to produce optimal redundancy for specific loss functions. meaning of probability, and the justification for using it are TheproofsofthetheoremsstatedaboveappearinSectionV questionable. below. In a nutshell, the main interpretationsto probabilityare the relativefrequencyapproach,a-prioriorlogicalapproachanthe IV. IMPLICATIONS ON THEUNDERSTANDING OF subjectivistic approach. Relative frequency theories interpret PROBABILITY probability as the limiting frequencies in very large groups of events (called “collectives”). A-priori theories interpret A. Initial probabilities probability as logical relation between sentences, and an A basic question in the application and philosophy of extension of formal logic: the attributes “true” and “false” probability theory is: where do initial probabilities originate are represented by probabilities of 1 and 0, and are extended from (see, e.g. [11]) ? The fact is, that in many situations a by adding a range of probabilities in between. Subjectivistic probabilitydistributionisdeducedfromthe relativefrequency theories interpret probability as a measure of the degree of of events in the past. While this deduction may be justified beliefofacertainpersoninacertainproposition,andtherefore basedonsomestationarityassumption,itisoftenusedexactly its value is not unique. in those situations where precise analysis of the source of A main issue in all interpretations is what probability events is not possible, and therefore the assumption that the means with respect to the future. The current results can be frequency of events in the future would be similar to their interpreted underthe frameworkof the subjectivistic theories, frequency in the past is not necessarily justified. In spite of which view probability as a tool for decision making, i.e. this, we often deduce a probability distribution based on past probabilityisjusttherelativeweightthatweputoneachfuture statistics and use this probability for decision making with event when making decisions. Because under subjectivistic regards to future events. It seems that not only humans but theories any probability is valid, there is a problem of justi- also animals use this principle [12]. fying any specific choice of a probability assignment, as well One motivation for the problem posed in Section II, of as the merit of making decisions according to probabilistic searchingforauniversalprobabilityassignment,istheattempt considerations. to justify this behavior based on mathematical, rather than The currentresults can be thoughtof as a partialresolution physical assumptions. The theory of universal prediction of to this question: the suggestion of learning probability from individualsequences,orrepetitivegames,seemsagoodframe- thepastbybiasingorditheringpastfrequencies,isagoodone work for this purpose, because it facilitates deduction from in the sense that it is better than any fixed behavior (and as a the past, without assumptions that the past indicates anything result,ofmakingdecisionsaccordingtoanyfixedprobability). with respect to the future. The existing universal prediction This demonstrates a clear merit in following probabilistic schemesarelesssuitableforthispurposesincetheydetermine considerations, which is not dependent on any assumptions the next strategy in a contrived way, as a function of the past with respect to the real world (the process x ). frequencies and the loss function, whereas in the probability- t There are some issues, however, with this interpretation. based decision making, it is assumed that there exist a single First,theproblemsettingislimited,comparedtoouractualuse “true” probability. of probability. Learning from experience extends far beyond The success of the FPF predictor for a large set of loss theframeworkofrepetitivegamesandconstantlossfunctions, functions, indicates that indeed it is useful to rely on past as we usually deduce probabilities from the past and use frequencies, and draw from them a “probability” distribution, them to solve new problems. Also the fact is assumed even if the future is arbitrary. The dither may be interpreted X discrete is somewhat limiting, although it may be sufficient as the assumption that the future would be similar but not tojustifyprobabilisticintuition,whichisfundamentallybased identical to the past, and prevents using a too “decisive” on distributions on finite sets (such as coins and dice). strategy (such as choosing ’0’ or ’1’ in the 0-1 loss case), But the main weakness of this interpretation is that it based on a small change in the frequencies. It would be relies on randomness for generating the universal probability farfetched to claim that this is the justification for using assignmentP(u) (and as a result, the claims we can make are probabilities: clearly the reason is related to the regularity also probabilistic),and so it may lead to a cyclic argumentof that many natural processes exhibit; however it supplements explaining probability by using probability. The randomness our intuitive understanding by showing that even if these used here is in a restricted form of “controlled randomness” assumptions fail, there is still benefit in learning probabilities which is generated by the forecaster. I.e. if we believe it is from the past. possible to draw random coins, it is enough for this interpre- tation to hold and be meaningful. An alternative assumption B. Meaning of probability is pseudo-randomness, i.e. assume that we can generate the In the previous section we tried to justify a specific choice dither not randomly, but such that “nature” (drawing the of a probability. However, probability itself is not a well next x ) cannot guess it, and it appears effectively random. t 5 Unfortunately,like in many other theories, we are not able to This loss could be thought of as the loss during a sequence escapesomeformof“belief”orconjecturewithrespecttothe which is an extension of the actual sequence with some future. random states. Notice the distinction between Lˆ defined in n Another way to avoid the need for randomness is to avoid (3), which is the loss of the universal predictor, and L(p) t problems such as the 0-1 loss case, in which one is forced which is the accumulated loss whose minimization yields the to bet, problems that are insolvable without randomness. For predictor. The distribution P(u) is proportional to N(p) (x), t−1 example,ifthelossisconvexwithrespecttothestrategy,then and thus the FPF forecaster is equivalent to optimizing the the loss when taking the expected value of a random strategy dithered loss: b is always better than the expected loss when b is random. In this case, the forecastercan make a deterministic decision: replace (8) withˆbt =E ˆb(tFPF) , wherethe expectedvalue is b(FPF) =argmin E [l(b,X)] t withrespecttotherandonmnessoofP(u).Thiscanbethoughtof b∈B X∼Pt(u)(x) asadifferentruleformakingdecisionsbasedonthepast:take =argmin P(u)(x)l(b,x) t as probability the empirical frequencies in the past, however b∈B Xx (14) when making a decision which changes significantly with respect to small variations in the probability,take the average =argmin c N(p) (x)l(b,x) decision overthese small variations.This rule is deterministic b∈B " t· x t−1 # X and aligns with intuition, however the restriction to “smooth” =B L(p) (b) . { t−1 } loss functions may be too limiting. Another question that would naturally arise with respect to thisexplanationishowitalignswiththefactthat,atleastinthe Notice that the constant c does not affect the minimum. theoretical application of probability theory (e.g. estimation t theory, communication theory) we do not use dithers in our 2) Bounding the expected loss: In terms of the expected probabilities. It seems that the idea of dithering the probabil- loss E Lˆ = n E l(ˆb ,x ) , onlythe marginaldistribu- ities is a similar notion to the idea of checking sensitivity of n t=1 t t a given solution to the probabilistic assumptions. In case the tionofhˆbt miattePrs,andthhereforediependencebetween ut(x) at solution to a given problem does not depend crucially on the differenttimes does not affect the expected loss. Thereforein exactprobabilityvalues,addingtheditherisindeedredundant. this section, we assume all ut are equal, ut(x)=u1(x). Ontheotherhand,ifthesolutiondependscruciallyonasmall Westartbyanalyzingaclairvoyantpredictorwhichincludes change in the probabilistic assumptions, it may be reasonable also the state x into the prediction. For this purpose, let us t to doubt its operation in the real world. define analogously to (12)-(13): V. PROOFS A. Proof of Theorem 2 N(cl)(x),N (x)+h u (x), L(cl)(b), l(b,x)N(cl)(x) The proof follows the same line of thought of Kalai- t t t t t t x Vemplala [10]: first, the regret of a clairvoyant forecaster X (15) using xt in addition to xt1−1 is bounded. Then, the difference where for t = 0 we define h0 = 0, and note that N0(x) = 0 in performance between the clairvoyant forecaster and the bydefinition,andthereforeN(cl)(x)=0andL(cl)(b)=0.We 0 0 proposed forecaster is bounded, by using the fact that some consider the loss of the predictor ˆb =B L(cl)(b) : t { t } of the ditherworksin the same directionas the the difference between them. 1) Definitions: Thecumulativelossforplayingtheconstant strategy b up to time t is Lt(b) = ti=1l(b,xt). We denote n l(B L(cl)(b) ,x ) for brevity B L(b) , argminL(b) the best strategy for { t } t { } P t=1 b X cumulative loss function L(b). n (=a) l(B L(cl)(b) ,x)(N (x) N (x)) The optimal fixed (a-posteriori) best fixed strategy is { t } t − t−1 B{Ln(b)} and has loss L∗n = Ln(B{Ln(b)}). As another Xtn=1Xx (cid:2) (cid:3) exampletoclarifythenotation,theFLpredictorcanbewritten = l(B L(cl)(b) ,x)(N(cl)(x) N(cl)(x)) as B L (b) , and the FPL predictor [10] can be written { t } t − t−1 B Lt{−1t(−b)1+p}t(b) where pt(b) is a random perturbation. Xt=1Xxn (cid:2) (cid:3) { } Let us define the dithered count at time t 1 as l(B L(cl)(b) ,x)(h u (x) h u (x)) − − { t } t t − t−1 t−1 Nt(−p)1(x),Nt−1(x)+htut(x), (12) Xt=1Xx (cid:2) (16(cid:3)) and the respective dithered accumulated loss as L(p) (b), l(b,x)N(p) (x). (13) t−1 t−1 wherein(a)weusedN (x) N (x)asanindicatorfunction Xx t − t−1 6 Ind(x =x). The first part can be bounded as: where we defined t n L = l(b,x)N (x)+ l(b,x)h u (x). (21) c t−1 t t l(B L(cl)(b) ,x)(N(cl)(x) N(cl)(x)) { t } t − t−1 Xx xX6=xt t=1 x XXn (cid:2) (cid:3) Noticing that the common part Lc is independent of u(xt), = L(cl)(B L(cl)(b) ) L(cl) (B L(cl)(b) ) we compute the conditional expectation given Lc for each of t { t } − t−1 { t } the predictors: t=1 Xn (cid:2) (cid:3) (≤a) L(tcl)(B{L(tcl)(b)})−L(tc−l)1(B{L(tc−l)1(b)}) E l(B{L(tp−)1(b)},xt) Lc (=b)LXt=(c1l)(cid:2)(B L(cl)(b) ) L(cl)(B L(cl)(b) ) (cid:3) (17) =hE l(B{Lc+htu((cid:12)(cid:12)(cid:12)xt)li(b,xt)},xt) Lc n { n } − 0 { 0 } h1 (cid:12) i (22) =L(ncl)(B{L(ncl)(b)}) = l(B{Lc+htl(b,xt)v},xt)dv(cid:12)(cid:12) L(cl)(B L (b) ) Zv=0 ≤ n { n } 1 =L (B L (b) )+h l(B L (b) ,x)u (x) = g(v)dv, n n n n n { } { } Zv=0 Xx where we defined for brevity g(v) = l(B L + ≤Ln(B{Ln(b)})+R|X|hn htl(b,xt)v ,xt), and { c =L∗ +R h , } n |X| n E l(B L(cl)(b) ,x ) L where we used (a) the fact that B L(cl) (b) is optimized for { t } t c L(tc−l)1 and (b) the sum of the telesc{optic−1serie}s. For the second =hE l(B{Lc+(htu(cid:12)(cid:12)(cid:12)(xti)+1)l(b,xt)},xt) Lc sum in (16), let us use the assumption ut =u1. Then: h1 (cid:12) i n = l(B{Lc+(htv+1)l(b,xt)},xt)dv(cid:12)(cid:12) l(B L(cl)(b) ,x)(h u (x) h u (x)) Zv=0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)=Xt=(cid:12)1Xnx (cid:2) l({B{tL(tcl)(}b)},x)t(htt−h−t−1t)−u11(tx−)1(cid:12) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ==Zv=11+0hl−t(B1l{(LBc+Lh+t(vh+l(bh,−tx1))lv(b,,xxt))}d,vxt)dv (23) (cid:12)(cid:12)(cid:12)(cid:12)Xtn=1Xx R(cid:2)ht ht−1 (cid:3)(cid:12)(cid:12)(cid:12)(cid:12) (18) Zv=1+hh−t−t11 { c t t } t ≤ | − | = g(v)dv t=1 x XXn Zv=h−t1 (a) = R (h h ) The integrands in (22),(23) are equal. Let us temporarily t t−1 |X| − Xt=1 assume that for t≥1, ht >1, so that the integration regions (b) partially overlap. For most of the integration region, because = R h , n |X| the integrands are the same (no matter what l(,x ) evaluates t · wherewe used (a)the assumptionthatthe sequence ht is non to), and the integration regions overlap, they cancel out, and decreasing,and(b)thedefinitionh0 =0.Combining(18)and we remain with the contribution of the edges where there is (17) into (16) yields: no overlap: n E l(B L(p) (b) ,x ) l(B L(cl)(b) ,x ) L Xt=1l(B{L(tcl)(b)},xt)≤L∗n+2R|X|hn (19) (2h2),(23){ t1−1 } t − 1+h{−t1t } t (cid:12)(cid:12) ci = g(v)dv g(v)dv (cid:12) bfoertTewhceaeesntenrethxbet(FcsPtlFea)pirv=iosyBatnotLpb(rope)udn(idbc)totr.heBTh{peLer(tkfcole)yr(mb)ias}nctahenadtdittfhhfeeereFnnePcwFe = h−tZ1vg=(0v)dv −1Z+vh=−th1−tg1(v)dv (24) t { t−1 } − element added to L(tcl)(b) is l(b,xt), and the dither element Z0 Z1 u(xt)(i.e.belongingtothestatethatactuallyhappenedattime g(v)dv t) contributes an offset in the same direction, which cancels ≤Z[0,h−t1]∪[1,1+h−t1]| | this addition or most values of u(xt). For this purpose let us ≤2·R·h−t 1. write the accumulated losses as: Recall that we assumed h 1. For h 1 the bound (24) is t t ≥ ≤ L(p) (b)= l(b,x)(N (x)+h u (x)) trivially true (because the RHS is at least 2R), and therefore t−1 t−1 t t x it holds for all ht. X =L +h u(x )l(b,x ) Applying the iterated expectations law and accumulating c t t t (24) yields: L(cl)(b)= l(b,x)(N (x)+h u (x)) (20) t t t t n n n x X E l(B L(p) (b) ,x ) l(B L(cl)(b) ,x ) 2R h−1 =L(tp−)1(b)+l(b,xt) " { t−1 } t − { t } t #≤ t t=1 t=1 t=1 =L +(h u(x )+1)l(b,x ) X X (2X5) c t t t 7 which, together with (19) yields: B. Proof of Theorem 3 (Log loss) n The sequence is xt, t=1,...,n. The accumulatedloss for E l(B L(p) (b) ,x ) L∗ probabilityq is n log 1 .Thebestfixedq inhindsight "Xt=1 n { t−1 } t #− nn is qt(x)=Pˆtx(x)Patn=d1yieldqst(Lxt∗n) =n xPˆx(x)lotgPˆx1(x). For =E"Xt=1l(B{L(tp−)1(b)},xt)−Xt=1l(B{L(tcl)(b)},xt)# hthteutu(nxiv))e,rsaanldesittimisaetoasryprtoopsoeseedth:aPtt(guPi)v(exn)P=t(uc)−t(1x·)(,Nthte−1c(hxo)ic+e n of q (), the probability distribution for the next state is just t +E"Xt=1l(B{L(tcl)(b)},xt)#−L∗n qt(·)=· argqminPt(uE)(x)logq(1X) =Pt(u)(x) n (25),(19) ≤ 2R h−t 1+|X|hn!,∆, E[Lˆ ]=E n log 1 Xt=1 (26) n t=1 Pt(u)(xt) X n which proves the first claim of the theorem. =s [Elog(c ) Elog(N (x )+h u (x ))] t t−1 t t t t − 3) Choices of the dither amplitude sequence: There re- t=1 X (30) mains the question of selecting the sequence of dither ampli- tudes h . For a given horizon n, a simple calculation shows, t In general, in order to achieve a small regret for the log that the best choice in terms of minimizing ∆ is a constant loss,itisrequiredthattheoverallcontributionof P(u)(x )for h , whichequals n . Aswill beseen below,a goodchoice t t t |X| all occurrences of a certain state x = x, would approximate t of a varying ht thqat yields an infinite horizon solution (i.e. in Pˆx(x). However the most important property, which is not which the setting of ht does not depend on the horizon n) is satisfied by FL, is not to give a probability too close to 0 ht = |2Xt|. However, because in real life we do not choose for a certain state x on its first appearance in the sequence. an“opqtimal”ht,itisfirstdesiredtoshowthatforawiderange I.e. if Nt−1(xt) = 0, it is required that Elog(Nt−1(xt) + of choices, the resulting predictor’s expected regret tends to h u (x )) = Elog(h u (x )) is finite. Indeed it is easy to t t t t t t zero. This analysis is rather straightforward and reoccurs in verify that this holds. many developments of this kind [1, Ex 4.7][10]. So, let us Following is the detailed calculation and bounding for choose h =h tα, with α (0,1). Then, E[Lˆ ]. The normalized c is bounded as: t 1 n t · ∈ n n n h−1 =h−1 t−α h−1 1+ x−αdx t=1 t 1 t=1 ≤ 1 (cid:18) Zx=1 (cid:19) ct = (Nt−1(xt)+htut(x))≤t−1+|X|ht (31) X X x 1 1 X =h−1 1+ (n1−α 1) h−1 n1−α 1 1 α − ≤ 1 1 α In the below, denote for conciseness N (x )=v (cid:18) − (cid:19) − t−1 t (27) Elog(N (x )+h u (x )) t−1 t t t t Substituting in (26) yields: 1 =Elog(v+h u (x ))= log(v+h y)dy t t t t ∆ 2R n Z0 n = n h−t 1+|X|·hn! =h−1 v+htlog(y)dy t=1 (28) t X 1 Zv ≤2R h−111 αn−α+|X|·h1nα−1 =h−t 1[ylogy−y]vv+ht (cid:18) − (cid:19) =h−1[(v+h )log(v+h ) vlogv h ] HFoarnnaanny’sαco∈nsi(s0t,e1n)cya.nItdisansytrahig1h,ttfhoirswayrideldtoss∆enent−h→a→t∞th0e,bie.est. (=a)ht−t 1 htlogt(v+ht)+vtlo−g 1+ hvt−−tht (32) choice is obtained by α= 12 and h1 =q|X2|, which yields: logx≥≥1+x(cid:2)x h−t 1"htlog(v+ht)+(cid:0)v1+hvth(cid:1)vt −h(cid:3)t# ∆n = √2Rn 2h−11+|X|·h1 =4Rr2|nX| (29) =log(v+ht)− v+htht (cid:0) (cid:1) ht =log(N (x )+h ) . 4) Proof of Corollary 2.1: To prove the corollary is it t−1 t t − N (x )+h t−1 t t sufficient to notice that all strategies for which the loss is computedintheproofofTheorem2,areintheaforementioned In (a) notice that v = 0 is a special case. using 0log0 = 0, set of optimizing strategies. For the L loss it is easy to see it is easy to verify that in this case the the expression before 2 that the set of optimizing strategies is the convex hull of (a) for v =0 equals log(ht) 1, and therefore the inequality X − (the strategy for given λ(x) can be interpreted as a center of holds. mass of with varying weights to the different points). Returning to (30): X 8 with (36) yields: n E[Lˆ ] L∗ E[Lˆ ]= [Elog(c ) Elog(N (x )+h u (x))] n − n n t − t−1 t t t n Xt=1 (log(t 1+ ht) log(t)) n ≤ − |X| − t=1 log(t 1+ h ) X t ≤ − |X| + log(N (x)+1) t=1 t−1 X n ht xX∈Xt:Xxt=xh +Xt=1(cid:20)−log(Nt−1(xt)+ht)+ Nt−1(xt)+ht(cid:21) −log(Nt−1(x)+ht)+ Nt−1(hxt)+ht n n i = log(t−1+|X|ht) = log 1+ |X|ht−1 t=1 t X t=1(cid:18) (cid:18) (cid:19)(cid:19) h X + −log(Nt−1(x)+ht)+ N (xt)+h + log 1+ 1−ht xX∈Xt:Xxt=x(cid:20) t−1 (33) t(cid:21) xX∈Xt:Xxt=xh (cid:18) Nt−1(x)+ht(cid:19) (37) h t + N (x)+h Forthesecondsum,whichwasbrokenintothesubsequenes t−1 t n i in which a specific state x appears, notice that Nt−1(x) |X|ht−1 increases by 1 between consecutive elements of the internal ≤ t t=1 sum. The final value of N (x) in the last element equals X t−1 1 h h the total number of apperances of x, N (x). At this point, it + − t + t n N (x)+h N (x)+h is beneficial to write L∗n in a similar form: xX∈Xt:Xxt=x(cid:20) t−1 t t−1 t(cid:21) n h 1 1 t 1 = |X| − + L∗n =nXx Pˆx(x)logPˆnx(x) Xt=n1 ht 1 xX∈nXt:Xxt=x1Nt−1(x)+ht t = Nn(x)log (34) = |X| − + N (x) t N (x )+h n t−1 t t x t=1 t=1 X X X =nlogn Nn(x)logNn(x) Let us consider the sequence x that maximizes the second − x sum. As clear intuitively, and will be proven below, this X sequence selects all states of x in a round-robin fashion, A consequence of the bound on the size of a type class which minimizes the growth rate of N (x ) and for which exp(nH(P)) [14, Lemma II.2] is (considering the t−1 t t|TypPe| d≤efined by the sequence x, ...,Nn(x),... and taking Nt−1(xt)=⌊t|−X1|⌋. n First,foragiventype(i.e.forgiven Nn(x) x∈X),consider the log of both sides): (cid:16) (cid:17) theorderthatwouldyieldthemaximum{.Itisc}learthatthem- thoccurrencesofdifferentstates(assumingthesestatesindeed n! N (x) n log n n log occuratleastmtimes)shouldoccuratconsecutivet-s.Inother (cid:18) x∈X Nn(x)!(cid:19)≤ x∈X n Nn(x) words, the sequence Nt−1(xt) is non decreasing. Suppose X that the opposite occurs, i.e. N (x ) > N (x ), then Q =nlogn N (x)logN (x) t−2 t−1 t−1 t n n − obviously x =x . Let us flip the order of these states, i.e. x∈X t 6 t−1 X (35) let x′t = xt−1 and x′t−1 = xt, then as a result the counts will also flip, N′ (x′ ) = N (x ) and N′ (x′) = t−2 t−1 t−1 t t−1 t Plugging into (34) yields: Nt−2(xt−1).Thisiseasiesttoseeviaanexample:supposethe sequence is x=(c,a,a,b,c,c), then the countsN (x ) are t−1 t n! 0,0,1,0,1,2. After flipping the states t = 3,4 the sequence L∗n ≥log N (x)! is x′ = (c,a,b,a,c,c) and the counts are 0,0,0,1,1,2. The (cid:18) x∈X n (cid:19) (36) elementspertainingtoothertimesarenotaffectedbythisflip, n Nn(x) Q while the sum of the two elements is now: = log(t) log(m) − t=1 x∈X m=1 1 1 X X X + N′ (x′ )+h N′ (x′)+h Notice that this way of bounding L∗ is slightly non stan- t−2 t−1 t−1 t−1 t t n 1 1 dard: rather than writing L∗ in a similar form to L , it = + (38) n U N (x )+h N (x )+h would generally be simpler to write the bound on E[Lˆ ] t−1 t t−1 t−2 t−1 t n 1 1 usingfactorials,andsimplifyitusingStirling’sapproximation, > + N (x )+h N (x )+h obtaining a form similar to (34), however this approach does t−2 t−1 t−1 t−1 t t not hold for varying ht. where the inequality holds because ht ht−1 and ≥ Let us now assume h is non-decreasing. Combining (33) N (x )>N (x ).Thiscanbeseenbydirectalgebraic t t−2 t−1 t−1 t 9 manipulation,or by using Lemma 1 with the convexfunction as well. A more detailed evaluation yields: f(x)= 1: x n 1 n 1 ∆=( h 1) + Lemma 1. Let f be a convex- function and a0,a1,b0,b1 |X| − t t−1 +h satisfy a a and b b , then∪ Xt=1 Xt=1 ⌊|X|⌋ 1 ≥ 0 1 ≥ 0 n 1 n 1 ( h 1) 1+ dt + dt ≤ |X| − t t 1+h f(a +b )+f(a +b ) f(a +b )+f(a +b ) (39) (cid:18) Z1 (cid:19) Zt=0 |X| − 0 0 1 1 0 1 1 0 ≥ n =( h 1)(1+log(n))+ log −|X| +1 |X| − |X| h I.e. the maximum sum is obtained by joining the smaller and (cid:18) |X| (cid:19) h=|X|−1 bigger elements together. = log(n +1) |X| −|X| log(n). Proof: Let us assume there is strict inequality at least in ≤|X| (44) one of the pairs, otherwise the result holds trivially.Write the hybridsumsasconvexcombinationsofthehomogenoussums: This ends the proof of Theorem 3. a +b =λ(a +b )+(1 λ)(a +b ) 0 1 0 0 1 1 C. Proof of Theorem 1 − (40) a +b =(1 λ)(a +b )+λ(a +b ), 1 0 − 0 0 1 1 Both Theorem 2 and Theorem 3 show the normalized expected regret tends to 0 with n, and it remains to change with from claims on expected regret to claims on the almost-sure a a regret. To prove that the regret tends to 0 almost surely, or λ= a1−a10−+b01−b0 ∈[0,1]. (41) emsotarbelipsrheecdisfealyc,tltihmatsluimpns→up∞(Lˆn−(EL[Lˆ∗n)]≤0L,∗u)sing0,thiteraelmreaaidnys to show that 1(Lˆ E[Lˆ ]n)→∞ 0 anlm−ostnsu≤rely. From (40) and the convexity of f: n n− n n−→→∞ n 1 1 f(a +b ) λf(a +b )+(1 λ)f(a +b ) (Lˆ E[Lˆ ])= l(ˆb ,x ) E[l(ˆb ,x )] (45) 0 1 0 0 1 1 n n t t t t ≤ − (42) n − n − f(a1+b0) (1 λ)f(a0+b0)+λf(a1+b1). Xt=1(cid:16) (cid:17) ≤ − Toshowthatthemeanaboveconvergesto0almostsurely,we Summing the two equations in (42) yields the desired result use Kolmogorov’scriterion for the applicability of the Strong (39). Law of Large numbers [15]. The elements of the sequence γ = l(ˆb ,x ) E[l(ˆb ,x )] have zero mean, and are inde- The conclusion is that the sequence Nt−1(xt) is non de- petndent, tbectau−se eachtˆb tdepends only on the deterministic creasing.Next,isitobviousthattoincreasethesum,allstates t in x should be chosen approximately the same number of history of the sequence, and on ut(x), which are assumed independent. Notice that γ are not identically distributed. times. If at some point, a state x has been chosen for the t m-th time at time t, while another state x′ did not appear Kolmogorov’s criterion requires that ∞t=1 Vart(2γt) < ∞. This holds trivially for bounded loss functions, for which the m 2timesat thispoint,thenbecauseofthe monotonicityof P N− (x ), x′ can never appear again in an optimal sequence. boundness of γt yields a constant bound on its variance. t−1 t | | Clearly, choosing x′ instead of x at time t would decrease Proving that this condition holds for the case of the log loss function is a rather technical calculation which is deferred to N (x ) for time t as well as for all future occurrences of t−1 t Appendix- A. (cid:3) the state x, and therefore will increase the sum. This concludes the proof that the last sum in (37) is maximizedbyN (x )= t−1 .Now,(37)mayberewritten APPENDIX t−1 t ⌊|X|⌋ as: A. Completion of the proof of Theorem 1 for the log loss This appendix completes the proof of Theorem 1 from n n h 1 1 E[Lˆ ] L∗ |X| t− + ,∆. (43) Section V-C, by showing that for the case of the log loss, n − n ≤ t=1 t t=1 ⌊t|−X1|⌋+ht the Kolmogorov criterion holds and therefore the normalized X X regret converges almost surely to the normalized expected The last bound is only a function of h and not of the regret. Our purpose is to upperbound the following variance: t sequence x. If h grows sublinearly, then{th}e dominantfactor in the second sutm will be t−1 and the sum would grow σt2 =Var(γt)=Var(l(ˆbt,xt))=Var log Pt(u)(xt) , like O(log(n)). On the other⌊h|Xan|d⌋, any growth rate of h that h (cid:16) (cid:17)i(46) sreagtirsefitetsenn1diPngnt=to1 h0tt, an−n→→d∞a0suwffiocuiledntycieolnddnitoiormnaislizhettd→exp0te,catnedd Pant(du)s(hxotw) isthdaetfiKnoeldmiong(o7r)ova’ss criterion P∞t=1 σt2t2 < ∞ holds. particularly this holds for ht =O(tα),α∈[0,1). P(u)(x)= Nt−1(x)+ht·ut(x) . (47) If ht is constant, then the first sum grows like O(log(n)) t t−1+ht· x′∈X ut(x′) P 10 We have: where we used again the fact that log2(u) is increasing for log Pt(u)(xt) =log(Nt−1(xt)+ht·ut(xt)) u≥1.Combining(48)with(49)andtheboundsaboveyields: (cid:16) (cid:17) log(t 1+h u (x′)) σt2 =Var log Pt(u)(xt) t t − − · xX′∈X 4+h2log(cid:16)2 t−1 +(cid:17)i1 +4+2log2 t−1 + =log Nt−1(xt) +ut(xt) ≤ (cid:18) ht (cid:19) (cid:18) ht |X|(cid:19) (cid:18) ht (cid:19) (48) 8+4log2 t−1 + ≤ h |X| A (cid:18) t (cid:19) (53) | log t−{z1 + u}(x′) . − ht xX′∈X t ! αUnde(r0,t1h)eaansdsusmopσt2io=nsOo(floTgh2e(to)r)emand1,clheatrl=y Kho1lm·otgαorwovit’hs B cri∈terion holds. t (cid:3) Using | {z } Var(A B) E (A B)2 2E A2 +2E B2 , (49) REFERENCES − ≤ − ≤ [1] N.Cesa-BianchiandG.Lugosi,Prediction,learningandgames. Cam- where the second i(cid:2)nequality s(cid:3)tems fro(cid:2)m (cid:3)(a b)(cid:2)2 =(cid:3)2a2 + bridgeUniversity Press,2006. − 2b2 (a+b)2 2a2+2b2,itisenoughtoboundtheexpected [2] N. Merhav and M. Feder, “Universal schemes for sequential decision squa−redvalueo≤feachlogin(48)separately.Forauniformr.v. from individual data sequences,” IEEE Trans. Information Theory, vol.39,no.4,pp.1280–1292,Jul.1993. U U[0,1] and a constanta 0, the followingboundholds: [3] ——,“Universalprediction,” IEEETrans.InformationTheory,vol.44, ∼ ≥ no.6,pp.2124–2147,Oct.1998. 1 E log2(a+U) = log2(a+u)du [4] T.M.CoverandJ.A.Thomas,ElementsofInformationTheory. John Wiley&sons,1991. Z0 [5] M.Feder, “Gambling usingafinite state machine,” IEEETrans.Infor- (cid:2) (cid:3) a+1 = log2(u)du mationTheory,vol.37,no.5,pp.1459–1465,sep1991. [6] R. E. Krichevsky and V. K. Trofimov, “The performance of universal Za encoding,”IEEETrans.InformationTheory,vol.27,no.2,pp.199–207, 1 a+1 log2(u)du+ log2(u)du Mar.1981. ≤ [7] J. Hannan, “Approximation to bayes risk in repeated play,” Princeton Z0 Zmax(a,1) University Press,vol. Contributions tothe TheoryofGames, III,Ann. 1 a+1 Math.StudyNumber39,pp.97–139,1957. log2(u)du+ log2(u)du [8] M. Feder, N. Merhav, and M. Gutman, “Universal prediction of indi- ≤ Z0 Zmax(a,1) vidualsequences,”IEEETrans.InformationTheory,vol.38,no.4,Jul. ulog2(u) 2ulog(u)+u 1+log2(a+1) 1992. ≤ − 0 [9] T.M.Cover,“Behaviorofsequentialpredictorsofbinarysequences,”in =(cid:2)2+log2(a+1). (cid:3) iRnaPndroocm.4PthroPcerassgeuse,C19o6n5.,Inpfpo.rm26.3T–h2e7o2r.y,StatisticalDecisionFunctions, (50) [10] A. T. Kalai and S. Vempala, “Efficient algorithms for online decision problems.” Journal of Computer and System Sciences, vol. 71, no. 3, We used the fact that log2(u) is increasing for u 1. Notice pp.291–307, Oct.2005. ≥ that the bound is trivial for a 1 because U 1. Applying [11] R.Weatherford, Philosophicalfoundations ofprobabilitytheory. Lon- ≥ ≤ don:Routledge &KeganPaul,1982. the bound to the squared elements in (48): [12] Z.ReznikovaandB.Ryabko,“Antsandbits,”IEEEInformationTheory N (x ) N (x ) Society Newsletter, vol.62,no.5,pp.17–20,2012. E log2 t−1 t +ut(xt) 2+log2 t−1 t +1 [13] Y. Lomnitz, “Universal communication over unknown channels,” (cid:20) (cid:18) ht (cid:19)(cid:21)≤ (cid:18) ht (cid:19) Ph.D. dissertation, Tel Aviv University, Aug. 2012, available online t 1 http://www.eng.tau.ac.il/∼yuvall/publications/YuvalL Phd report.pdf. 2+log2 − +1 . [14] I. Csisza´r, “The method of types,” IEEE Trans. Information Theory, ≤ (cid:18) ht (cid:19) vol.44,no.6,pp.2505–2523, Oct.1998. (51) [15] W. Feller, An Introduction to Probability Theory and Its Applications. NewYork:Wiley,1971. For the second term, the expectation is first applied only to one arbitrary element u (x), while conditioning on the other t elements: t 1 E log2 − + u (x′) t h " t x′∈X !# X t 1 =E E log2 − + u (x′) u (x′),x′ =x t t ( " ht x′∈X !| 6 #) X (50) t 1 E 2+log2 − + u (x′)+1 t ≤   h   t xX′6=x  t 1  2+log2 − + ,  ≤ h |X| (cid:18) t (cid:19) (52)

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.