Table Of Content

Moment-Matching Polynomials AdamKlivans Raghu Meka UT-Austin DIMACSand IAS [email protected] [email protected] 3 1 0 2 Abstract n We givea newframeworkforprovingtheexistenceoflow-degree,polynomialapproxi- a J matorsforBooleanfunctionswithrespecttobroadclassesofnon-productdistributions. Our 4 proofsusetechniquesrelatedtotheclassicalmomentproblemanddeviatesignificantlyfrom knownFourier-basedmethods,whichrequiretheunderlyingdistributiontohavesomeprod- ] uctstructure. C Ourmainapplicationisthefirstpolynomial-timealgorithmforagnosticallylearningany C functionofaconstantnumberofhalfspaceswithrespecttoanylog-concavedistribution(for . s any constantaccuracyparameter). This result was not known even for the case of learning c theintersectionoftwohalfspaceswithoutnoise.Additionally,weshowthatinthesmoothed- [ analysissetting,theaboveresultsholdwithrespecttodistributionsthathavesub-exponential 1 tails,apropertysatisfiedbymanynaturalandwell-studieddistributionsinmachinelearning. v Given that our algorithmscan be implementedusing SupportVector Machines(SVMs) 0 withapolynomialkernel,theseresultsgivearigoroustheoreticalexplanationastowhymany 2 8 kernelmethodsworksowellinpractice. 0 . 1 0 3 1 : v i X r a 1 1 Introduction: Beyond Worst-Case Learning Models Learning halfspaces is one of the core algorithmic tasks in machine learning and can be solved in the noiseless (PAC) model via efficient algorithms for linear programming. The two simplest generalizations ofthisproblemnamely1)learningtheintersectionoftwohalfspacesand2)learning a noisy halfspace (i.e., agnostic learning) have attracted the attention of many researchers in theoretical computer science and statistics. Surprisingly, they both remain challenging open problems. In the context of computational complexity, there are many hardness results for learning halfspace-related concept classes with respect to arbitrary distributions, and the literature is too vast for us to survey here. A brief summary might be that strong NP-hardness results are known forproper learning, wherethelearner mustoutput ahypothesis thatisofthesameform(orclose to the same form) as the unknown concept class [FGRW, GR, DOSW, KS1], and that there are cryptographic hardness results even forimproper learning, where thelearner isallowed tooutput anypolynomial-time computablehypothesis [FGKP,KS2]. Thesehardnessresultsapplytomany easy-to-state problems, including the two simple generalizations of learning halfspaces enumer- atedabove. There is a disconnect, however, between the many discouraging hardness results for learning convex sets and the success in practice of popular machine-learning tools for solving just these typesofproblems(e.g.,SupportVectorMachines). Areasonable questionmightbe“Whydoker- nelmethods–algorithmsthatattheircorelearnanoisyhalfspace– worksowellinpractice?” The allusion heretoSpielmanandTeng’sworkonSmoothedAnalysis[ST]isonpurpose: supervised learningseemsperfectly suitedtoanaverage-case analysisintermsoftheunderlying distribution onexamples. Indeed, the main positive result of this paper is a smoothed analysis of learning functions of halfspaces: we show that, in the smoothed-analysis setting, functions of halfspaces are agnostically learnable with respect to any distribution that obeys a subexponential tail bound (so-called subexponential densities) for any constant error parameter. These distributions include all log- concave distributions and need not be product or unimodal. Previous work (that we detail in the nextsection)required theunderlying distribution tobeGaussianoruniform overthehypercube. We leave open the possibility that functions of halfspaces are agnostically learnable with respect to all distributions in the smoothed-analysis model (i.e., all distributions that have been subject to a small Gaussian perturbation). We certainly are not aware of any, say, cryptographic hardness resultsforthissetting. 1.1 Introduction: PreviousWorkonDistribution-Specific Learning Manyresearchershavestudiedthecomplexityoflearningconvexsetswithrespecttofixedmarginal distributions. Alongtheselines,BlumandKannan[BK1]gavethefirstpolynomial-timealgorithm for learning intersections of m = O(1) halfspaces with respect to Gaussian distributions on Rn. Their algorithm runs in time nO(2m) (for any constant accuracy parameter). Vempala [Vem2] improved on this work and gave a randomized algorithm for learning intersections of centered halfspaces with respect to any log-concave distribution on Rn in time roughly (n/ε)O(m) (“centered” means that each bounding hyperplane passes through the mean of the distribution). In a beautiful follow-up paper, Vempala[Vem1]usedPCAtogiveanalgorithm forlearning theinter- 2 section of m halfspaces with respect to any Gaussian distribution in time poly(n) (m/ε)O(m). · WenotethattheseresultsholdinthePACmodel,anditisnotcleariftheysucceedintheagnostic setting. In the agnostic model, we are only aware of results that use the polynomial regression algorithm of Kalai et al. [KKMS]. Klivans et al. [KOS1] (combined with the observations in Kalai et al.) gave an algorithm for learning any function of m halfspaces in time nO(m2/ε2) with re- specttotheuniformdistribution on 1,1 n. ApplyingresultsonGaussiansurfacearea,Klivans {− } et al. [KOS2] gave an algorithm for agnostically learning intersections of m halfspaces in time npolylog(m)/ε2 withrespecttoanyGaussiandistribution. A major goal in this area has been to move beyond Gaussians and tackle the case when the underlying distribution is log-concave, as log-concave densities are a broad and widely-studied class ofdistributions. TheGaussian density islog-concave, and, infact, anyuniform distribution overaconvexsetislog-concave. Kalaietal.[KKMS]giveanalgorithmforagnosticallylearningasinglehalfspacewithrespect to any log-concave distribution in time nf(ε) for some function f. The best known bound for f is currently 2O(1/ε2) (follows from Section 5 of Lubinsky [Lub]). It is unclear how to extend the Kalaietal.analysis toworkfortheintersection oftwohalfspaces. Tosummarize, it wasnot known how to learn the intersection of two halfspaces with respect tolog-concave distributions eveninthenoiseless (PAC)model. 1.2 Statement ofResults Here we give the first polynomial-time algorithm for agnostically learning intersections (or even arbitraryfunctions)ofaconstantnumberofhalfspaceswithrespecttoanylog-concavedistribution onRn(seeTable1.2forthepreciseparameters): Theorem 1.1. Functions of m halfspaces are agnostically learnable with respect to any log- concave distribution onRn intimenOm,ε(1) whereεistheaccuracy parameter. Admittedly, our dependence on the number of halfspaces m and the error parameter ε is not great,butwestressthatnopolynomial-time algorithmwasknownevenfortheintersectionoftwo halfspaces. SeeTable1.2forasummaryofpreviouswork. WeremarkthatDanielKaneinaforthcoming paperhasindependently obtained Theorem1.1 using a set of completely different techniques ([Kan1]). His dependence on m and 1/ε, though stillexponential, issuperiortoours. Weextendtheaboveresult–inthesmoothed-analysis setting–toholdwithrespecttoarbitrary distributions withsub-exponential tailbounds. Wefirstdefinethemodelofsmoothed-complexity thatweconsider. Definition1.2. Givenadistribution onRn,andaparameterσ (0,1),let (σ)beaperturbed D ∈ D distribution of obtained by independently picking X , Z (0,Σ)n and outputting D ← D ← N X +Z,whereΣ σ cov(X)1. (cid:23) · That is, (σ) is obtained by adding Gaussian noise to and quantitatively, we want the D D variance of the noise in any direction to be comparable to (at least σ2 times) the variance of D in the same direction. For instance, for isotropic, perturbations by (0,σ)n would suffice. D N 1Here,(cid:23)denotesthesemi-definiteordering. 3 ConceptClass Distribution RunningTime Model Source Intersections Gaussian poly(n) (m/ε)m PAC [Vem1] · Intersections Gaussian n(polylog(m)/εO(1)) Agnostic [KOS2] Intersections Log-concave (n/ε)m PAC [Vem2] (centered) Onehalfspace Log-concave nf(ε) Agnostic [KKMS] Arbitrary Log-concave nexp((log(1/ε))O˜(m)/ε4) Agnostic Thiswork (moment- matching) Arbitrary Sub-exponential nexp((log(logm/σε))O˜(m)/σ4ε4) Agnostic Thiswork (σ-smoothed) Arbitrary Sub-gaussian n(log(logm/σε))O˜(m)/σ4ε4 Agnostic Thiswork (σ-smoothed) Figure 1: Summary of recent work on learning intersections and arbitrary functions of m halfspaces Thelattercorresponds moredirectlytothetraditional smoothed-complexity setup, butweusethe abovedefinitionasitisbasisindependent andallowsfornon-spherical Gaussianperturbations. Wedefinethesmoothed-complexity of(agnostically)learningaconceptclass underadistri- C bution tobethecomplexityof(agnostically) learning undertheperturbeddistributions (σ). D C D ThismodelfirstappearsintheworkofBlumandDunagan[BD](forthespecialcaseofspherical Gaussian perturbations) andwebelieveittobeanatural andpractical extension ofthetraditional models of learning. For instance, the main motivating principle behind smoothed-analysis– that real data involves measurement error– is very much applicable here. Besides the work of Blum andDunagan,thereseemstobelittleknownaboutlearninginthismodel. Wesayadistributionissub-exponential(sub-gaussian)ifeverymarginal(i.e.,one-dimensional projection)ofthedistributionobeysatailboundoftheforme−|z|(e−|z|2,respectively). Itisknown thatalllog-concavedistributions aresub-exponential. Sub-exponentialandsub-gaussiandensities are commonly studied in machine learning and statistics and model various real-word situations (see [BK2] for instance). We show that for these types of distributions, our learning algorithms havepolynomial smoothed-complexity (forconstant σ): Theorem 1.3. Functions of m halfspaces are agnostically learnable with respect to any subexponential distribution on Rn intimenOm,ε,σ(1) where εisthe accuracy parameter and σ isthe perturbation parameter. Weobtainmuchbetterparameters(intheconstanthiddeninO (1))forthespecialcaseof m,ε,σ sub-gaussian densities (seeTheorem4.2). Blum and Dunagan were the first to study the smoothed complexity of learning halfspaces. They showed that for a single halfspace in the noiseless (in labels) setting, the perceptron algorithm converges quickly with high probability for examples perturbed by Gaussian noise. Their expectedrunningtime,however,isinfinite(andthusstrictlyspeakingdoesnotgiveboundsonthe smoothed-complexity ofthePerceptronalgorithm). 4 Toobtainoursmoothed-analysisresults,weprovethatGaussianperturbationsprovideenough anticoncentrationforourpolynomialapproximationmethodstowork. Webelievethisconnection willfindadditionalapplicationsrelatedtothesmoothed-complexityoflearningBooleanfunctions. 1.3 Overview ofConceptual and Technical Contributions In their seminal paper, Linial et al. [LMN] introduced the polynomial approximation approach for learning Boolean functions. Thecore of their approach istosolve thefollowing optimization problem: given a Boolean function f, minimize, over all polynomials p of degree at most d, the quantity E [(f p)2]. x∈{−1,1}n − Thealgorithm isgiven uniformly random samples ofthe form (x,f(x)). Their“low-degree” algorithm approximately solves this optimization problem in time roughly nO(d). Later, the “sparse” algorithm of Kushilevitz and Mansour [KM2] solved the same optimization problem but where the minimization is over all sparse polynomials, and the algorithm is allowed query accesstothefunction f. Thesealgorithms weredeveloped inthecontextofPAClearning. Kalaietal. [KKMS]subsequentlyobservedthatinordertosucceedintheagnosticframework oflearning(weformallydefineagnosticlearninginSection2.1butfornowagnostic learningcan be thought of as a model of PAC learning with adversarial noise), it suffices to approximately minimizeE [f p ]. x∈{−1,1}n | − | Thatis,minimizing withrespect tothe1-norm rather thanthe2-norm results inhighly noise- tolerant learning algorithms. Finding efficient algorithms for directly minimizing the above ex- pectationwithrespecttothe1-norm(“ℓ minimization”),however,ismorechallengingthaninthe 1 ℓ case. The work of Kalai et al. [KKMS] gives the analogue of the “low-degree” algorithm for 2 ℓ minimization (in fact, their algorithm can be carried out using aSupport Vector Machine with 1 thepolynomial kernel), and theworkofGopalan etal. [GKK]gives theanalogue ofthe“sparse” algorithm forℓ minimization. 1 Although we have efficient algorithms that directly carry out ℓ minimization for low-degree 1 polynomials, provingtheexistenceofgoodlow-degreeℓ approximatorshasrequiredfirstfinding 1 a good low-degree ℓ approximator (i.e., Fourier polynomial) and then applying the simple fact 2 that E[p ] E[p2]. Directly analyzing the error of low-degree ℓ approximators seems quite 1 | | ≤ difficult. In our setting, for example, it is not even clear that the best low-degree ℓ polynomial p 1 approximator isunique! The main conceptual contribution of our methods is to provide the first framework for directly proving the existence of low-degree ℓ approximating polynomials for Boolean functions 1 (infact, wealso obtainsandwichingpolynomials). Onebenefit ofour approach isthat wedo not require the underlying distribution to be product (essentially all of the techniques involving the discrete Fourier polynomial require some sort of product structure). As such, in this work, we areabletoreasonaboutapproximating Booleanfunctions withrespect tointeresting non-product distributions, suchaslog-concave densities. In the following descriptions, we assume we are trying to show polynomial approximations forf : Rn 0,1 ,wheref = g(h (x),...,h (x)),whereg : 0,1 m 0,1 isanarbitrary 1 m → { } { } → { } Booleanfunctionandh ,...,h : Rn 0,1 arehalfspaces. 1 m → { } 5 1.4 A “Moment-Matching” Proof Our method uses ideas from probability theory and linear programming to give a framework for provingtheexistenceofsandwichingpolynomials (itiseasytoseethatsandwichingpolynomials are stronger than ℓ approximators). The main technical contribution is to show how to use a set 1 of powerful theorems from the study of the classicalmomentproblemto apply our framework to functions of halfspaces. At a high level, our approach makes crucial use of the following consequence ofstrongdualityforsemi-infinitelinearprograms: let beadistribution andlet k D D beanydistribution whereallmomentsoforderlessthanorequaltok matchthoseof . IfE [f] D D is “close” to E [f] then f has a low-degree sandwiching polynomials with respect to D. The Dk question then becomes how to analyze the bias of a Boolean function where only the low-order moments of a distribution have been specified. We show how to use several deep results from probability toanswerthisquestion inSections3.2and3.3. Weshowthatthemoment-matchingapproachalsohassomeinterestingapplicationsforlearn- ingwithrespecttodistributions onthediscretecube 1,+1 n. {− } 2 Preliminaries 2.1 AgnosticLearning Werecallthemodelofagnosticallylearningaconceptclass [Hau],[KSS]. Inthisscenariothere C isanunknowndistribution overRn 1,1 withmarginaldistribution overRn denoted . X D ×{− } D def Let opt = inf Pr [f(x) = y]; i.e. opt is the minimum error of any function from in f∈C (x,y)∼D 6 C predicting thelabelsy. Thelearnermustoutput ahypothesis whoseerroriswithinεofopt: Definition 2.1. Let be an arbitrary distribution on Rn 1,1 whose marginal over Rn is D ×{− } , and let be a class of Boolean functions f : Rn 1,1 . We say that algorithm B X D C → {− } is an agnostic learning algorithm for with respect to if the following holds: for any as C D D described above, if B is given access to a set of labeled examples (x,y) drawn from , then D with probability at least 1 δ algorithm B outputs a hypothesis h : Rn 1,1 such that − → {− } Pr [h(x) = y] opt+ε. (x,y)∼D 6 ≤ NotethatPAClearning isaspecialcaseofagnostic learning(thecasewhenopt = 0). The “L Polynomial Regression Algorithm” due to Kalai et al. [KKMS] shows that one can 1 agnostically learn any concept class that can be approximated by low-degree polynomials (in Kalaietal. [KKMS]itisshownhowtoimplement thisalgorithm usingastandard SVMwiththe polynomial kernel): Theorem 2.2([KKMS]). Fix onX Randlet f . Assumethere exists apolynomial pof degreedsuchthatE [f(xD) p(x)×] < εwhere ∈ CisthemarginaldistributiononX. Then, x∼DX | − | DX withprobability 1 δ, theL Polynomial Regression Algorithm outputs ahypothesis hsuch that 1 − Pr [h(x) = y] opt+εintimepoly(nd/ε,log(1/δ)). (x,y)∼D 6 ≤ Throughout, wesuppressthepoly(log(1/δ)) dependence onδ. 6 2.2 Probability For a random variable X Rm, let ϕ : Rm R be the characteristic function defined by X ∈ → ϕ (t) = E[exp( i t,x )], wherei = √ 1. X Weshalluse−thehfolloiwingstandarddi−stancemeasuresbetweenrandomvariablesX,Y Rm. ∈ Theλ-metric: • d (X,Y) = min max max ϕ (t) ϕ (t) ,1/T . λ X Y T>0 {ktk≤T{| − |} } TheLevydistance: for1beingtheall1’svector, • d (X,Y) = inf t Rm, Pr[X < t ε1] ε< Pr[Y < t] < Pr[X < t+ε1]+ε . LV ε>0{∀ ∈ − − } Kolmogorov-Smirnov orcdfdistance: • d (X,Y)= sup Pr[X t] Pr[Y t] . cdf t∈Rm {| ≥ − ≥ |} ForI = (i ,...,i ) Zn,andx Rn,letx(I) = n xij. Fork > 0,letI(k,n) = I = (i ,...,i ) 1Zn : nn ∈i k, i ∈0 . j=1 j { 1 n ∈ j=1 j ≤ j ≥ } Q We say that a class of functions is ε-approximated in ℓ by polynomials of degree d under 1 P C adistribution iffor everyf ,there exists adegree dpolynomial psuch that E [p(x) x∼D D ∈ C | − f(x)] ε. | ≤ We use the following properties of log-concave distributions (equivalent formulations can be foundinLovasz-Vempala[LV]). Theorem 2.3 ([CW]). Let random-variable X Rn be drawn from a log-concave distribution. ∈ Then,foreveryw Rn,andr > 0,E[ w,X r] rr E[ w,X 2]r/2. ∈ |h i| ≤ · h i Theorem2.4([CW]). Thereexistsauniversal constant C suchthatthefollowing holds. Forany real-valued log-concave random variable X with E[X2] = 1 and all t R, ε > 0, Pr[X ∈ ∈ [t,t+ε]] < Cε. Wealsousethefollowingsimplelemmas. ThefirsthelpsusconvertclosenessinLevydistance tocloseness incdfdistance, whilethesecondhelpsusgofromfoolingintersections ofhalfspaces tofooling arbitrary functions ofhalfspaces. Fact 2.5. Let X = (X ,...,X ) Rm be a random variable such that for every r [m], 1 m ∈ ∈ t R,ε > 0, Pr[X [t,t+ε]] < β ε for a fixed β > 0. Then, for any random variable Y, r ∈ ∈ · d (X,Y) m β d (X,Y). cdf LV ≤ · · Lemma2.6. LetX,Y Rm be real-valued random variables such that for every a ,...,a 1 m ∈ ∈ 1, 1 ,d ((a X ,a X ,...,a X ), (a Y ,a Y ,...,a Y )) ε. Then,foranyfunction cdf 1 1 2 2 m m 1 1 2 2 m m { − } ≤ g : 1, 1 m 1, 1 andthresholdsθ ,...,θ , E[g(sign(X θ ),...,sign(X θ ))] 1 m 1 1 m m { − } → { − } | − − − E[g(sign(Y θ ),...,sign(Y θ ))] 2mε. 1 1 m m − − | ≤ Proof. Fixθ ,...,θ andletX′ = (sign(X θ ),...,sign(X θ ))anddefineY′similarly. 1 m 1 1 m m − − Then,fromtheassumptions ofthelemma,foreverya 1, 1 m, ∈ { − } Pr[X′ = a] Pr[Y′ = a] < d ((a X ,a X ,...,a X ),(a Y ,a Y ,...,a Y )) < ε. cdf 1 1 2 2 m m 1 1 2 2 m m | − | Therefore, d (X′,Y′) < 2m−1ε. Thelemmanowfollows. TV 7 3 Moment-Matching Polynomials We develop a theory of “moment-matching polynomials” for showing the existence of good approximating polynomials. Ourmainresultisthefollowing. Theorem3.1. Let bealog-concave distribution overRn. Leth ,...,h : Rn 1, 1 , be 1 m D → { − } halfspaces and let g : 1, 1 m 1, 1 be an arbitrary function. Define f : Rn 1, 1 { − } → { − } → { − } by f(x) = g((h (x),...,h (x))). Then, there exists a real-valued polynomial P of degree at 1 m mostk = exp((log((logm)/ε))O(m)/ε4)suchthatE [ f(X) P(X) ] ε. X←D | − | ≤ Theorem 1.1 with the runtime given in Table 1.2 follows from the above result and Theo- rem 2.2. Thetheorem isproved in Section 3.3. We start by describing the two basic ingredients: LP-dualityandtheclassical momentproblem. 3.1 LP Duality It is now well known in the pseudorandomness literature that with respect to the uniform distribution over 1,1 n, a concept class has degree k sandwiching polynomials if and only if {− } C C is fooled by k-wise independent distributions [Baz]. The proof of this fact follows from LP du- alitywherefeasible solutions totheprimalarek-wiseindependent distributions andfeasibledual solutions areapproximating polynomials. In our setting, we consider continuous distributions over Rn that are not necessarily product. Assuch,thisequivalenceismoresubtle. Infact,itisnotevenclearhowtodefinek-wiseindepen- dence for non-product distributions (such as log-concave densities). Still, given a distribution D wecan writeasemi-infinite linear program (aprogram withinfinitely manyvariables butfinitely many constraints) whose feasible solutions are distributions that match all of ’smoments up to D degree k (in the case where is uniform over 1,1 n, matching all moments is equivalent to D {− } beingk-wiseindependent). ForI I(k,n),letσ = E [X(I)]. Letf . Wewritetheprimalprogram asfollows: I X←D ∈ ∈ C sup f(x)µ(x)dx µ Rk Z x(I)µ(x)dx = σ , I I(k,n), (3.1) I Rk ∀ ∈ Z µ(x) = 1. Rk Z ThesupremumisoverallprobabilitymeasuresµonRk. Asinthefinitedimensionalcase,feasible solutions to the dual program correspond to degree k approximating polynomials. The dual can bewrittenas inf a σ (3.2) I I a∈RI(k,n) I∈I(k,n) X a x(I) f(x), x Rk. (3.3) I ≥ ∀ ∈ I X 8 Theissuehereisthatingeneral,strongdualitydoesnotholdforsemi-infinitelinearprograms. In our case, however, where the σ ’s are obtained as moments from a distribution (as opposed to i D just arbitrary reals), it turns out that strong duality does hold. Tosee this, we note that the above primal LP is a special case of the so-called generalized moment problem LP, a classical problem from probability and analysis that asks if there exists a multivariate distribution with moments specifiedbytheσ ’s. Inourcase,feasibility isimmediate,astheσ ’sareobtained from . i i D Asforstrongduality,itisknownthatiftheσ ’sareintheinteriorofaparticularset(thedetails i arenotrelevanthere),thentheoptimalvalueoftheprimalequalstheoptimalvalueofthedual. In thecasethattheσ ’sdonotsatisfythiscondition, strongdualityholdsassumingwerelaxthedual i program constraints to some subset Ω Rn. Oneconcern is that we willnow obtain an optimal ⊆ approximating polynomial with respect to some distribution ′ defined on Ω (as opposed to the D original ). Butitisalsoknownthatinthiscase,allfeasibledistributions aresupportedonΩ. As D such,approximationwithrespectto ′isequivalenttoapproximationwithrespectto . Werefer D D thereadertoBertsimasandPopescu[BP](Section2)formoredetailsandreferences. Westartwithanimportantdefinition. Definition 3.2. Given two distributions , ′ on Rn, k 0, we say ′ k moment-matches if D D ≥ D D forallI I(k,n),EX←D[X(I)] = EX←D′[X(I)]. ∈ Wecannowprovethemainlemmaofthissection: Lemma3.3. Letf :Rn 0,1 andlet beadistributionoverRnwithallmomentsfinitesuch → { } D that the following holds: For every distribution ′ that k moment-matches , E [f(X)] X←D D D | − EX←D′[f(X)] < ε. Then,thereexistdegreeatmostk polynomials Pℓ,Pu :Rn Rsuchthat | → Foreveryx Support( ),P (x) f(x) P (x). ℓ u • ∈ D ≤ ≤ ForX ,E[P (X)] E[f(X)] εandE[f(X)] E[P (X)] ε. u ℓ • ← D − ≤ − ≤ Proof. Let opt∗ be the value of the primal program Equation 3.1. Then, by hypothesis opt∗ < γ+ε,whereγ = E [f(X)]. X←D Now, from the above discussion, strong duality (almost) holds for the programs in Equations (3.1) and (3.2), and we conclude that there exists a dual solution a RI(k,n) with value exactly ∈ opt∗ thatsatisfiestheinequality constraints forallx Support( ). Define, ∈ D P (x ,...,x ) = a x(I). u 1 n I I∈I(k,n) X Then,P ( )isadegreeatmostk polynomial, andP (x) f(x)foreveryx Support( ). u u ≥ ∈ D Further,theassumptioninthelemmaimpliesE [P (X)] = a σ = opt∗ < γ+ε. X←D u I∈I(k,n) I I Wehavetheexistenceofthelowersandwiching polynomial P similarly. ℓ P 3.2 The ClassicalMoment Problem Intheprevious section, wereduced theproblem ofconstructing low-degree sandwiching polyno- mialapproximators withrespect to tounderstanding theoptimal value ofasemi-infinite linear D program. The feasible solutions of the linear program correspond to all distributions that are k 9 moment-matching to . As such, for any k moment-matching distribution ′ we need to bound D D ED[f] ED′[f]. Inthissection,wegivesometechincalresultsthathelpusboundthisdifference | − | provided themomentsof donotgrowtoofast. D We begin with the following result showing that multivariate distributions whose marginals havematchinglowerordermomentshaveclosecharacteristicfunctions(asquantifiedbyλ-metric) provided themomentsarewellbehaved. Theorem 3.4 (Theorem 2, Page 171, [KR]). Let X,Y Rm be tworandom variables such that for anyt Rm, the real-valued random variables t,X∈, t,Y have identical first 2k moments. ∈ h i h i Then,forauniversal constant C, d (X,Y) Cβ−1/4 1+µ (X)1/2 , λ ≤ k 2 (cid:16) (cid:17) whereµ (X) = sup E[ t,x j] : t Rm, t 1 ,andβ = β (X) = k 1/µ (X)1/2j. j { |h i| ∈ k k ≤ } k k j=1 2j We now need to convert the above bound on closeness of characteristic functions to more P direct measures of closeness like Levy or Kolmogorov-Smirnov metrics. Such inequalities play animportantroleinFouriertheoreticproofsoflimittheorems(eg.,Esseen’sinequality;cf.Chapter XVI[Fel])andhereweusethefollowingmulti-dimensional versionduetoGabovich[Gab]. Theorem 3.5 ([Gab] Equation (8)). Let X,Y Rm be two vector-valued random variables. ∈ Then,forauniversal constant C andallsufficiently largeN,T > 0, C ϕ ((t ,...,t )) ϕ (t ,...,t ) X 1 m Y 1 m d (X,Y) | − |dt dt + LV 1 m ≤ t t t ··· 1 2 m Z ··· N1T≤t1,...,tm≤T C(logT)(log(NT)) +Pr[X / [ N,N]m]+Pr[Y / [ N,N]m]. T ∈ − ∈ − Theabovetheorem leadstothefollowingconcreterelation betweend andd . λ LV Lemma3.6. LetX,Y betwovector-valued randomvariableswithd (X,Y) δ. LetN(ε) R λ ≤ ∈ besuchthatPr[X / [ N(ε),N(ε)]m],Pr[Y / [ N(ε),N(ε)]m] δ. Then, ∈ − ∈ − ≤ d (X,Y) O((logN(δ)+2log(1/δ))m δ). LV ≤ · Proof. Without loss ofgenerality suppose that δ < 1/m2, as else the statement is trivial. LetT∗ bethevalueofT thatattainstheminimuminthedefinition ofd : λ d (X,Y)= max max ϕ (t) ϕ (t) ,1/T∗ . λ X Y {ktk≤T∗{| − |} } Asd (X,Y) δ,T∗ 1/δ. Therefore,foreveryt Rmwith t 1/δ, ϕ (t) ϕ (t) δ. λ X Y ≤ ≥ ∈ k k ≤ | − | ≤ Thus,applying Theorem3.5withN = N(δ)andT = 1/δ√m,weget ϕ (t) ϕ (t) d (X,Y) C | X − Y |dt+O(log2(NT) δ√m)+O(δ) LV ≤ ZN1T≤t1,...,tm≤T t1···tm · δ C dt+O(log2(NT) δ√m) ≤ ZN1T≤t1,...,tm≤T t1···tm · Cδ (logN +2logT)m+O(log2(NT) δ√m) ≤ · · = O((logN(δ)+2log(1/δ))m δ). · 10