ebook img

Stochastic Three-Composite Convex Minimization PDF

3.2 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Stochastic Three-Composite Convex Minimization

Stochastic Three-Composite Convex Minimization AlpYurtsever, B`a˘ngCôngVu˜, and VolkanCevher LaboratoryforInformationandInferenceSystems(LIONS) 7 ÉcolePolytechniqueFédéraledeLausanne,Switzerland 1 alp.yurtsever@epfl.ch,bang.vu@epfl.ch,volkan.cevher@epfl.ch 0 2 n Abstract a J Weproposeastochasticoptimizationmethodfortheminimizationofthesumof 1 3 threeconvexfunctions,oneofwhichhasLipschitzcontinuousgradientaswell asrestrictedstrongconvexity. Ourapproachismostsuitableinthesettingwhere ] itiscomputationallyadvantageoustoprocesssmoothterminthedecomposition C withitsstochasticgradientestimateandtheothertwofunctionsseparatelywith O theirproximaloperators,suchasdoublyregularizedempiricalriskminimization . problems. Weprovetheconvergencecharacterizationoftheproposedalgorithmin h expectationunderthestandardassumptionsforthestochasticgradientestimateof t a thesmoothterm. Ourmethodoperatesintheprimalspaceandcanbeconsideredas m astochasticextensionofthethree-operatorsplittingmethod. Numericalevidence supportstheeffectivenessofourmethodinreal-worldproblems. [ 1 v 1 Introduction 3 3 0 Weproposeastochasticoptimizationmethodforthethree-compositeminimizationproblem: 9 minimizef(x)+g(x)+h(x), (1) 0 x∈Rd . 1 wheref :Rd →Randg :Rd →Rareproper,lowersemicontinuousconvexfunctionsthatadmit 0 7 tractableproximaloperators,andh:Rd →Risasmoothfunctionwithrestrictedstrongconvexity. 1 Weassumethatwehaveaccesstounbiased,stochasticestimatesofthegradientofhinthesequel, : whichiskeytoscaleupoptimizationandtoaddressstreamingsettingswheredataarriveintime. v i Template(1)coversalargenumberofapplicationsinmachinelearning,statistics,andsignalprocess- X ingbyappropriatelychoosingtheindividualterms. Operatorsplittingmethodsarepowerfulinthis r setting,sincetheyreducethecomplexproblem(1)intosmallersubproblems. Thesealgorithmsare a easytoimplement,andtheytypicallyexhibitstate-of-the-artperformance. Toourknowledge,thereisnooperatorsplittingframeworkthatcancurrentlytackletemplate(1) usingstochasticgradientofhandtheproximaloperatorsoff andgseparately,whichiscriticalto thescalabilityofthemethods. Thispaperspecificallybridgesthisgap. Ourbasicframeworkiscloselyrelatedtothedeterministicthreeoperatorsplittingmethodproposed in[11],butweavoidthecomputationofthegradient∇handinsteadworkwithitsunbiasedestimates. Weproviderigorousconvergenceguaranteesforourapproachandprovideguidanceinselectingthe learningrateunderdifferentscenarios. Roadmap. Section2introducesthebasicoptimizationbackground. Section3thenpresentsthe mainalgorithmandprovidesitsconvergencecharacterization. Section4placesourcontributionsin lightoftheexistingwork. NumericalevidencethatillustratesourtheoryappearsinSection5. We relegatethetechnicalproofstothesupplementarymaterial. 30thConferenceonNeuralInformationProcessingSystems(NIPS2016),Barcelona,Spain. 2 Notationandbackground Thissectionrecallsafewbasicnotionsfromtheconvexanalysisandtheprobabilitytheory, and presentsthenotationusedintherestofthepaper. Throughout,Γ (Rd)denotesthesetofallproper, 0 lower semicontinuous convex functions from Rd to [−∞,+∞], and (cid:104)·|·(cid:105) is the standard scalar productonRdwithitsassociatednorm(cid:107)·(cid:107). Subdifferential. Thesubdifferentialoff ∈Γ (Rd)atapointx∈Rdisdefinedas 0 ∂f(x)={u∈Rd|f(y)−f(x)≥(cid:104)y−x|u(cid:105),∀y∈Rd}. Wedenotethedomainof∂f as dom(∂f)={x∈Rd|∂f(x)(cid:54)=∅}. If∂f(x)isasingleton,thenf isadifferentiablefunction,and∂f(x)={∇f(x)}. Indicatorfunction. GivenanonemptysubsetCinRd,theindicatorfunctionofCisgivenby (cid:26) 0 ifx∈C, ι (x)= (2) C +∞ ifx(cid:54)∈C. Proximaloperator. Theproximaloperatorofafunctionf ∈Γ (Rd)isdefinedasfollows 0 (cid:26) (cid:27) 1 prox (x)=arg min f(z)+ (cid:107)z−x(cid:107)2 . (3) f z∈Rd 2 Roughlyspeaking,theproximaloperatoristractablewhenthecomputationof(3)ischeap. Iff is theindicatorfunctionofanonempty,closedconvexsubsetC,itsproximityoperatoristheprojection operatoronC. Lipschitz continuos gradient. A function f ∈ Γ (Rd) has Lipschitz continuous gradient with 0 LipschitzconstantL>0(orsimplyL-Lipschitz),if (cid:107)∇f(x)−∇f(y)(cid:107)≤L(cid:107)x−y(cid:107), ∀x,y∈Rd. Strongconvexity. Afunctionf ∈Γ (Rd)iscalledstronglyconvexwithsomeparameterµ>0(or 0 simplyµ-stronglyconvex),if (cid:104)p−q |x−y(cid:105)≥µ(cid:107)x−y(cid:107)2, ∀x,y∈dom(∂f), ∀p∈∂f(x), ∀q ∈∂f(y). Solutionset. Wedenoteoptimumpointsof(1)byx(cid:63),andthesolutionsetbyX(cid:63): x(cid:63) ∈X(cid:63) ={x∈Rd|0∈∇h(x)+∂g(x)+∂f(x)}. Throughoutthispaper,weassumethatX(cid:63)isnotempty. Restrictedstrongconvexity. Afunctionf ∈Γ (Rd)hasrestrictedstrongconvexitywithrespectto 0 apointx(cid:63)inasetM⊂dom(∂f),withparameterµ>0,if (cid:104)p−q |x−x(cid:63)(cid:105)≥µ(cid:107)x−x(cid:63)(cid:107)2, ∀x∈M, ∀p∈∂f(x), ∀q ∈∂f(x(cid:63)). Let (Ω,F,P) be a probability space. An Rd-valued random variable is a measurable function x: Ω → Rd, where Rd is endowed with the Borel σ-algebra. We denote by σ(x) the σ-field generated by x. The expectation of a random variable x is denoted by E[x]. The conditional expectationofxgivenaσ-fieldA⊂F isdenotedbyE[x|A]. Givenarandomvariabley: Ω→Rd, theconditionalexpectationofxgivenyisdenotedbyE[x|y].See[17]formoredetailsonprobability theory. AnRd-valuedrandomprocessisasequence(xn)n∈NofRd-valuedrandomvariables. 3 Stochasticthree-compositeminimizationalgorithmanditsanalysis Wepresentstochasticthree-compositeminimizationmethod(S3CM)inAlgorithm1,forsolvingthe three-compositetemplate(1). Ourapproachcombinesthestochasticgradientofh,denotedasr,and theproximaloperatorsoff andg inessentiallythesamestructrureasthethree-operatorsplitting method[11,Algorithm2]. Ourtechniqueisanontrivialcombinationofthealgorithmicframework of[11]withstochasticanalysis. 2 Algorithm1Stochasticthree-compositeminimizationalgorithm(S3CM) Input: An initial point xf,0, a sequence of learning rates (γn)n∈N, and a sequence of squared integrableRd-valuedstochasticgradientestimates(rn)n∈N. Initialization: x =prox (x ) g,0 γ0g f,0 u =γ−1(x −x ) g,0 0 f,0 g,0 Mainloop: forn=0,1,2,...do x =prox (x +γ u ) g,n+1 γng f,n n g,n u =γ−1(x −x )+u g,n+1 n f,n g,n+1 g,n x =prox (x −γ u −γ r ) f,n+1 γn+1f g,n+1 n+1 g,n+1 n+1 n+1 endfor Output: x asanapproximationofanoptimalsolutionx(cid:63). g,n Theorem1 Assumethathisµ -stronglyconvexandhasL-Lipschitzcontinuousgradient. Further h assumethatgisµ -stronglyconvex,whereweallowµ =0. Considerthefollowingupdaterulefor g g thelearningrate: (cid:112) −γ2µ η+ (γ2µ η)2+(1+2γ µ )γ2 γ = n h n h n g n, forsomeγ >0andη ∈]0,1[. n+1 1+2γ µ 0 n g DefineF =σ(x ) ,andsupposethatthefollowingconditionsholdforeveryn∈N: n f,k 0≤k≤n 1. E[r |F ]=∇h(x )almostsurely, n+1 n g,n+1 2. Thereexistsc∈[0,+∞[andt∈R,thatsatisfies(cid:80)n E[(cid:107)r −∇h(x )(cid:107)2]≤cnt. k=0 k g,k Then,theiteratesofS3CMsatisfy E[(cid:107)x −x(cid:63)(cid:107)2]=O(1/n2)+O(1/n2−t). (4) g,n Remark1 The variance condition of the stochastic gradient estimates in the theorems above is satisfiedwhenE[(cid:107)r −∇h(x )(cid:107)2] ≤ cforalln ∈ Nandforsomeconstantc ∈ [0,+∞[. See n g,n [15,22,26]fordetails. Remark2 Whenr =∇h(x ),S3CMreducestothedeterministicthree-operatorsplittingscheme n n [11,Algorithm2]andwerecovertheconvergencerateO(1/n2)asin[11]. Whengiszero,S3CM reducestothestandardstochasticproximalpointalgorithm[2,13,26]. Remark3 Learningratesequence(γn)n∈NinTheorem1dependsonthestrongconvexityparameter µ ,whichmaynotbeavailableapriori. Ournextresultavoidstheexplicitrelianceonthestrong h convexityparameter,whileprovidingessentiallythesameconvergencerate. Theorem2 Assumethathisµ -stronglyconvexandhasL-Lipschitzcontinuousgradient. Con- h siderapositivedecreasinglearningratesequenceγ = Θ(1/nα)forsomeα ∈]0,1],anddenote n β =lim 2µ nαγ . n→∞ h n DefineF =σ(x ) ,andsupposethatthefollowingconditionsholdforeveryn∈N: n f,k 0≤k≤n 1. E[r |F ]=∇h(x )almostsurely, n+1 n g,n+1 2. E[(cid:107)r −∇h(x )(cid:107)2]isuniformlyboundedbysomepositiveconstant. n g,n 3. E[(cid:107)u −x(cid:63)(cid:107)2]isuniformlyboundedbysomepositiveconstant. g,n Then,theiteratesofS3CMsatisfy O(cid:0)1/nα(cid:1) if 0<α<1 O(cid:0)1/nβ(cid:1) if α=1, andβ <1 E[(cid:107)xg,n−x(cid:63)(cid:107)2]= O(cid:0)(logn)/n(cid:1) if α=1, andβ =1,  (cid:0) (cid:1) O 1/n if α=1, andβ >1. 3 Proofoutline. Weconsidertheproofofthree-operatorsplittingmethodasabaseline,andweuse thestochasticfixedpointtheorytoderivetheconvergenceoftheiteratesviathestochasticFejér monotonesequence. Seethesupplementforthecompleteproof. Remark4 Notethatu ∈ ∂g(x ). Hence,wecanreplacecondition3inTheorem2withthe g,n g,n boundedsubgradientassumption: (cid:107)p(cid:107)≤c,∀p∈∂g(x ),forsomepositiveconstantc. g,n Remark5(Restrictedstrongconvexity) LetMbeasubsetofRdthatcontains(xg,n)n∈Nandx(cid:63). SupposethathhasrestrictedstrongconvexityonMwithparameterµ . Then,Theorems1and2 h stillhold. Anexampleroleoftherestrictedstrongconvexityassumptiononalgorithmicconvergence canbefoundin[1,21]. Remark6(Extensiontoarbitrarynumberofnon-smoothterms.) Usingtheproductspacetech- nique [5, Section 6.1], S3CM can be applied to composite problems with arbitrary number of non-smoothterms: m (cid:88) minimize f (x)+h(x), i x∈Rd i=1 wheref :Rd →Rareproper,lowersemicontinuousconvexfunctions,andh:Rd →Risasmooth i functionwithrestrictedstrongconvexity. WepresentthisvariantinAlgorithm2. Theorems1and2 holdforthisvariant,replacingx byx ,andu byu fori=1,2,...,m. g,n n g,n i,n Algorithm2Stochasticm(ulti)-compositeminimizationalgorithm(SmCM) Input: Initial points {xf1,0,xf2,0,...,xfm,0}, a sequence of learning rates (γn)n∈N, and a se- quenceofsquaredintegrableRd-valuedstochasticgradientestimates(rn)n∈N Initialization: x =m−1(cid:80)m x 0 i=1 fi,0 fori=1,2,...,mdo u =γ−1(x −x ) i,0 0 fi,0 0 endfor Mainloop: forn=0,1,2,...do x =m−1(cid:80)m (x +γ u ) n+1 i=1 fi,n n i,n fori=1,2,...,mdo u =γ−1(x −x )+u i,n+1 n fi,n n+1 i,n x =prox (x −γ u −γ r ) fi,n+1 γn+1mfi n+1 n+1 i,n+1 n+1 n+1 endfor endfor Output: x asanapproximationofanoptimalsolutionx(cid:63). n Remark7 Withaproperlearningrate,S3CMstillconvergesevenifhisnot(restricted)strongly convexundermildassumptions. SupposethathhasL-Lipschitzcontinuousgradient. Setthelearning ratesuchthatε≤γ ≡γ ≤α(2L−1−ε),forsomeαandεin]0,1[. DefineF =σ(x ) , n n f,k 0≤k≤n andsupposethatthefollowingconditionsholdforeveryn∈N: 1. E[r |F ]=∇h(x )almostsurely. n+1 n g,n+1 2. (cid:80) E[(cid:107)r −∇h(x )(cid:107)2|F ]<+∞almostsurely. n∈N n+1 g,n+1 n Then,(xg,n)n∈NconvergestoaX(cid:63)-valuedrandomvectoralmostsurely. See[7]fordetails. Remark8 AlltheresultsaboveholdforanyseparableHilbertspace,exceptthatthestrongcon- vergenceinRemark7isreplacedbyweakconvergence. NotehoweverthatextendingRemark7to variablemetricsettingasin[10,27]isanopenproblem. 4 4 Contributionsinthelightofpriorwork Recent algorithms in the operator splitting, such as generalized forward-backward splitting [24], forward-Douglas-Rachfordsplitting[5],andthethree-operatorsplitting[11],applytoourproblem template(1). Thesekeyresults,however,areinthedeterministicsetting. Ourbasicframeworkcanbeviewedasacombinationofthethree-operatorsplittingmethodin[11] withthestochasticanalysis. The idea of using unbiased estimates of the gradient dates back to [25]. Recent developments of this idea can be viewed as proximal based methods for solving the generic composite convex minimizationtemplatewithasinglenon-smoothterm[2,9,12,13,15,16,19,26,23]. Thisgeneric formarisesnaturallyinregularizedorconstrainedcompositeproblems[3,13,20],wherethesmooth termtypicallyencodesthedatafidelity. Thesemethodsrequiretheevaluationofthejointproxoff andgwhenappliedtothethree-compositetemplate(1). Unfortunately,evaluationofthejointproxisarguablymoreexpensivecomparedtotheindividual proxoperators. Tomakecomparisonstark,considerthesimpleexamplewheref andgareindicator functionsfortwoconvexsets. Eveniftheprojectionontotheindividualsetsareeasytocompute, projectionontotheintersectionofthesesetscanbechallenging. Relatedliteraturealsocontainsalgorithmsthatsolvesomespecificinstancesoftemplate(1). Topoint outafew, randomaveragingprojectionmethod[28]handlesmultipleconstraintssimultaneously butcannotdealwithregularizers. Ontheotherhand,acceleratedstochasticgradientdescentwith proximalaverage[29]canhandlemultipleregularizerssimultaneously,butthealgorithmimposesa Lipschitzconditiononregularizers,andhence,itcannotdealwithconstraints. Toourknowledge,ourmethodisthefirstoperatorsplittingframeworkthatcantackleoptimization template (1) using the stochastic gradient estimate of h and the proximal operators of f and g separately, without any restriction on the non-smooth parts except that their subdifferentials are maximallymonotone.Whenhisstronglyconvex,undermildassumptions,andwithaproperlearning rate,ouralgorithmconvergeswithO(1/n)rate,whichisoptimalforthestochasticmethodsunder strongconvexityassumptionforthisproblemclass. 5 Numericalexperiments We present numerical evidence to assess the theoretical convergence guarantees of the proposed algorithm. WeprovidetwonumericalexamplesfromMarkowitzportfoliooptimizationandsupport vectormachines. Asabaseline,weusethedeterministicthree-operatorsplittingmethod[11]. Eventhoughtherandom averagingprojectionmethodproposedin[28]doesnotapplytoourtemplate(1)withitsallgenerality, it does for the specific applications that we present below. In our numerical tests, however, we observedthatthismethodexhibitsessentiallythesameconvergencebehaviorasourswhenused withthesamelearningratesequence. Fortheclarityofthepresentation,weomitthismethodinour results. 5.1 Portfoliooptimization TraditionalMarkowitzportfoliooptimizationaimstoreduceriskbyminimizingthevariancefora givenexpectedreturn. Mathematically,wecanformulatethisasaconvexoptimizationproblem[6]: minimize E(cid:2)|aTx−b|2(cid:3) subjectto x∈∆, aT x≥b, i av x∈Rd where∆isthestandardsimplexforportfolioswithno-shortpositionsorasimplesumconstraint, a =E[a ]istheaveragereturnsforeachassetthatisassumedtobeknown(orestimated),andb av i encodesaminimumdesiredreturn. Thisproblemhasastreamingnaturewherenewdatapointsarriveintime. Hence,wetypicallydonot haveaccesstothewholedataset,andthestochasticsettingismorefavorable. Forimplementation, 5 wereplacetheexpectationwiththeempiricalsampleaverage: p 1(cid:88) minimize (aTx−b)2 subjectto x∈∆, aT x≥b. (5) x∈Rd p i av i=1 Thisproblemfitsintoouroptimizationtemplate(1)bysetting p 1(cid:88) h(x)= (aTx−b)2, g(x)=ι (x), and f(x)=ι (x). p i ∆ {x|aTavx≥b} i=1 We compute the unbiased estimates of the gradient by r = 2(aT x−b)a , where index i is n in in n chosenuniformlyrandom. Weuse5differentrealportfoliodatasets: DowJonesindustrialaverage(DJIA,with30stocksfor 507days),NewYorkstockexchange(NYSE,with36stocksfor5651days),Standard&Poor’s500 (SP500,with25stocksfor1276days),Torontostockexchange(TSE,with88stocksfor1258days) thatarealsoconsideredin[4];andonedatasetbyFamaandFrench(FF100,100portfoliosformed onsizeandbook-to-market,23,647days)thatiscommonlyusedinfinancialliterature,e.g.,[6,14]. WeimputethemissingdatainFF100usingnearest-neighbormethodwithEuclideandistance. Figure1: Comparisonofthedeterministicthree-operatorssplittingmethod[11, Algorithm2]and ourstochasticthree-compositeminimizationmethod(S3CM)forMarkowitzportfoliooptimization (5). Resultsareaveragedover100Monte-Carlosimulations,andtheboundariesoftheshadedarea arethebestandworstinstances. Forthedeterministicalgorithm,wesetη =0.1. WeevaluatetheLipschitzconstantLandthestrong convexityparameterµ todeterminethestep-size. Forthestochasticalgorithm, wedonothave h access to the whole data, so we cannot compute these parameter. Hence, we adopt the learning ratesequencedefinedinTheorem2. Wesimplyuseγ =γ /(n+1)withγ =1forFF100,and n 0 0 γ =103forothers.1 Westartbothalgorithmsfromthezerovector. 0 1Notethatafine-tunedlearningratewithamorecomplexdefinitioncanimprovetheempiricalperformance, e.g.,γ =γ /(n+ζ)forsomepositiveconstantsγ andζ. n 0 0 6 We split all the datasets into test (10%) and train (90%) partitions randomly. We set the desired returnastheaveragereturnoverallassetsinthetrainingset,b=mean(a ). Otherbvaluesexhibit av qualitativelysimilarbehavior. TheresultsofthisexperimentarecompiledinFigure1. Wecomputetheobjectivefunctionover thedatapointsinthetestpartition,h . Wecompareouralgorithmagainstthedeterministicthree- test operatorsplittingmethod[11, Algorithm2]. Sinceweseekstatisticalsolutions, wecomparethe algorithmstoachievelowtomediumaccuracy. [11]providesothervariantsofthedeterministicalgo- rithm,includingtwoergodicaveragingschemesthatfeatureimprovedtheoreticalrateofconvergence. However,thesevariantsperformedworseinpracticethantheoriginalmethod,andareomitted. Solid lines in Figure 1 present the average results over 100 Monte-Carlo simulations, and the boundariesoftheshadedareaarethebestandworstinstances. Wealsoassessempiricalevidenceof theO(1/n)convergencerateguaranteedinTheorem2,bypresentingsquaredrelativedistancetothe optimumsolutionforFF100dataset. Here,weapproximatethegroundtruthbysolvingtheproblem tohighaccuracywiththedeterministicalgorithmfor105iterations. 5.2 Nonlinearsupportvectormachinesclassification This section demonstrates S3CM on a support vector machines (SVM) for binary classification problem. We are given a training set A = {a ,a ,...,a } and the corresponding class labels 1 2 d {b ,b ,...,b },wherea ∈ Rp andb ∈ {−1,1}. Thegoalistobuildamodelthatassignsnew 1 2 d i i examplesintooneclassortheothercorrectly. Ascommoninpractice,wesolvethedualsoft-marginSVMformulation: d d d 1(cid:88)(cid:88) (cid:88) minimize K(a ,a )b b x x − x subjectto x∈[0,C]d, bTx=0, x∈Rd 2 i j i j i j i i=1j=1 i=1 whereC ∈ [0,+∞[isthepenaltyparameterandK : Rp×Rp → Risakernelfunction. Inour exampleweusetheGaussiankernelgivenbyK (a ,a )=exp(−σ(cid:107)a −a (cid:107)2)forsomeσ >0. σ i j i j Define symmetric positive semidefinite matrix M ∈ Rd×d with entries M = K (a ,a )b b . ij σ i j i j Thentheproblemtakestheform d 1 (cid:88) minimize xTMx− x subjectto x∈[0,C]d, bTx=0. (6) x∈Rd 2 i i=1 Thisproblemfitsintothree-compositeoptimizationtemplate(1)with d 1 (cid:88) h(x)= xTMx− x , g(x)=ι (x), and f(x)=ι (x). 2 i [0,C]d {x|bTx=0} i=1 Onecansolvethisproblemusingthree-operatorsplittingmethod[11,Algorithm1]. Notethatprox f andprox ,whichareprojectionsontothecorrespondingconstraintsets,incurO(d)computational g cost,whereasthecostofcomputingthegradientisO(d2). Tocomputeanunbiasedgradientestimate,wechooseanindexi uniformlyrandom,andweform n r =dM x −1.HereM denotesithcolumnofmatrixM,and1representsthevectorofones. n in in in n Wecancomputer inO(d)computations,henceeachiterationofS3CMcostsanordercheaper n comparedtodeterministicalgorithm. WeuseUCImachinelearningdataset“a1a”,withd=1605datapointsandp=123features[8,18]. Notethatourgoalhereistodemonstratetheoptimizationperformanceofouralgorithmforareal worldproblem,ratherthancompetingthepredictionqualityofthebestengineeredsolvers. Hence, tokeepexperimentssimple,wefixproblemparametersC =1andσ =2−2,andwefocusonthe effectsofalgorithmicparametersontheconvergencebehavior. Sincep<d,M isrankdeficientandhisnotstronglyconvex. NeverthelessweuseS3CMwiththe learningrateγ =γ /(n+1)forvariousvaluesofγ . WeobserveO(1/n)empiricalconvergence n 0 0 rateonthesquaredrelativeerrorforlargeenoughγ ,whichisguaranteedunderrestrictedstrong 0 convexityassumption. SeeFigure2fortheresults. 7 Figure 2: [Left] Convergence of S3CM in the squared relative error with learning rate γ =γ /(n+1). [Right] Comparison of the deterministic three-operators splitting method [11, n 0 Algorithm1]andS3CMwithγ =1forSVMclassificationproblem. Resultsareaveragedover100 0 Monte-Carlosimulations. Boundariesoftheshadedareaarethebestandworstinstances. Acknowledgments ThisworkwassupportedinpartbyERCFutureProof,SNF200021-146750,SNFCRSII2-147633, andNCCR-Marvel. References [1] A.Agarwal,S.Negahban,andM.J.Wainwright. Fastglobalconvergenceofgradientmethods forhigh-dimensionalstatisticalrecovery. Ann.Stat.,40(5):2452–2482,2012. [2] Y. F. Atchadé, G. Fort, and E. Moulines. On stochastic proximal gradient algorithms. arXiv:1402.2365v2,2014. [3] H.H.BauschkeandP.L.Combettes. ConvexanalysisandmonotoneoperatortheoryinHilbert spaces. Springer-Verlag,2011. [4] A.Borodin,R.El-Yaniv,andV.Gogan. Canwelearntobeatthebeststock. InAdvancesin NeuralInformationProcessingSystems16,pages345–352.2004. [5] L.M.Briceño-Arias. Forward-Douglas–Rachfordsplittingandforward-partialinversemethod forsolvingmonotoneinclusions. Optimization,64(5):1239–1261,2015. [6] J.Brodie,I.Daubechies,C.deMol,D.Giannone,andI.Loris. SparseandstableMarkowitz portfolios. Proc.Natl.Acad.Sci.,106:12267–12272,2009. [7] V.Cevher, B.C.Vu˜, andA.Yurtsever. Stochasticforward–Douglas–Rachfordsplittingfor monotoneinclusions. EPFL-Report-215759,2016. [8] C.-C.ChangandC.-J.Lin. LIBSVM:Alibraryforsupportvectormachines. ACMTrans.Intell. Syst.Technol.,2(3):27:1–27:27,2011. [9] P.L.CombettesandJ.-C.Pesquet. Stochasticapproximationsandperturbationsinforward- backwardsplittingformonotoneoperators. arXiv:1507.07095v1,2015. [10] P.L.CombettesandB.C.Vu˜. Variablemetricforward–backwardsplittingwithapplicationsto monotoneinclusionsinduality. Optimization,63(9):1289–1318,2014. [11] D. Davis and W. Yin. A three-operator splitting scheme and its optimization applications. arXiv:1504.01032v1,2015. [12] O.Devolder. Stochasticfirstordermethodsinsmoothconvexoptimization. Technicalreport, CenterforOperationsResearchandEconometrics,2011. [13] J.DuchiandY.Singer. Efficientonlineandbatchlearningusingforwardbackwardsplitting. J. Mach.Learn.Res.,10:2899–2934,2009. [14] E.F.FamaandK.R.French. Multifactorexplanationsofassetpricinganomalies. Journalof Finance,,51:55–84,1996. 8 [15] C.Hu,W.Pan,andJ.T.Kwok. Acceleratedgradientmethodsforstochasticoptimizationand onlinelearning. InAdvancesinNeuralInformationProcessingSystems22,pages781–789. 2009. [16] G.Lan.Anoptimalmethodforstochasticcompositeoptimization.Math.Program.,133(1):365– 397,2012. [17] M. Ledoux and M. Talagrand. Probability in Banach spaces: Isoperimetry and processes. Springer-Verlag,1991. [18] M.Lichman. UCImachinelearningrepository. UniversityofCalifornia, Irvine, Schoolof InformationandComputerSciences,2013. [19] Q.Lin,X.Chen,andJ.Peña. Asmoothingstochasticgradientmethodforcompositeoptimiza- tion. OptimizationMethodsandSoftware,29(6):1281–1301,2014. [20] S.Mosci,L.Rosasco,M.Santoro,A.Verri,andS.Villa. Solvingstructuredsparsityregulariza- tionwithproximalmethods. InEuropeanConf.MachineLearningandPrinciplesandPractice ofKnowledgeDiscovery,pages418–433,2010. [21] S.Negahban,B.Yu,M.J.Wainwright,andP.K.Ravikumar. Aunifiedframeworkforhigh- dimensionalanalysisofM-estimatorswithdecomposableregularizers. InAdvancesinNeural InformationProcessingSystems22,pages1348–1356,2009. [22] A.Nemirovski. Prox-methodwithrateofconvergenceO(1/t)forvariationalinequalitieswith Lipschitzcontinuousmonotoneoperatorsandsmoothconvex-concavesaddlepointproblems. SIAMJ.onOptimization,15(1):229–251,2005. [23] A.Nitanda. Stochasticproximalgradientdescentwithaccelerationtechniques. InAdvancesin NeuralInformationProcessingSystems27,pages1574–1582.2014. [24] H.Raguet,J.Fadili,andG.Peyré. Ageneralizedforward-backwardsplitting. SIAMJournalon ImagingSciences,6(3):1199–1226,2013. [25] H.RobbinsandS.Monro. Astochasticapproximationmethod. Ann.Math.Statist.,22(3):400– 407,1951. [26] L.Rosasco,S.Villa,andB.C.Vu˜. Convergenceofstochasticproximalgradientalgorithm. arXiv:1403.5074v3,2014. [27] B. C. Vu˜. Almost sure convergence of the forward–backward–forward splitting algorithm. OptimizationLetters,10(4):781–803,2016. [28] M.Wang,Y.Chen,J.Liu,andY.Gu. Randommulti–constraintprojection: Stochasticgradient methodsforconvexoptimizationwithmanyconstraints. arXiv:1511.03760v1,2015. [29] W.ZhongandJ.Kwok. Acceleratedstochasticgradientmethodforcompositeregularization. J. Mach.Learn.Res.,33:1086–1094,2014. 9 Appendix: Proofofthemainresult Inthissupplement,weprovidetheproofsofTheorem1andTheorem2. ProofofTheorem1. Foreveryn∈N,wehave  u =γ−1(x −x )−(u +r )∈∂f(x ) ufg,,nn ∈∂gn(xg,ng,)n f,n g,n n f,n γ (u −u )=x −x n g,n+1 g,n f,n g,n+1 γγn((uuf,n+u+gu,n+r+nr)=)=xgx,n−x−fx,n . n g,n+1 f,n n g,n g,n+1 Now,letusdefine  χ =2γ (cid:104)x −x(cid:63) |u +r (cid:105)+2γ (cid:104)x −x(cid:63) |u (cid:105) χn1,n =2(cid:104)nxg,nf+,n1−x(cid:63) |xgf,,nn−xng,n+1(cid:105) n g,n+1 g,n+1 χ =2(cid:104)x −x |x −x (cid:105) 2,n f,n g,n+1 g,n f,n χχ3,n ==22γγn(cid:10)(cid:10)xxg,n+1−−xxf,n ||uug(cid:63),(cid:11)n,−u(cid:63)g(cid:11)=2γn2(cid:10)ug,n−ug,n+1 |ug,n−u(cid:63)g(cid:11) 4,n n g,n+1 f,n g whereu(cid:63) ∈∂g(x(cid:63)). Then,bysimplecalculationsweget g χ =(cid:107)x −x (cid:107)2−(cid:107)x −x (cid:107)2−(cid:107)x −x (cid:107)2 χ2,n =(cid:107)xg,n−xg(cid:63),(cid:107)n2+−1 (cid:107)x f,n−x(cid:63)(cid:107)g2,n−+1(cid:107)x −xg,n (cid:107)f2,n 1,n g,n g,n+1 g,n g,n+1 (7) χ =γ2(cid:107)u −u (cid:107)2+γ2(cid:107)u −u(cid:63)(cid:107)2−γ2(cid:107)u −u(cid:63)(cid:107)2  3,n =γn2(cid:107)ug,n+−1u(cid:63)(cid:107)2g,−n γ2(cid:107)un g,n−u(cid:63)(cid:107)g2+(cid:107)xn −g,xn+1 (cid:107)2g. n g,n g n g,n+1 g f,n g,n+1 Furthermore,foreveryn∈N,wecanexpressχ asfollows: n χ =2γ (cid:104)x −x |u +r (cid:105)+2γ (cid:104)x −x(cid:63) |u +u +r (cid:105) n n f,n g,n+1 f,n n n g,n+1 g,n+1 f,n n =χ +2γ (cid:104)x −x |u +r (cid:105) 1,n n f,n g,n+1 f,n n (cid:0) (cid:1) =χ +2γ (cid:104)x −x |u +r +u (cid:105)−(cid:104)x −x |u (cid:105) 1,n n f,n g,n+1 f,n n g,n f,n g,n+1 g,n =χ +χ +2γ (cid:104)x −x |u (cid:105) 1,n 2,n n g,n+1 f,n g,n =χ +χ +2γ (cid:10)x −x |u(cid:63)(cid:11)+2γ (cid:10)x −x |u −u(cid:63)(cid:11) 1,n 2,n n g,n+1 f,n g n g,n+1 f,n g,n g =χ +χ +χ +χ . 1,n 2,n 3,n 4,n Now,summingtheequalitiesin(7),weobtain, χ =γ2(cid:0)(cid:107)u −u(cid:63)(cid:107)2−(cid:107)u −u(cid:63)(cid:107)2(cid:1)+(cid:107)x −x(cid:63)(cid:107)2−(cid:107)x −x(cid:63)(cid:107)2 n n g,n g g,n+1 g g,n g,n+1 −(cid:107)x −x (cid:107)2+χ . g,n f,n 4,n (cid:68) (cid:69) Denoteu(cid:63) ∈∂f(x(cid:63)). Wehave x −x(cid:63) |u −u(cid:63) ≥0,sincef isconvex. Hence, f f,n f,n f χ =2γ (cid:0)(cid:10)x −x(cid:63) |u −u(cid:63)(cid:11)+(cid:10)x −x(cid:63) |u(cid:63) +r (cid:11)+(cid:104)x −x(cid:63) |u (cid:105)(cid:1) n n f,n f,n f f,n f n g,n+1 g,n+1 ≥2γ (cid:0)(cid:10)x −x(cid:63) |u(cid:63) +r (cid:11)+(cid:104)x −x(cid:63) |u (cid:105)(cid:1) n f,n f n g,n+1 g,n+1 =2γ (cid:0)(cid:10)x −x(cid:63) |u(cid:63) +r (cid:11)+(cid:10)x −x(cid:63) |u(cid:63)(cid:11)+(cid:10)x −x(cid:63) |u −u(cid:63)(cid:11)(cid:1) n f,n f n g,n+1 g g,n+1 g,n+1 g ≥2γ (cid:0)(cid:10)x −x(cid:63) |u(cid:63) +r (cid:11)+µ (cid:107)x −x(cid:63)(cid:107)2+(cid:10)x −x(cid:63) |u(cid:63)(cid:11)(cid:1), (8) n f,n f n g g,n+1 g,n+1 g wherethelastinequalityfollowsfromtheassumptionthatgisµ -stronglyconvex. Set g xe =prox (cid:0)(x −γ u −γ ∇h(x )(cid:1). f,n γnf g,n n g,n n g,n Then,usingthenon-expansivenessofprox ,weget γnf (cid:107)xe −x (cid:107)≤γ (cid:107)∇h(x )−r (cid:107). f,n f,n n g,n n Now,letusdefine  (cid:68) (cid:69) χ5,n =(cid:68)xf,n−xef,n |rn−∇h(xg,n(cid:69)) χ = xe −x(cid:63) |r −∇h(x ) χ6,n =χ f,n+χ =n(cid:104)x −xg(cid:63),n|r −∇h(x )(cid:105). 7,n 5,n 6,n f,n n g,n 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.