ebook img

(Almost) No Label No Cry PDF

37 Pages·2015·0.84 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview (Almost) No Label No Cry

(Almost) No Label No Cry GiorgioPatrini1,2,RichardNock1,2,PaulRivera1,2,TiberioCaetano1,3,4 AustralianNationalUniversity1,NICTA2,UniversityofNewSouthWales3,Ambiata4 Sydney,NSW,Australia name.surname @anu.edu.au { } Abstract InLearningwithLabelProportions(LLP),theobjectiveistolearnasupervised classifierwhen,insteadoflabels,onlylabelproportionsforbagsofobservations are known. This setting has broad practical relevance, in particular for privacy preservingdataprocessing.Wefirstshowthatthemeanoperator,astatisticwhich aggregatesalllabels,isminimallysufficientfortheminimizationofmanyproper scoringlosseswithlinear(orkernelized)classifierswithoutusinglabels. Wepro- vide a fast learning algorithm that estimates the mean operator via a manifold regularizer with guaranteed approximation bounds. Then, we present an itera- tive learning algorithm that uses this as initialization. We ground this algorithm in Rademacher-style generalization bounds that fit the LLP setting, introducing a generalization of Rademacher complexity and a Label Proportion Complexity measure. Thislatteralgorithmoptimizestractableboundsforthecorresponding bag-empirical risk. Experiments are provided on fourteen domains, whose size rangesupto 300Kobservations. Theydisplaythatouralgorithmsarescalable ≈ andtendtoconsistentlyoutperformthestateoftheartinLLP.Moreover,inmany cases, our algorithms compete with or are just percents of AUC away from the Oraclethatlearnsknowingalllabels. Onthelargestdomains, halfadozenpro- portionscansuffice,i.e. roughly40Ktimeslessthanthetotalnumberoflabels. 1 Introduction Machinelearninghasrecentlyexperiencedaproliferationofproblemsettingsthat,tosomeextent, enrich the classical dichotomy between supervised and unsupervised learning. Cases as multiple instance labels, noisy labels, partial labels as well as semi-supervised learning have been studied motivatedbyapplicationswherefullysupervisedlearningisnolongerrealistic.Inthepresentwork, weareinterestedinlearningabinaryclassifierfrominformationprovidedatthelevelofgroupsof instances, called bags. The type of information we assume available is the label proportions per bag,indicatingthefractionofpositivebinarylabelsofitsinstances.Inspiredby[1],werefertothis frameworkasLearningwithLabelProportions(LLP).Settingsthatperformabag-wiseaggregation of labels include Multiple Instance Learning (MIL) [2]. In MIL, the aggregation is logical rather thanstatistical:eachbagisprovidedwithabinarylabelexpressinganORconditiononallthelabels containedinthebag. Moregeneralsettingalsoexist[3][4][5]. ManypracticalscenariosfittheLLPabstraction. (a)Onlyaggregatedlabelscanbeobtaineddueto thephysicallimitsofmeasurementtools[6][7][8][9]. (b)Theproblemissemi-orunsupervised butdomainexpertshaveknowledgeabouttheunlabelledsamplesinformofexpectation,aspseudo- measurement [5]. (c) Labels existed once but they are now given in an aggregated fashion for privacy-preservingreasons,asinmedicaldatabases[10],frauddetection[11],housepricemarket, electionresults,censusdata,etc. . (d)Thissettingalsoarisesincomputervision[12][13][14]. Related work. The setting was first introduced by [12], where a principled hierarchical model generateslabelsconsistentwiththeproportionsandistrainedthroughMCMC.Subsequently,[9]and itsfollower[6]offeravarietyofstandardlearningalgorithmsdesignedtogenerateself-consistent 1 labels.[15]givesaBayesianinterpretationofLLPwherethekeydistributionisestimatedthroughan RBM.OtherideasrelyonstructurallearningofBayesiannetworkswithmissingdata[7],andonK- MEANSclusteringtosolvepreliminarylabelassignment[13][8].RecentSVMimplementations[11] [16]outperformmostoftheotherknownmethods. TheoreticalworksonLLPbelongtotwomain categories. The first contains uniform convergence results, for the estimators of label proportions [1],ortheestimatorofthemeanoperator[17]. Thesecondcontainsapproximationresultsforthe classifier [17]. Our work builds upon their Mean Map algorithm, that relies on the trick that the logistic loss may be split in two, a convex part depending only on the observations, and a linear partinvolvingasufficientstatisticforthelabel,themeanoperator. Beingabletoestimatethemean operatormeansbeingabletofitaclassifierwithoutusinglabels. In[17],thisestimationreliesona restrictivehomogeneityassumptionthattheclass-conditionalestimationoffeaturesdoesnotdepend onthebags. Experimentsdisplaythelimitsofthisassumption[11][16]. Contributions. Inthispaperweconsiderlinearclassifiers,butourresultsholdforkernelizedfor- mulations following [17]. We first show that the trick about the logistic loss can be generalized, andthemeanoperatorisactuallyminimallysufficientforawidesetof“symmetric”properscoring losseswithnoclass-dependentmisclassificationcost,thatencompassthelogistic,squareandMat- sushitalosses[18]. Wethenprovideanalgorithm, LMM, whichestimatesthemeanoperatorviaa Laplacian-basedmanifoldregularizerwithoutcallingtothehomogeneityassumption.Weshowthat under a weak distinguishability assumption between bags, our estimation of the mean operator is allthebetterastheobservationsnormincrease. This, asweshow, cannotholdfortheMeanMap estimator. Then, weprovideadata-dependentapproximationboundforourclassifierwithrespect to the optimal classifier, that is shown to be better than previous bounds [17]. We also show that themanifoldregularizer’ssolutionistightlyrelatedtothelinearseparabilityofthebags. Wethen provideaniterativealgorithm, AMM, thattakesasinputthesolutionof LMM andoptimizesitfur- theroverthesetofconsistentlabelings. Wegroundthealgorithminauniformconvergenceresult involving a generalization of Rademacher complexities for the LLP setting. The bound involves a bag-empirical surrogate risk for which we show that AMM optimizes tractable bounds. All our theoreticalresultsholdforanysymmetricproperscoringloss. Experimentsareprovidedonfour- teendomains,rangingfromhundredstohundredsofthousandsofexamples,comparing AMM and LMM to their contenders: Mean Map, InvCal [11] and SVM [16]. They display that AMM and ∝ LMM outperform their contenders, and sometimes even compete with the fully supervised learner while requiring few proportions only. Tests on the largest domains display the scalability of both algorithms. Suchexperimentalevidenceseriouslyquestionsthesafetyofprivacy-preservingsum- marizationofdata,wheneveraccurateaggregatesandinformativeindividualfeaturesareavailable. Section(2)presentsouralgorithmsandrelatedtheoreticalresults.Section(3)presentsexperiments. Section(4)concludes. ASupplementaryMaterial[19]includesproofsandadditionalexperiments. 2 LLPandthemeanoperator: theoreticalresultsandalgorithms Learningsetting Hereafter,boldfaceslikepdenotevectors,whosecoordinatesaredenotedplfor l = 1,2,.... Foranym N ,let[m] =. 1,2,...,m . LetΣm =. σ 1,1 m andX Rd. Examples are couples (o∈bser∗vation, label){ X Σ}, sampled i.i.d{. a∈cco{r−ding}to}some unk⊆nown 1 butfixeddistributionD. LetS =. (x ,y )∈,i ×[m] D denoteasize-msample. InLearning i i m { ∈ } ∼ with Label Proportions (LLP), we do not observe directly S but S , which denotes S with labels y removed;wearegivenitspartitioninn > 0bags,S = S ,j |[n],alongwiththeirrespective labelproportionsπˆj =. Pˆ[y =+1Sj]andbagpropor|ytions∪pˆjj =j. m∈j/mwithmj =card(Sj). (This | generalizestoacoverofS,bycopyingexamplesamongbags.) The“bagassignmentfunction”that partitionsSisunknownbutfixed.Inrealworlddomains,itwouldratherbeknown,e.g.state,gender, ageband. Aclassifierisafunctionh : X R,fromasetofclassifiersH. HL denotesthesetof linearclassifiers,notedhθ(x) =. θ>xwit→hθ X. A(surrogate)lossisafunctionF : R R+. WeletF(S,h)=. (1/m) F(y h(x ))denot∈etheempiricalsurrogateriskonScorrespon→dingto i i i lossF. Forthesakeofclarity,indexesi,jandkrespectivelyrefertoexamples,bagsandfeatures. P Themeanoperatoranditsminimalsufficiency Wedefinethe(empirical)meanoperatoras: 1 . µS = yixi . (1) m i X 2 Algorithm1LaplacianMeanMap(LMM) InputSj,πˆj,j [n];γ >0(7);w(7);V(8);permissibleφ(2);λ>0; ∈ SStteepp21:: lleettµB˜˜±S ←←argjmpˆji(nπˆXjb˜∈+jR2−n×d(1‘(−L,πˆXj))b˜u−js)ing(7)(Lemma2) SRteetpu3rn:θl˜e∗tθ˜∗ ←aPrgminθFφ(S|y,θ,µ˜S)+λkθk22(3) Table1: CorrespondencebetweenpermissiblefunctionsφandthecorrespondinglossF . φ lossname Fφ(x) −φ(x) logisticloss log(1+exp( x)) xlogx (1 x)log(1 x) − − − − − squareloss (1 x)2 x(1 x) − − Matsushitaloss x+√1+x2 x(1 x) − − p The estimation of the mean operator µS appears to be a learning bottleneck in the LLP setting [17]. Thefactthatthemeanoperatorissufficienttolearnaclassifierwithoutthelabelinformation motivates the notion of minimal sufficient statistic for features in this context. Let F be a set of lossfunctions,Hbeasetofclassifiers,Ibeasubsetoffeatures. Somequantityt(S)issaidtobe a minimal sufficient statistic for I with respect to F and H iff: for any F F, any h H and ∈ ∈ anytwosamplesSandS,thequantityF(S,h) F(S,h)doesnotdependonIifft(S) = t(S). 0 0 0 − Thisdefinitioncanbemotivatedfromtheoneinstatisticsbybuildinglossesfromloglikelihoods. ThefollowingLemmamotivatesfurtherthemeanoperatorintheLLPsetting,asitistheminimal sufficient statistic for a broad set of proper scoring losses that encompass the logistic and square losses[18]. Theproperscoringlossesweconsider,hereaftercalled“symmetric”(SPSL),aretwice differentiable,non-negativeandsuchthatmisclassificationcostisnotlabel-dependent. Lemma1 µSisaminimalsufficientstatisticforthelabelvariable,withrespecttoSPSLandHL. ([19], Subsection 2.1) This property, very useful for LLP, may also be exploited in other weakly supervised tasks [2]. Up to constant scalings that play no role in its minimization, the empirical surrogateriskcorrespondingtoanySPSL,Fφ(S,h),canbewrittenwithloss: φ(0)+φ?( x) φ?( x) . . F (x) = − =a + − , (2) φ φ φ(0) φ(1/2) b φ − andφisapermissiblefunction[20,18],i.e.dom(φ) [0,1],φisstrictlyconvex,differentiableand symmetricwithrespectto1/2. φ? istheconvexconj⊇ugateofφ. Table1showsexamplesofF . It φ followsfromLemma1anditsproof,thatanyF (Sθ),canbewrittenforanyθ h H as: φ θ L ≡ ∈ b 1 Fφ(S,θ) = 2mφ Fφ(σθ>xi)!− 2θ>µS =. Fφ(S|y,θ,µS) , (3) i σ XX whereσ Σ . 1 ∈ The Laplacian Mean Map (LMM) algorithm The sum in eq. (3) is convex and differentiable in θ. Hence, once we have an accurate estimator of µS, we can then easily fit θ to minimize Fφ(Sy,θ,µS). Thistwo-stepsstrategyisimplementedinLMMinalgorithm1. µScanberetrieved from|2nbag-wise,label-wiseunknownaveragesbσ: j n µS = (1/2) pˆj (2πˆj +σ(1−σ))bσj , (4) Xj=1 σX∈Σ1 with bσj =. ES[x|σ,j] denoting these 2n unknowns (for j ∈ [n],σ ∈ Σ1), and let bj =. (1/m ) x . The2nbσsaresolutionofasetofnidentitiesthatare(inmatrixform): j xi∈Sj i j P B Π>B± = 0 , (5) − 3 whereB =. [b1 b2 ...bn]> Rn×d,Π =. [DIAG(πˆ)DIAG(1 πˆ)]> R2n×n andB± R2n×d is | | | ∈ | − ∈ ∈ thematrixofunknowns: B± =. b+11|b+21|...|b+n1 b-11|b-21|...|b-n1 > . (6) h (B+)> (cid:12)(cid:12) (B–)> i (cid:12) System(5)isunderdetermined,unlesso|nema{kzesth}eh|omog{zeneity}assumptionthatyieldstheMean Mapestimator[17]. Ratherthanmakingsucharestrictiveassumption, weregularizethecostthat brings(5)withamanifoldregularizer[21],andsearchforB˜± =argminX∈R2n×d‘(L,X),with: . ‘(L,X) = tr (B>−X>Π)Dw(B−Π>X) +γtr X>LX , (7) andγ >0. Dw =. DIAG(w)isau(cid:0)ser-fixedbiasmatrixwithw∈(cid:1)Rn+,∗((cid:0)andw6=(cid:1) pˆingeneral)and: L =. εI+ La | 0 R2n×2n , (8) 0 La ∈ (cid:20) | (cid:21) where La =. D V Rn×n is the Laplacian of the bag similarities. V is a symmetric similarity − ∈ . matrixwithnonnegativecoordinates,andthediagonalmatrixDsatisfiesdjj = j0vjj0,∀j ∈[n]. ThesizeoftheLaplacianisO(n2),whichissmallcomparedtoO(m2)iftherearenotmanybags. P OnecaninterprettheLaplacianregularizationassmoothingtheestimatesofbσ w.r.tthesimilarity j oftherespectivebags. Lemma2 ThesolutionB˜±tominX∈R2n×d‘(L,X)isB˜± = ΠDwΠ>+γL −1ΠDwB. (cid:0) (cid:1) ([19], Subsection 2.2). This Lemma explains the role of penalty εI in (8) as ΠDwΠ> and L have respectivelyn-and( 1)-dimnullspaces,sotheinversionmaynotbepossible.Evenwhenthisdoes ≥ not happen exactly, this may incur numerical instabilities in computing the inverse. For domains where this risk exists, picking a small ε > 0 solves the problem. Let b˜σ denote the row-wise j decompositionof B˜± following(6), fromwhichwecomputeµ˜S following(4)whenweusethese 2nestimatesinlieuofthetruebσj.Wecompareµj =. πˆjb+j −(1−πˆj)b−j ,∀j ∈[n]toourestimates µ˜j =. πˆjb˜+j −(1−πˆj)b˜−j ,∀j ∈[n],grantedthatµS = jpˆjµj andµ˜S = jpˆjµ˜j. T[µh1eoµr2em...3µnS]u>pposeRthna×tdγ, Ms˜atis=.fies[µ˜γ1√µ˜22≤...µ(˜(nε](>2n)−1PR) n+×dmaaxnjd6=jς0(vVj,j0B)±/P)mi=n.jw((jε.(2Lne)t−M1) =+. max|j6=j|0v|jj0)2kB∈±kF. Thefollowingho|lds:| | ∈ 1 kM−M˜kF ≤ √n √2mjinwj2 − ×ς(V,B±) . (9) (cid:18) (cid:19) ([19],Subsection2.3)Themultiplicativefactortoς in(9)isroughlyO(n5/2)whenthereisnolarge discrepancyinthebiasmatrix D ,sotheupperboundisdrivenbyς(.,.)whentherearenotmany w bags. We have studied its variations when the “distinguishability” between bags increases. This settingisinterestingbecauseinthiscasewemaykilltwobirdsinoneshot,withtheestimationof M and the subsequent learning problem potentially easier, in particular for linear separators. We considertwoexamplesforv ,thefirstbeing(half)thenormalizedassociation[22]: jj0 vjnjc0 =. 21(cid:18)ASSAOSCSO(SCj(,SSjj,S∪jS)j0) + ASASOSSCO(SCj(0S,jS0j,S∪j0S)j0)(cid:19)=NASSOC(Sj,Sj0) , (10) vG,s =. exp( b b /s) ,s>0 . (11) jj0 −k j − j0k2 Here, ASSOC(Sj,Sj0) =. x∈Sj,x0∈Sj0 kx−x0k22 [22]. To put these two similarity measures in the context of Theorem 3, consider the setting where we can make assumption (D1) that there P exists a small constant κ > 0 such that b b 2 κmax bσ 2, j,j [n]. This is a k j − j0k2 ≥ σ,jk jk2 ∀ 0 ∈ weak distinguishability property as if no such κ exists, then the centers of distinct bags may just be confounded. Consider also the additional assumption, (D2), that there exists κ0 > 0 such that max d2 κ, j [n],whered =. max x x isabag’sdiameter.Inthefollowing Lemmj aj,t≤heli0tt∀le-o∈hnotationiswjithrespexcit,xto0i∈tShjek“lairg−est”i0ku2nknownineq.(4),i.e.max bσ . σ,jk jk2 4 Algorithm2AlternatingMeanMap(AMMOPT) InputLMMparameters+optimizationstrategyOPT min,max +convergencepredicatePR Step1: letθ˜0 LMM(LMMparameters)andt 0∈{ } ← ← Step2: repeat SSSttteeeppp222...231::: llleeetttθσt˜tt+←1t←+arag1rOgPmTσin∈θΣFπˆφF(φS(|Sy|,yθ,,θµt,Sµ(σS(tσ)))+) λkθk22 ← untilpredicatePRistrue Returnθ˜ =. argmintFφ(Sy,θ˜t+1,µS(σt)) ∗ | Lemma4 Thereexistsε >0suchthat ε ε ,thefollowingholds:(i)ς(Vnc,B±)=o(1)under assumptions(D1+D2);∗(ii)ς(VG,s,B±)∀=≤o(1∗)underassumption(D1), s>0. ∀ ([19],Subsection2.4)Hence,providedaweak(D1)orstronger(D1+D2)distinguishabilityassump- tion holds, the divergence between M and M˜ gets smaller with the increase of the norm of the unknownsbσ. TheproofoftheLemmasuggeststhattheconvergencemaybefasterforVG,s. The j followingLemmashowsthatbothsimilaritiesalsopartiallyencodethehardnessofsolvingtheclas- sificationproblemwithlinearseparators,sothatthemanifoldregularizer“limits”thedistortionof theb˜ sbetweentwobagsthattendnottobelinearlyseparable. ±. Lemma5 Takevjj0 ∈ {vjGj,0.,vjnjc0}. Thereexists0 < κl < κn < 1suchthat(i)ifvjj0 > κn then S ,S arenotlinearlyseparable,andifv <κ thenS ,S arelinearlyseparable. j j0 jj0 l j j0 ([19], Subsection 2.5) This Lemma is an advocacy to fit s in a data-dependent way in vG,s. The jj0 question may be raised as to whether finite samples approximation results like Theorem 3 can be provenfortheMeanMapestimator[17]. [19],Subsection2.6answersbythenegative. IntheLaplacianMeanMapalgorithm(LMM,Algorithm1),Steps1and2havenowbeendescribed. Step3isadifferentiableconvexminimizationproblemforθthatdoesnotusethelabels,soitdoes notpresentanytechnicaldifficulty. Aninterestingquestionishowmuchourclassifierθ˜ inStep3 divergesfromtheonethatwouldbecomputedwiththetrueexpressionforµS,θ . Itisn∗othardto showthatLemma17inAltunandSmola[23], andCorollary9inQuadriantoet∗al. [17]holdfor aLpMpMroxsoimthataitoknθ˜b∗o−undθ∗thka22t≤can(2bλe)s−i1gknµi˜fiSca−ntµlySbke22t.teTr,hwehfoenlloitwhionlgdsTthheaotrθemxsh,oθ˜wsxadatφa-(d[e0p,e1n])d,enit > i > i 0 (φ0 isthefirstderivative). Wecallthissettingproperscoringcomplianc∗e(PSC∗)[18∈]. PSC alwa∀ys holdsforthelogisticandMatsushitalossesforwhichφ0([0,1])=R.Forotherlosseslikethesquare lossforwhichφ([0,1]) = [ 1,1],shrinkingtheobservationsinaballofsufficientlysmallradius 0 − issufficienttoensurethis. Theorem6 Let fk Rm denote the vector encoding the kth feature variable in S : fki = xik (k [d]). Let F˜ de∈note the feature matrix with column-wise normalized feature vectors: f˜k =. (d/∈ k0kfk0k22)(d−1)/(2d)fk. UnderPSC,wehavekθ˜∗−θ∗k22 ≤(2λ+q)−1kµ˜S−µSk22,with: P . detF˜>F˜ 2e−1 q = (>0) , (12) m × b φ (φ 1(q /λ)) φ 00 0− 0 . . . forsomeq0 I=[ (x +max µS 2, µ˜S 2 )]. Here,x =maxi xi 2andφ00 =(φ0)0. ∈ ± ∗ {k k k k } ∗ k k ([19],Subsection2.7)Toseehowlargeqcanbe,considerthesimplecasewherealleigenvaluesof F˜>F˜,λk(F˜>F˜) [λ δ]forsmallδ. Inthiscase,qisproportionaltotheaveragefeature“norm”: ∈ ◦± detF˜>F˜ = tr F>F +o(δ)= ikxik22 +o(δ) . m md md (cid:0) (cid:1) P 5 . TheAlternatingMeanMap(AMM)algorithm LetusdenoteΣπˆ ={σ ∈Σm : i:xi∈Sjσi = (2πˆ 1)m , j [n] thesetoflabelingsthatareconsistentwiththeobservedproportionsπˆ,and j− . j ∀ ∈ } P µS(σ) = (1/m) iσixi thebiasedmeanoperatorcomputedfromsomeσ ∈ Σπˆ. Noticethatthe truemeanoperatorµS = µS(σ)foratleastoneσ Σπˆ. TheAlternatingMeanMapalgorithm, (AMM, AlgorithmP2), starts with the output of LMM∈and then optimizes it further over the set of consistentlabelings. Ateachiteration,itfirstpicksaconsistentlabelinginΣπˆ thatisthebest(OPT =min)ortheworst(OPT=max)forthecurrentclassifier(Step2.1)andthenfitsaclassifierθ˜onthe givensetoflabels(Step2.2).Thealgorithmtheniteratesuntilaconvergencepredicateismet,which testswhetherthedifferencebetweentwovaluesforFφ(.,.,.)istoosmall(AMMmin),orthenumber of iterations exceeds a user-specified limit (AMMmax). The classifier returned θ˜ is the best in the sequence. InthecaseofAMMmin,itisthelastofthesequenceasriskFφ(Sy,.,∗.)cannotincrease. Again,Step2.2isaconvexminimizationwithnotechnicaldifficulty. Step|2.1iscombinatorial. It canbesolvedintimealmostlinearinm[19](Subsection2.8). Lemma7 TherunningtimeofStep2.1inAMMisO˜(m),wherethetildenotationhideslog-terms. Bag-RademachergeneralizationboundsforLLP Werelatethe“min”and“max”strategiesof AMMbyuniformconvergenceboundsinvolvingthetruesurrogaterisk,i.e.integratingtheunknown distribution D and the true labels (which we may never know). Previous uniform convergence bounds for LLP focus on coarser grained problems, like the estimation of label proportions [1]. We rely on a LLP generalization of Rademacher complexity [24, 25]. Let F : R R+ be a → loss function and H a set of classifiers. The bag empirical Rademacher complexity of sample S, RRambd,emisadcehfienrecdomasplRexmbity=.eqEuσal∼sΣRmbsufoprhc∈aHrd{E(Σσ0∼)Σ=πˆE1S.[Tσh(exL)Fab(eσl0(Pxro)pho(rxti)o)n].CTohmepulesxuiatlyeomfpHiriicsa:l m πˆ L2m =. ED2mEI/12,I/22hsuHpES[σ1(x)(πˆ|s2(x)−πˆ|‘1(x))h(x)] . (13) ∈ Here,eachofI/2,l = 1,2isarandom(uniformly)subsetof[2m]ofcardinalm. LetS(I/2)bethe l l size-msubsetofSthatcorrespondstotheindexes. Takel = 1,2andanyx S. Ifi I/2 then i ∈ 6∈ l πˆs(x ) = πˆ‘(x ) is x ’s bag’s label proportion measured on S S(I/2). Else, πˆs(x ) is its bag’s l i l i i \ l 2 i la|belproport|ionmeasuredonS(I/2)andπˆ‘ (x )isitslabel(i.e. abag’slabelprop|ortionthatwould 2 1 i contain only x ). Finally, σ (x) =. 2 1| 1 Σ . L tends to be all the smaller as classifiersinHihavesmallm1agnitudeo×nbaxg∈sSw(Ih/12o)s−elab∈elpr1oport2imoniscloseto1/2. Theorem8 Suppose h 0s.t. h(x) h , x, h. Then,foranylossFφ,anytrainingsample ofsizemandany0<∃δ∗ ≥1,with|proba|b≤ility∗>∀1 ∀δ,thefollowingboundholdsoverallh H: ≤ − ∈ 2h 1 2 ED[Fφ(yh(x))] ≤ EΣπˆES[Fφ(σ(x)h(x))]+2Rmb +L2m+4 b ∗ +1 2mlogδ(1.4) (cid:18) φ (cid:19)r Furthermore,underPSC(Theorem6),wehaveforanyFφ: Rmb ≤ 2bφEΣm sup{ES[σ(x)(πˆ(x)−(1/2))h(x)]} . (15) h H ∈ ([19],Subsection2.9)Despitesimilarshapes(13)(15),Rb andL behavedifferently:whenbags m 2m arepure(πˆ 0,1 , j),L =0. Whenbagsareimpure(πˆ =1/2, j),Rb =0. Asbagsget j ∈{ } ∀ 2m j ∀ m impure, thebag-empiricalsurrogaterisk, EΣπˆES[Fφ(σ(x)h(x))], alsotendstoincrease. AMMmin andAMMmaxrespectivelyminimizealowerboundandanupperboundofthisrisk. 3 Experiments AlgorithmsWecompareLMM,AMM(Fφ=logisticloss)totheoriginalMM[17],InvCal[11],conv- SVMandalter- SVM[16](linearkernels). Tomakeexperimentsextensive,wetestseveralini- ∝ ∝ tializationsforAMMthatarenotdisplayedinAlgorithm2(Step1):(i)theedgemeanmapestimator, µ˜SEMM =. 1/m2( iyi)( ixi)(AMMEMM),(ii)theconstantestimatorµ˜S1 =. 1(AMM1),andfinally AMM10ran whichruns10randominitialmodels( θ0 2 1),andselectstheonewithsmallestrisk; P P k k ≤ 6 1.0 1.0 1.0 AUC rel. to MM1111....0123 LLLMMMMMMMM nGG c , s AUC rel. to Oracle0000....6789 LLLMMMMMMMM nGG c , s AUC rel. to Oracle0000....6789 AAAAAMMMMMMMMMM1MGnG0c M,r s a n AUC 0000....2468 dOABoMrimagMcgaleieGnrs doSmmaailnls 2 4 6 0.6 0.8 1.0 0.6 0.8 1.0 10^−5 10^−3 10^−1 divergence entropy entropy #bags/#instance (a) (b) (c) (d) Figure1: RelativeAUC(wrt MM)ashomogeneityassumptionisviolated(a). RelativeAUC(wrt Oracle)vsentropyonheartforLMM(b),AMMmin(c). AUCvsn/mforAMMminandtheOracle(d). G Table2: Smalldomainsresults. #win/#loseforrowvscolumn. Boldfacesmeansp-val<.001for Wilcoxonsigned-ranktests. Top-leftsubtableisforone-shotmethods,bottom-rightiterativeones, bottom-leftcomparethetwo. Italicisstate-of-the-art. Greycellshighlightthebestofall(AMMmin). G algorithm MM LMM InvCal AMMmin AMMmax conv- G G,s nc MM G G,s 10ran MM G G,s 10ran SVM ∝ M G 36/4 M G,s 38/3 30/6 L nc 28/12 3/37 2/37 InvCal 4/46 3/47 4/46 4/46 minMMGM 3338//1161 3256//1244 2350//2250 3372//1183 4467//43 31/7 . e.g.AMMmG,isnwinsonAMMmGin7times,loses15,with28ties MG,s 35/14 33/17 30/20 35/15 47/3 24/11 7/15 A10ran 27/22 24/26 22/28 26/24 44/6 20/30 16/34 19/31 maxM GMM 2257//2253 2232//2278 2221//2288 2256//2254 4455//55 1157//3353 1134//3376 1134//3376 180//4420 13/14 M G,s 25/25 21/29 22/28 24/26 45/5 15/35 13/37 13/37 12/38 15/22 16/22 A 10ran 23/27 21/29 19/31 24/26 50/0 19/31 15/35 17/33 7/43 19/30 20/29 17/32 M conv- 21/29 2/48 2/48 2/48 2/48 4/46 3/47 3/47 4/46 3/47 3/47 4/46 0/50 SV alter-∝∝ 0/50 0/50 0/50 0/50 20/30 0/50 0/50 0/50 3/47 3/47 2/48 1/49 0/50 27/23 thisisthesameprocedureofalter- SVM.MatrixV(eqs. (10),(11))usedisindicatedinsubscript: LMM/AMMG,LMM/AMMG,s,LMM/∝AMMnc respectivelydenotevG,s withs=1,vG,s withslearned on cross validation (CV; validation ranges indicated in [19]) and vnc. For space reasons, results notdisplayedinthepapercanbefoundin[19],Section3(includingruntimecomparisons,andde- tailedresultsbydomain). Wesplitthealgorithmsintwogroups,one-shotanditerative. Thelatter, including AMM, (conv/alter)- SVM, iteratively optimize a cost over labelings (always consistent ∝ withlabelproportionsforAMM,notalwaysfor(conv/alter)- SVM).Theformer(LMM,InvCal)do ∝ notandarethusmuchfaster. Testsaredoneona4-core3.2GHzCPUsMacwith32GBofRAM. AMM/LMM/MMareimplementedinR.CodeforInvCaland SVMis[16]. ∝ Simulateddomains, MMandthehomogeneityassumptionThetestingmetricistheAUC.Prior totestingonourdomains,wegenerate16domainsthatgraduallymoveawaythebσawayfromeach j other(wrtj),thusviolatingincreasinglythehomogeneityassumption[17]. Thedegreeofviolation ismeasuredaskB±−B±kF,whereB± isthehomogeneityassumptionmatrix,thatreplacesallbσj by bσ for σ 1,1 , see eq. (5). Figure 1 (a) displays the ratios of the AUC of LMM to the ∈ {− } AUCofMM. ItshowsthatLMMisallthebetterwithrespecttoMMasthehomogeneityassumption is violated. Furthermore, learning s in LMM improves the results. Experiments on the simulated domainof[16]onwhichMMobtainszeroaccuracyalsodisplaythatouralgorithmsperformbetter (1iterationonlyofAMMmaxbrings100%AUC). SmallandlargedomainsexperimentsWeconvert10smalldomains[19](m 1000)and4bigger ≤ ones(m > 8000)fromUCI[26]intotheLLPframework. Wecasttoone-against-allclassification whentheproblemismulticlass. Onlargedomains,thebagassignmentfunctionisinspiredby[1]: wecraftbagsaccordingtoaselectedfeaturevalue,andthenweremovethatfeaturefromthedata. Thisconformstotheideathatbagassignmentisstructuredandnonrandominreal-worldproblems. Mostofoursmalldomains,however,donothavealotoffeatures,soinsteadofclusteringonone featureandthendiscardit,werunK-MEANSonthewholedatatomakethebags,forK=n 2[5]. ∈ SmalldomainsresultsWeperform5-foldsnestedCVcomparisonsonthe10domains=50AUC valuesforeachalgorithm. Table2synthesisestheresults[19],splittingone-shotanditerativealgo- 7 Table3: AUCsonbigdomains(name: #instances #features). I=cap-shape,II=habitat, × III=cap-colour,IV=race,V=education,VI=country,VII=poutcome,VIII=job(numberofbags); foreachfeature,thebestresultoverone-shot,andoveriterativealgorithmsisboldfaced. algorithm mushroom:8124 108 adult:48842 89 marketing:45211 41 census:299285 381 I(6) II(7) ×III(10) IV(5) V(16) ×VI(42) V(4) VII(4) ×VIII(12) IV(5) VIII(9) ×VI(42) EMM 55.61 59.80 76.68 43.91 47.50 66.61 63.49 54.50 44.31 56.05 56.25 57.87 MM 51.99 98.79 5.02 80.93 76.65 74.01 54.64 50.71 49.70 75.21 90.37 75.52 LMMG 73.92 98.57 14.70 81.79 78.40 78.78 54.66 51.00 51.93 75.80 71.75 76.31 LMMG,s 94.91 98.24 89.43 84.89 78.94 80.12 49.27 51.00 65.81 84.88 60.71 69.74 min AAMMMMEMMMM 8859..1821 9999..4051 6195..4734 4893..9773 7576..3998 8700..6179 6512..3895 7555..2773 4538..1109 8879..8668 8874..7911 4608..8306 MM AMMG 89.18 99.45 50.44 83.41 82.55 81.96 51.61 75.16 57.52 87.61 88.28 76.99 A AMMG,s 89.24 99.57 3.28 81.18 78.53 81.96 52.03 75.16 53.98 89.93 83.54 52.13 AMM1 95.90 98.49 97.31 81.32 75.80 80.05 65.13 64.96 66.62 89.09 88.94 56.72 max AAMMMMEMMMM 9539..0445 535.3.126 2969..6770 5842..4567 6791..6633 5861..6329 5418..4486 5551..6334 5576..4980 7510..2705 7676..1746 6568..7617 AMM AAMMMMGG,s 9955..5804 6655..3322 9894..3206 8822..7659 7720..1965 8811..3399 5606..5888 4477..2277 3344..2299 4880..3323 6774..5445 5727..7406 AMM1 95.01 73.48 1.29 75.22 67.52 77.67 66.70 61.16 71.94 57.97 81.07 53.42 Oracle 99.82 99.81 99.8 90.55 90.55 90.50 79.52 75.55 79.43 94.31 94.37 94.45 rithms. LMMG,soutperformsallone-shotalgorithms. LMMGandLMMG,sarecompetitivewithmany iterativealgorithms,butloseagainsttheirAMMcounterpart,whichprovesthatadditionaloptimiza- tion over labels is beneficial. AMMG and AMMG,s are confirmed as the best variant of AMM, the firstbeingthebestinthiscase. Surprisingly,allmeanmapalgorithms,evenone-shots,areclearly superiorto SVMs.Furtherresults[19]revealthat SVMperformancesaredampenedbylearning ∝ ∝ classifierswiththe“invertedpolarity”—i.e. flippingthesignoftheclassifierimprovesitsperfor- mances.Figure1(b,c)presentstheAUCrelativetotheOracle(whichlearnstheclassifierknowing all labels and minimizing the logistic loss), as a function of the Gini entropy of bag assignment, gini(S) =. 4Ej[πˆj(1 πˆj)]. Foranentropycloseto1,wewereexpectingadropinperformances. − The unexpected [19] is that on some domains, large entropies ( .8) do not prevent AMMmin to competewiththeOracle. Nosuchpatternclearlyemergesfor SV≥MandAMMmax[19]. ∝ Big domains results We adopt a 1/5 hold-out method. Scalability results [19] display that every methodusingvncand SVMarenotscalabletobigdomains;inparticular,theestimatedtimefora ∝ singlerunofalter- SVMis>100hoursontheadultdomain.Table3presentstheresultsonthebig ∝ domains,distinguishingthefeatureusedforbagassignment. Bigdomainsconfirmtheefficiencyof LMM+AMM. Noapproachclearlyoutperformstherest,althoughLMMG,sisoftenthebestone-shot. SynthesisFigure1(d)givestheAUCsofAMMminovertheOracleforalldomains[19],asafunction G ofthe“degreeofsupervision”,n/m(=1iftheproblemisfullysupervised). Noticeably,on90%of theruns, AMMmin getsanAUCrepresentingatleast70%oftheOracle’s. Resultsonbigdomains G canberemarkable: onthecensusdomainwithbagassignmentonrace,5proportionsaresufficient foranAUC5pointsbelowtheOracle’s—whichlearnswith200Klabels. 4 Conclusion Inthispaper, wehaveshownthatefficientlearningintheLLPsettingispossible, forgeneralloss functions,viathemeanoperatorandwithoutresortingtothehomogeneityassumption. Throughits estimation,thesufficiencyallowsonetoresorttostandardlearningproceduresforbinaryclassifica- tion,practicallyimplementingareductionbetweenmachinelearningproblems[27];hencethemean operatorestimationmaybeaviableshortcuttotackleotherweaklysupervisedsettings[2][3][4] [5].Approximationresultsandgeneralizationboundsareprovided.Experimentsdisplayresultsthat aresuperiortothestateoftheart,withalgorithmsthatscaletobigdomainsataffordablecomputa- tionalcosts. PerformancessometimescompetewiththeOracle’s—thatlearnsknowingalllabels —,evenonbigdomains. Suchexperimentalfindingposessevereimplicationsonthereliabilityof privacy-preservingaggregationtechniqueswithsimplegroupstatisticslikeproportions. Acknowledgments NICTAisfundedbytheAustralianGovernmentthroughtheDepartmentofCommunicationsandthe Australian Research Council through the ICT Centre of Excellence Program. G. Patrini acknowl- edgesthatpartoftheresearchwasconductedattheCommonwealthBankofAustralia. Wethank A.Menon,D.Garc´ıa-Garc´ıa,N.deFreitasforinvaluablefeedback,andF.Yuforhelpwiththecode. 8 References [1] F.X.Yu,S.Kumar,T.Jebara,andS.F.Chang.Onlearningwithlabelproportions.CoRR,abs/1402.5902, 2014. [2] T.G.Dietterich,R.H.Lathrop,andT.Lozano-Pe´rez. Solvingthemultipleinstanceproblemwithaxis- parallelrectangles. ArtificialIntelligence,89:31–71,1997. [3] G.S.MannandA.McCallum. Generalizedexpectationcriteriaforsemi-supervisedlearningofcondi- tionalrandomfields. In46thACL,2008. [4] J.Grac¸a,K.Ganchev,andB.Taskar. Expectationmaximizationandposteriorconstraints. InNIPS*20, pages569–576,2007. [5] P.Liang,M.I.Jordan,andD.Klein.Learningfrommeasurementsinexponentialfamilies.In26thICML, pages641–648,2009. [6] D.J.Musicant,J.M.Christensen,andJ.F.Olson. Supervisedlearningbytrainingonaggregateoutputs. In7thICDM,pages252–261,2007. [7] J. Herna´ndez-Gonza´lez, I. Inza, and J. A. Lozano. Learning bayesian network classifiers from label proportions. PatternRecognition,46(12):3425–3440,2013. [8] M.StolpeandK.Morik. Learningfromlabelproportionsbyoptimizingclustermodelselection. In15th ECMLPKDD,pages349–364,2011. [9] B. C. Chen, L. Chen, R. Ramakrishnan, and D. R. Musicant. Learning from aggregate views. In 22thICDE,pages3–3,2006. [10] J. Wojtusiak, K. Irvin, A. Birerdinc, and A. V. Baranova. Using published medical results and non- homogenousdatainrulelearning. In10thICMLA,pages84–89,2011. [11] S.Ru¨ping. Svmclassifierestimationfromgroupprobabilities. In27thICML,pages911–918,2010. [12] H.KueckandN.deFreitas. Learningaboutindividualsfromgroupstatistics. In21thUAI,pages332– 339,2005. [13] S.Chen,B.Liu,M.Qian,andC.Zhang. Kernelk-meansbasedframeworkforaggregateoutputsclassi- fication. In9thICDMW,pages356–361,2009. [14] K.T.Lai,F.X.Yu,M.S.Chen,andS.F.Chang. Videoeventdetectionbyinferringtemporalinstance labels. In11thCVPR,2014. [15] K.Fan,H.Zhang,S.Yan,L.Wang,W.Zhang,andJ.Feng. Learningagenerativeclassifierfromlabel proportions. Neurocomputing,139:47–55,2014. [16] F.X.Yu,D.Liu,S.Kumar,T.Jebara,andS.F.Chang. SVMforLearningwithLabelProportions. In 30thICML,pages504–512,2013. ∝ [17] N.Quadrianto,A.J.Smola,T.S.Caetano,andQ.V.Le.Estimatinglabelsfromlabelproportions.JMLR, 10:2349–2374,2009. [18] R.NockandF.Nielsen. Bregmandivergencesandsurrogatesforlearning. IEEETrans.PAMI,31:2048– 2059,2009. [19] G.Patrini,R.Nock,P.Rivera,andT.S.Caetano. (Almost)nolabelnocry-supplementarymaterial”. In NIPS*27,2014. [20] M.J.KearnsandY.Mansour. Ontheboostingabilityoftop-downdecisiontreelearningalgorithms. In 28thACMSTOC,pages459–468,1996. [21] M.Belkin,P.Niyogi,andV.Sindhwani. Manifoldregularization: Ageometricframeworkforlearning fromlabeledandunlabeledexamples. JMLR,7:2399–2434,2006. [22] J.ShiandJ.Malik. Normalizedcutsandimagesegmentation. IEEETrans.PAMI,22:888–905,2000. [23] Y.AltunandA.J.Smola. Unifyingdivergenceminimizationandstatisticalinferenceviaconvexduality. In19thCOLT,pages139–153,2006. [24] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. JMLR,3:463–482,2002. [25] V.KoltchinskiiandD.Panchenko. Empiricalmargindistributionsandboundingthegeneralizationerror ofcombinedclassifiers. Ann.ofStat.,30:1–50,2002. [26] K.BacheandM.Lichman. UCImachinelearningrepository,2013. [27] A. Beygelzimer, V. Dani, T. Hayes, J. Langford, and B. Zadrozny. Error limiting reductions between classificationtasks. In22thICML,pages49–56,2005. 9 (Almost) No Label No Cry - Supplementary Material GiorgioPatrini1,2,RichardNock1,2,PaulRivera1,2,TiberioCaetano1,3,4 AustralianNationalUniversity1,NICTA2,UniversityofNewSouthWales3,Ambiata4 Sydney,NSW,Australia name.surname @anu.edu.au { } 1 Tableofcontents Supplementarymaterialonproofs Pg2 ProofofLemma1 Pg2 ProofofLemma2 Pg2 ProofofTheorem3 Pg3 ProofofLemma4 Pg4 ProofofLemma5 Pg6 MeanMapestimator’sLemmaandProof Pg8 ProofofTheorem6 Pg9 ProofofLemma7 Pg13 ProofofTheorem8 Pg13 Supplementarymaterialonexperiments Pg17 FullExperimentalSetup Pg17 SimulatedDomainforViolationofHomogeneityAssumption Pg18 SimulatedDomainfrom[1] Pg18 AdditionalTestsonalter- SVM[1] Pg18 ∝ Scalability Pg19 FullResultsonSmallDomains Pg19 1

Description:
Australian National University1, NICTA2, University of New South Wales3, Ambiata4. Sydney, NSW, Australia [23] Y. Altun and A. J. Smola.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.