Table Of Content

PRIVILEGED MULTI-LABEL LEARNING ShanYou1,ChangXu2,YunheWang1,ChaoXu1&DachengTao3 1KeyLab. ofMachinePerception(MOE),CooperativeMedianetInnovationCenter, SchoolofEECS,PekingUniversity,P.R.China 2SchoolofSoftware,UniversityofTechnologySydney 3SchoolofInformationTechnologies,UniversityofSydney ABSTRACT 7 1 Thispaperpresentsprivilegedmulti-labellearning(PrML)toexploreandexploit 0 the relationship between labels in multi-label learning problems. We suggest 2 thatforeachindividuallabel, itcannotonlybeimplicitlyconnectedwithother n labelsviathelow-rankconstraintoverlabelpredictors,butalsoitsperformance a onexamplescanreceivetheexplicitcommentsfromotherlabelstogetheracting J as an Oracle teacher. We generate privileged label feature for each example 5 and its individual label, and then integrate it into the framework of low-rank 2 basedmulti-labellearning. Theproposedalgorithmcanthereforecomprehensively exploreandexploitlabelrelationshipsbyinheritingallthemeritsofprivileged ] L informationandlow-rankconstraints. WeshowthatPrMLcanbeefficientlysolved M by dual coordinate descent algorithm using iterative optimization strategy with cheapupdates. Experimentsonbenchmarkdatasetsshowthatthroughprivileged . t labelfeatures,theperformancecanbesignificantlyimprovedandPrMLissuperior a toseveralcompetingmethodsinmostcases. t s [ 1 1 INTRODUCTION v 4 Differentfromsingle-labelclassification,multi-labellearning(MLL)allowseachexampletoown 9 multiple and non-exclusive labels. For instance, when to post a photo taken in the scene of Rio 1 Olympics on Instagram, Twitter or Facebook, we may simultaneously include hashtags as #Ri- 7 oOlympics,#athletes,#medalsand#flags. Orarelatednewsarticlecanbesimultaneouslyannotated 0 as “Sports”, “Politics” and “Brazil”. Multi-label learning aims to accurately allocate a group of . 1 labels to unseen examples with the knowledge harvested from the training data, and it has been 0 widely-usedinmanyapplications,suchasdocumentcategorizationYangetal.(2009);Lietal.(2015), 7 image/videosclassification/annotationYangetal.(2016);Wangetal.(2016);Bappyetal.(2016), 1 genefunctionclassificationCesa-Bianchietal.(2012)andimageretrievalRanjanetal.(2015). : v Themoststraightforwardapproachis1-vs-allorBinaryRelevance(BR)Tsoumakasetal.(2010), i X which decomposes the multi-label learning into a set of independent binary classification tasks. r However, due to neglecting label relationships, only passable performance can be achieved. A a numberofmethodshavethusbeendevelopedforfurtherimprovingtheperformancebytakinglabel relationships into consideration, such as label ranking Fu¨rnkranz et al. (2008), chains of binary classificationReadetal.(2011),ensembleofmulti-classclassificationTsoumakasetal.(2011)and label-specificfeaturesZhang&Wu(2015). Recently,embedding-basedmethodshaveemergedas amainstreamsolutionofthemulti-labellearningproblem. Theapproachesassumethatthelabel matrixislow-rank,andadoptdifferentmanipulationstoembedtheoriginallabelvectors,suchas compressedsensingHsuetal.(2009),principalcomponentanalysisTai&Lin(2012),canonical correlationanalysisZhang&Schneider(2011),landmarkselectionBalasubramanian&Lebanon (2012)andmanifolddeductionBhatiaetal.(2015);Houetal.(2016). Mostoflow-rankbasedmulti-labellearningalgorithmsexploitlabelrelationshipsinthehypothesis space. Thehypothesesofdifferentlabelsareinteractedwitheachotherunderthelow-rankconstraint, whichisasanimplicituseoflabelrelationships. Bycontrast,multiplelabelscanhelpeachotherina moreexplicitway,wherethehypothesisofalabelisnotonlyevaluatedbythelabelitself,butalso canbeassessedbytheotherlabels. Morespecificallyinmulti-labellearning,forthelabelhypothesis at hand, the other labels can together act as an Oracle teacher to provide some comments on its 1 performance,whichisthenbeneficialforupdatingthelearner. Multiplelabelsofexamplescanonly be accessedin thetraining stageinstead of thetesting stage, and thenOracle teachersonly exist inthetrainingstage. Thisprivileged settinghasbeenstudiedinLUPI(learningusingprivileged information)paradigmVapniketal.(2009);Vapnik&Vashist(2009);Vapnik&Izmailov(2015)and ithasbeenreportedthatappropriateprivilegedinformationcanboosttheperformanceinranking Sharmanskaetal.(2013), metriclearningFouadetal.(2013), classificationPechyony&Vapnik (2010)andvisualrecognitionMotiianetal.(2016). Inthispaper,webridgeconnectionsbetweenlabelsthroughprivilegedlabelinformationandthen formulateaneffectiveprivilegedmulti-labellearning(PrML)method. Foreachlabel,eachexample’s privilegedlabelfeaturecanbegeneratedfromotherlabels. Thenitisabletoprovideadditionalguid- ancesonthelearningofthislabel,giventheunderlyingconnectionsbetweenlabels. Byintegrating theprivilegedinformationintothelow-rankbasedmulti-labellearning,eachlabelpredictorlearned fromtheresultingmodelnotonlyinteractswithotherlabelsviatheirpredictors,butalsoreceives explicitcommentsfromtheselabels. IterativeoptimizationstrategyisemployedtosolvePrML,and we theoretically show that each subproblem can be solved by dual coordinate descent algorithm withtheguaranteeofsolution’suniqueness. Experimentalresultsdemonstratethesignificanceof exploitingtheprivilegedlabelfeaturesandtheeffectivenessoftheproposedalgorithm. 2 PROBLEM FORMULATION Inthissectionweelaboratetheintrinsicprivilegedinformationinmulti-labellearningandformulate thecorrespondingprivilegedmulti-labellearning(PrML)aswell. Wefirstintroducemulti-labellearning(MLL)problemanditsfrequentnotations. Givenntraining points,wedenotethewholedatasetasD = {(x ,y ),...,(x ,y )},wherex ∈ X ⊆ Rd isthe 1 1 n n i inputfeaturevectorandy ∈ Y ⊆ {−1,1}L isthecorrespondinglabelvectorwiththelabelsize i L. LetX =[x ,x ,...,x ]∈Rd×n bethedatamatrixandY =[y ,y ,...,y ]∈{−1,1}L×n be 1 2 n 1 2 n thelabelmatrix. Specifically,Y =1ifandonlyifthei-thlabelisassignedtotheexamplex and ij j Y =−1otherwise. GiventhedatasetD,multi-labellearningisformulatedaslearningamapping ij functionf :Rd →{−1,1}Lthatcanaccuratelypredictlabelsforunseentestpoints. 2.1 LOW-RANKMULTI-LABELEMBEDDING Astraightforwardmannertoparameterizethedecisionfunctionisusinglinearclassifiers,i.e.f(x)= ZTx=[z ,...,z ]TxwhereZ ∈Rd×L. Notethatthelinearformisactuallyincorporatedwiththe 1 L biastermbyaugmentinganadditional1tothefeaturevectorx. BinaryRelevance(BR)method Tsoumakasetal.(2010)decomposesmulti-labellearningintoasetofsingle-labellearningproblems. Thebinaryclassifierforeachlabelcanbeobtainedbythewidely-usedSVMmethod: n min 1(cid:107)z∗(cid:107)2+C(cid:88)ξ zi=[z∗i;bi],ξi 2 i 2 j=1 ij (1) s.t. Y ((cid:104)z∗,x (cid:105)+b )≥1−ξ ij i j i ij ξ ≥0,∀j =1,...,n, ij where ξ = [ξ ,...,ξ ]T is slack variable and (cid:104)·(cid:105) is the inner product between two vectors or i i1 in matrices.Predictors{z ,...,z }ofdifferentlabelsarethusindependentlysolvedwithoutconsidering 1 L relationshipsbetweenlabels,whichlimitstheclassificationperformanceofBRmethod. Somelabelscanbecloselyconnectedandusedtooccurtogetheronexamples,andthusthelabel matrix is often supposed to be low-rank, which leads to the low rank of label predictor matrix Z =[z ,...,z ]asaresult. ConsideringtherankofZ ask,whichissmallerthandandL,weare 1 L abletoemploytwosmallermatricestoapproximateZ, i.e.Z = DTW. D ∈ Rk×d canbeseen asadictionaryofhypothesesinlatentspaceRk,whileeachw inW = [w ,...,w ] ∈ Rk×L is i 1 L thecoefficientvectortogeneratethepredictorofi-thlabelfromthehypothesisdictionaryD. Each 2 classifierz isrepresentedasz =DTw (i=1,2,...,L)andProblem(1)canbeextendedinto: i i i L L n min 1((cid:107)D(cid:107)2 +(cid:88)(cid:107)w (cid:107)2)+C(cid:88)(cid:88)ξ D,W,ξ 2 F i 2 ij i=1 i=1j=1 (2) s.t. Y ((cid:104)DTw ,x (cid:105))≥1−ξ ij i j ij ξ ≥0,∀i=1,...,L;j =1,...,n, ij whereξ =[ξ ,...,ξ ]T. ThusinEq.(2),theclassifiersofalllabelsz aredrawnfromanidentical 1 L i low-dimensionalsubspace,i.e.therowspaceofD. Thenusingblockcoordinatedescent,eitherDor W canbesolvedwithintheempiricalriskminimization(ERM)frameworkbyturningitintoahinge lossminimizationproblem. 2.2 PRIVILEGEDINFORMATIONINMULTI-LABELLEARNING Theslackvariableξ inEq.(2)indicatesthepredictionerrorofthej-thexampleonthei-thlabel. In ij fact,itdepictstheerror-tolerantabilityofamodel,andisdirectlyrelatedtotheoptimalclassifier anditsclassificationperformance. Fromadifferentpointofview,slackvariablescanberegardedas commentsofsomeOracleTeacherontheperformanceofpredictorsoneachexample. Inmulti-label contextforeachlabel,itshypothesisisnotonlyevaluatedbyitself,butalsoassessedbytheother labels. ThusotherlabelscanbeseenasitsOracleteacher,whowillprovidesomecommentsduring thislabel’slearning. Notethattheselabelvaluesareknownasapriorionlyduringtraining;whenwe getdowntolearningthei-thlabel’spredictor,weactuallyknowthevaluesofotherlabelsforeach trainingpointx . Therefore,wecanformulatetheotherlabelvaluesasprivilegedinformation(or j hiddeninformation)ofeachexample. Let (cid:77) y˜ =y , withi-thelementbeing0. (3) i,j j Wecally˜ thetrainingpointx ’sprivilegedlabelfeatureonthei-thlabel. Itcanbeseenthatthe i,j j privilegedlabelspaceisconstructedstraightforwardlyfromtheoriginallabelspace. Theseprivileged labelfeaturescanthusberegardedasanexplicitwaytoconnectalllabels. Inaddition,notethat thevaliddimension(removing0)ofy˜ isL−1,sincewetaketheotherL−1labelvaluesasthe i,j privilegedlabelfeatures. Moreover,notalltheotherlabelshavethepositiveimpactonthelearning ofsomelabelSunetal.(2014),andthusitisappropriatetostrategicallyselectsomekeylabelsto formulatetheprivilegedlabelfeatures. WewilldiscussthisintheExperimentsection. Sinceforeachlabel,theotherlabelsserveastheOracleteacherviatheprivilegedlabelfeaturey˜ i,j oneachexample,thecommentsonslackvariablescanbemodelledasalinearfunctionVapnik& Vashist(2009), ξ (y˜ ;w˜ )=(cid:104)w˜ ,y˜ (cid:105). (4) ij i,j i i i,j Thefunctionξ (y˜ ;w˜ )isthuscalledcorrectingfunctionwithrespecttothei-thlabel,wherew˜ is ij i,j i i theparametervector. AsshowninEq.(4),theprivilegedcommentsy˜ directlycorrectthevaluesof i,j slackvariablesasthepriorknowledgeortheadditionalinformation. Integratingprivilegedfeatures asEq.(4)intotheSVMstimulatesthepopularSVM+methodVapnik&Vashist(2009),whichhas beenprovedtoimprovetheconvergencerateandtheperformance. Integratingtheproposedprivilegedlabelfeaturesintothelow-rankparameterstructureasEqs.(2) and(4),weformulateanewmulti-labellearningmodel,privilegedmulti-labellearning(PrML)by castingitintotheSVM+-basedLUPIparadigm, L L n min 1(cid:107)D(cid:107)2 + 1(cid:88)(γ (cid:107)w (cid:107)2+γ (cid:107)w˜ (cid:107)2)+C(cid:88)(cid:88)(cid:104)w˜ ,y˜ (cid:105) D,W,W˜ 2 F 2 i=1 1 i 2 2 i 2 i=1j=1 i i,j (5) s.t. Y (cid:104)DTw ,x (cid:105)≥1−(cid:104)w˜ ,y˜ (cid:105) ij i j i i,j (cid:104)w˜ ,y˜ (cid:105)≥0,∀i=1,...,L;j =1,...,n, i i,j whereW˜ = [w˜ ,...,w˜ ]. Particularly,weabsorbthebiastermtoobtainacompactvariantofthe 1 L originalSVM+,becauseitisturnedouttohaveasimplerforminthedualspaceandcanbesolved moreefficiently. Inthisway,thetrainingdatawithinmulti-labellearningisactuallyinthetriplet fashion,i.e.(x ,y ,Y˜),i=1,...,n,whereY˜ =[y˜ ,...,y˜ ]istheprivilegedlabelfeaturematrix i i i i 1,i L,i foreachlabel. 3 Remark. When W = I, i.e. the low-dimensional projection is identical, the proposed PrML degeneratesintoasimplerBR-stylemodel(wecallitprivilegedBinaryRelevance,PrBR),wherethe wholemodeldecomposesintoLindependentbinarymodels. However,everysinglemodelisstill combinedwiththecommentsformprivilegedinformation,thusitmaystillbesuperiortoBR. 3 OPTIMIZATION Inthissection,wepresenthowtosolvetheproposedprivilegedmulti-labellearningalgorithmEq.(5). ThewholemodelofEq.(5)isnotconvexduetothemultiplicationofDTw inconstraints. However, i each subproblem with fixed D or W is convex, thus it can be solved by various efficient convex solvers. Notethat(cid:104)DTw ,x (cid:105)hastwoequivalentforms,i.e.(cid:104)w ,Dx (cid:105)and(cid:104)D,w xT(cid:105),andthus i j i j i j the correcting function can be coupled with D or W, without damaging the convexity of either subproblem. Inthisway,Eq.(5)canbesolvedusingthealternativeiterationstrategy,i.e.iteratively conducting thefollowingtwo steps: optimizing W and privileged variable W˜ with fixed D, and updating D and privileged variable W˜ with fixed W. Both subproblems are related to SVM+, inducingtheirdualproblemstobequadraticprogramming(QP).Inthefollowing,weelaboratethe solvingprocessinrealimplementations. 3.1 OPTIMIZINGW,W˜ WITHFIXEDD Fixing D, Eq.(5) can be decomposed into L independent binary classification problems, each of whichregardsthevariablepair(w ,w˜ ). Paralleltechniquesormulti-corecomputationcanthus i i beemployedtospeedupthetrainingprocess. Inspecific,theoptimizationproblemwithrespectto (w ,w˜ )is i i n min 1(γ (cid:107)w (cid:107)2+γ (cid:107)w˜ (cid:107)2)+C(cid:88)(cid:104)w˜ ,y˜ (cid:105) wi,w˜i 2 1 i 2 2 i 2 j=1 i i,j (6) s.t.Y (cid:104)w ,Dx (cid:105)≥1−(cid:104)w˜ ,y˜ (cid:105) ij i j i i,j (cid:104)w˜ ,y˜ (cid:105)≥0,∀j =1,...,n. i i,j anditsdualformis(seesupplementarymaterials) 1 1 max − (α◦y∗)TK (α◦y∗)+1Tα− (α+β−C1)TK˜ (α+β−C1) (7) α,β 2 i D i 2γ i with the parameter update γ ← γ /γ ,C ← C/γ and the constraints α (cid:23) 0,β (cid:23) 0, i.e. α ≥ 2 1 1 j 0,β ≥0,∀j ∈[1:n].Moreover,y∗ =[Y ,Y ,...,Y ]T isthelabel-wisevectorsforthei-thlabel. j i i1 i2 in ◦istheHadamard(element-wise)productoftwovectorsormatrices. K ∈Rn×nistheD-based D features’ inner product (kernel) matrix with K (j,q) = (cid:104)Dx ,Dx (cid:105). K˜ is the privileged label D j q i features’innerproduct(kernel)matrixwithrespecttothei-thlabel,whereK˜ (j,q)=(cid:104)y˜ ,y˜ (cid:105). 1 i i,j i,q isthevectorwithallones. Pechyonyetal.(2010)proposedanSMO-stylealgorithm(gSMO)forSVM+problem. However, becauseofthebiasterm,theLagrangemultipliersaretangledtogetherinthedualproblem,which leadstoamorecomplicatedconstraintset {(α,β)|αTy∗ =0,1T(α+β−C1)=0,α(cid:23)0,β (cid:23)0} i than{(α,β)|α(cid:23)0,β (cid:23)0}inourPrML.Hencebyabsorbingthebiasterm,Eq.(6)canproducea morecompactdualproblemonlywithnon-negativeconstraint. Coordinatedescent(CD)1algorithm canbeappliedtosolvethedualproblem,andaclosed-formsolutioncanbeobtainedineachiteration stepLietal.(2016).AftersolvingtheEq.(7),accordingtotheKarush-Kuhn-Tucker(KKT)conditions, theoptimalsolutionfortheprimalproblem(6)canbeexpressedbytheLagrangemultipliers: n (cid:88) w = α Y Dx (8) i j ij j j=1 n 1 (cid:88) w˜ = (α +β −C)y˜ (9) i γ j j i,j j=1 1Weoptimizeanequivalent“min”probleminsteadoftheoriginal“max”one. 4 3.2 OPTIMIZINGD,W˜ WITHFIXEDW GivenfixedcoefficientmatrixW,weupdateandlearnthelineartransformationDwiththehelpof commentsprovidedbyprivilegedinformation. Thustheproblem(5)for(D,W˜)isreducedto L L n min 1(cid:107)D(cid:107)2 + γ2 (cid:88)(cid:107)w˜ (cid:107)2+C(cid:88)(cid:88)(cid:104)w˜ ,y˜ (cid:105) D,W˜ 2 F 2 i 2 i i,j i=1 i=1j=1 (10) s.t.Y (cid:104)D,w xT(cid:105)≥1−(cid:104)w˜ ,y˜ (cid:105) ij i j i i,j (cid:104)w˜ ,y˜ (cid:105)≥0,∀i=1,...,L;j =1,...,n. i i,j Eq.(10)hasLnconstraints,eachofwhichcanbeindexedwithatwo-dimensionalsubscript[i,j]. TheLagrangemultipliersofEq.(10)arethustwo-dimensionalaswell. Tomakethedualproblem of Eq.(10) consistent with Eq.(7), we define a bijection φ : [1 : L]×[1 : n] → [1 : Ln] as the row-basedvectorizationindexmapping,i.e.φ([i,j])=(i−1)n+j. Inanutshell,wearrangethe constraints(alsothemultipliers)accordingtotheorderofrow-basedvectorization. Inthisway,the correspondingdualproblemofEq.(10)isformulatedas(seedetailsinsupplementarymaterials) 1 1 max − (α◦y∗)TK (α◦y∗)+1Tα− (α+β−C1)TK˜(α+β−C1) (11) α(cid:23)0,β(cid:23)0 2 W 2γ2 wherey∗ =[y∗;y∗;...;y∗]istherow-basedvectorizationofY andK˜ =diag(K˜ ,K˜ ,...,K˜ )is 1 2 L 1 2 L ablockdiagonalmatrix,whichcorrespondstothekernelmatrixofprivilegedlabelfeatures. K W is the kernel matrix of input features with every element K (s,t) = (cid:104)G ,G (cid:105), where W φ−1(s) φ−1(t) G =w xT. BasedontheKKTconditions,(D,W˜)canbeconstructedusing(α,β): ij i j Ln (cid:88) D = α y∗G (12) s s φ−1(s) s=1 n 1 (cid:88) w˜ = (α +β −C)y˜ (13) i γ φ([i,j]) φ([i,j]) i,j 2 j=1 Inthisway,Eq.(11)hasanidenticaloptimizationformwithEq.(7). Thuswecanalsoturnittothe fastCDmethodLietal.(2016). However,duetothescriptindexmapping,directlyusingthemethod proposedinLietal.(2016)isveryexpensive. ConsideringtheprivilegedkernelmatrixK˜ isblock sparse,wecanfurtherspeedupthecalculation. DetailsofthemodifiedversionofdualCDalgorithm forsolvingEq.(11)arepresentedinAlgorithm1. Alsonotethatoneprimarymeritofthisalgorithm isthefreecalculationofthewholekernelmatrix. Instead,weonlyneedtocalculateitsdiagonal elementsasline2inAlgorithm1. 3.3 FRAMEWORKOFPRML Ourproposedprivilegedmulti-labellearningissummarizedinAlgorithm2.AsindicatedinAlgorithm 2,bothDandW areupdatedwiththehelpofcommentsfromprivilegedinformation. Notethatthe primalvariablesanddualvariablesareconnectedwithKKTconnections,andthusinrealapplications lines5-6and8-9inAlgorithm2canbeimplementediteratively. Sinceeachsubproblemisactually alinearSVM+optimizationandsolvedbytheCDmethod,itsconvergenceisconsistentwiththat ofthedualCDalgorithmforlinearSVMHsiehetal.(2008). Duetothecheapupdates,Hsiehetal. (2008);Lietal.(2016)empiricallyshoweditcanbemuchfasterthanGMO-stylemethodsandmany otherconvexsolverswhend(numberoffeatures)islarge. Moreover,theindependenceoflabelsin Problem(6)enablestouseparalleltechniquesandmulticorecomputationtoaccommodatethelarge L(numberoflabels). Asforalargen(numberofexamples)(alsolargeLforProblem(10)),wecan usemini-batchCDmethodTakacetal.(2015),whereeachtimeabatchofexamplesareselected andCDupdatesareparallellyappliedtothem,i.e.lines5-17canbeimplementedparallelly. Also recentlyChiangetal.(2016)designedaframeworkforparallelCDandachievedsignificantspeeding upevenwhenthedandnareverylarge. Thus,ourmodelcanscaletod,Landn. Inaddition,the solutionforeachofsubproblemisalsounique,asTheorem1stated. Theorem1. Thesolutiontotheproblem(6)or(10)isuniqueforanyγ >0,γ >0,C >0. 1 2 5 Algorithm1AdualcoordinatedescentalgorithmforsolvingEq.(11) Input: Training data: feature matrix X = [x ,x ,...,x ] ∈ Rd×n, label matrix Y = 1 2 n [y ,y ,...,y ] ∈ {−1,1}L×n. Privileged label features: y˜ of i-th label and j-th exam- 1 2 n i,j ple. Low-dimensionalembeddingprojection: W =[w ,...,w ]∈Rk×L. Learningparameters: 1 L γ,C >0. 1: Define the bijection φ : [1 : L]×[1 : n] → [1 : Ln] as the row-based vectorization index mapping,i.e.φ([i,j])=(i−1)n+j. Denoteitsinverseasφ−1. 2: Constuct Q ∈ R2Ln for each s ∈ [1 : 2Ln]: for 1 ≤ s ≤ Ln, [i,j] ← φ−1(s), Qs = (cid:107)w (cid:107)2(cid:107)x (cid:107)2+1/γ(cid:107)y˜ (cid:107)2;forLn+1≤s≤2Ln,[i,j]←φ−1(s−Ln),Q =1/γ(cid:107)y˜ (cid:107)2. i 2 j 2 i,j 2 s i,j 2 3: Initialization: η =[α;β]←0,D ←0andw˜i ←−Cγ (cid:80)nj=1y˜i,j foreachi∈[1:L]. 4: whilenotconvergencedo 5: randomlypickanindexs 6: if1≤s≤Lnthen 7: [i,j]←φ−1(s) 8: ∇s ←YijwiTDxj −1+w˜iTy˜i,j 9: δ ←max{−ηs,−∇s/Qs} 10: D ←D+δYijwixTj 11: elseifLn+1≤s≤2Lnthen 12: [i,j]←φ−1(s−Ln) 13: ∇s ←w˜iTy˜i,j 14: δ ←max{−ηs,−∇s/Qs} 15: endif 16: w˜i ←w˜i+(δ/γ)y˜i,j 17: ηs ←ηs+δ 18: endwhile Output: DictionaryDandcorrectingfunctions{w˜ }L . i i=1 Algorithm2PrivilegedMulti-labelLearning(PrML) Input: Training data: feature matrix X = [x ,x ,...,x ] ∈ Rd×n, label matrix Y = 1 2 n [y ,y ,...,y ]∈{−1,1}L×n. Learningparameters: γ ,γ ,C ≥0. 1 2 n 1 2 1: Constructionofprivilegedlabelfeaturesforeachlabelandeachtrainingpoint,e.g. asEq.(3). 2: initializationofD 3: whilenotconvergencedo 4: foreachi∈[1:L]do 5: [α,β]←solvingEq.(7) 6: updatewi,w˜iaccordingtoEq.(8)andEq.(9) 7: endfor 8: [α,β]←solvingEq.(11) 9: updateD,W˜ accordingtoEq.(12)andEq.(13) 10: endwhile Output: Alinearmulti-labelclassifierZ =DTW,togetherwithacorrectingfunctionW˜ w.r.t. L labels. Proofskeleton. BothEq.(6)andEq.(10)canbecastintoanidenticalSVM+optimizationwiththe formofobjectivefunctionbeingF = 1(cid:107)w(cid:107)2+ γ (cid:107)w˜(cid:107)2+C(cid:80)n (cid:104)w˜,x˜ (cid:105)andaclosedconvex 2 2 2 2 j=1 j feasible solution set. Denote u := (w;w˜) and assume two optimal solutions u ,u , we have 1 2 F(u ) = F(u ). Letu = (1−t)u +tu ,t ∈ [0,1],thenF(u ) ≤ (1−t)F(u )+tF(u ) = 1 2 t 1 2 t 1 2 F(u ),thusF(u )=F(u )=F(u )forallt∈[0,1],whichimpliesthatg(t)=F(u )−F(u )≡ 1 t 1 2 t 1 0,∀t∈[0,1]. Moreover,wehave n g(t)=1t2(cid:107)w −w (cid:107)2+t(cid:104)w −w ,w (cid:105)+ γt2(cid:107)w˜ −w˜ (cid:107)2+t(cid:104)w˜ −w˜ ,w˜ (cid:105)+tC(cid:88)(cid:104)w˜ −w˜ ,x˜ (cid:105) 2 2 1 2 2 1 1 2 2 1 2 2 1 1 2 1 i i=1 g(cid:48)(cid:48)(t)=(cid:107)w −w (cid:107)2+γ(cid:107)w˜ −w˜ (cid:107)2 =0. 2 1 2 2 1 2 6 Table1: Datastatistics. nisthetotalnumberofexamples. dandLarethenumberoffeaturesand labels,respectively;L¯ andDen(L)aretheaveragenumberofpositivelabelsinaninstanceandthe labeldensity,respectively. ‘Type’meansfeaturetype. Dataset n d L L¯ Den(L) type enron 1702 1001 53 3.378 0.064 nominal yeast 2417 103 14 4.237 0.303 numeric corel5k 5000 499 374 3.522 0.009 nominal bibtex 7395 1836 159 2.402 0.015 nominal eurlex 19348 5000 3993 5.310 0.001 nominal mediamill 43907 120 101 4.376 0.043 numeric Figure1: SomeoftheimageannotationresultsoftheproposedPrMLonthebenchmarkcorel5k dataset. Tags: red,ground-truth;blue,correctpredictions;black,wrongpredictions. Forγ >0,thenwehavew =w andw˜ =w˜ . 1 2 1 2 ProofofTheorem1mainlyliesinthestrictconvexityoftheobjectivefunctionineitherEq.(6)or (10). Concretedetailsarereferredtothesupplementarymaterials. Inthisway,thecorrectingfunction W˜ servesasabridgetochanneltheDandW,andtheconvergenceofW˜ inferstheconvergenceof DandW. ThuswecantakeW˜ asthebarometerofthewholealgorithm’sconvergence. 4 EXPERIMENTAL RESULTS Inthissection,weconductvariousexperimentsonbenchmarkdatasetstovalidatetheeffectivenessof usingtheintrinsicprivilegedinformationformulti-labellearning. Inaddition,wealsoinvestigatethe performanceandsuperiorityoftheproposedPrMLmodelcomparingtorecentcompetingmulti-label methods. 4.1 EXPERIMENTCONFIGURATION Datasets. We select six benchmark multi-label datasets, including enron, yeast, corel5k, bibtex, eurlexandmediamill. Specially,weconsiderthecaseswhend(eurlex),L(eurlex)andn(corel5k, bibtex,eurlex&mediamill)arelargerespectively. Alsonotethatenron,corel5k,bibtexandeurlex areofsparsefeatures. SeeTable1forthedetailsofthesedatasets. Comparisonapproaches. 1). BR(binaryrelevance)Tsoumakasetal.(2010). ASVMistrainedwithrespecttoeachlabel. 2). ECC (ensembles of classifier chains) Read et al. (2011). It turns ML into a series of binary classificationproblems. 3). RAKEL(randomk-labelsets)Tsoumakasetal.(2011). IttransformsMLintoanensembleof multi-classclassificationproblems. 4). LEML(lowrankempiricalriskminimizationformulti-labellearning)Yuetal.(2014). Itisa low-rankembeddingapproachwhichiscastedintoERMframework. 7 (a)Hammingloss↓ (b)One-error↓ (c)Coverage↓ (d)Rankingloss↓ (e)Averageprecision↑ (f)Macro-averagingAUC↑ Figure2: ClassificationresultsofPrML(bluesolidline)andLEML(reddashedline)oncorel5k (50%fortraining&50%fortesting)w.r.t. differentdimensionofprivilegedlabelfeatures. 5). ML2(multi-labelmanifoldlearning)Houetal.(2016). Itisalatestmulti-labellearningmethod, whichisbasedonthemanifoldassumptioninlabelspace. Evaluation Metrics. We use six prevalent metrics to evaluate the performance of all methods, includingHammingloss,One-error,Coverage,Rankingloss,Averageprecision(Averprecision) andMacro-averagingAUC(MacAUC).Notethatallevaluationmetricshavethevaluerange[0,1]. In addition, for the first four metrics, the smaller values would indicate the better classification performanceandweuse↓toindexthispositivelogic. Onthecontrary,forthelasttwometricslarger valuesrepresentthebetterperformance,indexedby↑. 4.2 ALGORITHMANALYSIS Performance visualization. First we analyze our proposed PrML and on a global sense, we selectthebenchmarkimagedatasetcorel5kandvisualizetheresultsofimageannotationtodirectly examinehowPrMLfunctions. Forthelearningprocess,werandomlyselected50%exampleswithout repeatingasthetrainingsetandtherestonesasthetestingset. Inourexperiment,parameterγ isset 1 tobe1;γ andC areintherangeof0.1∼2.0and10−3 ∼20respectively,anddeterminedusing 2 crossvalidationbyapartoftrainingpoints. Theembeddingdimensionkissettobek = (cid:100)0.9L(cid:101), where(cid:100)r(cid:101)isthesmallestintegergreaterthanr. SomeoftheannotationresultsispresentedinFigure 1,wherethelefttagsaretheground-truthandtherightonesarethetopfivepredictedtags. AsshowninFigure1,wecansafelyconcludethattheproposedPrMLperformswellontheimage annotationtasks. Itcanpredictcorrectlythesemanticlabelsinmostcases. Notethatalthoughin somecasesthepredictedlabelsarenotintheground-truth,theyareessentiallyrelatedinsemantic sense. Forexample,the“swimmers”inimage(b)wouldbemuchnaturalwhenitcomestothe“pool”. Moreover,wecanalsoseethatPrMLwouldmakesupplementarypredictionstodescribeimages, enrichingthecorrespondingcontent. Forexample,the“grass”inimage(c)and“buildings”inimage (d)arethemissingobjectsinground-truthlabels. Validationofprivilegedlabelfeatures.Wethenvalidatetheeffectivenessoftheproposedprivileged informationformulti-labellearning. Asdiscussedpreviously,theprivilegedlabelfeaturesserveas anguidanceorcommentsfromanOracleteachertoconnectthelearningofallthelabels. Forthe sakeoffairness,wesimplyimplementthevalidationexperimentswithLEML(withoutprivileged 8 labelfeatures)andPrML(withprivilegedlabelfeatures). Notethatourproposedprivilegedlabel featuresarecomposedwiththevaluesoflabels;however,notalllabelshaveprominentconnections inmulti-labellearningSunetal.(2014). Thusweselectivelyconstructtheprivilegedlabelfeatures withrespecttoeachlabel. Particularly, wejustuseK-nearestneighborruletoformthepoolperlabel. Foreachlabel, only labelsinitslabelpool,insteadofthewholelabelset,arereckonedtoprovidemutualguidanceduring itslearning. Inourimplementation,wesimplyutilizeHammingdistancetoaccomplishsearchof K-nearestneighboronthedatasetcorel5k. Theexperimentalsettingisthesamewithbeforeandboth algorithmssharethesameembeddingdimensionk. Moreover,wecarryoutindependentteststen timesandtheaverageresultsareshowninFigure2. AsshowninFigure2,wehavethefollowingtwoobservations. (a)PrMLisclearlysuperiortoLEML when we select enough labels as privileged label features, e.g. more than 350 labels in corel5k dataset. Sincetheironlydifferenceliesintheusageoftheprivilegedinformation,wecanconclude thattheguidancefromtheprivilegedinformation,i.e.theproposedprivilegedlabelfeatures,can significantlyimprovetheperformanceofmulti-labellearning. (b)Withmorelabelsinvolvedinthe privilegedlabelfeatures,theperformanceofPrMLkeepsimprovinginasteadyspeed,andwhen thedimensionofprivilegedlabelfeaturesislargeenough,theperformancetendstostabilizeonthe whole. Thenumberoflabelsisdirectlyrelatedtothecomplexityofcorrectingfunctiondefinedasalinear function. Thusfewlabelsmightinducethelowfunctioncomplexity,andthecorrectingfunctioncan notdeterminetheoptimalslackvariables. Inthisway,thefault-tolerantcapacitywouldbecrippled andthustheperformanceisevenworsethanLEML.Forexample,whenthedimensionofprivileged labelsislessthan250oncorel5k,theHammingloss,One-error,CoverageandRankinglossofPrML ismuchlargerthanLEML.Incontrast,overmuchlabelsmightintroduceunnecessaryguidanceof labels,andtheextralabelsthusmakenocontributiontothefurtherimprovementofclassification performance. Forinstance,theperformancewith365labelsinvolvedinprivilegedlabelfeatures wouldbeonparwiththatofalltheother(373)labelsinHammingloss,One-error,Rankingloss andAverageprecision. Moreover,inrealapplications,itisstillasafechoicethatallotherlabelsare involvedinprivilegedinformation. 4.3 PERFORMANCECOMPARISON Nowweformallyanalyzetheperformanceoftheproposedprivilegedmulti-labellearning(PrML) incomparisonwithpopularstate-of-the-artmethods. Foreachdataset,werandomlyselected50% exampleswithoutrepeatingasthetrainingsetandtherestfortesting. Fortheresults’credibility,the datasetdivisionprocessisimplementedtentimesindependentlyandwerecordedthecorresponding results in each trail. Parameters γ ,γ and C are determined in the same manner as before. As 1 2 forthelowembeddingdimensionk,followingthewisdomofYuetal.(2014),wechoosektobe in{(cid:100)0.8L(cid:101),(cid:100)0.85L(cid:101),(cid:100)0.9L(cid:101),(cid:100)0.95L(cid:101)}anddeterminedbycrossvalidationusingapartoftraining points. Particularly,wealsocoverthePrBR(privilegedinformation+BR)tofurtherinvestigatethe proposedprivilegedinformation. ThedetailedresultsarereportedinTable2. FromTable2,wecanseetheproposedPrMLiscomparabletothestate-of-the-artML2method,and significantlysurpassestheothercompetingmulti-labelmethods. Concretely,acrossallevaluation metricsanddatasets,PrMLranksfirstin52.8%casesandthefirsttwoinallcases;eveninthesecond place,PrML’sperformanceisclosetothetopone. ComparingBR&PrBR,andLEML&PrML,we cansafelyinferthattheprivilegedinformationplaysanimportantroleinenhancingtheclassification performanceofmulti-labelpredictors. Besides,inallthe36cases,PrMLwins34casesagainstPrBR andplaysatietwiceinRankinglossonenronandHamminglossoncorel5krespectively,which impliesthatthelow-rankstructureinPrMLhaspositiveimpactinfurtherimprovingthemulti-label performance. Therefore, we can see PrML has inherited the merits of both low-rank parameter structureandprivilegedlabelinformation. Inaddition,PrMLandLEMLtendtoperformbetteron datasetswithmorelabels(>100). Thismightbebecausethelow-rankassumptionismoresensible whenthenumberoflabelsisconsiderablylarge. 9 Table2: Averagepredictiveperformance(mean±std. deviation)oftenindepedenttrailsforvarious multi-labellearningmethods. Ineachtrail,50%examplesarerandomlyselectedwithoutrepeatingas trainingsetandtherestastestingset. Thetopperformanceamongallmethodsismarkedinboldface. dataset method Hammingloss↓ One-error↓ Coverage↓ Rankingloss↓ Averprecision↑ MacAUC↑ BR 0.060±0.001 0.498±0.012 0.595±0.010 0.308±0.007 0.449±0.011 0.579±0.007 ECC 0.056±0.001 0.293±0.008 0.349±0.014 0.133±0.004 0.651±0.006 0.646±0.008 enron RAKEL 0.058±0.001 0.412±0.016 0.523±0.008 0.241±0.005 0.539±0.006 0.596±0.007 LEML 0.049±0.002 0.320±0.004 0.276±0.005 0.117±0.006 0.661±0.004 0.625±0.007 ML2 0.051±0.001 0.258±0.090 0.256±0.017 0.090±0.012 0.681±0.053 0.714±0.021 PrBR 0.053±0.001 0.342±0.010 0.238±0.006 0.088±0.003 0.618±0.004 0.638±0.005 PrML 0.050±0.001 0.288±0.005 0.221±0.005 0.088±0.006 0.685±0.005 0.674±0.004 BR 0.201±0.003 0.256±0.008 0.641±0.005 0.315±0.005 0.672±0.005 0.565±0.003 ECC 0.207±0.003 0.244±0.009 0.464±0.005 0.186±0.003 0.752±0.006 0.646±0.003 yeast RAKEL 0.202±0.003 0.251±0.008 0.558±0.006 0.245±0.004 0.720±0.005 0.614±0.003 LEML 0.201±0.004 0.224±0.003 0.480±0.005 0.174±0.004 0.751±0.006 0.642±0.004 ML2 0.196±0.003 0.228±0.009 0.454±0.004 0.168±0.003 0.765±0.005 0.702±0.007 PrBR 0.227±0.004 0.237±0.006 0.487±0.005 0.204±0.003 0.719±0.005 0.623±0.004 PrML 0.201±0.003 0.214±0.005 0.459±0.004 0.165±0.003 0.771±0.003 0.685±0.003 BR 0.012±0.001 0.849±0.008 0.898±0.003 0.655±0.004 0.101±0.003 0.518±0.001 ECC 0.015±0.001 0.699±0.006 0.562±0.007 0.292±0.003 0.264±0.003 0.568±0.003 corel5k RAKEL 0.012±0.001 0.819±0.010 0.886±0.004 0.627±0.004 0.122±0.004 0.521±0.001 LEML 0.010±0.001 0.683±0.006 0.273±0.008 0.125±0.003 0.268±0.005 0.622±0.006 ML2 0.010±0.001 0.647±0.007 0.372±0.006 0.163±0.003 0.297±0.002 0.667±0.007 PrBR 0.010±0.001 0.740±0.007 0.367±0.005 0.165±0.004 0.227±0.004 0.560±0.005 PrML 0.010±0.001 0.675±0.003 0.266±0.007 0.118±0.003 0.282±0.005 0.651±0.004 BR 0.015±0.001 0.559±0.004 0.461±0.006 0.303±0.004 0.363±0.004 0.624±0.002 ECC 0.017±0.001 0.404±0.003 0.327±0.008 0.192±0.003 0.515±0.004 0.763±0.003 bibtex RAKEL 0.015±0.001 0.506±0.005 0.443±0.006 0.286±0.003 0.399±0.004 0.641±0.002 LEML 0.013±0.001 0.394±0.004 0.144±0.002 0.082±0.003 0.534±0.002 0.757±0.003 ML2 0.013±0.001 0.365±0.004 0.128±0.003 0.067±0.002 0.596±0.004 0.911±0.002 PrBR 0.014±0.001 0.426±0.004 0.178±0.010 0.096±0.005 0.529±0.009 0.702±0.003 PrML 0.012±0.001 0.367±0.003 0.131±0.007 0.066±0.003 0.571±0.004 0.819±0.005 BR 0.018±0.004 0.537±0.002 0.322±0.008 0.186±0.009 0.388±0.005 0.689±0.007 ECC 0.011±0.003 0.492±0.003 0.298±0.004 0.155±0.006 0.458±0.004 0.787±0.009 eurlex RAKEL 0.009±0.004 0.496±0.007 0.277±0.009 0.161±0.001 0.417±0.010 0.822±0.005 LEML 0.003±0.002 0.447±0.005 0.233±0.003 0.103±0.010 0.488±0.006 0.821±0.014 ML2 0.001±0.001 0.320±0.001 0.171±0.003 0.045±0.007 0.497±0.003 0.885±0.003 PrBR 0.007±0.008 0.484±0.003 0.229±0.009 0.108±0.009 0.455±0.003 0.793±0.008 PrML 0.001±0.002 0.299±0.003 0.192±0.008 0.057±0.002 0.526±0.009 0.892±0.004 BR 0.031±0.001 0.200±0.003 0.575±0.003 0.230±0.001 0.502±0.002 0.510±0.001 ECC 0.035±0.001 0.150±0.005 0.467±0.009 0.179±0.008 0.597±0.014 0.524±0.001 mediamill RAKEL 0.031±0.001 0.181±0.002 0.560±0.002 0.222±0.001 0.521±0.001 0.513±0.001 LEML 0.030±0.001 0.126±0.003 0.184±0.007 0.084±0.004 0.720±0.007 0.699±0.010 ML2 0.035±0.002 0.231±0.004 0.278±0.003 0.121±0.003 0.647±0.002 0.847±0.003 PrBR 0.031±0.001 0.147±0.005 0.255±0.003 0.092±0.002 0.648±0.003 0.641±0.004 PrML 0.029±0.002 0.130±0.002 0.172±0.004 0.055±0.006 0.726±0.002 0.727±0.008 5 CONCLUSION In this paper, we investigate the intrinsic privileged information to connect labels in multi-label learning. Tactfully,weregardthelabelvaluesastheprivilegedlabelfeatures. Thisstrategyindicates thatforeachlabel’slearning,otherlabelsofeachexamplemayserveasitsOraclecommentsonthe learningofthislabel. Withouttherequirementofadditionaldata,weproposetoactivelyconstruct privilegedlabelfeaturesdirectlyfromthelabelspace. Thenweintegratetheprivilegedinformation withthelow-rankhypothesesZ =DTW inmulti-labellearning,andformulateprivilegedmulti-label learning(PrML)asaresult. Duringtheoptimization,boththedictionaryDandthecoefficientmatrix W canreceivethecommentsfromtheprivilegedinformation. Andexperimentalresultsshowthat withthisveryprivilegedinformation,theclassificationperformancecanbesignificantlyimproved. Thuswecanalsotaketheprivilegedlabelfeaturesasawaytoboosttheclassificationperformanceof thelow-rankbasedmodels. Asforthefuturework,ourproposedPrMLcanbeeasilyextendedintoKernelversiontocohere withthenonlinearityintheparameterspace. Besides,usingSVM-styleL -hingelossmightfurther 2 improvethetrainingefficiencyXuetal.(2016). Theoreticalguaranteeswillbealsoinvestigated. 10