ebook img

Ties That Bind - Characterizing Classes by Attributes and Social Ties PDF

5.2 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Ties That Bind - Characterizing Classes by Attributes and Social Ties

Ties That Bind - Characterizing Classes by Attributes and Social Ties AriaRezaei BryanPerozzi∗ LemanAkoglu StonyBrookUniversity GoogleResearch H.JohnHeinzIIICollege StonyBrook,NY NewYork,NY CarnegieMellonUniversity [email protected] [email protected] [email protected] ABSTRACT node.Biologicaldatacanalsoberepresentedasattributedgraphs; Givenasetofattributedsubgraphs knowntobefromdifferent protein-proteininteraction(PPI)networkscanhavegeneencodings 7 classes, howcanwediscovertheirdifferences? Therearemany ofproteinsasattributes,orgeneinteractionnetworksmaycontain 1 caseswherecollectionsofsubgraphsmaybecontrastedagainst geneontologypropertiesasattributes[14,22]. 0 eachother.Forexample,theymaybeassignedgroundtruthlabels Weconsiderthefollowingquestion: Givenacollectionofat- 2 (spam/not-spam), or it may be desired to directly compare the tributedsubgraphsfromdifferentclasses,howcanwediscoverthe n biologicalnetworksofdifferentspeciesorcompoundnetworksof attributesthatcharacterizetheirdifferences?Thisisageneralques- a differentchemicals. tion, whichfindsapplicationsinvarioussettingsdependingon J Inthisworkweintroducetheproblemofcharacterizingthe how‘subgraphs’and‘classes’aredefinedandinterpreted.Insocial 1 differencesbetweenattributedsubgraphsthatbelongtodifferent networks,subgraphscouldbethelocalcommunitiesaroundeach 3 classes. Wedefinethischaracterizationproblemasoneofparti- individual.Thatisbecauseone’sacquaintancescarryalotofinfor- tioningtheattributesintoasmanygroupsasthenumberofclasses, mationaboutthemduetothefactorsofhomophily(phenomenon ] whilemaximizingthetotalattributedqualityscoreofallthegiven that“birdsofafeatherflocktogether”)andinfluence(phenomenon SI subgraphs. thatourattitudesandbehaviorsareshapedbyourpeers)[7].One Weshowthatourattribute-to-classassignmentproblemisNP- canalsoconsiderthesubgraphsextractedbyacommunitydetec- . s hardandanoptimal(1−1/e)-approximationalgorithmexists.We tionalgorithm,thesocialcirclesasdefinedbyindividuals,orany c alsoproposetwodifferentfasterheuristicsthatarelinear-timein collectionofsmallgraphletsthatcomefromanapplication(e.g. [ thenumberofattributesandsubgraphs. Unlikepreviouswork PPInetworksofacollectionofflyspecies).Ontheotherhand,a 1 whereonlyattributesweretakenintoaccountforcharacterization, ‘class’correspondstoabroadcategorizationofsubjects.Insocial v hereweexploitbothattributesandsocialties(i.e.graphstructure). networkanalysis,onemaytrytounderstandthedifferencesbe- 9 Throughextensiveexperiments,wecompareourproposedalgo- tweenindividualslivingindifferentcountries(e.g.U.S.vs.China), 3 rithms,showfindingsthatagreewithhumanintuitionondatasets orhavingdifferentdemographics(e.g.elderlyvs. teenagers). In 0 fromAmazonco-purchases,Congressionalbillsponsorshipsand biology,onemaywanttoanalyzePPInetworksofhealthyversus 9 DBLPco-authorships.Wealsoshowthatourapproachofcharac- sickindividualsorofmiceversushumans.1 .0 terizingsubgraphsisbettersuitedforsense-makingthandiscrimi- Inthiswork,weproposetocharacterizethedifferentclasses 1 natingclassificationapproaches. throughtheattributesthattheirsubgraphsfocuson.Intuitively,we 0 assumethenodesineachsubgraphshareasubsetofattributesin 7 CCSCONCEPTS common(e.g.acircleoffriendswhogotothesameschoolandplay 1 baseball).Thatis,membersofasubgraph“click”togetherthrougha v: c•Iennftoerrmedatcioomnpsyusttienmgs→→SoDcaitaalmneitnwinogrk;Saoncaialyltsaisg;ging;•Human- sharedcharacterizingattributesubspace,calledthefocusattributes [28].Itisexpectedthatoutofalargenumberofattributes,onlya i X ACMReferenceformat: fewofthemwouldberelevantforeachsubgraph. r AriaRezaei,BryanPerozzi,andLemanAkoglu..TiesThatBind-Charac- Ourmaininsightforcomparingsubgraphsisthenthatthesub- a terizingClassesbyAttributesandSocialTies.InProceedingsofDraft,Perth, graphsfromdifferentclasseswouldexhibitdifferentfocusattributes. Australia,April2017(WWW’17),9pages. Inotherwords,theattributesthatcharacterizethesubgraphsof DOI: oneclassaredifferentfromtheattributesaroundwhichthesub- graphsfromanotherclasscenterupon.Astereotypicalexampleto 1 INTRODUCTION thisinsightisteenagersfocusingonattributessuchas‘selfies’and Besides connectivity, many graphs contain a state (or content) ‘partying’whereaselderlybeingcharacterizedby‘knitting’and vectorforeachnode.Thistypeofgraphisknownasanattributed ‘teapartying’.Notethateventhoughclassesmightsharecommon graph, and is a natural abstraction for many applications. For attributes,weaimtoidentifythosethatareexclusiveandnotthe example,inasocialnetworktheprofileinformationofindividuals overlappingones;asthosebesthelpcharacterizethedifferences. (e.g.age,occupation,etc.)constitutetheattributevectorforeach Figure1presentsacompletesketchofourproblem. Avastbodyofmethodsforcommunitydetectionhasbeenpro- ∗WorkperformedwhileatStonyBrookUniversity posedforbothsimple[4,8,16,26,36]aswellasattributedgraphs WWW’17,Perth,Australia 2017.978-x-xxxx-xxxx-x/YY/MM. 1Herewefocusontwoclassesforsimplicityhoweverourmethodseasily DOI: generalizetosubgraphsfromanynumberofclasses. WWW’17,April2017,Perth,Australia AriaRezaei,BryanPerozzi,andLemanAkoglu class A subgraphs a a a a a a G g g g g 1 2 3 4 5 6 class A class B A1 A2 A3 A1 gA2 aa3 aa5 g 4 6 A3 a class B subgraphs gB1 a1 g g 2 B1 B2 g B2 assignment attributed graph characterizing subspaces & ranking (a) (b) (c) (d) Figure1:Problemsketchontoydata. Given(b)node-attributedsubgraphs(or(a)nodesaroundwhichweextractsubgraphs) fromdifferentclasses(AandB),wefind(c)thecharacterizingsubspace(i.e.,thefocusattributesandrespectiveweights)for eachsubgraph,and(d)splitandranktheattributesforcharacterizingandcomparingtheclasses. [2,12,13,15,23,25,28,33,37].Thoseareprimarilyconcernedwith humanintuitiononreal-worldscenariosfrom3datasets, extractingdisjointoroverlappinggroupsofnodes,whileoptimiz- anddemonstratethatourcharacterizationapproachisbet- ingsomegraphclusteringobjective.Ourproblemisconsiderably tersuitedtosense-makingthandiscriminativeclassifica- different. Unlikethem,ourgoalistounderstandthedifferences tionapproaches(§5). betweendistinctclassesofsubgraphs(orcommunities)through theattributesthatcharacterizethem,nottoextractbetterones. 2 PROBLEMDEFINITION Similarstudieshavebeendoneincharacterizingandcomparing Inthissection,weintroducethenotationusedthroughouttext thesocialmediauseofdifferentclassesofsubjects.Forexample, andpresenttheformalproblemdefinition. Anattributedgraph featuresfromauser’ssocialmediainteractionshavebeenshownto G=(V,E,A)isagraphwithnodesetV,undirectededgesE ⊆V×V, predictdemographicinformationsuchasgender[5],age[29,31], occupation[30],location[9,18],andincome[10].Morenuanced andasetofattributesA={a1,...ad}associatedwitheverynode, traitshavealsobeenpredictedaboutindividuals,suchaspersonality whereai ∈(cid:60)d denotesthed-dimensionalattributevectorfornode i. Inthisworkweconsiderreal-valuedandbinaryattributes. A [32],ormentalillness[6]. However,thesestudiestendtofocus categoricalattributecanbetransformedtobinarythroughone-hot solelyontextattributesanddonotconsiderbroaderlevelsofsocial encoding. interactionsinanetwork. Givenacollectionofattributedsubgraphsfromc classes,our ArecentworkinthesamelineswithoursisbyDellaPostaet aimistosplittheattributesinAintoc disjointgroupssuchthat al.[7],whichstudiednetworkeffectsforexplaininghowpolitical thetotalqualityscoreQ ofallthesubgraphsbasedonfunction ideologybecomeslinkedtolifestyles,suchas“latteliberals”and q(·)andtheirassignedattributesismaximized. Hereweusethe “bird-huntingconservatives”.Theirsimulatedmodelsrevealstrong normalitymeasure[27]forq(·),whichcanbereplacedwithany indicationsforinfluencesoperatingbetweenindividualsinpolitical othermeasureofinterestthatcanutilizebothgraphstructureand “echo chambers” rather than within individuals, demonstrating attributesingeneral. evidencetoward“oneisthecompanytheykeep”,i.e.,thatsocial Ourproblemstatementisgivenfortwoclassesasfollowsfor tiesmatter. simplicity,whichcanbegeneralizedtomultipleclasses. Inthiswork,weanalyzethedifferencesbetweenindividuals fromdifferentclasses. Unlikepreviousworkwhichhasfocused Definition2.1(CharacterizationProblem). primarilyontheindividual’sattributes(mostlytext),weuselo- Given calcommunitiesaroundindividualnodesinadditiontoattributes • pattributedsubgraphsд1+,д2+,...,дp+fromclass1, tocharacterizethem. Specifically,ourcontributionsincludethe • nattributedsubgraphsд1−,д2−,...,дn−fromclass2, following: fromgraphG,andattributevectora∈(cid:60)d foreachnode; • We introduce the general characterization problem for Find agivencollectionofattributedsubgraphsfromdifferent • apartitioningofattributestoclassesasA+andA−,where classes—whichleveragesboththestructureofsocialtiesas A+∪A−=AandA+∩A−=∅, wellastheattributes.Ourformulationentailspartitioning • focusattributesA+ ⊆ A+(andrespectiveweightsw+)for theattributesintoasmanygroupsasthenumberofclasses, eachsubgraphд+,i∀i,and i whilemaximizingthetotalattributedqualityscoreofthe • focusattributesiA− ⊆ A−(andrespectiveweightsw−)for inputsubgraphs(§3). eachsubgraphд−,j∀j; j • Weshowthatourattribute-to-classassignmentproblem j isNP-hardandanoptimal(1−1/e)-approximationalgo- suchthat rithmexists(§4.1). Wealsoproposetwodifferentfaster • totalqualityQ ofallsubgraphsismaximized,whereQ = heuristicsthatarelinear-timeinthenumberofattributes (cid:205)pi=1 q(дi+|A+) + (cid:205)nj=1 q(дj−|A−); andsubgraphs(§4.2). RankattributeswithinA+andA−. • Throughextensiveexperiments,wecomparetheperfor- Theaboveproblemcontainsthreesubproblems,inparticular, manceofthealgorithms,presentfindingsthatagreewith (P1)howtomeasurethequalityofanattributedsubgraph,(P2) TiesThatBind-CharacterizingClassesbyAttributesandSocialTies WWW’17,April2017,Perth,Australia howtofindthefocusattributes(andtheirweights)ofagivensub- д.Thisobjectiveiswrittenas graph,and(P3)howtoassignandranktheattributesfordifferent classessoastomaximizetotalquality. Inpractice,classesfocus max. wgT ·(aI +aX) wg onasmallsetofattributes.Further,ourrankingoftheattributes ensuresthoseirrelevanttobothclassesandthosecommonbetween s.t. (cid:107)wg(cid:107)p =1, wg(a)≥0, ∀a=1...d (2) themarerankedlowerandonlyafewofthemostdifferentiating Notethatwgisnormalizedtoitsp-normtorestrainthesolution attributesstandout.Figure1showsanexampleforourproblem space.Wealsointroducenon-negativityconstraintontheweights for5subgraphsfrom2classes,where6attributesaresplitintotwo tofacilitatetheirinterpretation.Inthefollowingweletxˆ =(aI + andrankedforcharacterizingandcomparingtheclasses. aX),wherexˆ(a)∈[−1,1]. Inthenextsection,weaddressthesubproblemsinthegiven Ifoneuses(cid:107)wg(cid:107)p=1,ortheL1norm,thesolutionpicksthesingle orderabove,in§3.1through§3.3respectively,tobuildupasolution attributewiththelargestxˆ entryasthefocus.Thatis,wg(a)=1 forourmainproblemstatement. wheremax(xˆ) = xˆ(a)and0otherwise. Thiscanbeinterpreted asthemostimportantattributethatcharacterizesthesubgraph. Notethatxˆmaycontainonlynegativeentries,inwhichcasethe 3 FORMULATION largestnegativeentryisselected,andthesubgraphisdeemedas 3.1 QuantifyingQuality lowquality. Toinferthecharacterizingsubspaceforagivensubgraph,weuse Iftherearemultipleattributesthatcanincreasenormality,we ameasureofsubgraphquality. Theideaistofindtheattribute canalsoselectalltheattributeswithpositiveentriesinxˆ asthe subspaceandrespectiveweightsthatmaximizethequalityofeach subgraphfocus.Theweightsoftheseattributes,however,should subgraph.Inthiswork,weusethenormalitymeasure[27],which beproportionaltothemagnitudeoftheirxˆvalues.Thisisexactly notonlyutilizesbothsubgraphstructureaswellasattributes,but what(cid:107)wg(cid:107)p=2,ortheL2normyields. Itisshown(see[27])that alsoquantifiesbothinternalandexternalconnectivityofthesub- underp=2, graph. xˆ(a) Foragivensubgraphд,itsnormalityN(д)isgivenasinEq. wg(a)= (cid:113) , (3) (1),whereW istheadjacencymatrix,ki isnodei’sdegree,sim(·) (cid:205)xˆ(i)>0xˆ(i)2 itshethneusmimbielraroiftyedfugnesc,tiaonndoBf(aдtt)rdiebnuotetevsetchteornsowdeeisgahttethdebbyowungd,eariys wherexˆ(a)>0and0otherwise,suchthatwgisunit-normalized. ofthesubgraph(forisolatedsubgraphs,B(д)isempty). Thetwo ThenormalityscoreofsubgraphдthenbecomesN(д)=wgT·xˆ = tseurrmprsisinin(g1,)arnedspheicgthivlyelsyimquilaanrtciofynnдeicnttieornnsailnlysiadnedдeixntcerrenaaslelyin:tmerannayl, n(cid:205)oxˆr(ma)>o0fx√ˆi(cid:205)nxˆd(xˆiu)(>ac0)exˆd(io)2nxtˆh(ae)att=ri(cid:113)bu(cid:205)texˆs(i)w>i0thxˆp(io)s2it=ive(cid:107)xxˆˆ+e(cid:107)n2t,riie.es..,the2- quality,whereasifsuchedgesareattheboundary,theydecrease externalquality.Fortechnicaldetailsofnormality,see[27]. 3.3 IdentifyingClassDifferences 3.3.1 Splittingattributesbetweenclasses. Inthislastpartwe N(д)=I+X = (cid:213) (cid:0)Wij −k2ikej(cid:1)sim(ai,aj|wg) raetttruirbnutteosopuarcembaeitnwpereonbdleiffmersetanttecmlaesnset,swsohearsetowbeeseaebklettoosipdleitnttihfye i∈д,j∈д theirdifferences.Weaimtoobtainsuchanassignmentofattributes − (cid:213) (cid:0)1−min(1,ki2keb)(cid:1)sim(ai,ab|wg) withagoaltomaximizethetotalquality(i.e.,normality)ofallthe i∈д,b∈B(д) subgraphsfrombothclasses.Thisensuresthatthesubgraphsare (i,b)∈E stillcharacterizedwell,evenundertheconstraintthattheattributes = wgT · (aI +aX) (1) arenotsharedacrossclasses. LetS+ = {д1+,...,дp+}andS− = {д1−,...,дn−}denotethesets Onecanhandlehighlyheterogeneousattributessimplybychoosing ofallsubgraphsinclass1andclass2,respectively,whereeach therightsim(·)function.AlsonotethataI andaX arevectorsthat subgraphisnowassociatedwithad-dimensionalnon-negative canbedirectlycomputedfromdata.Attributeswithlargenon-zero vectorx.Thisisthesameasthexˆvectorintroducedin§3.2,except weightsinwgarecalledthefocusattributesofsubgraphд. thatallthenegativeentriesaresettozero.Recallthattheentries ofxˆdepictthecontributionofeachattributetothequalityofthe subgraph.Therefore,wecandropthenegativeentries(recallthat 3.2 DiscoveringCharacterizingSubspaces theoptimizationin(2)selectsonlythepositiveentries,ifany).2 ForasubgraphwecanuseEq.(1)tocomputeitsnormalitypro- ThegoalisthentofindtwodisjointattributegroupsA+andA−, videdwg,theweightsforthe(focus)attributes.Howeverthefocus A+∪A− =AandA+∩A− = ∅,suchthatthetotalqualityofall isoftenlatentandhardtoguesswithoutpriorknowledge,espe- subgraphsismaximized(seeproblemstatementin§2).Givenaset ciallyinhighdimensionswherenodesareassociatedwithalong ofselectedattributesS,thequalityofasubgraphдcanbewritten listofattributes. Evenifthefocusisknownapriori,itishardto as manuallyassignweights. Instead,weinfertheattributeweightvectorforagivensubgraph, 2Theremaybesubgraphsforwhichxˆcontainsonlynegativeentries.We soastomaximizeitsnormalityscore.Inotherwords,weleverage excludesuchsubgraphsfromthestudyofdiscoveringclassdifferences,as normalityasanobjectivefunctiontoinferthebestwgforagiven theyaredeemedlowquality. WWW’17,April2017,Perth,Australia AriaRezaei,BryanPerozzi,andLemanAkoglu addingthesameattributetoitssmallersubsetS(cid:48); N(д|a∪S)− (cid:115) N(д|S)≤N(д|a∪S(cid:48))−N(д|S(cid:48)),S(cid:48)⊆S ⊆A. N(д|S)= (cid:213)x(a)2=(cid:107)x[S](cid:107)2 (4) Underthissetting,wefindthatourproblemin(5)canbestated a∈S asaninstanceoftheSubmodularWelfareProblem(SWP),whichis i.e.,the2-normofxinducedontheattributesubspace.Therefore, definedasfollows. theoverallproblemcanbe(re)formulatedas Definition 4.1 (Submodular Welfare Problem). Given d items A+⊆mAa,Ax−.⊆A p1i(cid:213)∈S+(cid:107)xi[A+](cid:107)2+n1 j(cid:213)∈S−(cid:107)xj[A−](cid:107)2 wanid:2m[d]pl→aye(cid:60)rs+,hfianvdinagpmarotnitoiotonninegsoufbtmheodduiltaermustiinlittoymfudnicstjiooinnst suchthat A+∩A−=∅ (5) setsI1,I2,...,Im inordertomaximize(cid:205)mi=1wi(Ii). In our formulation items map to the attributes ford = |A|, Notethatwenormalizethetermsbythenumberofsubgraphsin whereasplayerscorrespondtotheclasses, inthesimplestcase eachclasstohandleclassimbalance.Wealsoemphasizethatour form=2.Inaddition,theutilityfunctioniswrittenforeachclass objectivein(5)isdifferentfromaclassificationproblemintwo c ∈{+,−}as keyways.First,weworkwithxvectorsthatembedinformation othnesourbiggirnaaplhacttornibnuetcetivveictytoarssaw’se.llSaescofoncdu,souattrroibbjuetcetsivreaethmebrrtahcaens wc(Ic)=N(S(c)|A(c))= n(1c) k∈(cid:213)S(c)(cid:107)xk[A(c)](cid:107)2 (7) characterizationandaimstofindapartitioningofattributesthat whichistheaveragenormalityscoresofsubgraphsSbelonging maximizestotalquality,whichisdifferentfromfindingadecision toclassc.As(cid:107)·(cid:107)2isamonotoneandsubmodularfunction,sois boundarythat minimizes classificationloss as in discriminative N(S(c))sincethesumofsubmodularfunctionsisalsosubmodular approaches(See§5). [21].Notethatalthoughwefocusontwoclassesinthiswork,the 3.3.2 Rankingattributes. Asolutionto(5)(nextsection)pro- SWPisdefinedmoregenerallyformplayers,i.e.,classes.Assuch, vides a partitioning of the attributes into two groups. We can itiseasytogeneralizeourproblemtomoreclassesfollowingthe analyzethespecificattributesassignedtoclassestocharacterize samesolutionsintroducedfortheSWP. theirdifferences.Sincethisisanexploratorytask,analyzingalarge TheSWPisfirststudiedbyLehmannetal.[19],whoproposed numberofattributeswouldbeinfeasible.Foreasierinterpretation, asimpleon-linegreedyalgorithmthatgivesa1/2-approximation weneedarankingoftheattributes. forthisproblem.Later,Vondra´kproposedanimproved(1−1/e)- eacOhnaettcriobuultdetah.iThnkisofhouwsienvger(cid:205)dio∈eSs(cn)oNt(rдeifl|aect∈thAe(cd)i)fffeorrenstcioartiinngg acapnpnrooxtibmeaatpiopnrosxoimluatitoendt[o35a].faKcthoortbeetttaerl.thshanow1e−d1t/hea,tutnhleesSsWPP= power but only the importance ofa for classc. We want both NP[17].Mirroknietal.furtherprovedthatabetterthan(1−1/e)- importantanddifferentiatingattributestorankhigherastheytruly approximationwouldrequireexponentiallymanyvaluequeries, characterizethedifferencebetweensubgraphsofthetwoclasses. regardlessofP=NP[24].Assuch,Vondra´k’ssolutionistheopti- Specifically,someattributesmayexhibitpositivexentriesfora malapproximationalgorithmfortheSWP,whichweusetosolve particularclass,howeververysmallvalues,indicatingonlyslight ourproblemin(5).Thesolutionusesamultilinearextensiontore- relevance. Wemayalsohavesomeattributesthatexhibitlarge laxthesubsetoptimizationintoanumericaloptimizationproblem positivexentries,howeverforbothclasses.Whilerelevant,such suchthatadvancedoptimizationtechniques,inparticularacontin- attributesarenon-differentiatingandwouldbeuninformativefor uousgreedyalgorithm,canbeapplied.Thecontinuoussolutionis ourtask. thenroundedtoobtainanear-optimalsetwiththeaforementioned To get rid of only slightly relevant or non-differentiating at- guarantee[35]. tributesandobtainasparsesolution,wedefinearelativecontribu- 4.2 FasterHeuristics tionscorerc(·)foreachattributeaas 1 (cid:213) 1 (cid:213) 4.2.1 Pre-normalizedweights. Fortheformulationshownin(5), rc(a)= xi(a)− xj(a) (6) weunit-normalizetheattributeweightsasinEq.(3),onlybasedon p n wagheicqhuiaslitthyeodfiffsuebregnracephbseiti∈wnSec+elansas’1scaonndtrcjil∈baSuss−tio2.nWaleontheetnorthanekavthere- athseelqeucatelidtysufubsnecttSio:nwNg((aд)|S=)=√(cid:205)(cid:112)a(cid:205)x∈(Saa)x∈(Sa)x2(.aTh)2,isanndorrmeqauliziraetsiotnhaytieSldiss attributeswithineachclassbytheirrcvalues. given/known.Awaytosimplifythisfunctionistofixtheattribute 4 ALGORITHMS walelitghhets(kantowwgn(a))po=si√tiv(cid:205)eax∈a(Aatt)xr(iab)u2t,eis.e.i,ntoAnroatrhmearlitzheanthaem(unbaksneodwonn) 4.1 OptimalApproximation subset.Thiswaytheattributeweightscanbepre-computedanddo notdependontheto-be-selectedattributesubsets.Thesimplified ItiseasytoshowthatourqualityfunctionN(д|S) = (cid:107)x[S](cid:107)2 in versionofthemaximizationin(5)isthenwrittenas Eq. (4)isamonotonesubmodularsetfunctionwithrespecttoS fornon-negativex. Thatis,thequalityofasubgraphincreases max. 1 (cid:213) (cid:213) xi(a)2 + 1 (cid:213) (cid:213) xj(a)2 mfoollnowotsonthiceadlliymwiniitshhiinngcrreeatsuirnngsspertospiezretyS.kInnoawdndiitnioenc,otnhoeminiccsre,ai.see., A+⊆A,A−⊆A pi∈S+a∈A+ Di n j∈S−a∈A− Dj addinganewattributeatoasetSincreasesthefunctionlessthan suchthat A+∩A−=∅ (8) TiesThatBind-CharacterizingClassesbyAttributesandSocialTies WWW’17,April2017,Perth,Australia 1 1 1 mal Obj. Value00.9.95 mum Obj. Value000...789 SSTTToooWimpppAp---525li50fied mal Obj. Value00.9.95 Ratio to Opti00.8.85 SSTTooWimppAp--35lified Ratio to Maxi0000....3456 Ratio to Opti0.85 SSWimAplified 5 10 15 20 200 400 600 800 1000 0.95 0.80 0.55 0.30 0.05 # Attributes d # Attributes d P Figure2:Ratioofobjectivevalueachievedbyeachtestalgorithm(left)tooptimalvaluefoundbybrute-force(d =3,4,...,20) and(center)tomaximumachievedvalue(d =50,100,...,1000).(right)PerformanceofSimplifieddegradesasitassignsnearly allattributestotheclasswithhigherexpectedxvalues,ignoringthediminishingreturnspropertyoftheobjectivefunction, whereasSWAremainsnear-optimalunderallsettings.Allresultsareaveragedover10randomdatasets. wherethedenominatorDi = (cid:112)(cid:205)a∈Axi(a)2 = (cid:107)xi(cid:107)2,whichcan these(separate)solutionsmayenduphavingcommonattributes, nowbetreatedasconstantasitdoesnotdependonA+(samefor weresolvethesolutionsbyassigningeachcommonattributeonlyto Dj). theclassforwhichitsaveragecontributionishigher(theindividual ThesimplifiedfunctionN(д|S) = (cid:107)x[S](cid:107)22 isnowamonotone termsinEq.(6)).Thesearchrepeatsuntileachclassgetsassigned (cid:107)x(cid:107)2 kuniqueattributes.Finally,kisnotacriticalparametertoset,but modularfunctionwithrespecttoS.Thecontributionofaparticular rathercanbechoseninteractively. newattributetothequalityofasubgraphdoesnotanymoredepend ontheotherattributesthatarealreadyintheselectedset.Thatis, 5 EVALUATION N(д|a∪S)−N(д|S)=N(д|a∪S(cid:48))−N(д|S(cid:48))= x(a)2,∀S,S(cid:48)⊆A. (cid:107)x(cid:107)2 We evaluate our approach to the characterization problem and Asaresult,asimplelinear-timealgorithmcanbeemployedto proposedalgorithmsonbothsyntheticandreal-worlddatasets. solvetheobjectivein(8).Thealgorithmiteratesovertheattributes Ourgoalistoanswerthefollowingquestions: (orderdoesnotmatter),andassignseachattributeatotheclasscfor • Howdotheproposedalgorithmsperformandcompareto whichtheaveragesubgraphqualityisimprovedmorethanothers, eachother?Whatistheirscalabilityandruntime? thaWtihs,ilaertghmeoabxjcecnt1(icv)e(cid:205)vaklu∈eSs(co)fxt(cid:107)hkx(eka(cid:107)s)22o,lubtrieoanksitnog(5ti)easnadrb(8it)raarreillyi.kely •• AHroewthdeoefisncdhinagrascotenrirzeaatli-ownocroldmdpaatraemtoeacnlaisnsgiffiucla?tion? todifferduetothedifferenceincomputingtheattributeweights, theweightnormalizationdoesnotchangetheorderoftheattributes 5.1 AnalysisonSyntheticDatasets byimportancewithinagivenset.Therefore,weconjecturethatthe Throughsyntheticdataexperiments,ourgoalistocomparethe twosolutionswillperformsimilarly,whichweinvestigatethrough algorithmicandcomputationalperformanceoftheproposedal- experimentsin§5. gorithms,respectivelyintermsofobjectivevalueachievedand runningtime.Specifically,wecompare: 4.2.2 Top-kattributesperclass. Forexploratorytasks,suchas understanding the class differences via characterizing attribute • SWA(SubmodularWelfareAlgorithm,§4.1) subspaces,itwouldbemostinterestingtolookatthetopkmost • Simplified(withpre-normalizedweights,§4.2.1) importantattributesfromeachclass. Wealsoexpecteachsub- • Top-k(§4.2.2) graphtoexhibitonlyahandfuloffocusattributes(experimentson Wegeneratethexvectorsforp=n=100subgraphseachfrom real-worldgraphsconfirmthisintuition).Therefore,limitingthe c =2classes,whilevaryingthenumberofattributesd.Thex(a) analysistoatopfewattributeswouldbesufficient. valuesofeachfeatureaforsubgraphsfromclasscaredrawnfroma Foragiven(small)k,findingthetopkattributesA∗k thatmaxi- Normaldistributionwithadistinctµac andσac.Theµac’sthemselves mizeN(д|A∗k)forasinglesubgraphдiseasy—thatwouldbethe aredrawnfromazero-meanunit-varianceNormal(notethatthose kattributeswiththelargestvaluesinд’sx(seeEq.(4)).However, attributeswithnegativemeantendtobelessrelevantfortheclass). wehaveamulti-criterionobjective,withagoaltofindthetopk Theσac’sarerandomlydrawnfroma[0,1]uniformdistribution. attributesthatmaximizethetotalnormalityforallsubgraphsfrom Algorithmic performance. In the first experiment, we test aclassatonceratherthanasingleone,thatisN(S(c)|A(c))(seeEq. theoptimalityofthealgorithmsbycomparingtheirachievedob- ∗k (7)). jectivevaluetothatofbrute-forcewherewetryallpossiblepar- Themulti-criterionoptimizationproblemisNP-hard[21].On titioningstoidentifytheoptimalsolution. Weexperimentwith theotherhand,weknowthat (cid:107) · (cid:107)2 isamonotonesubmodular d ={3,4,...,20}asbrute-forceisnotcomputationallyfeasiblefor function,andsoistheclassqualityfunctionN(·) = (cid:205)(cid:107)·(cid:107)2. As morethan20attributes,andk ={3,5}forTop-k. such,wefindthetopkattributesforeachclassseparately,using InFigure2(left)wereporttheratiobetweeneachtestalgorithm’s thelazygreedyhill-climbingalgorithmintroducedin[21].Since objectivevalueandtheoptimalasfoundbybrute-force(ratio1 WWW’17,April2017,Perth,Australia AriaRezaei,BryanPerozzi,andLemanAkoglu meanstheyareequal)withvaryingattributesize.Theresultsare 8 averaged over 10 random realizations of the synthetic datasets SWA asdescribedabove. WenoticethatSWAachievesnear-optimal Simplified performance throughout, which Simplified catches up with as 6 Top-5 Top-25 thenumberofattributesd increases. Top-k losesoptimalityas Top-50 kbecomessmallercomparedtod,wherethedeclineisfasterfor e (s)4 smallerk. m Figure2(center)issimilar,wherewecompareperformances Ti underlargerattributesizesd ={50,100,...,1000}.Asdislarger, 2 wealsouselargerk = {5,25,50}. Asbrute-forcecannotbecom- puted in reasonable time, we report ratios w.r.t. the maximum objectivevalueachievedamongthetestedalgorithms.Wefindthat 0 200 400 600 800 1000 Simplifiedachievesnear-identicalperformancetoSWA.Again,the # Attributes d ratiosofTop-kmethodsdropastheattributespacegrows.Inter- Figure3:Runningtime(seconds)withincreasingnumberof estingly,Top-50(outof1000)attributesfromeachofbothclasses attributes.Allalgorithmsscalenear-linearlywithd. yield64.78%ofthemaximumobjectivevalue. WhileitmayappearthatSimplifiedperformsaswellasSWA, wecanshowthatundercertainconditionswherethediminish- Democrats Republicans health government ingreturnspropertyofsubmodularproblemsplaysamajorrole, families taxation itbecomesinferiortoSWA.Toshowsuchasetting, wedesign education law anexperimentwherethex(a)valuesofanattributeaaredrawn commerce employment uniformlyfrom[P,1]forclass1, andfrom[0,1−P]forclass2 housing public works employment natural resources aswedecreaseP from0.95to0.05. Notethatastheranges(and emergencies congress hencethevariance)ofthevaluesincrease,theexpectedvalueof foreign trade finance everyattributeremainshigherforclass1.Theresultsareshown environment commerce criminal law immigration inFigure2(right). ForlargeP,thevaluesforclass1aresignifi- cantlylargerandbothalgorithmsassignallattributestoclass1. Figure4: Top10attributesinrankedorderforDemocrats Astherangesstartoverlappingandtheexpectedvaluesgetcloser, andRepublicansinCongress. Characterizingattributesre- Simplifiedcontinuestoassignallattributestoclass1(withhigher vealthecontrastbetweenliberalandconservativeideas. expectedvalue)eventhoughthemarginalincreasetotheobjective valuedecreasessignificantlyaswegoonduetodiminishingreturns. PARTY FOCUS ON Asthevariancegetsevenlarger,Simplifiedagainperformssimilar ARMED FORCES War in toSWAasitstartsassigningsomeattributestoclass2duetothe War in Iraq Afghanistan randomvariation.Arguably,itisunlikelytoencounterthissetting Bombing inreal-worlddatasets,wherethereexistmanysimilarly-distributed of Iraq attributesforsufficientlydifferentclassesofsubgraphs. Computational performance. Finally, we compare the pro- Democrat posedalgorithmsintermsoftheirrunningtimeandscalability,as Republican thenumberofattributesgrows.Figure3showsruntimeinseconds 1993 1995 1997 1999 2001 2003 2005 2007 ford ={50,100,...,1000}.Wenotethatallthealgorithmsscale near-linearly.SWAhasthelargestslopewithincreasingd,while Figure 5: Change of focus on attribute “National Security finishingunder8secondsford =1000andp=n=100subgraphs andArmedForces”amongDemocratsandRepublicans.We fromtwoclasses. ThescalabilityofTop-k dependsonk which observeincreasedinterestbyRepublicansintimeofwar. decreaseswithincreasingk.Simplifiedheuristicliesinthebottom andisreliablyoneofourfastestmethods. Overall,SWAandSimplifiedworkbestonalldatasets. Sim- localcommunityextractionalgorithmofAndersenetal.[3].One plifiedcanbeparallelizedeasily,aseachattributeisprocessed canalsouseego-networks,whereanodeisgroupedwithallits independently. For massive datasets, one can also fall back to immediateneighbors. Top-k,whichiscapableofidentifyingthefewkeyattributesfor Wereportthetop-10attributesbyrelativecontributionin(6) characterization. perclasssidebysideforcomparison.Tobeprecise,werandomly sample90%ofoursubgraphs100timesandpresenttheaverage 5.2 AnalysisonReal-worldDatasets relativecontribution(bars)andstandarddeviation(errorbars)so Forreal-worlddataanalysisweconsiderattributedgraphswhere astoensurethatourresultsarenotanartifactofthesubgraphsat nodesareassignedclasslabels.Westudytheclassdifferencesof hand. nodesbythe“companythattheykeep”. Thatis,wecharacterize Weexperimentwith3real-worldattributednetworks: (i)bill eachnodewithalocalcommunitysurroundingthem,usingthe co-sponsorshipsofCongressmen[11],(ii)co-purchasenetworkof TiesThatBind-CharacterizingClassesbyAttributesandSocialTies WWW’17,April2017,Perth,Australia Under 13 Over 13 Kids & Family Comedy Animation Fitness VHS Documentary Clifford Classic Comedy Action Performing Arts Adventure Horror Indiana Jones Musicals Dinosaurs Bible Cartoon Network Christian Video Land Before Time Yoga (a)Rankingbyrelativecontribution(Proposed) (b)Rankingbyrelativecontribution(Proposed) Animation Classics Under 13 Over 13 Kids & Family Drama Videos for Babies Fitness Educational Comedy Charlie Brown Infantil y familiar Dr. Seuss Performing Arts Kids & Family Comedy 7-9 Years Action Magic School Bus Spanish Language Holidays Westerns Mary-Kate & Ashley MGM Home Dragon Tales Musicals For the Whole Family School Days Infantil y familiar Mystery There Goes A... Military & War Nickelodeon French Rugrats - All Grown Up Drama Warner Video Sesame Street Little Bear Ballet & Dance Franklin Puppets Scooby-Doo Eyewitness (c)Rankingbycoefficients(LR) (d)Rankingbycoefficients(LR) Figure6:Characterizationvs.Classification:LogisticRegression(LR)prefersinfrequentattributesthatdiscriminatewell.Our proposedmethoddiscoverssubspacesthatcharacterizethedatainamorenaturalway. Amazonvideos[20],and(iii)DBLPco-authorshipnetwork. We Amazon. Thisnetworkcontains4011nodes,9487edges,and903 describetheindividualdatasetsandpresentourfindingsnext. attributes. Nodesarevideosfromamazon.com,andedgesdepict co-purchaserelationsbetweentwovideos,indicatingthattheyare Congress. Weconsider8co-sponsorshipnetworksfromthe103rd frequentlyboughttogether.Attributesrangefromdescribingthe Congressto110th.Thenodesarecongressmen.Anedgedepicts videogenresuchas“Comedy”and“Drama”,totheage-rangeof co-sponsorshipofabillbytwocongressmen,andtheedgeweightis theaudienceintendedforthevideossuchas“7-9Years”,popular thenumberoftimestwonodessponsoredabilltogether.Eachbill franchiseslike“SpongeBobSeries”,theformofthevideoslike isassignedaphrasethatdescribesitssubject,withatotalof32such “Animation”,andthedeviceitcomesinsuchas“VHS”or“DVD”. phrases.Wemirrorthesebillsubjectstotheirsponsorstocreate nodeattributes.Thenetworksarehighlydense,soweremovelow- WehaveexperimentedwithtwoscenariosonAmazontoshow- casethestrengthofourmethodincharacterizingdifferentclasses. weightededgessuchthatthesizeofthegiantconnectedcomponent Wesetsemanticallydifferentattributesasclassesandusetherest maintainsmorethan95%ofitsoriginalsize. Figure4presentsthetop-rankingattributesamongtwoclasses, forcharacterization.Thespecificqueriesare(1)Animationvs.Clas- DemocratsandRepublicans(averagedover8congressesinthe sicsand(2)VideosforUnder13yearsoldandVideosforOver13 dataset).Asexpected,Democratshavealiberalagendacentered yearsold. Under13classconsistsofthevideosexhibitingtheat- tributes“Birth-2Years”,“3-6Years”,“7-9Years”and“10-12 uponsocialandenvironmentalprograms,whileRepublicansmainly focusonregulatinggovernment,immigrationandfinancialissues. Years”.RestofthevideosbelongtotheOver13class. Figure6aand6brespectivelyshowtop-10attributesperclass SincetheCongressdatasetistemporal,wecanalsoexplorehow asrankedbyourmethodonthesetwoscenarios. Wefindthat thefocusofthetwopartieschangesonaparticularsubjectovertime. “Kids&Family”andagegroups“3-12Years”arekeycharacterizing Aclearexampleofthisisbillson“NationalSecurityandArmed Forces”.Figure5showstheaveragecontributionofthisattributefor attributesforAnimation.“WarnerVideos”and“CartoonNetwork” arealsoamongthetopattributes.Perhapssurprisingly,“Christian” bothparties(individualtermsinEq.(6))overyears.Startingwith videosand“Bible”storiesfollowtheabove. Ontheotherhand, theUSconflictwithIraq,thisattributeseizesRepublicans’attention, anditcontinuestogrowafterthe9/11attacksandthebeginningof wenotegenre-relatedattributesthattrulydefineClassics,suchas “PerformingArts”,“Comedy”,and“Musicals”. thewarinAfghanistan.Itreachesitspeakduringthestartofwar Forthesecondscenario,weobserve“Kids&Family”and“Ani- inIraqandthenstartstoloseattentiontowardsthelastyearswhen theUStroopsarewithdrawnfromthemiddle-east.Thisabnormal mation”attributestomostlycharacterizetheUnder13videos.In changeininterestinnationalsecurityandarmedforcesisespecially contrast,thecharacterizingattributesforOver13arethosethatcan- notreallybeconsumedbychildren,including“Comedy”,“Fitness”, interestingsincebeforeandaftertheyearsofinternationalcrisis, and“Documentary”videos. thisattributehasclosetozeroattention,evenachievesnegative valuesduringthelastyearsasaDemocratattribute,whichindicates DBLP. Thisnetworkcontains134Knodes,1.478Medgesand2Kat- itisnotcharacteristicofneitheroftheparties. tributes.Nodesarecomputerscientistsandlinksareco-authorship WWW’17,April2017,Perth,Australia AriaRezaei,BryanPerozzi,andLemanAkoglu WeseethisbehavioragaininFigure6d,whereLRranksrare attributeshighly(suchas“CharlieBrown”and“Mary-Kate&Ash- ley”)abovemorefrequentattributeswhicharequitediscriminative (“Kids&Family”and“Animation”).IncontrasttoLR,ourproposed methodranksattributesbytheircontributionacrosssubgraphs, findingasubspaceofattributeswhichbettercharacterizestheinput subgraphs. Intuitively,whenanattributeispresentinmanyofthenodesthat belongtoaclass,itisconsideredtobeacharacterizingattribute Figure 7: Top 10 attributes in ranked order for ICC and of that class. On the other hand, when observing an attribute ICASSPconferencesinDBLP. atanodeindicatesahighprobabilityofthenodebelongingto a particular class, then the attribute is a discriminative one for thatclass.Toquantifythese,weuseclasssupportandconfidence, metricscommonlyusedinassociationrulemining[1]. relations. Attributesarecomputerscienceconferencesandjour- nals, whereanodeexhibitsanattributeifs/hehasatleastone Let#(c,a)denotethenumberofnodesinclasscwithattributea, publicationinthevenue. #(a)totalnumberofnodeswithattributea,and#(c)totalnumber TheclassesareICC,aconferenceoncommunications,vs.ICASSP, ofnodesfromclassc,then: aconferenceonspeechandsignalprocessing.Werandomlysample • Confidence(C):probabilityofbelongingtoclasscwhen 100nodesfromallnodesofeachclassandfindsubgraphsaround attributeaisobservedinsomenode:Cfd(c,a)=Pr(c|a)= them,tomaintainamanageablesetofsubgraphs.Figure7shows #(c,a) #(a) top-10attributesforthetwoclasses.WefindthatattributesforICC • Class Confidence(CC): probability of belonging only revolvearoundnetworking,communicationsandmobiletechnolo- to class c, when attribute a is observed in some node: gies,including“INFOCOM”,“GLOBECOM”and“PMIRC”,while CC(c+,a)=Pr(c+|a)−Pr(c−|a) attributesforICASSPareconferencesonspeech,videoandimage • Support(S): percentage of nodes in classc that exhibit processingandlinguistics,including“INTERSPEECH”,“ICIP”and attributea:Sup(c,a)= #(c,a) “EUSIPCO”. #(c) • ClassSupport(CS):differenceofsupportforabetween classes:CS(c+,a)=Sup(c+,a)−Sup(c−,a) 5.3 Characterizationvs.Classification Asweseekdistinctsubspaces,weonlyusetherelativemetrics, Hereweexaminethedifferencesbetweencharacterizationanddis- i.e.CCandCS.Ideallyhavinganattributehighonbothmetricsis criminativeclassification.Similartoourmethod,asparsesolution best,howeverthiscasehappensrarely.Anattributewithhighclass fromaregularizedclassifierwillcontainasubspaceoftheattributes, supportcanbeconsideredasagoodrepresentativeofaclasswhile whichcanberankedusinganattributeimportancescorefromthe anattributewithhighclassconfidencecanbeusedforclassification model.Suchregularizedsparsemethodsarepopularapproaches purposes.Tomeasuretheaveragecharacterizationoftheattributes forexploratorydataanalysisandformthefoundationsofmany assignedtoagivenclass,wehave: intFeroprrtehtiasbcloemmpoadriesloinng,wmeeuthseodLso.gisticRegression(LR)withLASSO CS(c,A(c))= (cid:205)a∈A(c)waCS(c,a), (9) regularization[34]tolearnasparsesolution. WefirsttrainaLR (cid:205)a∈A(c)wa modelforbinaryclassificationbetweenclassesc+ andc− using whereCSistheweightedaverageofCSoverallattributesassigned rawnodeattributesasinput. Afterclassification,weusetheLR toclassc. Weightwa hereisthemetricthatweuseforranking coefficientstopartitiontheattributesbetweentheclasses,assign- attributesincorrespondingmethods,whichistherelativecontribu- ingthosewithpositivecoefficientstoc+andnegativeonestoc−, tionforourproposedapproachandtheabsolutecoefficientvalues andrankbytheirmagnitude.Toovercomeclassimbalanceinthe forLR.Likewise,tomeasurethetotaldiscriminationofasetof dataset,weoversamplefromthesmallclasstomaketheclasssizes attributes,wehave: etiqmuFeaisglutaornede6litmhcoeinmnapdtaeorttehhseethecffelaetscostpifio-rcfaasntuikocihnn.gsWaamtteprrilebinpugete.astothbitsaipnreodcewdiuthreo1u0r CC(c,A(c))= (cid:205)a∈(cid:205)A(ac)∈wA(ac)CwCa(c,a), (10) methodtothosefoundthroughLR.WeseethatLRprefersinfre- whereagain,CC istheweightedaverageofCCforallattributes quent,buthighlydiscriminatingattributes.Forexample,consider assignedtoclassc. thedifferenceinattributesfoundbetweenourmethodandLRfor Figure8presentsbothmeasuresforthetworankingmethods theAnimationvs.Classicsclassificationtask(Fig.6aand6c).Here, forthereal-worldscenarios.OurmethodoutperformsLRw.r.t.the LRcompletelyfailstoassignhighweighttotwoattributes(“3-6 characterizationaspect(CS).Thisistobeexpected–asourmethod Years”and“10-12Years”)thatarebothveryprevalentinthedataset. searchesforattributespresentinafocusedsubspaceacrossmany Insteaditsthirdstrongestattributeis“Dr.Seuss”,whichisperfectly subgraphs of a class and ranks them accordingly. Surprisingly, discriminative(allitemswiththisattributeareAnimation),butis innearlyallcases,thesubspaceswefindalsohaveacomparable presentinonly4%ofthenodes.Thisisaclearexampleofsacrificing discriminativepower(CC)toLR.TheUnder13casewherewehave characterizationfordiscrimination. lowCCiswhenmostdiscriminatingattributes(asrankedhighby TiesThatBind-CharacterizingClassesbyAttributesandSocialTies WWW’17,April2017,Perth,Australia 1 [6] M.D.Choudhury,M.Gamon,S.Counts,andE.Horvitz.Predictingdepression port Proposed viasocialmedia.InICWSM,2013. up LR [7] D.DellaPosta,Y.Shi,andM.Macy. Whydoliberalsdrinklattes? American S0.5 JournalofSociology,120(5):1473–1511,2015. ss [8] I.Dhillon,S.Mallela,andD.Modha. Information-theoreticco-clustering. In Cla KDD,2003. 0 [9] J.Eisenstein,B.O’Connor,N.A.Smith,andE.P.Xing.Alatentvariablemodel AnimationClassics Under 13 Over 13 ICASSP ICC forgeographiclexicalvariation.InEMNLP,pages1277–1287,2010. ce 1 [10] L.Flekova,L.Ungar,andD.Preoctiuc-Pietro.Exploringstylisticvariationwith en ageandincomeontwitter.InACL,2016. d [11] J.H.Fowler. Legislativecosponsorshipnetworksintheushouseandsenate. onfi0.5 SocialNetworks,28(4):454–465,2006. C [12] J.Gao,F.Liang,W.Fan,C.Wang,Y.Sun,andJ.Han.Oncommunityoutliersand ss theirefficientdetectionininformationnetworks.InKDD,pages813–822,2010. Cla 0 [13] S.Gu¨nnemann,I.Fa¨rber,B.Boden,andT.Seidl.Subspaceclusteringmeetsdense AnimationClassics Under 13 Over 13 ICASSP ICC subgraphmining:Asynthesisoftwoparadigms.InICDM,2010. [14] S.Han,B.-Z.Yang,H.R.Kranzler,X.Liu,H.Zhao,L.A.Farrer,E.Boerwinkle, J.B.Potash,andJ.Gelernter.IntegratingGWASsandhumanproteininteraction Figure8:CSandCCforourmethodandLR.Ourmethodal- networksidentifiesagenesubnetworkunderlyingalcoholdependence. The wayshassuperiorAverageClassSupportwhencomparedto AmericanJournalofHumanGenetics,93(6):1027–1034,2013. LR,andalsohascompetitiveAverageClassConfidence. (LR [15] P.Iglesias,E.Mu¨ller,F.Laforet,F.Keller,andK.Bo¨hm.Statisticalselectionof isexplicitlyoptimizingfordiscrimination,soitisexpected congruentsubspacesforoutlierdetectiononattributedgraphs.InICDM,2013. [16] G.KarypisandV.Kumar. Multilevelalgorithmsformulti-constraintgraph todobetterinthisregard) partitioning.InProc.ofSupercomputing,pages1–13,1998. [17] S.Khot,R.J.Lipton,E.Markakis,andA.Mehta. Inapproximabilityresults forcombinatorialauctionswithsubmodularutilityfunctions. Algorithmica, 52(1):3–18,2008. LR)areinfrequentandhencehavelowsupport.Infact,8/10oftop [18] V.Kulkarni,B.Perozzi,andS.Skiena. Freshmanorfresher?quantifyingthe attributesinFigure6dhaveCSvalue<0.1. geographicvariationoflanguageinonlinesocialmedia.InTenthInternational AAAIConferenceonWebandSocialMedia,2016. [19] B.Lehmann,D.J.Lehmann,andN.Nisan.Combinatorialauctionswithdecreas- 6 CONCLUSION ingmarginalutilities.InEC,pages18–28,2001. Studieshaveshownevidenceforcharacteristicdifferencesbetween [20] J.Leskovec,L.A.Adamic,andB.A.Huberman.Thedynamicsofviralmarketing. ACMTransactionsontheWeb(TWEB),1(1):5,2007. individualsofdifferentgenders,agegroups,politicalorientations, [21] J.Leskovec,A.Krause,C.Guestrin,C.Faloutsos,J.M.VanBriesen,andN.S. personalities,etc.Inthiswork,wegeneralizedandmathematically Glance.Cost-effectiveoutbreakdetectioninnetworks.InKDD,pages420–429, formalizedthecharacterizationproblemofattributedsubgraphs 2007. [22] A.Lewis,N.Jones,M.Porter,andC.Deane. Thefunctionofcommunitiesin fromdifferentclasses.Oursolutionisthroughalensintothenode proteininteractionnetworksatmultiplescales.BMCSystemsBiology,4(1):100, attributesaswellasthesocialtiesintheirlocalnetworks. We 2010. [23] B.Long,Z.Zhang,X.Wu,andP.S.Yu. Spectralclusteringformulti-type showedthatourproblem,ofpartitioningattributesbetweenclasses relationaldata.InICML,2006. soastomaximizethetotalqualityofinputsubgraphs,isNP-hard, [24] V.S.Mirrokni,M.Schapira,andJ.Vondra´k.Tightinformation-theoreticlower andthattheproposedalgorithmsfindnear-optimalsolutionsand boundsforwelfaremaximizationincombinatorialauctions.InEC,pages70–77, 2008. scalewellwiththenumberofattributes.Extensiveexperimentson [25] F.Moser,R.Colak,A.Rafiey,andM.Ester.Miningcohesivepatternsfromgraphs syntheticandreal-worlddatasetsdemonstratedtheperformance withfeaturevectors.InSDM,2009. ofthealgorithms,thesuitabilityofourapproachforqualitative [26] A.Y.Ng,M.I.Jordan,andY.Weiss. Onspectralclustering:Analysisandan algorithm.InNIPS,2001. exploratory analysis, and its advantage over discriminating ap- [27] B.PerozziandL.Akoglu.Scalableanomalyrankingofattributedneighborhoods. proaches. InSIAMSDM,2016. [28] B.Perozzi,L.Akoglu,P.IglesiasSa´nchez,andE.Mu¨ller.FocusedClusteringand OutlierDetectioninLargeAttributedGraphs.InKDD,pages1346–1355,2014. ACKNOWLEDGMENTS [29] B.PerozziandS.Skiena.Exactagepredictioninsocialnetworks.InWWW’15 This research is sponsored by NSF CAREER 1452425 and IIS Companion,pages91–92,2015. [30] D.Preo¸tiuc-Pietro,V.Lampos,andN.Aletras. Ananalysisoftheuseroccu- 1408287,DARPATransparentComputingProgramunderContract pationalclassthroughtwittercontent. TheAssociationforComputational No.FA8650-15-C-7561,AROYoungInvestigatorProgramunder Linguistics,2015. [31] D.Rao,D.Yarowsky,A.Shreevats,andM.Gupta.Classifyinglatentuserattributes ContractNo.W911NF-14-1-0029,andafacultygiftfromFacebook. intwitter.In2ndInternationalWorkshoponSearchandMiningUser-generated Anyconclusionsexpressedinthismaterialareoftheauthorsand Contents,pages37–44.ACM,2010. [32] H.A.Schwartz,J.C.Eichstaedt,M.L.Kern,L.Dziurzynski,S.M.Ramones, donotnecessarilyreflecttheviews,expressedorimplied,ofthe M.Agrawal,A.Shah,M.Kosinski,D.Stillwell,M.E.Seligman,andL.H.Un- fundingparties. gar. Personality,gender,andageinthelanguageofsocialmedia:TheOpen- Vocabularyapproach.PLoSONE,2013. [33] J.TangandH.Liu.Unsupervisedfeatureselectionforlinkedsocialmediadata. REFERENCES InKDD,pages904–912,2012. [1] R.Agrawal,T.Imielin´ski,andA.Swami.Miningassociationrulesbetweensets [34] R.Tibshirani.Regressionshrinkageandselectionviathelasso.Journalofthe ofitemsinlargedatabases.InSIGMOD,volume22,pages207–216.ACM,1993. RoyalStatisticalSociety.SeriesB,pages267–288,1996. [2] L.Akoglu,H.Tong,B.Meeder,andC.Faloutsos.PICS:Parameter-freeidentifi- [35] J.Vondra´k.Optimalapproximationforthesubmodularwelfareprobleminthe cationofcohesivesubgroupsinlargeattributedgraphs.InSIAMSDM,pages valueoraclemodel.InSTOC,pages67–74,2008. 439–450,2012. [36] S.WhiteandP.Smyth.Aspectralclusteringapproachtofindingcommunities [3] R.Andersen,F.R.K.Chung,andK.J.Lang. Localgraphpartitioningusing ingraph.InSDM,2005. pagerankvectors.InFOCS,pages475–486,2006. [37] Y.Zhou,H.Cheng,andJ.X.Yu.Graphclusteringbasedonstructural/attribute [4] A.Banerjee,S.Basu,andS.Merugu.Multi-wayclusteringonrelationgraphs.In similarities.PVLDB,2(1):718–729,2009. SIAMSDM,2007. [5] J.D.Burger,J.Henderson,G.Kim,andG.Zarrella.Discriminatinggenderon twitter.InEMNLP,pages1301–1309,2011.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.