ebook img

Mapping and Matching Algorithms: Data Mining by Adaptive Graphs PDF

0.31 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Mapping and Matching Algorithms: Data Mining by Adaptive Graphs

Mapping and Matching Algorithms: Data Mining by Adaptive Graphs PaoloD’Alberto∗ VeronicaMilenkly† January6,2015 Abstract question: If we are observing the output of U and M, can weguessx,whichisassociatedtoU(x)andM(x)? 5 Assume we have two bijective functions U(x) and M(x) with 1 M(x) (cid:54)= U(x)forallxandM,N : N → N. Everydayandin We define an airport as Li with i ∈ [0,N −1]. There 0 differentlocations,weseethedifferentresultsofU andM without areN airportsandweenumeratethem. Wedescribethefirst 2 day we observe events simply as t . Thus t is the second seeing x. We are not assured about the time stamp nor the order 0 1 n withinthedaybutatleastthelocationisfullydefined. Wewantto day: thiswillimplythatdaytiprecedesti+1. a findthematchingbetweenU(x)andM(x)(i.e.,wewillnotknow Let us start considering the first day t0. For every Li J thereisasetofassociatedMACaddress,weidentifythisset x). We formulate this problem as an adaptive graph mining: we 2 as MLi. Also, we determine the users in one mile radius ] dsteevmelsofprothmeathperoarcyt,icthaelpsorolubtlieomn,tahnudstohueridmepfilneimtioennsta.tTiohne.sTohluistiwonoriks fromtL0i: WeidentifythissetasStL0i. H simple,clear,andtheimplementationparallelandefficient. Inour (1.1) SLi ={u:dist(u,L )<1attimet } O experience,theproblemandthesolutionarenovelandwewantto t0 i 0 cs. shareourfinding. a samThpeleuosefrthseetaSvtLajiilaisblneoitmcopmrepssleiotensb:ecwauesseawmeplheavineotinmlye [ 1 Introduction the values of U(x,t,(cid:96)), we cannot keep an ordered time 1 Let start by introducing the problem by its practical case. sequence beside a day granularity, and by construction we v We are traveling with our smart phone. We take a taxi and maycoveronlyasmallareaoftheairportL . i 1 go to the airport. We surf using our data plan. We arrive In practice, we associate (cid:104)SLi : MLi(cid:105), the departing 9 at the airport and we connected to the local WiFi, we surf. t0 t0 addresses to the departing users. This is the mapping we 4 Before boarding we turn off our phone. We land and the 0 wouldliketorefineasmuchaspossible,untilwecanhavea previousprocessrestarts. Duringoursurfing,ourphonewill 0 one-to-onematching. Thatis,wecaninferthehiddenxthat 1. beidentifiedbyauniquenumberasafunctionofthedevice determines the unique mapping between U(x) and M(x). and application (i.e., UUID). While we are using the WiFi, 0 ThesesameusersarelandingtodifferentLj withi (cid:54)= j and 5 ourdevicewillhavealsoaMACaddressandIP.Ifwehave thusdifferentaddressesMLj andMLj maybegiven. 1 thedistinctsetofMACsandUUIDs,canwefindthematch: t0 t1 v: whatUUIDisassociatedwiththeMAC? Everymapping(cid:104)StLji : MtLji(cid:105)describesagraph, afully connectedbipartitegraph.Weneedtocombineallmappings i Ifweidentifyourphoneasx,wehavetwodeterministic X as above in order to achieve our goal. This is an adaptive functions: functionU(x,t,(cid:96))withlocation(cid:96)andtimetthat graph algorithm: we build a graph step by step, day by r identifies our unique device, and function M(x,t,(cid:96)) with a day. We check whether mappings in different graphs have location and time that identifies the MAC. We have only a intersectionandwecansplitthegraphbycuttingedges. sampleintimeofU(x,t,(cid:96))andasamplebylocationofM. The final goal is to grind these mappings into matches In practice, We may not gain U(x) and M(x) at the same where one user is associate to one MAC or at least to the time but in a reasonable interval of time, say one day: for finest refinement possible. In the following, we formulate example, at a specific airport and date (day) we may have our solution using the same notations, we present our algo- either one but not both with no specific time information rithm,afewsimplifications,andourresults. besidetheday. Also,givenxwemayhaveS =U(x,t ,(cid:96) ), i j thatisU(x)isnotuniqueanditmaythecompositionofaset 2 Thealgorithm ofexclusivefunctionsU (x)butwhenpossibleweenforcea i deterministicanduniqueresult. Consider the mappings for the first day: (cid:104)StL0i : MtL0i(cid:105). The problem boils down as to answer the following These can be considered as the departing addresses for the departingusers. TheusersdepartingfromL canbetheuser i ∗[email protected] landing at Lj with j (cid:54)= i. If there is intersection between †[email protected] addresseswecouldrefinethemapping: two mappings composed by disjoint simpler mappings and theirproducts (2.2) (cid:104)StL0i :MtL0i(cid:105)=(cid:104)j(cid:88)S(cid:54)=tLi0(cid:104)iS(cid:114)tL0i(cid:16)n∩i=(cid:91)(cid:54)=(cid:0)0nj,=1(cid:91)S0,tL1njS(cid:17)tLn:jM(cid:1):tL0MitL(cid:114)0i(cid:16)∩ni=(cid:0)(cid:91)(cid:54)=n0j,=(cid:91)10M,1tLMnjtL(cid:17)n(cid:105)j+(cid:1)(cid:105) (2.6) P =(cid:16)K(cid:88)−1wj(cid:17)∗(cid:16)L(cid:88)−1vi(cid:17) j=0 i=0 Here the + operation is the disjoint concatenation of map- pings binding fewer elements and refining them towards wherew = (cid:104)A : M (cid:105)andv = (cid:104)B : N (cid:105). Wewillabuse j j j i i i matches. OurinterpretationofEquation2.2follows: ifthere thesetnotationalittlehere: is any landing information the mapping between departing users and departing address, then we can refined the map- K−1 L−1 pingintotwomajorcomponents. P = (cid:88) (cid:16)(cid:88)w (cid:114)v (cid:17)+ j i (2.3) (cid:104)S˙tL0i :M˙tL0i(cid:105)≡(cid:104)StL0i(cid:114)(cid:16)ni=(cid:91)(cid:54)=0j,1StLnj(cid:17):MtL0i(cid:114)(cid:16)ni=(cid:91)(cid:54)=0j,1MtLnj(cid:17)(cid:105) K(cid:88)j=−01L(cid:88)−i1=0 (2.7) w ∩v + j i Equation2.3and2.2.(partone)refertothedepartingusers j=0 i=0 withoutlandinginformation. (2.4(cid:88))j(cid:104)StL0i→Lj :MtL0i→Lj(cid:105)≡j(cid:88)(cid:54)=i(cid:104)StL0i∩(cid:16)n(cid:91)=10StLnj(cid:17):MtL0i∩(cid:16)n(cid:91)=10MtLnj(cid:17)(cid:105) L(cid:88)−1(cid:16)K(cid:88)−1vi(cid:114)wj(cid:17) i=0 j=0 If there is an intersection between departing and landing users, we refine the mapping with the intersection of the Wenoticethatthesum(cid:80)Li=−01wj (cid:114)vi ≡ (cid:80)Li=−01(cid:104)Aj (cid:114)Bi : departingandlandingaddresses. Thelocationsaredisjoint, Mj (cid:114)Ni(cid:105)isnotdisjointbecauseeverytermhasincommon very likely a user will be only at two locations in two days themapping: and thus SLi ∩(cid:0)∪ SLj(cid:1) will be true only for one j, t0 n=0,1 tn L−1 thenthemappingsaredisjoint. Becauseweareconsidering (cid:88) w (cid:114)( v )≡(cid:104)A (cid:114)∪L−1B :M (cid:114)∪L−1M (cid:105) twoconsecutivedayswemayhavetocombinetwoormore j i j i=0 i j i=0 i i=0 mappings one step further. Let us introduce the product of mappings. alsow (cid:114)v andw (cid:114)v haveincommontheoneaboveand j 0 j 1 Bydefinition,ifwehavea(cid:104)A:M(cid:105)anyuserinAcanbe (cid:80)L−1w ∩v ,whicharealreadyincludedinthesecondterm i=2 j i mapped to any address in M. As a graph, this represents a inEquation2.7. ThusthefirstterminEquation2.7becomes bipartitefully-connectedgraph. Ifwehaveanothermapping basically(cid:80)K−1w (cid:114)((cid:80)L−1v ). j=0 j i=0 i (cid:104)B :N(cid:105)andthereisanintersectionbetweenaddresses,then we know that the same users should be in both A and B. K−1 L−1 (cid:88) (cid:88) Afteralltheaddressisuniquetothedevice. Itmakessense P = w (cid:114)( v )+ j i to take the intersection of users as well. In practice, we j=0 i=0 assume users will have likely or consistently the same user K−1L−1 (cid:88) (cid:88) identificationnumber. Thiswillrefinethemappingreducing (2.8) w ∩v + j i thesizeofthethreeresultingmappings: 1 j=0 i=0 (cid:104)A:M(cid:105)∗(cid:104)B :N(cid:105)=(cid:104)A(cid:114)B :M (cid:114)N(cid:105)+ L(cid:88)−1 K(cid:88)−1 v (cid:114)( w ) i i (2.5) (cid:104)A∩B :M ∩N(cid:105)+ i=0 j=0 (cid:104)B(cid:114)A:N (cid:114)M(cid:105) Equation2.8representsadisjointmapping. We use the ∗ operator to represent this operation. In Let us return to Equation 2.2.(part two) and 2.4 espe- combinationwith+operator,ouralgorithmwillbebasedan cially how to combine the terms that have intersection: we algebra.Ifthereisnointersection,thereisnorefinementand can imaginethat the index j infers anorder for thecompo- (cid:104)A:M(cid:105)∗(cid:104)B :N(cid:105)=(cid:104)A:M(cid:105)+(cid:104)B :N(cid:105). nents: thedestinationL ,L ,...,andL . Attheendof 0 1 N−1 If there are only two mappings the product is intuitive thefirstdayt wecansummarizeourknowledgeas 0 and the final result is a disjoint mapping. Let us consider N−1 D = (cid:88)(cid:104)S˙Li :M˙ Li(cid:105)+ 1This definition does not fully represent reality. For example, a t0 t0 t0 M(x0)=m0isuniqueandU(x0)isnotuniquesayu0in(cid:104)A:M(cid:105)and (2.9) i=0 u∅1(cid:105)i=n∅(cid:104)Bin:sNtea(cid:105)d,tohfe(cid:104)nuin0,Euq1ua:timon02(cid:105)..5Twheehdaevfieni(cid:104)tuio0n:o∅f*(cid:105)+op(cid:104)e∅rat:iomn0is(cid:105)+see(cid:104)kui1ng: N(cid:88)−1N(cid:89)−1(cid:104)StL0i→Lj :MtL0i→Lj(cid:105) foradeterministicanduniqueintimematch. j=0 i=0 FortherefinementinEquation2.9,wehaveadisjointset ofmappings. Nowwemustcombinethem. Thefirsttermin Equation2.2.(partone)presentsdisjointmappingsandthus canbejustaddedinEquation2.9. Thesecondtermisalittle trickier. Weseeitastheintersectionofmappingsthathave common users and thus narrowing the mapping size. The secondshouldreducetoaperfectmatchingandwhenitdoes wecanremovetheusersandputthemaside. Nowletusconsidertheseconddayt . Letuscompute 1 D independently from the previous step. Then we join t1 the two steps by checking users intersections and refining the mappings: for each mapping in D we can make a t1 product/intersectionofeachmappinginD andthus: t0 D =D ∗D t1 t1 t0 Seethesymmetricpropertyoftheproduct. Beforeany productorupdate, thetermsarealistofdisjointmappings. The product is meant to combine mappings that have com- monaddressessothattorefinethemappingsintomatches. We should keep an order during the concatenation, for example: (2.10) Dtk+1 = Dtk+1∗Dtk =(cid:16)Ni(cid:88)=−01(cid:104)S˙tLki :M˙tLki(cid:105)+Nj(cid:88)=−01Ni(cid:89)=−01(cid:104)StLki→Lj :MtLki→Lj(cid:105)(cid:17)∗ (cid:16)Ni(cid:88)=−01(cid:104)S˙tLki+1 :M˙tLki+1(cid:105)+Nj(cid:88)=−01Ni(cid:89)=−01(cid:104)StLki+→1Lj :MtLki+→1Lj(cid:105)(cid:17) 3 AStudyinParallelism Thedailymappings(cid:104)SLi :MLi(cid:105)requiresthedatafromtwo ti t0 consecutivedays: t andt . Thefirstparallelcomputation i i+1 is based on the split of the interval of time into smaller and consecutive intervals: two week interval each, say. We compute each two-week interval in parallel. This is an embarrassingparallelism. The total interval of time is composed of six months of data, we actually split the computation into up to 15 independentcomputations.EachD iscomposedbyasetof tj matchesandmappings. WetakethelistofD andcompute tj consecutive-pairproductsasabinarytree. Figure1: Decompositionofthecomputation Obviously, the last computation in the binary tree is a single product and it seems that there is no parallelism to exploit. Take the example in Figure 1. The final product D =D ∗D willrequireatleastasmuchasthesumof 56 28 56 thepreviouscomputations: O(D ∗D )+O(D ∗D ), 14 28 42 56 whichdoesnotseemparallelfriendly. Inpractice,aswegoupinthetree,welooseexplicitpar- allelism but we can exploit the same amount of parallelism in the product. Thus, we can keep the same level of paral- lelism throughout the computation and thus efficient use of anyarchitecture. The product becomes more complex as we go up. In fact, the product has to explore a Cartesian product of the operand mappings (graphs). We explore if there are edges computingtheintersectionfirstandreuseit: acrosstheoperandsandthisiswhyweusethetermadaptive (4.11) forthegraphweexploreandbuild. T =∅ K K 4 TheImplementation P = (cid:88) (cid:16)(cid:0)T∪=W =(cid:88)P L(cid:88)−1wj+(cid:96)∗vi(cid:1)+(cid:88)P wj+(cid:96)(cid:114)W(cid:17)+ #Rimplementation (cid:96)=+K j=0i=0 j=0 product<-function(D0,D1,P=2){ P if(length(D0)==0&&length(D1)==0) { R=list() } L−1 eellsseeiiff(((liesn.gntuhl(lD(0D)0)>0||&&le(nigst.hn(uDl0l)(D=1=)0)||&&lelnegntght(hD(1D)1=)=>00))){{ RR== DD10 }} (cid:88)vi(cid:114)T else{ L=group2(1:length(D0),length(D0)/P) i=0 ii<-function(K){ Above,wepresenttheimplementationinRoftheproduct. if=o0r;(Rk=inliKs)t({);D=list(’S’=c(),’M’=c()) The operand D0 is the mappings at time ti (t0) and the lQ==ilist(’S’=c(),’M’=c()) operandD1isthemappingsattimeti+1(t1). for(jin1:length(D1)){ S=intersect(D0[[k]]$S,D1[[j]]$S) The implementation has an important difference from M=intersect(D0[[k]]$M,D1[[j]]$M) if(length(M)>0&&length(S)>0){ Equation 2.8 and the intent in Equation 4.11, the intersec- iQ$=Mi=+un1;ionR([Q[$Mi,]]M)=list(’S’=S,’M’=M) tion has to be non empty for both S and M to be recorded Q$S=union(Q$S,S) } andused(inthefollowingsetdifferencecomputation).Ifwe } if(i>l){ couldapplyequation2.8,thefinalproductwillbethecom- SM==sseettddiiffff((DD00[[[[kk]]]]$$SM,,QQ$$SM)) position of disjoint terms. The implementation of Equation if(length(M)>0&&length(S)>0){ i=i+1; R[[i]]=list(’S’=S,’M’=M) 4.11doesnotassurethatthefinalproducthasdisjointterms } D$M=union(D$M,Q$M) and actually it may allow identical terms to appear, thanks D$S=union(D$S,Q$S) }else{ to the symmetric nature or the graph. Our implementation i=i+1; R[[i]]=D0[[k]] } allowsaminimumandconsistentcomputation: equalterms } list("R"=R,"D"=D,"disjoint"=(i==0)) areremovedandno-disjointtermsinvolvingperfectmatches } aresimplified. RT=mclapply(L,ii,mc.preschedule=TRUE,mc.cores=P) Aswecansee,theproductexploresallpairstofindin- R=list(); D=list(’S’=c(),’M’=c()); disjoint=TRUE; i=0 for(rtinRT){ tersections,asquareeffectonthecomputationalcomplexity. if(length(rt)>0){ for(rinrt$R){ Our implementation choice is based on reducing the com- i=i+1; R[[i]]=r } plexityeventhoughbyaconstant. disjoint=disjoint&&rt$disjoint D$M=union(D$M,rt$D$M) D$S=union(D$S,rt$D$S) 5 TheCaseStudy } } if(disjoint){ Duetotheproprietarynatureofthedata,wecannotsharethe for(kin1:length(D1)){ i=i+1; R[[i]]=D1[[k]] setitselfandafewofitsdetails. However,wesharethecode } }else{ verbatim because of its simplicity (and will share the code for(kin1:length(D1)){ S=setdiff(D1[[k]]$S,D$S) uponrequest). M=setdiff(D1[[k]]$M,D$M) if(length(M)>0&&length(S)>0){ We observed about six airports for about six months. i=i+1; R[[i]]=list(’S’=S,’M’=M) } WeobservedaboutfivehundredthousanduniqueMACsthat } } appearmorethanonce(ifthereisonlyoneappearance,there R } isverylittlesignalandamatchingwillbepossibleonlyifwe } matchallotherMACs,whichisunlikely). We observed nine million unique users collected in a radius of one mile from the airports requested center of interest and they appeared more than two times during the entireperiod. Onaverage,wehave2.5millionuniqueusers andtenthousanduniqueMACsperday. Webuildthemappingsusingtwodifferentgranularities. Theproductisthecoreofthewholecomputation. The SeeFigure1. Weuseagranularityoftwoandfourweeksto softwarecameaftertheformalsolutionwasfound. Thefirst implementation was verbatim from Equation 2.8. This was start the computation. This is to cope with the randomness a good starting point. There are actually a few drawbacks: of the user observation: We can only obtain a sample of First, the intersections are sparse and not balanced; that is, the users UUID and their appearance or their lack affect theremaybeintersectionbetweenS butnotinbetweenM i (cid:80) (cid:80) i thematchesandtheirproducts. Alsotheasymmetricnature orviceversa. Thismeansthatthecomputation w v j i j i oftheproductimplementationexemplifiedofEquation4.11 willspendquitesomeworkfindingemptyintersectionsand will make the resulting graphs different. Otherwise, the this information is not used for the other terms. Abusing a little the notation, we can rewrite the computation in such graphsshouldbecompletelydeterministic,consistentlybuilt a way that we avoid the graphs union computations by Table1: Two/four-weekmappinggraph(above)anduser/macdistribution(below) weeks matches mappings userscovered macscoverage 2 30667 130304 23448155 689828 4 33912 126650 24153536 686038 Table2: Ratiouser/macdistribution weeks Min. 1stQu. Median Mean 3rdQu. Max. 2 0.007 3.000 8.000 80.990 30.000 28690.000 4 0.012 3.000 8.000 87.940 32.000 27440.000 andeventuallyidentical. cores128GB).However,thesquarecomplexity(i.e.,O(N2) The computation time also may differ because of the nodes of the graph) forced us to tune the code and to relax different sparsity of the mappings and their combinations. thecomputation. Wehadtoexploitparallelisminawaythat We present the results separately and we conclude this itisnotnaturaltoR. sectionwithafewconsiderations. We would suggest to chose a different environment or languagetoexploitparallelismatlooplevel. 5.1 Thetwoandfourweekgraphs Theprocessfollows theonepresentedinFigure1, westartbuildingday-by-day 6 Conclusion graphuptotwo/fourweeks. Then,webuildthefullgraph. Tothebestofourknowledge,theproblemisnovelbecause Usingthesamenumberofresources,16coresforeach the refinement of the mappings requires the intersection of computation,thefourweekgraphisalittlefaster(i.e.,2hrs two different sets. There is no truth given a priori and thus faster for a 4 days computation from end-to-end), provides thereisnolearning. Thisisanexampleofgraphmining. To morematches,fewermappingsbutmoreredundancy.Notice thebestofourknowledge,oursolutionisnovelaswell. thatthetwo-weekgraphexploitsmoreparallelismanditwill We provide a formal definition of the problem and its befasterifmoreresourcecouldbeusedatthebeginningof solution in order to start a conversation. The formal state- theprocess. ment actually have been driving our problem presentation If we take a graph and we compute the ratio of the andsolution. Thedesireofawelldefinedformalismhelped users number over the MACs number in each mapping, we us freeing ideas by means of no ambiguity and dangerous can summarize the graphs using their distributions. We andmisplacedintuitions. summarize their distribution in Table 2. In practice, the As result, we have a solution that balances parallelism two-weekgraphhasfewermatchesbutthemappingstendto in an elegant fashion as it unfolds during the computation. be more refined than the mappings in the four-week graph, This parallelism is not common and we wanted to share its Table1. application. This,initself,couldbeattractivetoothers. We have not tried to combine the results (four and two weeks);itispossibletotakethegraphscombinethematches and then compute the product of the mappings. This is left asfutureinvestigation. 5.2 Considerations The problem formulation and its no- tations were used to write a first implementation: the first prototype was applied to a small graph after a few weeks. ThechoicetowritethesolutioninRwasfortheeaseincon- nectingtodifferentdatabaseswherethedatawereavailable. The simplesemantic ofthe languagefit the originalformu- lationwell. As we increased the size of the graph, we decided to keep the original solution, exploit the R parallelism, and to beef up the hardware (from 8-cores 32 GB machine to 32-

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.