Table Of ContentMapping and Matching Algorithms: Data Mining by Adaptive Graphs
PaoloD’Alberto∗ VeronicaMilenkly†
January6,2015
Abstract question: If we are observing the output of U and M, can
weguessx,whichisassociatedtoU(x)andM(x)?
5 Assume we have two bijective functions U(x) and M(x) with
1 M(x) (cid:54)= U(x)forallxandM,N : N → N. Everydayandin We define an airport as Li with i ∈ [0,N −1]. There
0 differentlocations,weseethedifferentresultsofU andM without areN airportsandweenumeratethem. Wedescribethefirst
2 day we observe events simply as t . Thus t is the second
seeing x. We are not assured about the time stamp nor the order 0 1
n withinthedaybutatleastthelocationisfullydefined. Wewantto day: thiswillimplythatdaytiprecedesti+1.
a findthematchingbetweenU(x)andM(x)(i.e.,wewillnotknow Let us start considering the first day t0. For every Li
J thereisasetofassociatedMACaddress,weidentifythisset
x). We formulate this problem as an adaptive graph mining: we
2 as MLi. Also, we determine the users in one mile radius
] dsteevmelsofprothmeathperoarcyt,icthaelpsorolubtlieomn,tahnudstohueridmepfilneimtioennsta.tTiohne.sTohluistiwonoriks fromtL0i: WeidentifythissetasStL0i.
H simple,clear,andtheimplementationparallelandefficient. Inour
(1.1) SLi ={u:dist(u,L )<1attimet }
O experience,theproblemandthesolutionarenovelandwewantto t0 i 0
cs. shareourfinding. a samThpeleuosefrthseetaSvtLajiilaisblneoitmcopmrepssleiotensb:ecwauesseawmeplheavineotinmlye
[ 1 Introduction
the values of U(x,t,(cid:96)), we cannot keep an ordered time
1 Let start by introducing the problem by its practical case. sequence beside a day granularity, and by construction we
v We are traveling with our smart phone. We take a taxi and maycoveronlyasmallareaoftheairportL .
i
1 go to the airport. We surf using our data plan. We arrive In practice, we associate (cid:104)SLi : MLi(cid:105), the departing
9 at the airport and we connected to the local WiFi, we surf. t0 t0
addresses to the departing users. This is the mapping we
4
Before boarding we turn off our phone. We land and the
0 wouldliketorefineasmuchaspossible,untilwecanhavea
previousprocessrestarts. Duringoursurfing,ourphonewill
0 one-to-onematching. Thatis,wecaninferthehiddenxthat
1. beidentifiedbyauniquenumberasafunctionofthedevice determines the unique mapping between U(x) and M(x).
and application (i.e., UUID). While we are using the WiFi,
0 ThesesameusersarelandingtodifferentLj withi (cid:54)= j and
5 ourdevicewillhavealsoaMACaddressandIP.Ifwehave thusdifferentaddressesMLj andMLj maybegiven.
1 thedistinctsetofMACsandUUIDs,canwefindthematch: t0 t1
v: whatUUIDisassociatedwiththeMAC? Everymapping(cid:104)StLji : MtLji(cid:105)describesagraph, afully
connectedbipartitegraph.Weneedtocombineallmappings
i Ifweidentifyourphoneasx,wehavetwodeterministic
X as above in order to achieve our goal. This is an adaptive
functions: functionU(x,t,(cid:96))withlocation(cid:96)andtimetthat
graph algorithm: we build a graph step by step, day by
r identifies our unique device, and function M(x,t,(cid:96)) with
a day. We check whether mappings in different graphs have
location and time that identifies the MAC. We have only a
intersectionandwecansplitthegraphbycuttingedges.
sampleintimeofU(x,t,(cid:96))andasamplebylocationofM.
The final goal is to grind these mappings into matches
In practice, We may not gain U(x) and M(x) at the same
where one user is associate to one MAC or at least to the
time but in a reasonable interval of time, say one day: for
finest refinement possible. In the following, we formulate
example, at a specific airport and date (day) we may have
our solution using the same notations, we present our algo-
either one but not both with no specific time information
rithm,afewsimplifications,andourresults.
besidetheday. Also,givenxwemayhaveS =U(x,t ,(cid:96) ),
i j
thatisU(x)isnotuniqueanditmaythecompositionofaset
2 Thealgorithm
ofexclusivefunctionsU (x)butwhenpossibleweenforcea
i
deterministicanduniqueresult. Consider the mappings for the first day: (cid:104)StL0i : MtL0i(cid:105).
The problem boils down as to answer the following These can be considered as the departing addresses for the
departingusers. TheusersdepartingfromL canbetheuser
i
∗pdalberto@ninthdecimal.com landing at Lj with j (cid:54)= i. If there is intersection between
†vmilenkiy@ninthdecimal.com addresseswecouldrefinethemapping:
two mappings composed by disjoint simpler mappings and
theirproducts
(2.2) (cid:104)StL0i :MtL0i(cid:105)=(cid:104)j(cid:88)S(cid:54)=tLi0(cid:104)iS(cid:114)tL0i(cid:16)n∩i=(cid:91)(cid:54)=(cid:0)0nj,=1(cid:91)S0,tL1njS(cid:17)tLn:jM(cid:1):tL0MitL(cid:114)0i(cid:16)∩ni=(cid:0)(cid:91)(cid:54)=n0j,=(cid:91)10M,1tLMnjtL(cid:17)n(cid:105)j+(cid:1)(cid:105) (2.6) P =(cid:16)K(cid:88)−1wj(cid:17)∗(cid:16)L(cid:88)−1vi(cid:17)
j=0 i=0
Here the + operation is the disjoint concatenation of map-
pings binding fewer elements and refining them towards wherew = (cid:104)A : M (cid:105)andv = (cid:104)B : N (cid:105). Wewillabuse
j j j i i i
matches. OurinterpretationofEquation2.2follows: ifthere
thesetnotationalittlehere:
is any landing information the mapping between departing
users and departing address, then we can refined the map-
K−1 L−1
pingintotwomajorcomponents. P = (cid:88) (cid:16)(cid:88)w (cid:114)v (cid:17)+
j i
(2.3) (cid:104)S˙tL0i :M˙tL0i(cid:105)≡(cid:104)StL0i(cid:114)(cid:16)ni=(cid:91)(cid:54)=0j,1StLnj(cid:17):MtL0i(cid:114)(cid:16)ni=(cid:91)(cid:54)=0j,1MtLnj(cid:17)(cid:105) K(cid:88)j=−01L(cid:88)−i1=0
(2.7) w ∩v +
j i
Equation2.3and2.2.(partone)refertothedepartingusers
j=0 i=0
withoutlandinginformation.
(2.4(cid:88))j(cid:104)StL0i→Lj :MtL0i→Lj(cid:105)≡j(cid:88)(cid:54)=i(cid:104)StL0i∩(cid:16)n(cid:91)=10StLnj(cid:17):MtL0i∩(cid:16)n(cid:91)=10MtLnj(cid:17)(cid:105) L(cid:88)−1(cid:16)K(cid:88)−1vi(cid:114)wj(cid:17)
i=0 j=0
If there is an intersection between departing and landing
users, we refine the mapping with the intersection of the Wenoticethatthesum(cid:80)Li=−01wj (cid:114)vi ≡ (cid:80)Li=−01(cid:104)Aj (cid:114)Bi :
departingandlandingaddresses. Thelocationsaredisjoint, Mj (cid:114)Ni(cid:105)isnotdisjointbecauseeverytermhasincommon
very likely a user will be only at two locations in two days themapping:
and thus SLi ∩(cid:0)∪ SLj(cid:1) will be true only for one j,
t0 n=0,1 tn L−1
thenthemappingsaredisjoint. Becauseweareconsidering (cid:88)
w (cid:114)( v )≡(cid:104)A (cid:114)∪L−1B :M (cid:114)∪L−1M (cid:105)
twoconsecutivedayswemayhavetocombinetwoormore j i j i=0 i j i=0 i
i=0
mappings one step further. Let us introduce the product of
mappings. alsow (cid:114)v andw (cid:114)v haveincommontheoneaboveand
j 0 j 1
Bydefinition,ifwehavea(cid:104)A:M(cid:105)anyuserinAcanbe (cid:80)L−1w ∩v ,whicharealreadyincludedinthesecondterm
i=2 j i
mapped to any address in M. As a graph, this represents a inEquation2.7. ThusthefirstterminEquation2.7becomes
bipartitefully-connectedgraph. Ifwehaveanothermapping basically(cid:80)K−1w (cid:114)((cid:80)L−1v ).
j=0 j i=0 i
(cid:104)B :N(cid:105)andthereisanintersectionbetweenaddresses,then
we know that the same users should be in both A and B. K−1 L−1
(cid:88) (cid:88)
Afteralltheaddressisuniquetothedevice. Itmakessense P = w (cid:114)( v )+
j i
to take the intersection of users as well. In practice, we j=0 i=0
assume users will have likely or consistently the same user
K−1L−1
(cid:88) (cid:88)
identificationnumber. Thiswillrefinethemappingreducing (2.8) w ∩v +
j i
thesizeofthethreeresultingmappings: 1
j=0 i=0
(cid:104)A:M(cid:105)∗(cid:104)B :N(cid:105)=(cid:104)A(cid:114)B :M (cid:114)N(cid:105)+ L(cid:88)−1 K(cid:88)−1
v (cid:114)( w )
i i
(2.5) (cid:104)A∩B :M ∩N(cid:105)+
i=0 j=0
(cid:104)B(cid:114)A:N (cid:114)M(cid:105)
Equation2.8representsadisjointmapping.
We use the ∗ operator to represent this operation. In Let us return to Equation 2.2.(part two) and 2.4 espe-
combinationwith+operator,ouralgorithmwillbebasedan cially how to combine the terms that have intersection: we
algebra.Ifthereisnointersection,thereisnorefinementand can imaginethat the index j infers anorder for thecompo-
(cid:104)A:M(cid:105)∗(cid:104)B :N(cid:105)=(cid:104)A:M(cid:105)+(cid:104)B :N(cid:105). nents: thedestinationL ,L ,...,andL . Attheendof
0 1 N−1
If there are only two mappings the product is intuitive thefirstdayt wecansummarizeourknowledgeas
0
and the final result is a disjoint mapping. Let us consider
N−1
D = (cid:88)(cid:104)S˙Li :M˙ Li(cid:105)+
1This definition does not fully represent reality. For example, a t0 t0 t0
M(x0)=m0isuniqueandU(x0)isnotuniquesayu0in(cid:104)A:M(cid:105)and (2.9) i=0
u∅1(cid:105)i=n∅(cid:104)Bin:sNtea(cid:105)d,tohfe(cid:104)nuin0,Euq1ua:timon02(cid:105)..5Twheehdaevfieni(cid:104)tuio0n:o∅f*(cid:105)+op(cid:104)e∅rat:iomn0is(cid:105)+see(cid:104)kui1ng: N(cid:88)−1N(cid:89)−1(cid:104)StL0i→Lj :MtL0i→Lj(cid:105)
foradeterministicanduniqueintimematch.
j=0 i=0
FortherefinementinEquation2.9,wehaveadisjointset
ofmappings. Nowwemustcombinethem. Thefirsttermin
Equation2.2.(partone)presentsdisjointmappingsandthus
canbejustaddedinEquation2.9. Thesecondtermisalittle
trickier. Weseeitastheintersectionofmappingsthathave
common users and thus narrowing the mapping size. The
secondshouldreducetoaperfectmatchingandwhenitdoes
wecanremovetheusersandputthemaside.
Nowletusconsidertheseconddayt . Letuscompute
1
D independently from the previous step. Then we join
t1
the two steps by checking users intersections and refining
the mappings: for each mapping in D we can make a
t1
product/intersectionofeachmappinginD andthus:
t0
D =D ∗D
t1 t1 t0
Seethesymmetricpropertyoftheproduct. Beforeany
productorupdate, thetermsarealistofdisjointmappings.
The product is meant to combine mappings that have com-
monaddressessothattorefinethemappingsintomatches.
We should keep an order during the concatenation, for
example:
(2.10)
Dtk+1 =
Dtk+1∗Dtk =(cid:16)Ni(cid:88)=−01(cid:104)S˙tLki :M˙tLki(cid:105)+Nj(cid:88)=−01Ni(cid:89)=−01(cid:104)StLki→Lj :MtLki→Lj(cid:105)(cid:17)∗
(cid:16)Ni(cid:88)=−01(cid:104)S˙tLki+1 :M˙tLki+1(cid:105)+Nj(cid:88)=−01Ni(cid:89)=−01(cid:104)StLki+→1Lj :MtLki+→1Lj(cid:105)(cid:17)
3 AStudyinParallelism
Thedailymappings(cid:104)SLi :MLi(cid:105)requiresthedatafromtwo
ti t0
consecutivedays: t andt . Thefirstparallelcomputation
i i+1
is based on the split of the interval of time into smaller
and consecutive intervals: two week interval each, say. We
compute each two-week interval in parallel. This is an
embarrassingparallelism.
The total interval of time is composed of six months
of data, we actually split the computation into up to 15
independentcomputations.EachD iscomposedbyasetof
tj
matchesandmappings. WetakethelistofD andcompute
tj
consecutive-pairproductsasabinarytree. Figure1: Decompositionofthecomputation
Obviously, the last computation in the binary tree is a
single product and it seems that there is no parallelism to
exploit. Take the example in Figure 1. The final product
D =D ∗D willrequireatleastasmuchasthesumof
56 28 56
thepreviouscomputations: O(D ∗D )+O(D ∗D ),
14 28 42 56
whichdoesnotseemparallelfriendly.
Inpractice,aswegoupinthetree,welooseexplicitpar-
allelism but we can exploit the same amount of parallelism
in the product. Thus, we can keep the same level of paral-
lelism throughout the computation and thus efficient use of
anyarchitecture.
The product becomes more complex as we go up. In
fact, the product has to explore a Cartesian product of the
operand mappings (graphs). We explore if there are edges computingtheintersectionfirstandreuseit:
acrosstheoperandsandthisiswhyweusethetermadaptive (4.11)
forthegraphweexploreandbuild. T =∅
K K
4 TheImplementation P = (cid:88) (cid:16)(cid:0)T∪=W =(cid:88)P L(cid:88)−1wj+(cid:96)∗vi(cid:1)+(cid:88)P wj+(cid:96)(cid:114)W(cid:17)+
#Rimplementation (cid:96)=+K j=0i=0 j=0
product<-function(D0,D1,P=2){ P
if(length(D0)==0&&length(D1)==0) { R=list() } L−1
eellsseeiiff(((liesn.gntuhl(lD(0D)0)>0||&&le(nigst.hn(uDl0l)(D=1=)0)||&&lelnegntght(hD(1D)1=)=>00))){{ RR== DD10 }} (cid:88)vi(cid:114)T
else{
L=group2(1:length(D0),length(D0)/P) i=0
ii<-function(K){ Above,wepresenttheimplementationinRoftheproduct.
if=o0r;(Rk=inliKs)t({);D=list(’S’=c(),’M’=c()) The operand D0 is the mappings at time ti (t0) and the
lQ==ilist(’S’=c(),’M’=c()) operandD1isthemappingsattimeti+1(t1).
for(jin1:length(D1)){
S=intersect(D0[[k]]$S,D1[[j]]$S) The implementation has an important difference from
M=intersect(D0[[k]]$M,D1[[j]]$M)
if(length(M)>0&&length(S)>0){ Equation 2.8 and the intent in Equation 4.11, the intersec-
iQ$=Mi=+un1;ionR([Q[$Mi,]]M)=list(’S’=S,’M’=M) tion has to be non empty for both S and M to be recorded
Q$S=union(Q$S,S)
} andused(inthefollowingsetdifferencecomputation).Ifwe
}
if(i>l){ couldapplyequation2.8,thefinalproductwillbethecom-
SM==sseettddiiffff((DD00[[[[kk]]]]$$SM,,QQ$$SM)) position of disjoint terms. The implementation of Equation
if(length(M)>0&&length(S)>0){
i=i+1; R[[i]]=list(’S’=S,’M’=M) 4.11doesnotassurethatthefinalproducthasdisjointterms
}
D$M=union(D$M,Q$M) and actually it may allow identical terms to appear, thanks
D$S=union(D$S,Q$S)
}else{ to the symmetric nature or the graph. Our implementation
i=i+1; R[[i]]=D0[[k]]
} allowsaminimumandconsistentcomputation: equalterms
}
list("R"=R,"D"=D,"disjoint"=(i==0)) areremovedandno-disjointtermsinvolvingperfectmatches
}
aresimplified.
RT=mclapply(L,ii,mc.preschedule=TRUE,mc.cores=P)
Aswecansee,theproductexploresallpairstofindin-
R=list(); D=list(’S’=c(),’M’=c()); disjoint=TRUE; i=0
for(rtinRT){ tersections,asquareeffectonthecomputationalcomplexity.
if(length(rt)>0){
for(rinrt$R){ Our implementation choice is based on reducing the com-
i=i+1; R[[i]]=r
} plexityeventhoughbyaconstant.
disjoint=disjoint&&rt$disjoint
D$M=union(D$M,rt$D$M)
D$S=union(D$S,rt$D$S) 5 TheCaseStudy
}
}
if(disjoint){ Duetotheproprietarynatureofthedata,wecannotsharethe
for(kin1:length(D1)){
i=i+1; R[[i]]=D1[[k]] setitselfandafewofitsdetails. However,wesharethecode
}
}else{ verbatim because of its simplicity (and will share the code
for(kin1:length(D1)){
S=setdiff(D1[[k]]$S,D$S) uponrequest).
M=setdiff(D1[[k]]$M,D$M)
if(length(M)>0&&length(S)>0){ We observed about six airports for about six months.
i=i+1; R[[i]]=list(’S’=S,’M’=M)
} WeobservedaboutfivehundredthousanduniqueMACsthat
}
} appearmorethanonce(ifthereisonlyoneappearance,there
R
} isverylittlesignalandamatchingwillbepossibleonlyifwe
}
matchallotherMACs,whichisunlikely).
We observed nine million unique users collected in a
radius of one mile from the airports requested center of
interest and they appeared more than two times during the
entireperiod. Onaverage,wehave2.5millionuniqueusers
andtenthousanduniqueMACsperday.
Webuildthemappingsusingtwodifferentgranularities.
Theproductisthecoreofthewholecomputation. The
SeeFigure1. Weuseagranularityoftwoandfourweeksto
softwarecameaftertheformalsolutionwasfound. Thefirst
implementation was verbatim from Equation 2.8. This was start the computation. This is to cope with the randomness
a good starting point. There are actually a few drawbacks: of the user observation: We can only obtain a sample of
First, the intersections are sparse and not balanced; that is, the users UUID and their appearance or their lack affect
theremaybeintersectionbetweenS butnotinbetweenM
i (cid:80) (cid:80) i thematchesandtheirproducts. Alsotheasymmetricnature
orviceversa. Thismeansthatthecomputation w v
j i j i oftheproductimplementationexemplifiedofEquation4.11
willspendquitesomeworkfindingemptyintersectionsand
will make the resulting graphs different. Otherwise, the
this information is not used for the other terms. Abusing a
little the notation, we can rewrite the computation in such graphsshouldbecompletelydeterministic,consistentlybuilt
a way that we avoid the graphs union computations by
Table1: Two/four-weekmappinggraph(above)anduser/macdistribution(below)
weeks matches mappings userscovered macscoverage
2 30667 130304 23448155 689828
4 33912 126650 24153536 686038
Table2: Ratiouser/macdistribution
weeks Min. 1stQu. Median Mean 3rdQu. Max.
2 0.007 3.000 8.000 80.990 30.000 28690.000
4 0.012 3.000 8.000 87.940 32.000 27440.000
andeventuallyidentical. cores128GB).However,thesquarecomplexity(i.e.,O(N2)
The computation time also may differ because of the nodes of the graph) forced us to tune the code and to relax
different sparsity of the mappings and their combinations. thecomputation. Wehadtoexploitparallelisminawaythat
We present the results separately and we conclude this itisnotnaturaltoR.
sectionwithafewconsiderations. We would suggest to chose a different environment or
languagetoexploitparallelismatlooplevel.
5.1 Thetwoandfourweekgraphs Theprocessfollows
theonepresentedinFigure1, westartbuildingday-by-day 6 Conclusion
graphuptotwo/fourweeks. Then,webuildthefullgraph. Tothebestofourknowledge,theproblemisnovelbecause
Usingthesamenumberofresources,16coresforeach the refinement of the mappings requires the intersection of
computation,thefourweekgraphisalittlefaster(i.e.,2hrs two different sets. There is no truth given a priori and thus
faster for a 4 days computation from end-to-end), provides thereisnolearning. Thisisanexampleofgraphmining. To
morematches,fewermappingsbutmoreredundancy.Notice thebestofourknowledge,oursolutionisnovelaswell.
thatthetwo-weekgraphexploitsmoreparallelismanditwill We provide a formal definition of the problem and its
befasterifmoreresourcecouldbeusedatthebeginningof solution in order to start a conversation. The formal state-
theprocess. ment actually have been driving our problem presentation
If we take a graph and we compute the ratio of the andsolution. Thedesireofawelldefinedformalismhelped
users number over the MACs number in each mapping, we us freeing ideas by means of no ambiguity and dangerous
can summarize the graphs using their distributions. We andmisplacedintuitions.
summarize their distribution in Table 2. In practice, the As result, we have a solution that balances parallelism
two-weekgraphhasfewermatchesbutthemappingstendto in an elegant fashion as it unfolds during the computation.
be more refined than the mappings in the four-week graph, This parallelism is not common and we wanted to share its
Table1. application. This,initself,couldbeattractivetoothers.
We have not tried to combine the results (four and two
weeks);itispossibletotakethegraphscombinethematches
and then compute the product of the mappings. This is left
asfutureinvestigation.
5.2 Considerations The problem formulation and its no-
tations were used to write a first implementation: the first
prototype was applied to a small graph after a few weeks.
ThechoicetowritethesolutioninRwasfortheeaseincon-
nectingtodifferentdatabaseswherethedatawereavailable.
The simplesemantic ofthe languagefit the originalformu-
lationwell.
As we increased the size of the graph, we decided to
keep the original solution, exploit the R parallelism, and to
beef up the hardware (from 8-cores 32 GB machine to 32-