ebook img

Vertex Nomination via Content and Context PDF

2.1 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Vertex Nomination via Content and Context

Vertex Nomination via Content and Context January 20, 2012 Glen A. Coppersmith and Carey E. Priebe Human Language Technology Center of Excellence Johns Hopkins University 2 1 0 2 Abstract n a J If I know of a few persons of interest, how can a combination of human language technology and graph 9 1 theory help me find other people similarly interesting? If I know of a few people committing a crime, how ] can I determine their co-conspirators? Given a set of actors deemed interesting, we seek other actors who P A are similarly interesting. We use a collection of communications encoded as an attributed graph, where . t vertices represent actors and edges connect pairs of actors that communicate. Attached to each edge is the a t s setofdocumentswhereinthatpairofactorscommunicate,providingcontentincontext–thecommunication [ topic in the context of who communicates with whom. In these documents, our identified interesting actors 1 v communicateamongsteachotherandwithotheractorswhoseinterestingnessisunknown. Ourobjectiveisto 8 1 nominatethemostlikelyinterestingvertexfromallverticeswithunknowninterestingness. Asanillustrative 1 example, the Enron email corpus consists of communications between actors, some of which are allegedly 4 . 1 committingfraud. Someoftheirfraudulentactivityiscapturedinemails,alongwithmanyinnocuousemails 0 (both between the fraudsters and between the other employees of Enron); we are given the identities of a 2 1 few fraudster vertices and asked to nominate other vertices in the graph as likely representing other actors : v committing fraud. Foundational theory and initial experimental results indicate that approaching this task i X with a joint model of content and context improves the performance (as measured by standard information r a retrieval measures) over either content or context alone. 1 1 Introduction Given a set of documents containing communications among a collection of actors and an identified subset of actors deemed interesting, we wish to select actors from outside the identified set who exhibit similar behavior to the identified interesting actors. For a concrete example, within the Enron email collection (see, e.g., Priebe et al. (2005)) is a set of executives and traders allegedly committing fraud. If we know the identitiesofasubsetofthefraudsters, canwenominateotherpeoplefromthecompanyaslikelyfraudsters? We assume that what indicates that actors are interesting (fraudulent) is manifest both in the topics about which they communicate (the content of their messages) and with whom in the company they communicate (the context of their messages). We conceptualize this as an attributed graph, where each vertex is an actor and pairs of actors that communicate are connected by edges. The edges are attributed by the content of the messages exchanged (in our case, represented as a distribution over topics). We design and evaluate a family of test statistics that score each actor (vertex) based on the content and context of their email communications (edges). We nominate vertices from outside the identified set as likely to be interesting. This task has noted similarities to the Netflix challenge (e.g., Bell et al. (2008)), recommender systems (Resnick and Varian (1997) and contents of the special issue), and detecting communities of interest (e.g. Cortes et al. (2002)). Information useful for the vertex nomination task might be encoded in both content and context. It is reasonable to assume that test statistics based on either content alone or context alone would have some efficacyforvertexnomination,butstatisticswhichtakeadvantageofbothcontentandcontextmightprovide superior inferential capability (e.g. Priebe et al. (2010b)). Selecting test statistics useful for this task (or selecting the uniformly most powerful test statistic against some specified composite alternative) is both interesting and decidedly nontrivial. (See Priebe et al. (2010a) for a summary of inferential complexity in a related task in perhaps the simplest possible model – without content.) We set up a deceptively simple generative model (described in Section 2 and depicted in Figure 1) to study this task and present results from simulations and experiments on real data (the Enron email corpus). The possible space of test statistics is practically limitless, even for this simple setting. For tractability, we limit ourselves to a simple family of linear fusion statistics (Section 3.1) and demonstrate how their performance for this task depends on many underlying factors (manifest as parameters in the generative modelandlatentqualitiesofthedata). Theoptimalperformanceisfoundinafusionofcontentandcontext rather than either alone, in both simulated and observed data (Sections 4 and 5 respectively). This paper proceeds as follows: Section 2 spells out our assumptions and describes the joint model of content and context, Section 3 describes the experimental and evaluation methods used, Section 4 de- scribessimulationexperimentswherethecontentandcontextisgeneratedaccordingtoourmodel,Section5 demonstratesthat(A)ourassumptionsarereasonable(andrealdatacorrespondingtotheassumptionsdoes 2 naturally occur), (B) when our assumptions are met, vertex nomination works, and (C) when our assump- tions are met, the fusion of content and context is superior to either alone, and Section 6 makes concluding remarks and discusses future directions. The appendix details our data set, the Enron email corpus. 2 Model We base our model upon two assumptions, detailed below. Specifically, when the physical world exhibits a group of interest that meets these assumptions, our model is reasonable (as demonstrated in Section 5). We observe communications among our identified interesting set, among our candidate set, and between actors in the identified set and actors in the candidate set. Assumption 1: Pairs of vertices in the group of interest (identified and not identified) communicate • among themselves with a different frequency than other pairs. Assumption 2: The group of interest communicates about topics in different proportions than the • population of actors as a whole. The context information available for vertex nomination is derived from Assumption 1, while the content information is derived from Assumption 2. LetG=(V,E,φ ,φ )bethesimplestofattributedgraphs(Gisundirected,withnoself-loops,nomulti- V E edges and no hyper-edges). Let V be the set of vertices (actors) and E be the set of edges (communication between pairs of actors). Specifically, E V(2), where V(2) denotes the set of unordered pairs of vertices. ⊂ Attribution functions φ : V Φ and φ : V(2) Φ place (categorical) attributes on the vertices V V E E → → and edges, respectively, where Φ = 1,...,K , Φ = 0,1,...,K and K is the number of vertex V V E E V { } { } attributes(interestingandnotinterestingforourpurposes)andK isthenumberofedgeattributes(topics E for our purposes); φ = 0 represents a non-observed edge, so for all e / E, φ (e) = 0 and for all e E, E E ∈ ∈ φ (e) 1,...,K . For this investigation, K =K =2 and Φ =Φ = red,green . We use red and 1 E E V E V E ∈{ } { } interchangeably, as appropriate for the context. Likewise for green and 2. For our investigation, we use a simple edge- and vertex-attributed independent edge model. We use a stochasticblock-modelrandomgraph(sometimesreferredtoasa“kidney-egg”orκgraph), wherethereisa “chatter”grouppresent–asubsetoftheactorswhichcommunicateamongstthemselvesinexcessofwhatis expected from the activity present in the rest of the graph and with a topic distribution different from that governing the rest of the graph. As depicted in Figure 1, κ(n,p,m,s) is a random graph model Bollob´as (2001) on n vertices (V = n); v : φ (v) = 1 = m, so m vertices have the attribute of interest (red) V | | |{ }| and communicate differently than the collection v : φ (v) = 2 of n m not of interest (green) vertices. V { } − 3 The edge attribute for a pair of vertices u,v with φ (u)=φ (v)=1 is governed by the probability vector V V s = [s ,s ,s ](cid:48) where s is the probability that the edge is red (φ (uv) = 1), s is the probability that the 0 1 2 1 E 2 edgeisgreen(φ (uv)=2),ands istheprobabilityofnoedge;edgeattributesforallotherpairsofvertices E 0 are governed by p=[p ,p ,p ](cid:48). Like s, p is the probability of a red edge, p is the probability of a green 0 1 2 1 2 edge and p is the probability of no edge. 0 The observed graph includes occlusion of most of the vertex attributes: G(cid:48) = (V,E,φ ,φ(cid:48) ,φ ) is a V V E κ(n,p,m,s;m(cid:48)) graph where φ(cid:48) : V Φ 0 and φ (v) = 0 denotes that the attribute for vertex v V V V → ∪{ } is occluded. For our particular setting, all observed attributes are red and we observe no green attributes (Figure 1). We let = v :φ (v)=1) be the set of vertices with true red attributes. Our identified set – V M { } the set of vertices with observed (true) red attributes – is given by (cid:48) = v : φ(cid:48) (v) = 1 and (cid:48) = m(cid:48). V M { } |M| (Weassumethattheidentifiedset (cid:48) isselectedatrandom.) ThecandidatesetisV (cid:48). Weassume M ⊂M \M that there is no error in the vertex-attributes – just occlusion; in addition, we assume that we observe the attributes on all edges, and there is no error in the edge-attributes. Weassumethatn>>m>m(cid:48) >0. Thatis,thereisatleastonevertexknowntobeofinterest(m(cid:48) 1), ≥ whichallowsthesetofcontextmeasuresweemploytomeasurefunctionsofthegraph-proximitytoamember of (cid:48). We also assume that candidate set V (cid:48) contains at least one true red (m>m(cid:48)) and at least one M \M true green (n > m) vertex. (The question of whether or not there exist any red vertices in the candidate set is an interesting one; we do not directly address it here, but the methods described here do inform how one might approach that question.) To rephrase the inference task in our freshly minted notation: We are given a graph G(cid:48) on vertices V, m ofwhichhaveattributered( V)andn mofwhichhaveattributegreen(V ). Allvertex-attributes M⊂ − \M areoccludedsavem(cid:48) drawnfromtheset ( (cid:48) );thusallobservedvertex-attributesarered. Wewish M M ⊂M to rank order all vertices with occluded attributes – the candidate set V (cid:48) – according to their similarity \M to the identified set (cid:48). Performance is judged by how high in the ranked list of candidate vertices V (cid:48) M \M the vertices (cid:48) with occluded (but truly red) attributes fall. M\M 3 Methods 3.1 Statistics We employ test statistics, based on the content and context of each vertex and its communications, to rank-order the candidate set V (cid:48) for nomination. Consider a vertex v in the candidate set V (cid:48). \M \M If p = s and p < s , then v (cid:48) will have a stochastically larger value for both the number of 2 2 1 1 ∈ M\M known red vertices adjacent to v and the number of red edges incident to v. This observation gives rise to the observation that the posterior probability of class membership ρ(v) = P[φ (v) = 1G(cid:48),φ(cid:48) (v) = 0] is V | V 4 κ n,m,m ,r,p,s ! V \ M p M s p V ! ! M M\ M \ M M V ! \M V Figure 1: Our model, a κ(n,p,m,s;m(cid:48)) graph with n= V vertices, m= of which have attribute red | | |M| and n m = V of which have attribute green. We observe attributes for only m(cid:48) = (cid:48) identified − | \M| |M| set vertices (filled circles) with the remaining n m(cid:48) = V (cid:48) candidate set vertices (open circles) having − | \M| occludedattributes. Edgeswithattributegreenareofthetopicnotofinterest(2)andrededgesaretopicof interest(1). Pairsofredvertices(regardlessofocclusion)areconnectedaccordingtoprobabilitydistribution over topics s, while pairs of vertices where at least one is labeled green are co1nnected according to p. The 1 vertex nomination task is to select one vertex from the candidate set V (cid:48) (the vertices with occluded \M attributes, shown here as open circles) that is in (cid:48) (truly red). M\M 5 monotonically increasing in both the context-only statistic (cid:88) T0(v)= I φ(cid:48) (u)=1 (1) { V } u∈{w:wv∈E} and the content-only statistic (cid:88) T1(v)= I φ (uv)=1 . (2) E { } uv∈E This in turn motivates the class of linear fusion statistics Tγ(v)=(1 γ)T0(v)+γT1(v). (3) − Larger scores are more indicative of membership in . The parameter γ determines the relative weight M of content and context information. We rank each vertex v in the candidate set V (cid:48) for nomination as \M a likely member of according to Tγ(v). Let γ(cid:63) denote the fusion parameter which yields the highest M performance. For our independent edge model κ(n,p,m,s;m(cid:48)), the joint distribution of T0(v),T1(v) is available: for v V , we have ∈ \M T0(v;G) Bin(m(cid:48),p +p ), 1 2 ∼ T1(v;G) Bin(n 1,p ), 1 ∼ − p T1 T0 =c Bin(c, 1 )+ Bin(n 1 m(cid:48),p ), | ∼ p +p ind − − 1 1 2 while for v (cid:48), we have ∈M\M T0(v;G) Bin(m(cid:48),s +s ), 1 2 ∼ T1(v;G) Bin(m 1,s )+ Bin(n m,p ), 1 ind 1 ∼ − − s T1 T0 =c Bin(c, 1 )+ Bin(m 1 m(cid:48),s ) | ∼ s +s ind − − 1 1 2 + Bin(n m,p ). ind 1 − For each candidate v V (cid:48), we calculate Tγ, where γ (0,1). For some plots we select a few ∈ \M ∈ illustrative values of γ, rather than plotting the entire range: γ = 0 for context-only (represented on plots by “X”), γ =1 for content-only (represented on plots by “N”), γ =0.5 for (one particular instantiation of) fusion of content and context (represented on plots by “+”), and γ =γ(cid:63) for the linear fusion of content and context with the optimal performance (represented on plots by “*”). For a given γ, we rank vertices for nomination according to Tγ and consider the ordered candidates 6 vγ ,vγ , ,vγ . E.g.,consideringverticesinthecandidatesetV (cid:48),wehavevγ =argmax Tγ(v), (1) (2) ··· (n−m(cid:48)) \M (1) v vγ is the vertex associated with the second largest value of Tγ(v), etc. We evaluate the efficacy of each Tγ (2) according to three evaluation criteria, described below. 3.2 Evaluation Criteria Probability Correct • If we nominate one vertex based on the values of the linear fusion statistic, then performance can be measured based on whether this nominee is in fact truly red – the success at rank 1 (S@1): S@1(γ)=I vγ (cid:48) . (4) { (1) ∈M\M} For a random experiment, we consider E[S@1(γ)]. Mean Reciprocal Rank • Reciprocal rank (RR) is a measure of how far down a ranked list one must go to find the first truly red vertex: (cid:16) (cid:17)−1 RR(γ)= min i:vγ (cid:48) . (5) { (i) ∈M\M} For a random experiment, we consider the mean reciprocal rank MRR(γ)=E[RR(γ)]. (6) Mean Average Precision • Average precision (AP) examines the placement within a ranked list of all truly red vertices – the average of the precision at the rank of each truly red vertex. We define precision at rank r as r (cid:88) I vγ (cid:48) { (i) ∈M\M} Pre(r,γ)= i=1 (7) r and average precision as |V\M(cid:48)| (cid:88) I vγ (cid:48) Pre(i,γ) { (i) ∈M\M} AP(γ)= i=1 . (8) (cid:48) |M\M| For a random experiment, we define the mean average precision MAP(γ)=E[AP(γ)]. (9) 7 Section 4 demonstrates that these measures are all highly correlated, as is further explored by Buckley and Voorhees (2005). For our experiments, the relative ranking of vertex nomination methods is consistent across evaluation measures. 4 Simulation Experiments We evaluate the performance of content and context fusion via simulation in the κ(n,p,m,s;m(cid:48)) model. We consider p,s in the standard 2-simplex S2 = x R3 : x 0,(cid:80) x = 1 , constrained so that { ∈ i ≥ i i } s = p (so the probability of green content being present is the same throughout the graph) and s > p 2 2 1 1 (so the probability of red content being present is greater for edges that connect pairs in than for all M other pairs). (Note that this implies s < p , so also overall connectivity probability is greater for edges 0 0 that connect pairs in than for all other pairs.) We use n=184 (the number of actors in our Enron email M corpus) and consider various values of m,m(cid:48) such that n>>m>m(cid:48) >0. We assess performance using the three evaluation criteria, E[S@1(γ)], MRR(γ), and MAP(γ), introduced in Section 3.2, for γ [0,1]. ∈ Figure 2 presents performance using MAP(γ) for κ(n=184,p=[0.6,0.2,0.2](cid:48),m,s=[0.4,0.4,0.2](cid:48);m(cid:48) =m/4) aswevarym. Forsmallm,allourfusionstatisticsperformequallypoorly. (ThefarleftmostpointinFigure 2 represents m = 4 and m(cid:48) = 1, where almost no information is available.) As m (and hence m(cid:48) = m/4) increases, fusion of content and context provides superior performance: γ =0.5 and γ =γ(cid:63) are superior to either γ =1 or γ =0 alone. Figure3generalizestheresultspresentedinFigure2,presentingperformanceaswevarym(cid:48) (thepropor- tion of m with observed attributes) for all three of our evaluation criteria. Again, for small m, all perform equallypoorly(approximatelychance). Aswevarytheratioofmtom(cid:48),weseethatT0.5 againissuperiorto eitherT1 orT0 aloneinsomecases(m(cid:48) = m,m),butT0 issuperiortoT1 andT0.5 inothercases(m(cid:48) = 3m). 4 2 4 (Chance, indicated by the dashed green line, is not the same throughout Figure 3, since we fix n but vary m(cid:48): the number of correct answers left in the candidate set (cid:48), from left to right, is 3m, m, m. The M\M 4 2 4 performance of T1 changes across plots only because of these variations in chance performance.) Figure 4 generalizes the results presented in Figure 2, by showing performance (measured in average precision) as a function of γ, with γ free to vary from [0,1]. Let k be the integer such that Vγ is the yth (k) highest ranked true but unknown red vertex, then k (cid:88) I vγ (cid:48) Pre(i,γ) { (i) ∈M\M} APy(γ)= i=1 . (10) y 8 00 1.1. 88 0.0. +* 66 0.0. +* P A M +* N 0.40.4 +* N X N X +* X N +* X 22 N 0.0. +* X N X +* N + N X N* X 00 X 0.0. 55 1100 1155 2200 2255 3300 3355 4400 m Figure2: Context,content,arbitrarylinearfusion,andoptimallinearfusion(γ = 0,1,0.5,γ(cid:63) respectively) { } results for MAP(γ) in the κ(n = 184,p = [0.6,0.2,0.2](cid:48),m,s = [0.4,0.4,0.2](cid:48);m(cid:48) = m/4) model, as we vary m. We plot m on the x-axis and MAP(γ) on the y-axis. Content (γ = 1) is represented by points labeled “N”, context (γ = 0) by points labeled “X”, arbitrary linear fusion (γ = 0.5) by points labeled “+”, and optimal linear fusion (γ =γ(cid:63)) by points labeled “*”. Results are obtained via 1000 Monte Carlo replicates. The green dashed line denotes chance performance. 9 m(cid:48) = m m(cid:48) = m m(cid:48) = 3m 4 2 4 1.01.0 1.01.0 1.01.0 0.80.8 +* +* N+* 0.80.8 +* +* X+* 0.80.8 +* X+* * 0.60.6 +* +* N NX X 0.60.6 +* +* X NX N 0.60.6 +* X+ X 0.40.4 N 0.40.4 X N 0.40.4 +* X N 0.00.20.00.2 NX+* NX+* NX+* NX+* NX X X 0.00.20.00.2 NX+* NX+* NX+* NX+* NX N 0.00.20.00.2 NX+* NX+* NX+* NX+* NX N N N 55 1100 1155 2200 2255 3300 3355 4400 55 1100 1155 2200 2255 3300 3355 4400 55 1100 1155 2200 2255 3300 3355 4400 0.60.81.00.60.81.0 +* N+* N+* NX+* NX+* NX+* 0.60.81.00.60.81.0 +* X+* NX+* NX+* NX+* NX+* 0.60.81.00.60.81.0 +* X+* X+* X+* NX+* NX+* X N 0.40.4 +* N X 0.40.4 +* X N 0.40.4 +* X N 0.20.2 +* N X 0.20.2 +* NX N 0.20.2 +* X N N 0.00.0 NX+* NX X 0.00.0 NX+* NX 0.00.0 NX+* NX N 55 1100 1155 2200 2255 3300 3355 4400 55 1100 1155 2200 2255 3300 3355 4400 55 1100 1155 2200 2255 3300 3355 4400 1.01.0 1.01.0 1.01.0 0.80.8 0.80.8 0.80.8 +* +* 0.00.20.40.60.00.20.40.6 NX+* NX+* NX+* NX+* NX+* NX+* NX+* NX+* NX+* 0.00.20.40.60.00.20.40.6 NX+* NX+* NX+* NX+* NX+* NX+* NX+* NX+* NX 0.00.20.40.60.00.20.40.6 NX+* NX+* NX+* NX+* NX+* NX+* NX+* NX+* NX 55 1100 1155 2200 2255 3300 3355 4400 55 1100 1155 2200 2255 3300 3355 4400 55 1100 1155 2200 2255 3300 3355 4400 Figure 3: The performance of γ = 0,1,0.5,γ(cid:63) according to E[S@1(γ)], MRR(γ), and MAP(γ) (top, middle, and bottom, respectively). C{olumns, fro}m left to right, represent m(cid:48) = m,m,3m. The x-axis 4 2 4 represents increasing values of m and the y-axis represents the evaluation criterion. Results are obtained via 1000 random graphs generated according to the κ(n = 184,p = [0.6,0.2,0.2](cid:48),m,s = [0.4,0.4,0.2](cid:48);m(cid:48)) model. As in the previous figure, lines with “X” markers denote context alone , those with “N” markers denote content alone, those with “+” markers denote γ = 0.5 and those with “*” markers denote γ = γ(cid:63). The dashed line with no markers denotes chance performance. 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.