Table Of ContentWho With Whom And How? - Extracting Large Social
Networks Using Search Engines
Stefan Siersdorfer, Philipp Kemkes, Hanno Ackermann, Sergej Zerr
L3SResearchCenter,Hannover,Germany
{siersdorfer,kemkes,zerr}@L3S.de
ackermann@tnt.uni-hannover.de
7
ABSTRACT
1
0 Social network analysis is leveraged in a variety of applica-
2 tions such as identifying influential entities, detecting com-
n munities with special interests, and determining theflow of
a information and innovations. However, existing approaches
J for extracting social networks from unstructured Web con-
8 tentdonotscalewellandareonlyfeasibleforsmallgraphs.
2 In this paper, we introduce novel methodologies for query-
based search engine mining, enabling efficient extraction of
] social networks from large amounts of Web data. To this
I
end, we use patterns in phrase queries for retrieving entity
S
connections, and employa bootstrappingapproach for iter-
.
s atively expandingthe pattern set. Our experimental evalu-
c
ationindifferentdomainsdemonstratesthatouralgorithms
[
provide high quality results and allow for scalable and effi-
1 cient construction of social graphs.
v
Categories andSubject Descriptors
5
8 H.3.3[InformationSystems]: Informationstorageandre-
2 trieval; E.1 [Data Structures]: Graphsand networks
8 Figure 1: Subgraph of a network extracted using our ap-
0 Keywords proachalongwithexamplesforconnectingphrasepatterns.
.
1 pattern based queries,social network extraction
0 items and new contacts to their users. Furthermore, know-
7 1. INTRODUCTION ingthetopologyofsocialgraphscanshedlightontheprop-
1 agationofideasandtrustinsocialnetworks[36, 1, 12],and
Networkingandcommunicationarenaturalhumanactiv-
: can enhance various IR applications such as personalized
v ities that are massively supported through modern infor-
queryexpansion and media recommendation [4, 13].
i mation technologies, most prominently the internet. Large
X In order to enable the analysis of social networks they
networks of people are present in many parts of the Web,
havetobeextractedfromunderlyingdatainthefirstplace.
r for instance, in form of contacts and friends in Social Web
a Some sources already provide explicit and easy to extract
platforms, co-occurring named entities in web pages, or co-
information about user relations. This includesonline plat-
authors of articles in scientific portals. Social networks ex-
forms such as Facebook, Twitter, or LinkedIn that main-
tractedfromtheseinformationsourcesareexploitedforvar-
tain user databases and offer software interfaces for ac-
ious purposes: Node centrality measures in networks can
cessing contacts, friends, or followers. Other sources - for
help to identify important and influential persons or com-
instance, email corpora revealing communication links be-
munitiesinareassuchasentertainment,science,orpolitics.
tween persons, or scientific portals comprising information
Informationonsocialconnectionsincustomernetworkscan
onco-authorship-containmoreimplicit,yeteasytoextract
be leveraged as part of recommender systems that suggest
network information. However, in many cases information
aboutsocialconnectionsishiddenwithinunstructureddata
Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor
such as Web pages, archives,and multimediarepositories.
classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed
Search engines constitute our main access points to the
forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcita-
tiononthefirstpage. Copyrightsforcomponentsofthisworkownedbyothersthan Web,andthereexistsanumberofworksthatemploysearch
ACMmustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orre- queries for detectingco-occurrences of entities and leverage
publish,topostonserversortoredistributetolists,requirespriorspecificpermission theseconnectionsforextractingsocialnetworks[20,25,30].
and/orafee.RequestpermissionsfromPermissions@acm.org. However, the suggested algorithmic approaches do not
CIKM’15,October19–23,2015,Melbourne,Australia.
scale well and are only suitable for rather small datasets
(cid:13)c 2015ACM.ISBN978-1-4503-3794-6/15/10...$15.00.
comprising just a few hundredor a few thousandnodes.
DOI:http://dx.doi.org/10.1145/2806416.2806582.
In this paper, we exploit pattern-based search engine textual patterns [2, 22]. In [32] a bootstrapping algorithm
queries for efficient acquisition of large scale networks from is employed for iteratively discovering new patternsand se-
unstructured Web data. These queries consist of entities manticconnections. In[33]and[5]categoricalfactsandcon-
(e.g.“BarackObama”)andconnectingpatterns(e.g.“meets cepthierarchiesareextractedfromwebandnewscorporaas
with”, “and his wife”) leveraged for mining links between entity-pattern-entity triples. Cimiano et al. [7, 8] leverage
persons. Using such search patterns we can simultaneously pattern-based search engine queries for building ontologies.
explore a large number of connections with a single query. In [16] comparable entities such as consumer items are ex-
Startingwithaninitialsetofentities,weiterativelyexpand tracted using an expandable set of contrasting expressions
both entity pair and pattern sets. We identify connections (e.g.“vs.”and“insteadof”);Jiangetal.[17]usesimilarrela-
whicharelikelytobeimportantandusethisinformationto tionsforconstructingentitygraphsandproductrecommen-
formulate subsequent queries, thereby greatly reducing the dations in web search. The KnowItAll system [11] employs
number of search engine requests. For illustration, a small asupervisedlearningapproachforvalidatingfactsextracted
excerpt of aconstructed graph is shown in Figure 1. usingsearchenginequeriesbasedonmanuallyprovidedpat-
tern templates. Open information extraction systems such
The main contributions of this work can be summarized
as TextRunner [35] employ shallow parsing, and, as pat-
as follows:
terns,employnormalizedexpressionsbetweennounphrases
in an unsupervised manner. The output of these systems
• Weintroduceintelligentprioritizationapproachesforef- are databases that contain fact tuples such as “(Berkley,
ficientlyexpandingsocial networksusingpattern based located in, Bay Area)”or“(Oppenheimer, professor of, the-
search engine queries. oretical physics)”. In contrast to works focusing on general
fact mining, we providea novelmethodology for extracting
• Weemploya bootstrappingapproach forcovering mul-
social networks. Our contributions include the systematic
tiple aspects of social relations by iteratively extending
construction of connected social graphs, their efficient and
search pattern and entity sets for discovering domain
intelligent expansion, as well as the analysis of their struc-
specific social connections.
tural properties.
• Weconductanextensiveexperimentalevaluationdemon- The first approach employing search engines for network
stratingthehighefficiencyandaccuracyofourmethods mining is probably the work by Kautz et al. [20]. The au-
clearlyoutperformingthebaseline. Furthermore,wean- thors exploit co-occurrences of names on result pages to
alyzevariousaspectsoftheextractedsocialgraphs,such identifyconnectionsbetweenpersons;theresultingnetworks
as their structural properties, connecting patterns, and are relatively small and just centered around a single per-
strength of social relations. son. The POLYPHONET system [25, 26] determines co-
occurrencecountsinGooglequeriestoidentifypairwisecon-
The remainder of this paper is organized as follows: Sec- nections. In addition, different types of relationships and
tion 2 discusses related work on social network construc- correlations with topic-related keywords are taken into ac-
tionfromsemi-structureddatasources,searchenginebased count to disambiguate entities. In [6] search engines are
network mining, and extraction of semantic relationships. used to retrieve connected email addresses. The Flink sys-
In Section 3 we describe our methods for network extrac- tem[27]focusesontheminingofauthornetworksandcom-
tionincludingsearchpatternsandtheirexpansionbasedon bines web search with information from emails and pub-
bootstrappingaswellasdifferentsearchprioritizationcrite- lications. In [29] co-occurrences of persons in search result
ria. The evaluation presented in Section 4 studies the cost snippetsareleveraged. Thereareafewattemptstoincrease
efficiency and accuracy of our methods, and provides fur- efficiencyindiscoveringentityrelationsontheweb [15,30];
therinsightsaboutthestructureoftheextracted networks. they focus on snippet clustering and entity disambiguation
Finally, in Section 5 we conclude and describe directions of to reduce the number of search requests. However, all of
our futurework. the current methods are based on pair-wise entity queries,
making them infeasible for mining larger graphs.
2. RELATED WORK To the best of our knowledge, our methods are the first
to go beyond pair-wise querying for social connections and
useintelligentquerypatternconstructionandprioritization
There is a plethora of work on social network extraction
approaches for graph expansion that allow for tunable and
fromtextandvisualdata. In[9],forinstance,socialconnec-
larger scale network construction.
tions between fictional characters are inferred from dialogs
inbooks,and,similarly,in[19]asocialnetworkisextracted
from the narrative of an Ottoman scholar and world trav-
3. SOCIALGRAPHMINING
eler. In [10] social connections are constructed from a his-
torical multimedia repository by leveraging co-occurrences In this section we describe our methodology for mining
offacesinimages. Birdetal.[3]extractsocialnetworksus- social graphs using search engines. To this end, we first in-
ing information aboutsenders andrecipientsobtained from troduce an efficient basic algorithm using web search with
headers in email corpora. Social networks extracted from a fixed pattern set for identifying connections between per-
other (semi-)structured sources include academic networks sons. Wethenextendthismethodbyaniterativeapproach
or co-citation graphs in publication databases [28, 34]. for automatically collecting new patterns. Finally, in order
Information extraction tackles the problem of deriving to further improve the cost efficiency of network construc-
structured information from unstructured data. There is tion, we introduce methods for prioritizing nodes for graph
a body of work on semantic relationship mining based on expansion.
3.1 Mining GraphswithaStaticSetofQuery
Algorithm1:GraphConstructionwithPatternSetEx-
Patterns
pansion
Given an initial seed set of entities I and a set of pat- Input: I: initial set of entities
terns P, we use connectivity search queries of the form IP: initial set of patterns
<entity> <p>(where<p>isaconnectingpatternase.g. in τ: threshold for edge weights
“Barack Obama meets”) in order to identify links to other σ: threshold for pattern weights
entities. We iteratively issue these queries over an initial maxIter: maximum numberof iterations
seed set of entities (e.g. a list of a politicians). In this Output: social graph G=(V,E)
way we obtain links between entities in the seed set as well 1 begin
as new entities. This is repeated for all patterns in our 2 E ←∅
pattern set (in our experiments we used a small initial set 3 V ←I
consistingofthesingleterm“and”). Forsnippetsofthetop- 4 Cand←I // graph extension candidates
k search results in Bing we check for exact strings of the 5 P ←IP // pattern set
form <entity> <p> <entity> where both entities match a 6 iter←0
person name in a database (we were using entities from 7 repeat
DBpedia and IMDb in our experiments).1 The process of 8 NewCand←∅
querying entities is repeated for the newly discovered enti- 9 NewPatterns←∅
ties in a“breadth-first”manner. The resulting (undirected) 10 for e∈Cand do
edges between two entities are weighted by the overall oc- 11 edges←search(e,P,τ) // web search for
currence count in the snippets across all patterns, omitting edges with node e, using query patterns
edges whose weight is below a threshold τ. P and weight threshold τ
Algorithm 2 shows thedetails of thealgorithm. The net- 12 for ({e,ei},w)∈edgesdo
work and a priority queue of nodes are initialized based on 13 if {e,ei}6∈E then
the entity seed set (lines 2–4). Here the priority queue is 14 V ←V ∪{ei}
a simple“first-in-first-out”queue; in Section 3.3 we will in- 15 NewCand←NewCand∪{ei}
troduce more enhanced mechanisms for prioritization. The 16 E ←E∪{({e,ei},w)}
nodes in the queue serve as input for pattern-based queries 17 else
(line8),andqueryresultsareusedfor expandingthesocial 18 E ←E\{({e,ei},x)}
network and updatingthe queue(lines 9–17). The network 19 E ←E∪{({e,ei},x+w)}
expansion continues until the query budget is spent or un-
til there are no unvisited nodes left (line 19). The running 20 Cand←Cand∪NewCand
timeofthealgorithmislinearinthesumofnodesandedges 21 NewPatterns←pSearch(E,σ) // web search
visited and found. for patterns, using edges from E, and
weight threshold σ
3.2 Iterative PatternMining
22 P ←P ∪NewPatterns;iter←iter+1
We extend the Graph Mining method described above 23 until iter=maxIter;
through an approach for iteratively discovering new pat- 24 return (V,E)
terns. To this end, we start with an initial set of seed pat-
terns and extend it in each round of the algorithm, taking
into account theweights of edges extracted in that round.
We then iterate over the top-h resulting edges with
the highest weight and use them to issue queries of The details of the algorithm are shown in Algorithm 1.
the form “<entity>” “<entity>”. Snippets of the Seed patterns and entities are first used to initialize the
search results are then checked for strings of the form bootstrapping approach (lines 2–5). Then, for a fixed num-
<entity> <c> <entity>, where <c> is an arbitrary string ber of iterations, the social network is expanded based on
betweentwodifferententities;<c>becomesacandidatefor the current query patterns (lines 11–19), and, conversely,
a new pattern. Among the candidates we want to identify the pattern set is expanded using the current edges in the
patterns with high coverage in the result snippets. To this network(lines21–22). Therunningtimeofthealgorithmis
end, let n be number of occurrences of candidate pattern linear in the sum of nodes and edges visited, multiplied by
<c> (support), m be the number of distinct entity pairs thenumberof patternsfound.
(diversity) and d be the number of distinct domains (spam
resistance)comprisingthepattern. Weinclude<c>intothe 3.3 Prioritization of Nodes for Network Ex-
pattern set if n·m·d2 is above a threshold σ. Extracting pansion
top-k patterns can be managed effectively using a priority
The algorithms described so far expand the social graph
queuewithlog(n)runningtimeforoperationslikeinsertion
in a “breath-first-search” manner. The advantage of this
and updating. This pattern mining method is employed af-
approach is its good coverage: Seedset and subsequentially
tereachcompleteiterationofthealgorithmdescribedinthe
found nodes are systematically visited and expanded until
previous paragraph.
our budget of connectivity queries is exhausted. However,
1 in order to further optimize budget usage we might want
Wedecidedtokeeptheentityrecognitionsub-componentsimple
to put higher priority on more“essential”nodes during the
and lightweight. More enhanced NLP and disambiguation tech-
niquesareoutofthescopeofthiswork,butcouldhelptofurther networkexpansionprocess. Inthefollowingwefocusontwo
improveouralreadyverygoodresults. criteria for prioritizing nodes: popularity and novelty.
4. EVALUATION
Algorithm 2: Graph Construction with Static Pattern
Set In this section we evaluate the network extraction meth-
Input: I: initial set of entities ods described in Section 3, starting with entities from two
P: set of patterns domains: politics and movies. The objective of our evalua-
τ: threshold for edge weights tion was to study (1) the cost efficiency of different strate-
maxReq: maximum numberof requests gies in terms of network size obtained with a given budget
pQueue: A container which sorts thecandidates of search engine requests, (2) the accuracy of the extracted
according to theemployed algorithm networks, and (3) the structure of the social graphs as well
Output: social graph G=(V,E) as properties of discovered social relations. In our exper-
1 begin iments, more than 300,000 persons and 1.4 million social
2 E ←∅ // Edges connections were retrieved using two seed sets containing
3 V ←I // Vertices just 342 politicians and 100 actors. The resulting networks
4 pQueue.addAll(I)// graph extension candidates are coined WikiNet and IMDbNet.
5 requestCnt←0
4.1 Setup
6 repeat
7 e←pQueue.pop()
8 (edges,requests)←search(e,P,τ)// web DomainsandSeedSets. Forinitializingouralgorithmswe
search for edges with node e, using query used two entity seed sets from political and movie domains
patterns P and weight threshold τ as follows:
9 for ({e,ei},w)∈edgesdo
10 if {e,ei}6∈E then • WikiNetseed: Weextracted342politiciannamesfrom
11 V ←V ∪{ei} the Wikipedia list of the current heads of state and
12 E ←E∪{({e,ei},w)} government from all over theworld2.
13 if ei 6∈pQueue then
14 pQueue.push(ei) • IMDbNetseed: Weconsideredthe100currentandfor-
mer leading actors as listed in IMDB3.
15 else
16 E ←E\{({e,ei},x)}
In order to simplify named entity recognition in search en-
17 E ←E∪{({e,ei},x+w)}
ginesnippets,werestrictedthesetofpossiblepersonsinour
18 requestCnt←requestCnt+requests WikiNettothoselistedinDBpediaVersion2014 (containing
19 until requestCnt≥maxReq or pQueue.empty(); about1milliondistinctpersonnames)andintheIMDbNet
20 return (V,E) tonamesfromtheIMDbdirectory(about2milliondistinct
names).
Implementation Details. We employed the Bing Search
EngineAPIforissuingournetworkexpansionqueries. Pre-
liminary tests showed that in most cases the overall result
list returned per query did not exceed k = 200 entries, al-
Popularity in this context intuitively corresponds to the though the estimated number of results often went beyond
importance of nodes within the network. We employ the thousands. Therefore, we issued 4 requests per query as
numberofinlinksρinthecurrentversionofthesocialgraph Bing provides 50 results per request (if available). In order
during the extraction process as popularity measure. Al- to avoid recurring API requests we simultaneously cached
though very effective in our experiments, one could easily results from previous queries. The search API has some
imagine using alternative node centrality measures such as technicallimitations, includingthefact that a few (special)
PageRank [31],hubscores inHITS[21], oroneoftheirvar- characters,like“&”,“,”,“+”cannotbeusedassearchquery.
ious variants (e.g. [23, 18]). Furthermore, most of the snippets returned for disjunctive
queries do not contain query terms, and, thus, cannot be
Novelty, on the other hand, corresponds to how recently
used for relation extraction.
anodewasaddedinthecourseofournetworkconstruction
process. Intuitively, exploring more recent nodes can help
TestedStrategies. Weevaluated thefollowing methods:
to reach out more quickly to new communities within the
network. Wecombinepopularityandnoveltyintoapriority
• The breadth-first graph mining approach with fixed
measure Φ for a node v using exponential temporal decay
as follows: Φ(v) = ρ·e−α·t, where our time measure t cor- pattern set described in Section 3.1 (BF).
responds to the numberof expansion node steps conducted
• TheprioritizationapproachfromSection3.3,withdif-
since the node was added to the queue, and α is a param-
ferent values for the tunable decay parameter α (0,
eter for balancing the influence of popularity and novelty.
0.005, and 0.01) (Prio).
Exponentialdecayisoftenusedfortimedependentranking
ofitemsinthecontextofIR(seee.g. TempPageRank[37]).
• The iterative pattern mining approach described in
This approach can be directly integrated in the framework Section 3.2 (PatternIter).
providedinAlgorithm2usingapriorityqueuebasedonthe
Φ score, and could, in principle, also be integrated into the 2http://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_governme
dynamicpattern mining process (Algorithm 1). 3http://www.imdb.com/list/ls050274118/
Algorithm Nodes Edges • Ournewmethodsclearlyoutperformthebaselinewhich
baseline t=0.1 6,956 13,571 does not scale well for expanding entity sets and pro-
duces much smaller networks for the given budget.
baseline t=0.4 2,925 3,838
Even for a small co-occurrence threshold t = 0.1 the
BF 113,988 368,806
WikiNet networksobtainedbythebaselinemethodarealready
decay α=0 98,479 456,223 betweenoneandtwoordersofmagnitudesmaller. For
decay α=0.005 110,260 346,663 t = 0.4 (suggested in POLYPHONET for extracting
decay α=0.01 116,081 323,372 relations of acceptable quality) the reduction in net-
work growth becomes even more apparent.
baseline t=0.1 7,159 34,903
baseline t=0.4 2,192 6,666
Note, that in addition to the number of search engine
BF 109,453 376,429 requests taken into account for this evaluation, the base-
IMDbNet
decay α=0 107,085 425,830 line also introduces a substantial overhead by requiring the
decay α=0.005 112,736 264,679 downloadandprocessingofalargesetofWebpages(about
decay α=0.01 112,782 242,184 85,000 pages in ourexperiments). Incontrast, ourmethods
avoid these extra costs by working directly on search result
Table 1: Number of nodes and edges obtained for different snippets.
strategies. Figure2providesadditionaldetailsaboutthenetworksize
We set the edge weight threshold τ to 2 for all algorithms. development with respect to the amount of search engine
For PatternIter we set the cut-off value σ for pattern se- requestsspent. Theeffectofastrongeremphasisonnovelty
lection to 5, and used value h = 100 for top connections (corresponding to higher α) is reflected in a slight increase
employed for discovering new patterns. The initial query in the number of new nodes, starting from around 50k –
pattern sets used in our experiments consisted only of the 100k requests. On theotherhand, prioritization of popular
generalpattern“and”;forextractingconnectionsfrom snip- nodeswithhighin-degree(lowerα)leadstoahighincrease
pets we employed an additional small set of manually se- in the number of edges (starting from around 10k requests
lected patterns4. for both datasets).
Baseline. All of the current works on social graph con- 4.3 Network Properties
struction using search engines are based on pairwise entity In the following we describe structural properties of the
queries. Asbaselinewetestedthemethoddescribedin[25] extracted networks as well as the development of the net-
(POLYPHONET)whichtriestoreducethenumberofpair- works duringtheexecution of ouralgorithms.
wisequeriesbyidentifyingsubsetsofpromisingquerypairs.
Morespecifically,startingwithaseedsetofentities,foreach Node Degree and Edge Weight Distributions. Table 2
entityalistofcandidateentities(containedintheentityset)
provides summary statistics for the distribution of nodes
is extracted from Web pages obtained by querying for the
andedgesintheextractednetworks. Forallexperimentswe
entity. Then queries for the pairs of seed entities and the
observearight-skeweddistributionofnodesandedgeswith
candidates for associated entities are issued in order to ob-
median node degree of 1 or 2 and median edge weight of 3.
tainco-occurrencevaluesbasedonsearchresultcounts,and
(Theedgeweightsinalltheexperimentsfollowapowerlaw
edges with a value above a threshold t are included in the
distributionasexemplarilydepictedinFigure5fortheBF-
graph. In order to allow for network expansion beyond the
constructed WikiNet.) The decay parameter α for thePrio
seedset,weincludedfurtherentitiesfoundintheWebpages
strategyhasaclearimpactonthedistributions: Decreasing
for identifyingadditional querycandidates.
the value of α results in an increase of the means for both
4.2 CostEfficiency node degree and edge weights due to the higher prioritiza-
tionofnodeswithalreadyhighdegree,andalsoresultsina
Table 1 shows the number of nodes and edges in the ex-
wider“spread”ofthevaluesasreflectedbyahigherstandard
pandednetworkscomputedbyour methodsusing thesame
deviation.
seed sets and exhaustinga fixedbudget of 200k queries per
Figure3showsthedegreedistributionsofthenodesinthe
run. The main observations are thefollowing:
constructednetworks. Thedistributionforthebreadth-first
• Starting with the small seed sets consisting of just a
(BF) strategy is close to a power law distribution. For the
fewhundredentitiesdescribedinthesetup,ourmeth-
Priostrategiesthedescribedeffectforthedecayparameterα
ods are able to extract networks containing around
interfereswiththepowerlaw,resultingintwolocalextrema:
100,000entities,andbetween200,000and450,000edges
a minimum for nodes with lower in-degree followed by a
for both domains. Our quality oriented evaluation in
maximum for nodeswith higherin-degree.
Section 4 will demonstrate thehigh accuracy of edges
generated byour methods. GraphStructures. Figure 4 visualizes the social networks
• The tunable decay parameter α in the prioritization obtained fordifferent methodsand seed sets. Fortheprior-
approachPrioisapplicablefortradingnoveltyagainst itization strategy Prio the decay parameter α has a strong
popularity. This is reflected in both domains by the influence on the network structure: For a stronger focus
larger number of social relations extracted for lower on popularity (α = 0) the larger number of edges results
values of α and the larger number of nodes found for in higher graph density; putting more emphasis on novelty
higher valuesof α. (α = 0.01) leads to a wider spread of the graph, providing
a bird’s eye view of the discovered space. A more in-depth
4
Thefollowingpatternswereused:“meets”,“ ”,“&”,“,”,“speaks
explorationofcommunitystructuresissubjecttoourfuture
with”,“und”,“et”,“y”,“-”
work.
Node Degree Edge Weight
Algorithm Mean SD Median Mean SD Median
BF 6.47 12.53 2 8.53 23.62 3
decay α=0 9.27 16.24 2 8.84 23.93 3
WikiNet
decay α=0.005 6.29 9.46 2 7.65 18.43 3
decay α=0.01 5.57 8.39 2 7.65 17.95 3
BF 6.88 14.03 1 8.69 24.11 3
decay α=0 7.95 15.43 1 8.97 25.04 3
IMDbNet
decay α=0.005 4.70 7.60 2 7.43 19.14 3
decay α=0.01 4.29 6.74 2 7.35 18.67 3
Table 2: Summarystatistics for node degrees and edge weights in theextracted networks.
500k 120k
BF BF
α=0 α=0
400k α=0.005 α=0.005
α=0.01 90k α=0.01
es 300k es
dg od 60k
#E 200k #N
30k
100k
0 0
0 50k 100k 150k 200k 0 50k 100k 150k 200k
#Requests #Requests
(a) WikiNet edges (b) WikiNet nodes
500k 120k
BF BF
α=0 α=0
400k α=0.005 α=0.005
α=0.01 90k α=0.01
es 300k es
dg od 60k
#E 200k #N
30k
100k
0 0
0 50k 100k 150k 200k 0 50k 100k 150k 200k
#Requests #Requests
(c) IMDbNetedges (d) IMDbNetnodes
Figure 2: Nodesand edges versusrequests spent for WikiNet and IMDbNet.
Example Connections. Finally, Table 3 lists the top-15 reference connections. One of the connected pairs was ob-
connections in the IMDbNet found by the BF method. tained usingourmethod andtheotherconsisted ofpersons
Among those connections are famous co-actors (“Terence uniformly sampled from the entity set (the order of these
Hill”and“BudSpencer”),romanticcouples(“RyanGosling” two options was randomized). Users were asked the follow-
and“Eva Mendes”), or a combination of both (“Brad Pitt” ing question: “Which of the given pairs has the stronger
and“AngelinaJolie”-thestrongestconnectionobtained). In relation?”,andwereencouragedtouseanysourceincluding
addition, we observe that our methods are also capable of searchenginestoanswerthisquestion(asanaidweprovided
discovering connections from other domains (“Sergey Brin” links to a Google search page). The assessors consisted of
and“Larry Page”). 15 undergraduate students of Computer Science and other
disciplines, and 5 graduatestudentsof Computer Science.
4.4 QualityofExtracted Networks Table 4 shows that all of our methods achieved accuracy
valuesbetween95%and99%,demonstratingthecorrectex-
Weconductedalargeuserstudytoevaluatethequalityof
traction of almost all tested edges. Overall, for 4694 out of
theconstructednetworkswithoverall5,600evaluatededges.
4800 edge pairs the users preferred the connections discov-
The goal of our evaluation was two-fold: First to check the
eredbyouralgorithms. Wemeasuredinter-rateragreement
correctness of the extracted social connections; second to
with 5users for a subset of 200 edgesfrom each of thecon-
evaluate strength of theconnections in more detail.
structednetworksandobtainedanaveragepairwisepercent
Pairwiseassessment. Determiningthecorrectnessofacon- agreement of 93.7% and 96.5% for WikiNet and IMDbNet,
nectionbetweentwopersonsisadifficulttaskforhumanas- respectively,alsoreflected byhigh Fleiss’ Kappa[14]values
sessors. Therefore,insteadoflettingassessorsdirectlyassign of 0.83 for WikiNet and 0.97 for IMDbNet.
arelevancevalue,wefirstaskedthemtochoosebetweentwo
105 105 105
104 104 104
y y y
nc 103 nc 103 nc 103
e e e
u u u
eq 102 eq 102 eq 102
Fr Fr Fr
10 10 10
1 1 1
1 10 100 1 10 100 1 10 100
Node degree Node degree Node degree
(a) WikiNet BF (b) WikiNet α=0.01 (c) WikiNet α=0.0
105 105 105
104 104 104
y y y
nc 103 nc 103 nc 103
e e e
u u u
eq 102 eq 102 eq 102
Fr Fr Fr
10 10 10
1 1 1
1 10 100 1 10 100 1 10 100
Node degree Node degree Node degree
(d) IMDbNetBF (e) IMDbNet α=0.01 (f) IMDbNetα=0.0
Figure 3: Nodedegree distribution for different algorithms and theWikiNet/IMDbNetseed sets.
(a) WikiNet BF (b)WikiNet α=0.0 (c) WikiNet α=0.01
(d) IMDbNetBF (e) IMDbNetα=0.0 (f) IMDbNetα=0.01
Figure 4: Resulting WikiNet and IMDbNetgraphs for different strategies and decay parameters.
106 Pattern Edges Pairs Domains
105 and 4,230 94 91
y 104 , 1,656 93 87
c & 828 90 95
n
ue 103 806 91 89
q
Fre 102 und 1,124 79 71
; 54 33 25
10 : 36 27 24
- 36 23 20
1
1 10 100 1000 or 33 19 14
Edge weight / 28 22 11
Figure 5: Edge weight distribution for the BF algorithm with 20 15 15
on 23 15 14
applied on WikiNet seed set
y 16 14 13
and hiswife 32 11 10
PERSON 1 PERSON 2 et 16 11 9
Brad Pitt – Angelina Jolie | 18 13 7
Trey Parker – Matt Stone mit 14 12 9
e 14 10 9
SpencerPratt – HeidiMontag
vs. 13 10 9
StephenMoyer – AnnaPaquin
and wife 25 7 7
Bud Abbott – Lou Costello
+ 9 8 7
Robert Pattinson – Kristen Stewart left and 13 7 6
Bruce Robison – Kelly Willis and actor 12 7 7
Ryan Gosling – EvaMendes hat 8 7 8
Dev Patel – Freida Pinto Pictures Photo of 31 28 1
Terence Hill – Bud Spencer
Table 5: Top-25 patternsafter 1st iteration of patternmin-
Sergey Brin – Larry Page
ing algorithm PatternIter applied toIMDbNet
Nick Cannon – Mariah Carey
Tod Williams – Billie Tsien other in person or not”, 4 -“agree - Persons most probably
Josh Dallas – Ginnifer Goodwin knoweachotherinperson”,5-“stronglyagree-Personsare
Eric Dane – RebeccaGayheart strongly related or know each other well (in person)”. We
assessed 600 edges (100 edges per dataset for each of our
Table 3: Top-15 social relations found for IMDbNet using methods) sampled from the generated social networks and
theBF extraction method. mixed in random order with 200 uniformly sampled pairs
of unconnected nodes. Pairs of nodes generated using our
Algorithm Votes Accuracy approaches obtained an average rating of 4.55 (stdv 0.87),
incontrasttoanaverageof1.81(stdv0.99)forunconnected
BF 800 0.969 ± 0.008
nodes. We also found that just 10.2% of the social connec-
WikiNet decay a=0.0 800 0.990 ± 0.007
tionsfound in ournetworksobtained arating of3 orlower.
decay a=0.01 800 0.983 ± 0.009
Theseresultsfurtherconfirmthehighaccuracyofourgraph
BF 800 0.958 ± 0.014 construction methods.
IMDbNet decay a=0.0 800 0.983 ± 0.009
4.5 Pattern ExpansionandAnalysis
decay a=0.01 800 0.986 ± 0.008
Overall 4800 0.978 ± 0.004 In this subsection we show results for the cost efficiency
and accuracy of the iterative pattern mining algorithm de-
Table 4: User evaluation results: accuracy of different scribed in Section 3.2, and analyze the phrase patterns ob-
methods tained for social connections.
Cost Efficiency and Accuracy. We conducted two iter-
Wealsotestedhowmanyoftheedgesinourgraphscorre-
ations of the pattern mining algorithm leading to 100,319
spondedtoco-workersinmoviesaslisted inIMDb. Alarge
requestsforIMDbNetand44,089forWikiNetandobtained
percentage of edges found by our algorithms (e.g. 33.8%
for method BF, and 32.7% for method Prio with α = 0) over 2,100 and over 5,000 distinct patterns (occurring with
frequencyofatleast10)forWikiNetandIMDbNet,respec-
wereco-actinginthesamemovies. However,ouralgorithms
tively. The extracted patterns provide additional informa-
are capable to additionally reveal many relationships be-
tion about type and topical focus of the extracted social
tween actors not appearing in the same movie as captured
connections. OveralltheresultingWikiNetcontained20,622
by IMDb (for instance,“Tom Cruise”and“Katie Holmes”).
nodes and 40,342 edges, and the resulting IMDbNet 35,649
Assessmentofindividualedges. In addition to the pair- nodes and 129,326 edges.
wisechecksdescribedabovewealsodirectlyassessedstrength Ahumanassessmentofedgesextractedbythealgorithm,
and correctness of individual connections using a 5 point which was conducted in the same way as described in Sec-
Likert scale. Levels 1 to 5 corresponded to the following tion 4.4, resulted in accuracy values of more than 0.96 and
answers to the question“Are these people connected?”: 1 0.97 for the graph expansions of the WikiNet and IMDb-
-“strongly disagree - Persons are not connected”, 2 -“dis- Net seed sets, respectively - with 95% confidence intervals
agree-Personsdonotlikelyknoweachotherinperson”,3- of0.01. Thisdemonstratestheviabilityofourapproachfor
“not known - it is not clear whether the persons know each constructinghigh-quality networks.
Pattern terms in WikiNet (politicianseed) Pattern terms in IMDbNet (actor seed)
meet minist presid coach prime star direct photo produc actor
forward former senat rep leader director pictur wife written actress
center beat back guard defens galleri husband latest daughter pic
king gener sai manag mayor plai date welcom film cast
right defend sen defeat chief girlfriend join alongsid marri boyfriend
midfield lineback score receiv captain featur video screenwrit attend movi
left governor secretari foreign tackl imag danc perform show sister
replac khan player against face camilla frontman screenplai dure skarsgard
assist deputi premier state while mother born parker writer comedi
head u. professor republican talk fan also stellan best kiss
Table 6: Top-50 stemmed terms extracted from patterns according to their Mutual Information values with seed sets of
politicians in Wikipediaand actors in IMDb.
Extracted Phrase Patterns. Table 5 shows the 25 best peripheral or assisting positions (“guard”, “assist”, “secre-
patterns extracted by searching for the 100 most highly tari”), as well as concepts related to competition (“defend”,
weightededgesafterthefirstiterationofactorseededgraph “score”, “beat”). In contrast, terms from patterns in the
construction,andcomputingpatternratingsasdescribedin movie domain mostly refer to actors (“star”, “actor”, “ac-
Section 3.2. Apart from obvious - yet useful - relationship tress”), film shooting (“cast”,“screenwrit”,“plai”), and fam-
patternslike“with”,“or”, and“on”, many of the mined pat- ily relations (“husband”, “sister”, “boyfriend”). This result
terns consist of special characters such as“,”,“&”, and“ ” indicatesthattermsfoundinpatternsareoftenstronglycor-
which often serve to connect persons linked to each other. relatedwithsocialcommunities,andillustratesthepotential
AlthoughEnglish patternsareprevalent,ourmethodfound merit of using machine learning techniquesfor domain spe-
alsousefulpatternsinotherlanguagesallowingformoregen- cific selection of pattern candidates. This opens promising
eral, multilingualretrievalofsocialconnections,andwiden- directions for focused, topic-oriented network expansion.
ingthescopeoftheminednetworks. Onesuchpatternwas,
forinstance,theconjunctionterm“and”inSpanish,French,
and German (“y”, “et”, and “und”). Just a small number 5. CONCLUSIONSAND FUTUREWORK
of patternsdonot describe relations between persons. This
We have introduced efficient methods for extracting so-
holdsforinstancethephrase“Pictures Photos of”whichoc-
cial networks from unstructuredWeb datausing connectiv-
curred in a large spam site containing many person names.
ity search queries. Our graph expansion algorithms lever-
However,thispatternobtainedacomparablylowratingbe-
age pattern based query mining enhanced by a bootstrap-
causeitappearedinjustasingledomain. Finally,weobserve
pingapproachforfindingnewqueryphrasepatternsandby
a few more specific patterns such as “and US president”,
methods for intelligently prioritizing nodes. Our evaluation
“and actors”, “and his wife”that can help to reveal inter-
shows the applicability of our methods for extracting large
esting information about the type of detected relationships
scalesocialnetworks,anddemonstratesthehighaccuracyof
and contexts.
ourapproach. Wealsofoundthatpopularityandnoveltyof
nodesareimportantcriteriaforcontrollingthenetworkcon-
Term Analysis of Patterns. Patterns can provide useful struction process: Popularity of nodes can be leveraged for
cluesondomainsandtopicalfociofextracted(sub)networks. efficiently extracting networks with more connections and
Althoughgeneralpatternslike“and”and“with”occurred,for higher density, while exploiting novelty is useful for faster
instance, very frequently both in thecontext of movies and exploration of more entities. An in-depth analysis of the
politics we also found various domain specific differences: characteristics of the extracted networks sheds additional
Patterns such as “meets”, “in conversation with”, and“is light on information obtained though connectivity queries
joined by”were more dominant in the politician seed based and on properties of ouralgorithms.
networks, while patterns like“and director”,“and actress”, Inourfutureworkweaimtoanalyzepatterntorevealsen-
“is starring alongside” were more prominent in the movie timent and other information about the social connections
domain. found: Are the linked individuals friends or foes? Do they
Inordertoobtainthemostdiscriminativetermsfromthe have a personal or private connection? Are the relation-
patterns, we computed the Mutual Information (MI) mea- ships bidirectional or rather unidirectional? Furthermore,
sure[24]frominformationtheory,whichmeasureshowmuch weplantoimprovetheefficiencyofouralgorithmsbypriori-
the joint distribution of features (terms from patterns in tizingsearchpatternsbasedonthecontextofanentityusing
our case) and categories deviates from a hypothetical dis- observed structures in graphs as training data for machine
tribution in which features and categories (“politics” and learning approaches. Finally, weaim tostudystrategies for
“movies”) are independent of each other. Table 6 shows extracting social graphs in a way that allows for simultane-
the top-50 representative stemmed words automatically ex- ously detecting communities in an efficient way. We think
tractedfrom ourquerypatternsforthepolitician andactor that this work has direct applications to social graph ex-
seedednetworksdescribedinSection4.1. Manyoftheterms ploration and people search and provides a foundation for
inthepatternsfallingintothepoliticscategorydescribeper- interesting analyses and findingsabout social networks and
sonal roles like leadership (“minist”,“presid”,“senat”) and communities.
6. ACKNOWLEDGMENTS [16] A.Jain and P. Pantel. Identifyingcomparable entities
This work is partly funded by the European Research on theweb. CIKM ’09, pages 1661–1664. ACM, 2009.
Council under ALEXANDRIA (ERC 339233) and by the [17] Z.Jiang, L. Ji, J. Zhang, J. Yan,P. Guo, and N.Liu.
EuropeanCommissionFP7underQualiMaster(grantagree- Learning open-domaincomparable entitygraphs from
ment No. 619525). usersearch queries. CIKM, 2013.
[18] X.Jin, S.Spangler,R.Ma, andJ.Han.Topicinitiator
detection on theworld wide web. WWW ’10, pages
7. REFERENCES
481–490. ACM, 2010.
[19] C. Karbeyaz, E. Can, F. Can, and M. Kalpakli. A
[1] A.Anagnostopoulos, R.Kumar, and M. Mahdian.
content-basedsocial network study of evliya celebi’s
Influenceand correlation in social networks. KDD ’08,
seyahatname-bitlissection. InComputer and
pages 7–15. ACM, 2008.
Information Sciences II.Springer, 2012.
[2] N.Bach and S. Badaskar. A Review of Relation
[20] H.A.Kautz, B. Selman, and M. A.Shah.The hidden
Extraction. 2007.
web. AI Magazine, 18(2):27–36, 1997.
[3] C. Bird, A. Gourley,P. Devanbu,M. Gertz, and
[21] J. M. Kleinberg. Hubs,authorities, and communities.
A.Swaminathan. Mining email social networks.In
ACM Comput. Surv., 31(4es), dec1999.
Proceedings of the 2006 International Workshop on
[22] N.Konstantinova.Review of relation extraction
Mining Software Repositories, MSR ’06, pages
methods: What is new out there? Springer, 2014.
137–143. ACM, 2006.
[23] A.N.Langville and C. D. Meyer. Google’s PageRank
[4] M. R.Bouadjenek, H. Hacid,and M. Bouzeghoub.
and Beyond: The Science of Search Engine Rankings.
Sopra: A new social personalized rankingfunction for
Princeton University Press, 2006.
improvingweb search. SIGIR’13, pages 861–864.
[24] C. Manning, P. Raghavan,H.Schutze,and
ACM, 2013.
E. Corporation. Introduction to information retrieval.
[5] M. J. Cafarella, J. Madhavan,and A. Halevy.
Cambridge UniversityPress, 2008.
Web-scaleextraction of structured data. SIGMOD
[25] Y.Matsuo, J. Mori, M. Hamasaki, K. Ishida,
Rec., 37(4):55–61, mar 2009.
T. Nishimura, H.Takeda, K.Hasida, and M. Ishizuka.
[6] X.Canaleta, P. Ros, A.Vallejo, D. Vernet,and
Polyphonet: An advanced social network extraction
A.Zaballos. Asystemtoextractsocial networksbased
system from theweb. WWW,2006.
on the processing of information obtained from
[26] Y.Matsuo, H.Tomobe, and T. Nishimura.Robust
internet.In Proceedings of the 11th International
estimation of google countsfor social network
Conference of the Catalan Association for Artificial
extraction.In AAAI,2007.
Intelligence, pages 283–292. IOS Press, 2008.
[27] P.Mika. Flink: Semanticweb technology for the
[7] P. Cimiano, S.Handschuh,and S.Staab. Towards the
extraction and analysis of social networks. Web
self-annotating web. WWW ’04, pages 462–471. ACM,
Semant., 3(2-3):211–223, oct 2005.
2004.
[28] A.Mohaisen, A.Yun,and Y. Kim. Measuring the
[8] P. Cimiano, G. Ladwig, and S.Staab. Gimme’ the
mixingtime of social graphs. InSIGCOMM,2010.
context: Context-drivenautomatic semantic
annotation with c-pankow.WWW ’05, pages 332–341. [29] M. K.M. Nasution and S.A. Noah.Superficial
ACM, 2005. method for extracting social network for academics
usingweb snippets.In RSKT, 2010.
[9] D.K. Elson, N.Dames, and K. R.McKeown.
Extractingsocial networksfrom literary fiction.In [30] R.Nuray-Turan,Z. Chen,D. V.Kalashnikov, and
ACL’10. S.Mehrotra. Exploitingweb queryingfor web people
search in weps2. InWeb People Search Evaluation
[10] L. W.et al. Building thesocial graph of the history of
Workshop (WePS), 18th WWW Conference, 2009.
european integration - A pipelinefor
humanist-machineinteraction in the digital [31] L. Page, S. Brin, R.Motwani, and T. Winograd. The
humanities. InHISTOINFORMATICS,2013. pagerank citation ranking: Bringing orderto theweb.
Technical report, 1999.
[11] O.Etzioni, M. Cafarella, D.Downey,S. Kok,A.-M.
Popescu, T. Shaked,S.Soderland, D.S. Weld,and [32] P.Pantel and M. Pennacchiotti. Espresso: Leveraging
A.Yates. Web-scaleinformation extraction in generic patternsfor automatically harvestingsemantic
knowitall: (preliminary results). InWWW’09. relations. InACL 2006, Sydney, Australia, 2006.
[12] A.Goyal, F.Bonchi, and L. V.Lakshmanan. Learning [33] M. Pasca. Acquisition of categorized named entities
influenceprobabilities in social networks.WSDM’10, for web search. In CIKM’04.
pages 241–250. ACM, 2010. [34] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang,and Z.Su.
[13] I.Guy,N. Zwerdling, I. Ronen,D. Carmel, and Arnetminer: Extraction and miningof academic social
E. Uziel. Social media recommendation based on networks.KDD ’08, pages 990–998. ACM, 2008.
people and tags. SIGIR’10, pages 194–201. ACM, [35] A.Yates, M. Cafarella, M. Banko,O. Etzioni,
2010. M. Broadhead, and S.Soderland. Textrunner: Open
[14] K.Gwet. Handbook of Inter-rater Reliability: The information extraction on theweb. In ACL, 2007.
DefinitiveGuide to Measuringthe Extent of Agreement [36] S.Yeand S. Wu.Measuring message propagation and
Among Raters. AdvancedAnalytics,LLC, 2010. social influenceon twitter.com. In LNCS.2010.
[15] J. He, Y.Liu, Q.Tu, C. Yao, and N.Di. Efficient [37] P.S. Yu,X. Li, and B. Liu. On thetemporal
entityrelation discovery on web. JCIS, 2007. dimension of search. WWW,2004.