Table Of ContentExploiting Wikipedia
Semantics for Computing
Word Associations
by
Shahida Jabeen
Athesis
submittedtotheVictoriaUniversityofWellington
infulfilmentofthe
requirementsforthedegreeof
DoctorofPhilosophy
inComputerScience.
VictoriaUniversityofWellington
2014
Abstract
Semanticassociationcomputationistheprocessofautomaticallyquan-
tifying the strength of a semantic connection between two textual units
based on various lexical and semantic relations such as hyponymy (car
andvehicle)andfunctionalassociations(bankandmanager). Humans
have can infer implicit relationships between two textual units based on
their knowledge about the world and their ability to reason about that
knowledge. Automatically imitating this behavior is limited by restricted
knowledgeandpoorabilitytoinferhiddenrelations.
Variousfactorsaffecttheperformanceofautomatedapproachestocom-
puting semantic association strength. One critical factor is the selection
of a suitable knowledge source for extracting knowledge about the im-
plicit semantic relations. In the past few years, semantic association com-
putation approaches have started to exploit web-originated resources as
substitutes for conventional lexical semantic resources such as thesauri,
machine readable dictionaries and lexical databases. These conventional
knowledge sources suffer from limitations such as coverage issues, high
construction and maintenance costs and limited availability. To overcome
these issues one solution is to use the wisdom of crowds in the form of
collaboratively constructed knowledge sources. An excellent example of
such knowledge sources is Wikipedia which stores detailed information
not only about the concepts themselves but also about various aspects of
therelationsamongconcepts.
The overall goal of this thesis is to demonstrate that using Wikipedia
for computing word association strength yields better estimates of hu-
mans’associationsthantheapproachesbasedonotherstructuredandun-
structured knowledge sources. There are two key challenges to achieve
this goal: first, to exploit various semantic association models based on
different aspects of Wikipedia in developing new measures of semantic
associations; and second, to evaluate these measures compared to human
performance in a range of tasks. The focus of the thesis is on exploring
twoaspectsofWikipedia: asaformalknowledgesource,andasaninformal
textcorpus.
The first contribution of the work included in the thesis is that it ef-
fectively exploited the knowledge source aspect of Wikipedia by devel-
oping new measures of semantic associations based on Wikipedia hyper-
link structure, informative-content of articles and combinations of both
elements. ItwasfoundthatWikipediacanbeeffectivelyusedforcomput-
ing noun-noun similarity. It was also found that a model based on hybrid
combinations of Wikipedia structure and informative-content based fea-
tures performs better than those based on individual features. It was also
found that the structure based measures outperformed the informative-
contentbasedmeasuresonbothsemanticsimilarityandsemanticrelated-
nesscomputationtasks.
The second contribution of the research work in the thesis is that it ef-
fectively exploited the corpus aspect of Wikipedia by developing a new
measure of semantic association based on asymmetric word associations.
The thesis introduced the concept of asymmetric associations based mea-
sure using the idea of directional context inspired by the free word asso-
ciationtask. Theunderlyingassumptionwasthattheassociationstrength
can change with the changing context. It was found that the asymmetric-
associationbasedmeasureperformedbetterthanthesymmetricmeasures
on semantic association computation, relatedness based word choice and
causality detection tasks. However, asymmetric-associations based mea-
sures have no advantage for synonymy-based word choice tasks. It was
also found that Wikipedia is not a good knowledge source for capturing
verb-relationsduetoitsfocusonencyclopedicconceptsspeciallynouns.
It is hoped that future research will build on the experiments and dis-
cussions presented in this thesis to explore new avenues using Wikipedia
for finding deeper and semantically more meaningful associations in a
widerangeofapplicationareasbasedonhumans’estimatesofwordasso-
ciations.
iv
Acknowledgment
“The difference between a successful person and others is not a lack of strength,
nor a lack of knowledge, but rather a lack of will”. Three years ago, this will
motivated me to leave my family, my job and my country to pursue my
career in research. Three precious years of my life in New Zealand have
beenfullofexperiencesthattaughtmethevalueofhardworkinlife. This
thesis is an outcome of the support, help and sacrifice of many priceless
peoplewhohavehadagreatinfluenceonmylifeandstudiesoverthelast
threeyears. Igreatlyadmiremysupervisors,Dr. SharonGaoandDr. Peter
Andreae, for believing in me. I am indebted to them for their outstanding
supportandmentoring. Ourweeklymeetingshaveundoubtedlybeenkey
in keeping me focused and giving me the courage to achieve my goals. I
would also like to express my gratitude to Prof. Mengjie Zhang, for being
very supportive and encouraging. His success celebrations and Christ-
masBBQ’shavealwaysbeenwonderfullyexciting. IwishtothankECRG
group for creating an insightful and friendly research environment. I am
also grateful to Peter D. Turney (National Research Council of Canada)
for providing me datasets for the word choice task. I also want to thank
ThomasK.Landauer(LSAandNLPLab,UniversityofColorado)forpro-
vidingtheTOEFLdataset.
Last but not the least, I want to thank my family for their continuous
support,helpandprayersandmybelovedhusbandforhisunconditional
love. It would not have been possible for me to complete my journey
withouthissupportandencouragement.
v
vi
List of Tables
2.1 Correlation of the association strength of three terms of se-
manticassociations. . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Statistics of Benchmark datasets used for evaluation of se-
manticassociationmeasures. . . . . . . . . . . . . . . . . . . 28
2.3 Examples of approaches using knowledge-sources and text
corporainvariousapplications. . . . . . . . . . . . . . . . . . 33
2.4 Some well-known text corpora used by corpus-based ap-
proaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1 Performance comparison of the semantic measures on all-
terms(All)versionsofM&C,R&GandWS-353datasets. . . . 87
3.2 PerformancecomparisonofthesemanticmeasuresonNon-
Missing(NM)versionsofM&C,R&GandWS-353datasets. . 88
3.3 Performance comparison of semantic measures with WLM
measure on all-terms (All) versions of M&C, R&G and WS-
353datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.4 PerformancecomparisonofthesemanticmeasuresonM&C,
R&GandWS-353datasetsusingmanualandautomaticdis-
ambiguations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.5 Performance comparison of the semantic measures on all-
termsversionsofsimilarity(WS-353-(S))andrelatedness(WS-
353-(R))basedsubsetsoftheWS-353dataset. . . . . . . . . . 93
vii
viii LISTOFTABLES
3.6 PerformancecomparisonofthesemanticmeasuresonNon-
Missing (NM) versions of similarity (WS-353-(S)) and relat-
edness(WS-353-(R))basedsubsetsoftheWS-353dataset. . . 94
3.7 Performance of the semantic measures on three biomedical
datasets: MiniMayo,MeSHandMayoMeSHdatasets. . . . . 96
4.1 Performance comparison of the DCRM measure with the
previously best measure WikiSim on domain independent
datasets. Bold values indicate the best correlation-based
performanceofanymeasureonaspecificdataset. . . . . . . 114
5.1 Performancecomparisonoffourclassifiersusingthehybrid
modelbasedonthe3-featurecombination(f ). . . . . . . . 125
123
5.2 A comparison of averaged feature ranking on all datasets.
Thehighestrankcorrespondstothelowestcorrelation-based
performanceofafeatureandviceversa. . . . . . . . . . . . . 129
5.3 Performance comparison of both hybrid models with pre-
viously best performing approaches on three benchmark
datasets: M&C,R&GandWS-353. . . . . . . . . . . . . . . . 130
6.1 Performance comparison of symmetric and asymmetric as-
sociationbasedmeasuresontwosubsetsofWS-353: WS353-
SimandWS353-Relusingproximitywindowwithw = 10. . 146
6.2 PerformancecomparisonofAPRM avgmeasurewithexist-
ingstate-of-artmeasuresontheMTURK-771dataset. . . . . 148
7.1 ThePerformanceofAPRMonvariouswordchoicedatasets.
Thebestperformanceoneachdatasetisshowninbold. . . . 161
7.2 PerformancecomparisonofAPRMwithstate-of-artapproaches
ontheRDWPdataset. Thebestperformanceisshowninbold.162
7.3 The performance of APRM on the synonymy-based word
choice task using two datasets: ESL and TOEFL. The best
performanceisshowninbold. . . . . . . . . . . . . . . . . . . 166
Description:this goal: first, to exploit various semantic association models based on rect particularly in the case of antonyms, where two words are quite dis-.