ebook img

Exploiting Wikipedia Semantics for Computing Word Associations PDF

263 Pages·2014·2.57 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Exploiting Wikipedia Semantics for Computing Word Associations

Exploiting Wikipedia Semantics for Computing Word Associations by Shahida Jabeen Athesis submittedtotheVictoriaUniversityofWellington infulfilmentofthe requirementsforthedegreeof DoctorofPhilosophy inComputerScience. VictoriaUniversityofWellington 2014 Abstract Semanticassociationcomputationistheprocessofautomaticallyquan- tifying the strength of a semantic connection between two textual units based on various lexical and semantic relations such as hyponymy (car andvehicle)andfunctionalassociations(bankandmanager). Humans have can infer implicit relationships between two textual units based on their knowledge about the world and their ability to reason about that knowledge. Automatically imitating this behavior is limited by restricted knowledgeandpoorabilitytoinferhiddenrelations. Variousfactorsaffecttheperformanceofautomatedapproachestocom- puting semantic association strength. One critical factor is the selection of a suitable knowledge source for extracting knowledge about the im- plicit semantic relations. In the past few years, semantic association com- putation approaches have started to exploit web-originated resources as substitutes for conventional lexical semantic resources such as thesauri, machine readable dictionaries and lexical databases. These conventional knowledge sources suffer from limitations such as coverage issues, high construction and maintenance costs and limited availability. To overcome these issues one solution is to use the wisdom of crowds in the form of collaboratively constructed knowledge sources. An excellent example of such knowledge sources is Wikipedia which stores detailed information not only about the concepts themselves but also about various aspects of therelationsamongconcepts. The overall goal of this thesis is to demonstrate that using Wikipedia for computing word association strength yields better estimates of hu- mans’associationsthantheapproachesbasedonotherstructuredandun- structured knowledge sources. There are two key challenges to achieve this goal: first, to exploit various semantic association models based on different aspects of Wikipedia in developing new measures of semantic associations; and second, to evaluate these measures compared to human performance in a range of tasks. The focus of the thesis is on exploring twoaspectsofWikipedia: asaformalknowledgesource,andasaninformal textcorpus. The first contribution of the work included in the thesis is that it ef- fectively exploited the knowledge source aspect of Wikipedia by devel- oping new measures of semantic associations based on Wikipedia hyper- link structure, informative-content of articles and combinations of both elements. ItwasfoundthatWikipediacanbeeffectivelyusedforcomput- ing noun-noun similarity. It was also found that a model based on hybrid combinations of Wikipedia structure and informative-content based fea- tures performs better than those based on individual features. It was also found that the structure based measures outperformed the informative- contentbasedmeasuresonbothsemanticsimilarityandsemanticrelated- nesscomputationtasks. The second contribution of the research work in the thesis is that it ef- fectively exploited the corpus aspect of Wikipedia by developing a new measure of semantic association based on asymmetric word associations. The thesis introduced the concept of asymmetric associations based mea- sure using the idea of directional context inspired by the free word asso- ciationtask. Theunderlyingassumptionwasthattheassociationstrength can change with the changing context. It was found that the asymmetric- associationbasedmeasureperformedbetterthanthesymmetricmeasures on semantic association computation, relatedness based word choice and causality detection tasks. However, asymmetric-associations based mea- sures have no advantage for synonymy-based word choice tasks. It was also found that Wikipedia is not a good knowledge source for capturing verb-relationsduetoitsfocusonencyclopedicconceptsspeciallynouns. It is hoped that future research will build on the experiments and dis- cussions presented in this thesis to explore new avenues using Wikipedia for finding deeper and semantically more meaningful associations in a widerangeofapplicationareasbasedonhumans’estimatesofwordasso- ciations. iv Acknowledgment “The difference between a successful person and others is not a lack of strength, nor a lack of knowledge, but rather a lack of will”. Three years ago, this will motivated me to leave my family, my job and my country to pursue my career in research. Three precious years of my life in New Zealand have beenfullofexperiencesthattaughtmethevalueofhardworkinlife. This thesis is an outcome of the support, help and sacrifice of many priceless peoplewhohavehadagreatinfluenceonmylifeandstudiesoverthelast threeyears. Igreatlyadmiremysupervisors,Dr. SharonGaoandDr. Peter Andreae, for believing in me. I am indebted to them for their outstanding supportandmentoring. Ourweeklymeetingshaveundoubtedlybeenkey in keeping me focused and giving me the courage to achieve my goals. I would also like to express my gratitude to Prof. Mengjie Zhang, for being very supportive and encouraging. His success celebrations and Christ- masBBQ’shavealwaysbeenwonderfullyexciting. IwishtothankECRG group for creating an insightful and friendly research environment. I am also grateful to Peter D. Turney (National Research Council of Canada) for providing me datasets for the word choice task. I also want to thank ThomasK.Landauer(LSAandNLPLab,UniversityofColorado)forpro- vidingtheTOEFLdataset. Last but not the least, I want to thank my family for their continuous support,helpandprayersandmybelovedhusbandforhisunconditional love. It would not have been possible for me to complete my journey withouthissupportandencouragement. v vi List of Tables 2.1 Correlation of the association strength of three terms of se- manticassociations. . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 Statistics of Benchmark datasets used for evaluation of se- manticassociationmeasures. . . . . . . . . . . . . . . . . . . 28 2.3 Examples of approaches using knowledge-sources and text corporainvariousapplications. . . . . . . . . . . . . . . . . . 33 2.4 Some well-known text corpora used by corpus-based ap- proaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1 Performance comparison of the semantic measures on all- terms(All)versionsofM&C,R&GandWS-353datasets. . . . 87 3.2 PerformancecomparisonofthesemanticmeasuresonNon- Missing(NM)versionsofM&C,R&GandWS-353datasets. . 88 3.3 Performance comparison of semantic measures with WLM measure on all-terms (All) versions of M&C, R&G and WS- 353datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.4 PerformancecomparisonofthesemanticmeasuresonM&C, R&GandWS-353datasetsusingmanualandautomaticdis- ambiguations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.5 Performance comparison of the semantic measures on all- termsversionsofsimilarity(WS-353-(S))andrelatedness(WS- 353-(R))basedsubsetsoftheWS-353dataset. . . . . . . . . . 93 vii viii LISTOFTABLES 3.6 PerformancecomparisonofthesemanticmeasuresonNon- Missing (NM) versions of similarity (WS-353-(S)) and relat- edness(WS-353-(R))basedsubsetsoftheWS-353dataset. . . 94 3.7 Performance of the semantic measures on three biomedical datasets: MiniMayo,MeSHandMayoMeSHdatasets. . . . . 96 4.1 Performance comparison of the DCRM measure with the previously best measure WikiSim on domain independent datasets. Bold values indicate the best correlation-based performanceofanymeasureonaspecificdataset. . . . . . . 114 5.1 Performancecomparisonoffourclassifiersusingthehybrid modelbasedonthe3-featurecombination(f ). . . . . . . . 125 123 5.2 A comparison of averaged feature ranking on all datasets. Thehighestrankcorrespondstothelowestcorrelation-based performanceofafeatureandviceversa. . . . . . . . . . . . . 129 5.3 Performance comparison of both hybrid models with pre- viously best performing approaches on three benchmark datasets: M&C,R&GandWS-353. . . . . . . . . . . . . . . . 130 6.1 Performance comparison of symmetric and asymmetric as- sociationbasedmeasuresontwosubsetsofWS-353: WS353- SimandWS353-Relusingproximitywindowwithw = 10. . 146 6.2 PerformancecomparisonofAPRM avgmeasurewithexist- ingstate-of-artmeasuresontheMTURK-771dataset. . . . . 148 7.1 ThePerformanceofAPRMonvariouswordchoicedatasets. Thebestperformanceoneachdatasetisshowninbold. . . . 161 7.2 PerformancecomparisonofAPRMwithstate-of-artapproaches ontheRDWPdataset. Thebestperformanceisshowninbold.162 7.3 The performance of APRM on the synonymy-based word choice task using two datasets: ESL and TOEFL. The best performanceisshowninbold. . . . . . . . . . . . . . . . . . . 166

Description:
this goal: first, to exploit various semantic association models based on rect particularly in the case of antonyms, where two words are quite dis-.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.