Table Of ContentNaturalLanguageProcessingforOnlineApplications
Natural Language Processing
Editor
Prof.RuslanMitkov
SchoolofHumanities,LanguagesandSocialSciences
UniversityofWolverhampton
StaffordSt.
WolverhamptonWV11SB,UnitedKingdom
Email:R.Mitkov@wlv.ac.uk
AdvisoryBoard
ChristianBoitet(UniversityofGrenoble)
JohnCarroll(UniversityofSussex,Brighton)
EugeneCharniak(BrownUniversity,Providence)
EduardHovy(InformationSciencesInstitute,USC)
RichardKittredge(UniversityofMontreal)
GeoffreyLeech(LancasterUniversity)
CarlosMartin-Vide(RoviraiVirgiliUn.,Tarragona)
AndreiMikheev(UniversityofEdinburgh)
JohnNerbonne(UniversityofGroningen)
NicolasNicolov(IBM,T.J.WatsonResearchCenter)
KemalOflazer(SabanciUniversity)
AllanRamsey(UMIST,Manchester)
MoniqueRolbert(UniversitédeMarseille)
RichardSproat(AT&TLabs Research,FlorhamPark)
Keh-YihSu(BehaviourDesignCorp.)
IsabelleTrancoso(INESC,Lisbon)
BenjaminTsou(CityUniversityofHongKong)
Jun-ichiTsujii(UniversityofTokyo)
EvelyneTzoukermann(BellLaboratories,MurrayHill)
YorickWilks(UniversityofSheffield)
Volume5
Natural Language Processing for Online Applications: Text Retrieval,
ExtractionandCategorization
byPeterJacksonandIsabelleMoulinier
Natural Language Processing
for Online Applications
Text Retrieval,
Extraction and Categorization
Peter Jackson
Isabelle Moulinier
ThomsonLegal&Regulatory
JohnBenjaminsPublishingCompany
Amsterdam / Philadelphia
TM ThepaperusedinthispublicationmeetstheminimumrequirementsofAmerican
8
NationalStandardforInformationSciences–PermanenceofPaperforPrinted
LibraryMaterials,ansiz39.48-1984.
LibraryofCongressCataloging-in-PublicationData
Jackson,Peter,1948-
Natural language processing for online applications : text retrieval, extraction, and
categorization/PeterJackson,IsabelleMoulinier.
p. cm.(NaturalLanguageProcessing,issn1567–8202;v.5)
Includesbibliographicalreferencesandindex.
I.Jackson,Peter.II.Moulinier,Isabelle.III.Title.IV.Series.
QA76.9.N38 I33 2002
006.3’5--dc21 2002066539
isbn902724988(cid:2)1(Eur.)/158811249(cid:2)7(US)(Hb;alk.paper)
isbn902724989(cid:2)X(Eur.)/158811250(cid:2)0(US)(Pb;alk.paper)
©2002–JohnBenjaminsB.V.
Nopartofthisbookmaybereproducedinanyform,byprint,photoprint,microfilm,orany
othermeans,withoutwrittenpermissionfromthepublisher.
JohnBenjaminsPublishingCo.·P.O.Box36224·1020meAmsterdam·TheNetherlands
JohnBenjaminsNorthAmerica·P.O.Box27519·Philadelphiapa19118-0519·usa
Table of contents
Preface
C1
Naturallanguageprocessing
. WhatisNLP?
. NLPandlinguistics
.. Syntaxandsemantics
.. Pragmaticsandcontext
.. TwoviewsofNLP
.. Tasksandsupertasks
. Linguistictools
.. Sentencedelimitersandtokenizers
.. Stemmersandtaggers
.. Nounphraseandnamerecognizers
.. Parsersandgrammars
. Planofthebook
C2
Documentretrieval
. Informationretrieval
. Indexingtechnology
. Queryprocessing
.. Booleansearch
.. Rankedretrieval
.. Probabilisticretrieval
.. Languagemodeling
. Evaluatingsearchengines
.. Evaluationstudies
.. Evaluationmetrics
.. Relevancejudgments
.. Totalsystemevaluation
. Attemptstoenhancesearchperformance
Tableofcontents
.. Queryexpansionandthesauri
.. Queryexpansionfromrelevanceinformation*
. ThefutureofWebsearching
.. IndexingtheWeb
.. SearchingtheWeb
.. Rankingandrerankingdocuments
.. Thestateofonlinesearch
. Summaryofinformationretrieval
C3
Informationextraction
. TheMessageUnderstandingConferences
. Regularexpressions
. FiniteautomatainFASTUS
.. FiniteStateMachinesandregularlanguages
.. FiniteStateMachinesasparsers
. Pushdownautomataandcontext-freegrammars
.. Analyzingcasereports
.. Contextfreegrammars
.. Parsingwithapushdownautomaton
.. Copingwithincompletenessandambiguity
. Limitationsofcurrenttechnologyandfutureresearch
.. Explicitversusimplicitstatements
.. Machinelearningforinformationextraction
.. Statisticallanguagemodelsforinformationextraction
. Summaryofinformationextraction
C4
Textcategorization
. Overviewofcategorizationtasksandmethods
. Handcraftedrulebasedmethods
. Inductivelearningfortextclassification
.. NaïveBayesclassifiers
.. Linearclassifiers*
.. Decisiontreesanddecisionlists
. NearestNeighboralgorithms
. Combiningclassifiers
.. Datafusion
.. Boosting
Tableofcontents
.. Usingmultipleclassifiers
. Evaluationoftextcategorizationsystems
.. Evaluationstudies
.. Evaluationmetrics
.. Relevancejudgments
.. Systemevaluation
C5
Towardstextmining
. Whatistextmining?
. Referenceandcoreference
.. Namedentityrecognition
.. Thecoreferencetask
. Automaticsummarization
.. Summarizationtasks
.. Constructingsummariesfromdocumentfragments
.. Multi-documentsummarization(MDS)
. Testingofautomaticsummarizationprograms
.. Evaluationproblemsinsummarizationresearch
.. Buildingacorpusfortrainingandtesting
. ProspectsfortextminingandNLP
Index
Preface
Thereisnosingletextonthemarketthatcoverstheemergingtechnologiesof
documentretrieval,informationextraction,andtextcategorizationinacoher-
entfashion.Thisbookseekstosatisfyagenuineneedonthepartoftechnology
practitionersintheInternetspace,whoarefacedwithhavingtomakedifficult
decisionsas to what research has been done,and what the bestpractices are.
It is not intendedas a vendorguide (such things are quicklyout of date), or
asarecipeforbuildingapplications(suchrecipesareverycontext-dependent).
Butitdoesidentifythekeytechnologies,theissuesinvolved,andthestrengths
andweaknessesofthevariousapproaches.Thereisalsoastrongemphasison
evaluationin everychapter, both in termsof methodology(how to evaluate)
andwhatcontrolledexperimentationandindustrialexperiencehavetotellus.
Iwaspromptedtowrite thisbook afterspendingsevenyearsrunningan
R&DgroupinanInternetpublishingandsolutionsbusiness.Duringthattime,
wewereabletoputintoproductionanumberofsystemsthateithergenerated
revenueorenabledcostsavingsforthecompany,leveragingtechnologiesfrom
informationretrieval,informationextraction,andtextcategorization.Thisis
notachronicleoftheseexploits,butaprimerforthosewhoarealreadyinter-
estedinnaturallanguageprocessingforonlineapplications.Nevertheless,my
treatmentofthephilosophyandpracticeoflanguageprocessingiscoloredby
thecontextinwhichIfunction,namelythearenaofcommercialexploitation.
Thus, althoughthere isafocusontechnical detailandresearchresults,Ialso
addresssomeoftheissuesthatariseinapplyingsuchsystemstodatacollections
ofrealisticsizeandcomplexity.
Thebook isnotintendedexclusivelyasan academictext, althoughIsus-
pect that it will be of interestto studentswho wish to use these technologies
in an industrial setting. It is also aimed at software engineers, project man-
agers,andtechnologyexecutiveswhowantorneedtounderstandthetechnol-
ogyatsomelevel.Ihope thatsuch people findituseful,andthatit provokes
ideas,discussion,andactioninthefieldofappliedresearchanddevelopment.
Eachchapterbeginswithlightermaterialandthenprogressestoheavierstuff,
withsomeofthelatersectionsandsidebarsbeingmarkedwith anasteriskas