ebook img

Natural language processing for online applications: text retrieval, extraction and categorization PDF

237 Pages·2002·1.283 MB·Natural Language Planning
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Natural language processing for online applications: text retrieval, extraction and categorization

NaturalLanguageProcessingforOnlineApplications Natural Language Processing Editor Prof.RuslanMitkov SchoolofHumanities,LanguagesandSocialSciences UniversityofWolverhampton StaffordSt. WolverhamptonWV11SB,UnitedKingdom Email:R.Mitkov@wlv.ac.uk AdvisoryBoard ChristianBoitet(UniversityofGrenoble) JohnCarroll(UniversityofSussex,Brighton) EugeneCharniak(BrownUniversity,Providence) EduardHovy(InformationSciencesInstitute,USC) RichardKittredge(UniversityofMontreal) GeoffreyLeech(LancasterUniversity) CarlosMartin-Vide(RoviraiVirgiliUn.,Tarragona) AndreiMikheev(UniversityofEdinburgh) JohnNerbonne(UniversityofGroningen) NicolasNicolov(IBM,T.J.WatsonResearchCenter) KemalOflazer(SabanciUniversity) AllanRamsey(UMIST,Manchester) MoniqueRolbert(UniversitédeMarseille) RichardSproat(AT&TLabs Research,FlorhamPark) Keh-YihSu(BehaviourDesignCorp.) IsabelleTrancoso(INESC,Lisbon) BenjaminTsou(CityUniversityofHongKong) Jun-ichiTsujii(UniversityofTokyo) EvelyneTzoukermann(BellLaboratories,MurrayHill) YorickWilks(UniversityofSheffield) Volume5 Natural Language Processing for Online Applications: Text Retrieval, ExtractionandCategorization byPeterJacksonandIsabelleMoulinier Natural Language Processing for Online Applications Text Retrieval, Extraction and Categorization Peter Jackson Isabelle Moulinier ThomsonLegal&Regulatory JohnBenjaminsPublishingCompany Amsterdam / Philadelphia TM ThepaperusedinthispublicationmeetstheminimumrequirementsofAmerican 8 NationalStandardforInformationSciences–PermanenceofPaperforPrinted LibraryMaterials,ansiz39.48-1984. LibraryofCongressCataloging-in-PublicationData Jackson,Peter,1948- Natural language processing for online applications : text retrieval, extraction, and categorization/PeterJackson,IsabelleMoulinier. p. cm.(NaturalLanguageProcessing,issn1567–8202;v.5) Includesbibliographicalreferencesandindex. I.Jackson,Peter.II.Moulinier,Isabelle.III.Title.IV.Series. QA76.9.N38 I33 2002 006.3’5--dc21 2002066539 isbn902724988(cid:2)1(Eur.)/158811249(cid:2)7(US)(Hb;alk.paper) isbn902724989(cid:2)X(Eur.)/158811250(cid:2)0(US)(Pb;alk.paper) ©2002–JohnBenjaminsB.V. Nopartofthisbookmaybereproducedinanyform,byprint,photoprint,microfilm,orany othermeans,withoutwrittenpermissionfromthepublisher. JohnBenjaminsPublishingCo.·P.O.Box36224·1020meAmsterdam·TheNetherlands JohnBenjaminsNorthAmerica·P.O.Box27519·Philadelphiapa19118-0519·usa Table of contents Preface  C1 Naturallanguageprocessing  . WhatisNLP?  . NLPandlinguistics  .. Syntaxandsemantics  .. Pragmaticsandcontext  .. TwoviewsofNLP  .. Tasksandsupertasks  . Linguistictools  .. Sentencedelimitersandtokenizers  .. Stemmersandtaggers  .. Nounphraseandnamerecognizers  .. Parsersandgrammars  . Planofthebook  C2 Documentretrieval  . Informationretrieval  . Indexingtechnology  . Queryprocessing  .. Booleansearch  .. Rankedretrieval  .. Probabilisticretrieval  .. Languagemodeling  . Evaluatingsearchengines  .. Evaluationstudies  .. Evaluationmetrics  .. Relevancejudgments  .. Totalsystemevaluation  . Attemptstoenhancesearchperformance   Tableofcontents .. Queryexpansionandthesauri  .. Queryexpansionfromrelevanceinformation*  . ThefutureofWebsearching  .. IndexingtheWeb  .. SearchingtheWeb  .. Rankingandrerankingdocuments  .. Thestateofonlinesearch  . Summaryofinformationretrieval  C3 Informationextraction  . TheMessageUnderstandingConferences  . Regularexpressions  . FiniteautomatainFASTUS  .. FiniteStateMachinesandregularlanguages  .. FiniteStateMachinesasparsers  . Pushdownautomataandcontext-freegrammars  .. Analyzingcasereports  .. Contextfreegrammars  .. Parsingwithapushdownautomaton  .. Copingwithincompletenessandambiguity  . Limitationsofcurrenttechnologyandfutureresearch  .. Explicitversusimplicitstatements  .. Machinelearningforinformationextraction  .. Statisticallanguagemodelsforinformationextraction  . Summaryofinformationextraction  C4 Textcategorization  . Overviewofcategorizationtasksandmethods  . Handcraftedrulebasedmethods  . Inductivelearningfortextclassification  .. NaïveBayesclassifiers  .. Linearclassifiers*  .. Decisiontreesanddecisionlists  . NearestNeighboralgorithms  . Combiningclassifiers  .. Datafusion  .. Boosting  Tableofcontents  .. Usingmultipleclassifiers  . Evaluationoftextcategorizationsystems  .. Evaluationstudies  .. Evaluationmetrics  .. Relevancejudgments  .. Systemevaluation  C5 Towardstextmining  . Whatistextmining?  . Referenceandcoreference  .. Namedentityrecognition  .. Thecoreferencetask  . Automaticsummarization  .. Summarizationtasks  .. Constructingsummariesfromdocumentfragments  .. Multi-documentsummarization(MDS)  . Testingofautomaticsummarizationprograms  .. Evaluationproblemsinsummarizationresearch  .. Buildingacorpusfortrainingandtesting  . ProspectsfortextminingandNLP  Index  Preface Thereisnosingletextonthemarketthatcoverstheemergingtechnologiesof documentretrieval,informationextraction,andtextcategorizationinacoher- entfashion.Thisbookseekstosatisfyagenuineneedonthepartoftechnology practitionersintheInternetspace,whoarefacedwithhavingtomakedifficult decisionsas to what research has been done,and what the bestpractices are. It is not intendedas a vendorguide (such things are quicklyout of date), or asarecipeforbuildingapplications(suchrecipesareverycontext-dependent). Butitdoesidentifythekeytechnologies,theissuesinvolved,andthestrengths andweaknessesofthevariousapproaches.Thereisalsoastrongemphasison evaluationin everychapter, both in termsof methodology(how to evaluate) andwhatcontrolledexperimentationandindustrialexperiencehavetotellus. Iwaspromptedtowrite thisbook afterspendingsevenyearsrunningan R&DgroupinanInternetpublishingandsolutionsbusiness.Duringthattime, wewereabletoputintoproductionanumberofsystemsthateithergenerated revenueorenabledcostsavingsforthecompany,leveragingtechnologiesfrom informationretrieval,informationextraction,andtextcategorization.Thisis notachronicleoftheseexploits,butaprimerforthosewhoarealreadyinter- estedinnaturallanguageprocessingforonlineapplications.Nevertheless,my treatmentofthephilosophyandpracticeoflanguageprocessingiscoloredby thecontextinwhichIfunction,namelythearenaofcommercialexploitation. Thus, althoughthere isafocusontechnical detailandresearchresults,Ialso addresssomeoftheissuesthatariseinapplyingsuchsystemstodatacollections ofrealisticsizeandcomplexity. Thebook isnotintendedexclusivelyasan academictext, althoughIsus- pect that it will be of interestto studentswho wish to use these technologies in an industrial setting. It is also aimed at software engineers, project man- agers,andtechnologyexecutiveswhowantorneedtounderstandthetechnol- ogyatsomelevel.Ihope thatsuch people findituseful,andthatit provokes ideas,discussion,andactioninthefieldofappliedresearchanddevelopment. Eachchapterbeginswithlightermaterialandthenprogressestoheavierstuff, withsomeofthelatersectionsandsidebarsbeingmarkedwith anasteriskas

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.