ebook img

Fundamentals of Predictive Text Mining PDF

232 Pages·2010·2.903 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Fundamentals of Predictive Text Mining

Texts in Computer Science Editors DavidGries FredB.Schneider Forfurthervolumes: http://www.springer.com/series/3191 Sholom M. Weiss (cid:2) Nitin Indurkhya (cid:2) Tong Zhang Fundamentals of Predictive Text Mining SholomM.Weiss TongZhang T.J.WatsonResearchCenter Dept.Statistics,HillCenter IBMCorporation RutgersUniversity KitchawanRoad1101 Piscataway,08854-8019NJ YorktownHeights,10598NY USA USA [email protected] [email protected] NitinIndurkhya SchoolofComputerScience&Engg. UniversityofNewSouthWales Sydney,2052NSW Australia [email protected] SeriesEditors DavidGries FredB.Schneider DepartmentofComputerScience DepartmentofComputerScience UpsonHall UpsonHall CornellUniversity CornellUniversity Ithaca,NY14853-7501,USA Ithaca,NY14853-7501,USA ISSN1868-0941 e-ISSN1868-095X ISBN978-1-84996-225-4 e-ISBN978-1-84996-226-1 DOI10.1007/978-1-84996-226-1 SpringerLondonDordrechtHeidelbergNewYork BritishLibraryCataloguinginPublicationData AcataloguerecordforthisbookisavailablefromtheBritishLibrary LibraryofCongressControlNumber:2010928348 ©Springer-VerlagLondonLimited2010 Apartfromanyfairdealingforthepurposesofresearchorprivatestudy,orcriticismorreview,asper- mittedundertheCopyright,DesignsandPatentsAct1988,thispublicationmayonlybereproduced, storedortransmitted,inanyformorbyanymeans,withthepriorpermissioninwritingofthepublish- ers,orinthecaseofreprographicreproductioninaccordancewiththetermsoflicensesissuedbythe CopyrightLicensingAgency.Enquiriesconcerningreproductionoutsidethosetermsshouldbesentto thepublishers. Theuseofregisterednames,trademarks,etc.,inthispublicationdoesnotimply,evenintheabsenceofa specificstatement,thatsuchnamesareexemptfromtherelevantlawsandregulationsandthereforefree forgeneraluse. Thepublishermakesnorepresentation,expressorimplied,withregardtotheaccuracyoftheinformation containedinthisbookandcannotacceptanylegalresponsibilityorliabilityforanyerrorsoromissions thatmaybemade. Coverdesign:VTEX,Vilnius Printedonacid-freepaper SpringerispartofSpringerScience+BusinessMedia(www.springer.com) Preface Fiveyearsago,weauthored“TextMining:PredictiveMethodsforAnalyzingUn- structuredInformation.”Thatbookwasgearedmostlytoprofessionalpractitioners, but was adaptable to course work with some effort by the instructor. Many topics were evolving, and this was one of the earliest efforts to collect material for pre- dictive text mining. Since then, the book has seen extensive use in education, by ourselves and other instructors, with positive responses from students. With more datasourcedfromtheInternet,thefieldhasseenveryrapidgrowthwithmanynew techniques that would interest practitioners. Given the amount of supplementary new material we had begun using, a new edition was clearly needed. A year ago, ourpublisheraskedustoupdatethebookandtoaddmaterialthatwouldextendits useasatextbook.Wehaverevisedmanysections,addingnewmaterialtoreflectthe increaseduseoftheweb.Exercisesandsummariesarealsoprovided. Thepredictionproblem,lookingforpredictivepatternsindata,hasbeenwidely studied. Strong methods are available to the practitioner. These methods process structured numerical information, where uniform measurements are taken over a sample of data. Text is often described as unstructured information. So, it would seem,textandnumericaldataaredifferent,requiringdifferentmethods.Orarethey? Inourview,apredictionproblemcanbesolvedbythesamemethods,whetherthe data are structured numerical measurements or unstructured text. Text and docu- mentscanbetransformedintomeasuredvalues,suchasthepresenceorabsenceof words,andthesamemethodsthathaveprovensuccessfulforpredictivedatamining can be applied to text. Yet, there are key differences. Evaluation techniques must beadaptedtothechronologicalorderofpublicationandtoalternativemeasuresof error. Because the data are documents, more specialized analytical methods may be preferred for text. Moreover, the methods must be modified to accommodate veryhighdimensions:tensofthousandsofwordsanddocuments.Still,thecentral themesaresimilar. Our view of text mining allows us to unify the concepts of different fields. No longer is “natural language processing” the sole domain of linguists and their al- liedcomputerspecialists.Nolongerissearchenginetechnologydistinctfromother formsofmachinelearning.Oursisanopenview.Wewelcomeyoutotryyourhand v vi Preface atlearningfromdata,whethernumericalortext.Largetextcollections,oftenread- ily available on the Internet, contain valuable information that can be mined with today’s tools instead of waiting for tomorrow’s linguistic techniques. While oth- erssearchfortheessenceoflanguageunderstanding,wecanimmediatelylookfor recurringwordpatternsinlargecollectionsofdigitaldocuments. Our main theme is a strictly empirical view of text mining and an application ofwell-knownanalyticalmethods.Weprovideexamplesandsoftware.Ourpresen- tationhasapragmaticbentwithnumerousreferencesintheresearchliteraturefor youtofollowwhensoinclined.Wewanttobepractical,yetinclusiveofthewide communitythatmightbeinterestedinapplicationsoftextmining.Weconcentrate onpredictivelearningmethodsbutalsolookatinformationretrievalandsearchen- gines, as well as clustering methods. We illustrate by examples, case studies, and theaccompanyingdownloadablesoftware. Whilesomeanalyticalmethodsmaybehighlydeveloped,predictivetextmining isanemergingareaofapplication.Wehavetriedtosummarizeourexperiencesand providethetoolsandtechniquesforyourownexperiments. Audience OurbookisaimedatITprofessionalsandmanagersaswellasadvancedundergrad- uatecomputersciencestudentsandbeginninggraduatestudents.Somebackground in data mining is beneficial but is not essential. A few sections discuss advanced concepts that require mathematical maturity for a proper understanding. In such sections, intuitive explanations are also provided that may suffice for the less ad- vancedreader.Mostpartsofthebookcanbereadandunderstoodbyanyonewith a sufficient analytic bend. If you are looking to do research in the area, the ma- terial in this book will provide direction in expanding your horizons. If you want tobeapractitioneroftextmining,youcanreadaboutourrecommendedmethods andourdescriptionsofcasestudies.Thesoftwarerequiresfamiliaritywithrunning command-lineprogramsandeditingconfigurationfiles. ForInstructors The material in this book has been successfully used for education in a variety of ways ranging from short intensiveone-weekcourses to twelve-weekfull semester courses.Inshortcourses,themathematicalmaterialcanbeskipped.Theexercises haveundergoneclass-testingoverseveralyears.Eachchapterhasthefollowingac- companyingmaterial: • achaptersummary • exercises. Inaddition,numerousexamplesandfiguresareinterlacedthroughoutthebook, andtheseareavailableatdata-miner.comasfreelydownloadableslides. Preface vii SupplementaryWebSoftware Data-MinerPty.Ltd.hasprovidedafreesoftwarelicenseforthosewhohavepur- chasedthebook.Thesoftware,whichimplementsmanyofthemethodsdiscussed in the book, can be downloaded from the data-miner.com Web site. Linux scripts formanyexamplesarealsoavailablefordownload.SeetheAppendixAfordetails. Acknowledgements Fred Damerau, our colleague and mentor, was a co-author of our original book. He is no longer with us, and his contributions to our project, especially his ex- pertise in linguistics, were immeasurable. Some of the case studies in Chap. 7 are basedonourpriorpublications.Inthoseprojects,weacknowledgetheparticipation of Chidanand Apté, Radu Florian, Abraham Ittycheriah, Vijay Iyengar, Hongyan Jing,DavidJohnson,FrankOles,NavalVerma,andBrianWhite.ArindamBaner- jee made many helpful comments on a draft of our book. The exercises in the bookevolvedfromourtext-miningcourseconductedregularlyatstatistics.com.We thankoureditor,WayneWheeler,andourpreviouseditorsAnnKostantandWayne Yuhasz,fortheirsupport. NewYork,USA SholomWeissandTongZhang Sydney,Australia NitinIndurkhya Contents 1 OverviewofTextMining . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 What’sSpecialAboutTextMining? . . . . . . . . . . . . . . . . . 1 1.1.1 StructuredorUnstructuredData? . . . . . . . . . . . . . . 2 1.1.2 IsTextDifferentfromNumbers? . . . . . . . . . . . . . . 3 1.2 WhatTypesofProblemsCanBeSolved? . . . . . . . . . . . . . . 5 1.3 DocumentClassification . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 InformationRetrieval . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 ClusteringandOrganizingDocuments . . . . . . . . . . . . . . . 7 1.6 InformationExtraction . . . . . . . . . . . . . . . . . . . . . . . . 8 1.7 PredictionandEvaluation . . . . . . . . . . . . . . . . . . . . . . 9 1.8 TheNextChapters . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.10 HistoricalandBibliographicalRemarks . . . . . . . . . . . . . . . 11 1.11 QuestionsandExercises . . . . . . . . . . . . . . . . . . . . . . . 12 2 FromTextualInformationtoNumericalVectors . . . . . . . . . . . 13 2.1 CollectingDocuments . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 DocumentStandardization . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Lemmatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.1 InflectionalStemming . . . . . . . . . . . . . . . . . . . . 19 2.4.2 StemmingtoaRoot . . . . . . . . . . . . . . . . . . . . . 19 2.5 VectorGenerationforPrediction . . . . . . . . . . . . . . . . . . 21 2.5.1 MultiwordFeatures . . . . . . . . . . . . . . . . . . . . . 26 2.5.2 LabelsfortheRightAnswers . . . . . . . . . . . . . . . . 28 2.5.3 FeatureSelectionbyAttributeRanking . . . . . . . . . . . 29 2.6 SentenceBoundaryDetermination . . . . . . . . . . . . . . . . . 29 2.7 Part-of-SpeechTagging . . . . . . . . . . . . . . . . . . . . . . . 31 2.8 WordSenseDisambiguation . . . . . . . . . . . . . . . . . . . . . 32 2.9 PhraseRecognition . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.10 NamedEntityRecognition. . . . . . . . . . . . . . . . . . . . . . 33 ix x Contents 2.11 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.12 FeatureGeneration . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.14 HistoricalandBibliographicalRemarks . . . . . . . . . . . . . . . 36 2.15 QuestionsandExercises . . . . . . . . . . . . . . . . . . . . . . . 38 3 UsingTextforPrediction. . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1 RecognizingthatDocumentsFitaPattern. . . . . . . . . . . . . . 41 3.2 HowManyDocumentsAreEnough? . . . . . . . . . . . . . . . . 42 3.3 DocumentClassification . . . . . . . . . . . . . . . . . . . . . . . 43 3.4 LearningtoPredictfromText . . . . . . . . . . . . . . . . . . . . 44 3.4.1 SimilarityandNearest-NeighborMethods . . . . . . . . . 45 3.4.2 DocumentSimilarity. . . . . . . . . . . . . . . . . . . . . 46 3.4.3 DecisionRules . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4.4 DecisionTrees . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4.5 ScoringbyProbabilities . . . . . . . . . . . . . . . . . . . 55 3.4.6 LinearScoringMethods . . . . . . . . . . . . . . . . . . . 58 3.5 EvaluationofPerformance . . . . . . . . . . . . . . . . . . . . . 66 3.5.1 EstimatingCurrentandFuturePerformance . . . . . . . . 66 3.5.2 GettingtheMostfromaLearningMethod . . . . . . . . . 69 3.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.8 HistoricalandBibliographicalRemarks . . . . . . . . . . . . . . . 70 3.9 QuestionsandExercises . . . . . . . . . . . . . . . . . . . . . . . 72 4 InformationRetrievalandTextMining . . . . . . . . . . . . . . . . . 75 4.1 IsInformationRetrievalaFormofTextMining? . . . . . . . . . . 75 4.2 KeyWordSearch . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.3 Nearest-NeighborMethods . . . . . . . . . . . . . . . . . . . . . 77 4.4 MeasuringSimilarity . . . . . . . . . . . . . . . . . . . . . . . . 78 4.4.1 SharedWordCount . . . . . . . . . . . . . . . . . . . . . 78 4.4.2 WordCountandBonus . . . . . . . . . . . . . . . . . . . 78 4.4.3 CosineSimilarity . . . . . . . . . . . . . . . . . . . . . . 79 4.5 Web-basedDocumentSearch . . . . . . . . . . . . . . . . . . . . 80 4.5.1 LinkAnalysis . . . . . . . . . . . . . . . . . . . . . . . . 81 4.6 DocumentMatching . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.7 InvertedLists. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.8 EvaluationofPerformance . . . . . . . . . . . . . . . . . . . . . 87 4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.10 HistoricalandBibliographicalRemarks . . . . . . . . . . . . . . . 88 4.11 QuestionsandExercises . . . . . . . . . . . . . . . . . . . . . . . 89 5 FindingStructureinaDocumentCollection . . . . . . . . . . . . . . 91 5.1 ClusteringDocumentsbySimilarity. . . . . . . . . . . . . . . . . 93 5.2 SimilarityofCompositeDocuments. . . . . . . . . . . . . . . . . 94 5.2.1 k-MeansClustering . . . . . . . . . . . . . . . . . . . . . 96 Contents xi 5.2.2 HierarchicalClustering . . . . . . . . . . . . . . . . . . . 99 5.2.3 TheEMAlgorithm . . . . . . . . . . . . . . . . . . . . . 102 5.3 WhatDoaCluster’sLabelsMean? . . . . . . . . . . . . . . . . . 105 5.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.5 EvaluationofPerformance . . . . . . . . . . . . . . . . . . . . . 108 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.7 HistoricalandBibliographicalRemarks . . . . . . . . . . . . . . . 110 5.8 QuestionsandExercises . . . . . . . . . . . . . . . . . . . . . . . 111 6 LookingforInformationinDocuments . . . . . . . . . . . . . . . . . 113 6.1 GoalsofInformationExtraction . . . . . . . . . . . . . . . . . . . 113 6.2 FindingPatternsandEntitiesfromText . . . . . . . . . . . . . . . 115 6.2.1 EntityExtractionasSequentialTagging. . . . . . . . . . . 116 6.2.2 TagPredictionasClassification . . . . . . . . . . . . . . . 117 6.2.3 TheMaximumEntropyMethod . . . . . . . . . . . . . . . 118 6.2.4 LinguisticFeaturesandEncoding . . . . . . . . . . . . . . 123 6.2.5 LocalSequencePredictionModels . . . . . . . . . . . . . 124 6.2.6 GlobalSequencePredictionModels. . . . . . . . . . . . . 128 6.3 CoreferenceandRelationshipExtraction . . . . . . . . . . . . . . 129 6.3.1 CoreferenceResolution . . . . . . . . . . . . . . . . . . . 129 6.3.2 RelationshipExtraction . . . . . . . . . . . . . . . . . . . 131 6.4 TemplateFillingandDatabaseConstruction . . . . . . . . . . . . 132 6.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.5.1 InformationRetrieval . . . . . . . . . . . . . . . . . . . . 133 6.5.2 CommercialExtractionSystems. . . . . . . . . . . . . . . 134 6.5.3 CriminalJustice . . . . . . . . . . . . . . . . . . . . . . . 135 6.5.4 Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.7 HistoricalandBibliographicalRemarks . . . . . . . . . . . . . . . 137 6.8 QuestionsandExercises . . . . . . . . . . . . . . . . . . . . . . . 138 7 DataSourcesforPrediction:Databases,HybridDataandtheWeb . 141 7.1 IdealModelsofData. . . . . . . . . . . . . . . . . . . . . . . . . 141 7.1.1 IdealDataforPrediction. . . . . . . . . . . . . . . . . . . 141 7.1.2 IdealDataforTextandUnstructuredData . . . . . . . . . 142 7.1.3 HybridandMixedData . . . . . . . . . . . . . . . . . . . 142 7.2 PracticalDataSourcing . . . . . . . . . . . . . . . . . . . . . . . 144 7.3 PrototypicalExamples . . . . . . . . . . . . . . . . . . . . . . . . 145 7.3.1 Web-basedSpreadsheetData . . . . . . . . . . . . . . . . 146 7.3.2 Web-basedXMLData . . . . . . . . . . . . . . . . . . . . 146 7.3.3 OpinionDataandSentimentAnalysis. . . . . . . . . . . . 148 7.4 HybridExample:IndependentSourcesofNumericalandTextData 151 7.5 MixedDatainStandardTableFormat . . . . . . . . . . . . . . . . 152 7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 7.7 HistoricalandBibliographicalRemarks . . . . . . . . . . . . . . . 154 7.8 QuestionsandExercises . . . . . . . . . . . . . . . . . . . . . . . 154

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.