ebook img

Text Mining: Predictive Methods for Analyzing Unstructured Information PDF

243 Pages·2005·1.249 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Text Mining: Predictive Methods for Analyzing Unstructured Information

Text Mining Sholom M. Weiss Nitin Indurkhya Tong Zhang Fred J. Damerau Text Mining Predictive Methods for Analyzing Unstructured Information ~ Springer Sholom M.Weiss Nitin Indurkhya IBMResearch School of Computer Science and Engineering TJWatson Labs Universityof New South Wales Yorktown Heights) NY10598 Sydney)NSW2052 USA Australia [email protected] [email protected] Tong Zhang Fred J.Damerau IBMResearch IBMResearch TJWatson Labs TJWatson Labs Yorktown Heights) NY10598 Yorktown Heights) NY10598 USA USA [email protected] [email protected] ISBN0-387-95433-3 Printed on acid-free paper. © 2005Springer Science+Business Media) Inc. All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media) Inc., 233Spring Street) New York) NY 10013)USA))except for briefexcerpts in con- nection with reviews or scholarlyanalysis. Use in connection with any form of infor- mation storage and retrieval) electronic adaptation) computer software) or by similar or dissimilar methodologynow known or hereafter developed is forbidden. The use in this publication of trade names) trademarks) service marks) and similar terms) even if they are not identified as such) is not to be taken as an expression of opinion as to whether or not they are subject to proprietaryrights. Printed in the United States of America. (MP) 9 8 7 6 5 4 3 2 1 SPIN 10864579 springeronline.com Preface Data mining is a mature technology. The prediction problem, looking for predictive patterns in data, has been widely studied. Strong meth- odsareavailabletothepractitioner.Thesemethodsprocessstructured numericalinformation,whereuniformmeasurementsaretakenovera sample of data. Text is often described as unstructured information. So, it would seem, text and numerical data are different, requiring different methods. Or are they? In our view, a prediction problem can be solved by the same methods, whether the data are structured nu- merical measurements or unstructured text. Text and documents can betransformedintomeasuredvalues,suchasthepresenceorabsence ofwords,andthesamemethodsthathaveprovensuccessfulforpredic- tive data mining can be applied to text. Yet, there are key differences. Evaluation techniques must be adapted to the chronological order of publicationandtoalternativemeasuresoferror.Becausethedataare documents, more specialized analytical methods may be preferred for text. Moreover, the methods must be modified to accommodate very highdimensions:tensofthousandsofwordsanddocuments.Still,the centralthemesaresimilar. Our view of text mining allows us to unify the concepts of different fields. No longer is “natural language processing” the sole domain of linguistsand theirallied computer specialists.No longer issearch en- ginetechnologydistinctfromotherformsofmachinelearning.Oursis anopenview.Wewelcomeyoutotryyourhandatlearningfromdata, whethernumericalortext.YouneednothaveaPh.D.inlinguisticsto workinthisarea. Not everyone will agree with our perspective. The natural language specialist may argue that ours is a shallow view of text that will solve some problems, but the bigger problems, such as answering questions vi Preface posed by a user, can only be solved with a deeper understanding of language. There is room for both viewpoints to coexist. Large text collections contain valuable information that can be mined with today’s tools in- stead of waiting for tomorrow’s techniques. While others search for the essence of language understanding, we can immediately look for recurringwordpatternsinlargecollectionsofdigitaldocuments. Some parts of the book may seem simple to the advanced student or professional. Other parts may appear mathematical. They all fit our common theme of a strictly empirical view of text mining and anapplicationofwell-knownanalyticalmethods.Weprovideexamples and software. Our presentation has a pragmatic bent with numerous referencesintheresearchliteratureforyoutofollowwhensoinclined. Wewanttobepractical,yetinclusiveofthewidecommunitythatmight beinterestedinapplicationsoftextmining.Weconcentrateonpredic- tivelearningmethodsbutalsolookatinformationretrievalandsearch engines,aswellasclusteringmethods.Weillustratebyexamples,case studies,andtheaccompanyingdownloadablesoftware. While some analytical methods may be highly developed, predictive text mining is an emerging area of application. We have tried to sum- marize our experiences and provide the tools and techniques for your ownexperiments. Audience Our book is aimed at IT professionals and managers as well as advanced undergraduate computer science students and beginning graduate students. Some background in data mining is beneficial but is not essential. If you are looking to do research in the area, the ma- terialinthisbookwillprovidedirectioninexpandingyourhorizons.If you want to be a practitioner of text mining, you can read about our recommendedmethodsandourdescriptionsofcasestudies. Supplementary Web Software Data-Miner Pty. Ltd. has provided a free software license for those who have purchased the book. The software, which implements many of the methods discussed in the book, can be downloaded from the data-miner.comWebsite. Preface vii Acknowledgements Some of the case studies in Chapter 7 are based on our prior publica- tions.Inthoseprojects,weacknowledgetheparticipationofChidanand Apté, Radu Florian, Abraham Ittycheriah, Vijay Iyengar, Hongyan Jing, David Johnson, Frank Oles, Naval Verma, and Brian White. Arindam Banerjee made many helpful comments on a draft of our book.Wethankoureditors,WayneWheeler,AnnKostant,andWayne Yuhasz, for their support. Our experiences in writing this book were quiteenjoyable.Weworkedmostlyonourowntime,someofuslocated in different time zones, sometimes distant from home and communi- cating over theInternet.The fourofus,three computer scientistsand one linguist, are all colleagues and collaborators. Yet, we have worked in different areas, with substantial overlap in our approaches to text mining. SholomWeiss,TongZhang,andFredDamerau-NewYork NitinIndurkhya-AustraliaandBrasil NorthernSummerandSouthernWinter,2004 Contents Preface v 1 OverviewofTextMining 1 1.1 What’sSpecialaboutTextMining? 1 1.1.1 StructuredorUnstructuredData? 2 1.1.2 IsTextDifferentfromNumbers? 3 1.2 WhatTypesofProblemsCanBeSolved? 6 1.3 DocumentClassification 7 1.4 InformationRetrieval 8 1.5 ClusteringandOrganizingDocuments 9 1.6 InformationExtraction 10 1.7 PredictionandEvaluation 11 1.8 TheNextChapters 12 1.9 HistoricalandBibliographicalRemarks 13 2 FromTextualInformationtoNumericalVectors 15 2.1 CollectingDocuments 15 2.2 DocumentStandardization 18 2.3 Tokenization 20 2.4 Lemmatization 21 2.4.1 InflectionalStemming 21 2.4.2 StemmingtoaRoot 23 2.5 VectorGenerationforPrediction 25 2.5.1 MultiwordFeatures 32 2.5.2 LabelsfortheRightAnswers 34 2.5.3 FeatureSelectionbyAttributeRanking 35 2.6 SentenceBoundaryDetermination 36 x Contents 2.7 Part-Of-SpeechTagging 37 2.8 WordSenseDisambiguation 39 2.9 PhraseRecognition 39 2.10 NamedEntityRecognition 40 2.11 Parsing 40 2.12 FeatureGeneration 42 2.13 HistoricalandBibliographicalRemarks 44 3 UsingTextforPrediction 47 3.1 RecognizingthatDocumentsFitaPattern 49 3.2 HowManyDocumentsAreEnough? 51 3.3 DocumentClassification 52 3.4 LearningtoPredictfromText 54 3.4.1 SimilarityandNearest-NeighborMethods 55 3.4.2 DocumentSimilarity 56 3.4.3 DecisionRules 58 3.4.3.1 HowtoFindtheBestDecisionRules 64 3.4.4 ScoringbyProbabilities 66 3.4.5 LinearScoringMethods 69 3.4.5.1 HowtoFindtheBestScoringModel 71 3.5 EvaluationofPerformance 77 3.5.1 EstimatingCurrentandFuturePerformance 77 3.5.2 GettingtheMostfromaLearningMethod 80 3.6 Applications 81 3.7 HistoricalandBibliographicalRemarks 82 4 InformationRetrievalandTextMining 85 4.1 IsInformationRetrievalaFormofTextMining? 85 4.2 KeyWordSearch 87 4.3 Nearest-NeighborMethods 88 4.4 MeasuringSimilarity 89 4.4.1 SharedWordCount 89 4.4.2 WordCountandBonus 90 4.4.3 CosineSimilarity 91 4.5 Web-BasedDocumentSearch 92 4.5.1 LinkAnalysis 93 4.6 DocumentMatching 97 4.7 InvertedLists 98 4.8 EvaluationofPerformance 100 4.9 HistoricalandBibliographicalRemarks 101 Contents xi 5 FindingStructureinaDocumentCollection 103 5.1 ClusteringDocumentsbySimilarity 106 5.2 SimilarityofCompositeDocuments 107 5.2.1 k-MeansClustering 109 5.2.1.1 CentroidClassifier 113 5.2.2 HierarchicalClustering 114 5.2.3 TheEMAlgorithm 117 5.3 WhatDoaCluster’sLabelsMean? 120 5.4 Applications 122 5.5 EvaluationofPerformance 123 5.6 HistoricalandBibliographicalRemarks 126 6 LookingforInformationinDocuments 129 6.1 GoalsofInformationExtraction 129 6.2 FindingPatternsandEntitiesfromText 132 6.2.1 EntityExtractionasSequentialTagging 132 6.2.2 TagPredictionasClassification 133 6.2.3 TheMaximumEntropyMethod 135 6.2.4 LinguisticFeaturesandEncoding 140 6.2.5 SequentialProbabilityModel 143 6.3 CoreferenceandRelationshipExtraction 145 6.3.1 CoreferenceResolution 145 6.3.2 RelationshipExtraction 148 6.4 TemplateFillingandDatabaseConstruction 149 6.5 Applications 151 6.5.1 InformationRetrieval 151 6.5.2 CommercialExtractionSystems 151 6.5.3 CriminalJustice 152 6.5.4 Intelligence 153 6.6 HistoricalandBibliographicalRemarks 154 7 CaseStudies 157 7.1 MarketIntelligencefromtheWeb 157 7.2 LightweightDocumentMatchingforDigitalLibraries 163 7.3 GeneratingModelCasesforHelpDeskApplications 167 7.4 AssigningTopicstoNewsArticles 172 7.5 E-mailFiltering 178 7.6 SearchEngines 182 7.7 ExtractingNamedEntitiesfromDocuments 186 7.8 CustomizedNewspapers 191 7.9 HistoricalandBibliographicalRemarks 194 xii Contents 8 EmergingDirections 197 8.1 Summarization 198 8.2 ActiveLearning 201 8.3 LearningwithUnlabeledData 202 8.4 DifferentWaysofCollectingSamples 203 8.4.1 MultipleSamplesandVotingMethods 204 8.4.2 OnlineLearning 205 8.4.3 Cost-SensitiveLearning 206 8.4.4 UnbalancedSamplesandRareEvents 207 8.5 QuestionAnswering 208 8.6 HistoricalandBibliographicalRemarks 210 Appendix:SoftwareNotes 213 A.1 SummaryofSoftware 213 A.2 Requirements 214 A.3 DownloadInstructions 215 References 217 AuthorIndex 229 SubjectIndex 233

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.