Text Mining with MATLAB(cid:2) Rafael E. Banchs Text Mining with (cid:2) MATLAB 123 Rafael E.Banchs Barcelona ISBN 978-1-4614-4150-2 ISBN 978-1-4614-4151-9 (eBook) DOI 10.1007/978-1-4614-4151-9 SpringerNewYorkHeidelbergDordrechtLondon LibraryofCongressControlNumber:2012940228 (cid:3)SpringerScience+BusinessMediaNewYork2013 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionor informationstorageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purposeofbeingenteredandexecutedonacomputersystem,forexclusiveusebythepurchaserofthe work. Duplication of this publication or parts thereof is permitted only under the provisions of theCopyrightLawofthePublisher’slocation,initscurrentversion,andpermissionforusemustalways beobtainedfromSpringer.PermissionsforusemaybeobtainedthroughRightsLinkattheCopyright ClearanceCenter.ViolationsareliabletoprosecutionundertherespectiveCopyrightLaw. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexempt fromtherelevantprotectivelawsandregulationsandthereforefreeforgeneraluse. While the advice and information in this book are believed to be true and accurate at the date of publication,neithertheauthorsnortheeditorsnorthepublishercanacceptanylegalresponsibilityfor anyerrorsoromissionsthatmaybemade.Thepublishermakesnowarranty,expressorimplied,with respecttothematerialcontainedherein. Printedonacid-freepaper SpringerispartofSpringerScience+BusinessMedia(www.springer.com) Preface This book is the result of a multidisciplinary journey during the last few years of mycareer.AsanElectricalEngineer,whodidhisdissertationonElectromagnetic Field Theory, I found myself surprised about writing a book on Text Mining! It was in 2004, when I first got involved in natural language processing research, at the TALP research centre of Univesitat Politècnica the Catalunya, in Barcelona. There, I participated in the European Project TC-STAR, which focused on the problem of speech-to-speech translation. Later, at Barcelona Media Innovation Centre,Igottheopportunitytofurtherexploreothernaturallanguageapplications and problems such as information retrieval and sentiment analysis. Currently, at the Institute for Infocomm Research, in Singapore, I am focused on the problems of knowledge representation and pragmatics. In my experience as Electrical Engineer, the MATLAB(cid:2) programming plat- form has always been an excellent tool for conducting experimental research and proof of concepts, as well as for implementing prototypes and applications. However, in the natural language processing community, with the exception of some machine learning practitioners that have entered in the community via text miningapplications,thereisnotawellestablishedcultureofusingtheMATLAB(cid:2) platform. As a technical computing software that specializes in operating with matrices and vectors, MATLAB(cid:2) offers and excellent framework for text mining and natural language processing research and development. This book has been written with two objectives in mind. It aims at opening the doors of natural language research and applicationstoMATLAB(cid:2)usersfrom other disciplines, as well as introducing the new practitioners in the field to the many possibilities offered by the MATLAB(cid:2) programming platform. The book has been conceived as an introductory book, which should be easy to follow and digest. All examples and figures presented in the book can be reproduced by following the same procedures described for each case. Finally, I would like to thank all the persons that have encouraged and helped me to make this project come to life. Special thanks to the MathWorks Book ProgramandtheSpringerEditorialteamforallthesupportprovided.Thanksalso to my colleagues who have helped reviewing the different chapters of the book. And special thanks to my wife, my children and my parents for their support and their patience! Singapore, January 2012 Rafael E. Banchs v MATLAB(cid:2) is a registered trademark of The MathWorks, Inc. All program examples and figures presented in this book have been generated with MATLAB(cid:2) version 7.12.0.635 (R2011a) All illustrated examples of MATLAB(cid:2) dialog boxes, figures and graphic interfaces have been reprinted with permission from The MathWorks, Inc. For MATLAB(cid:2) and Simulink(cid:2) product information, please contact: The MathWorks, Inc. 3 Apple Hill Drive Natick, MA 01760-2098, USA Tel: 508-647-7000 Fax: 508-647-7001 E-mail: [email protected] Web: mathworks.com vii Contents 1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 About Text Mining and MATLAB(cid:2). . . . . . . . . . . . . . . . . . . 2 1.2 About this Book. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 A (Very) Brief Introduction to MATLAB(cid:2) . . . . . . . . . . . . . . 6 1.4 Further Reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Part I Fundamentals 2 Handling Textual Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1 Characters and Character Arrays . . . . . . . . . . . . . . . . . . . . . 15 2.2 Handling Text with Cell Arrays. . . . . . . . . . . . . . . . . . . . . . 18 2.3 Handling Text with Structures . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Some Useful Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 Further Reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.6 Proposed Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3 Regular Expressions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1 Basic Operators for Matching Characters. . . . . . . . . . . . . . . . 33 3.2 Matching Sequences of Characters . . . . . . . . . . . . . . . . . . . . 36 3.3 Conditional Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4 Working with Tokens. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.5 Further Reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.6 Proposed Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4 Basic Operations with Strings. . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.1 Searching and Comparing . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Replacement and Insertion. . . . . . . . . . . . . . . . . . . . . . . . . . 57 ix x Contents 4.3 Segmentation and Concatenation . . . . . . . . . . . . . . . . . . . . . 60 4.4 Set Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.5 Further Reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.6 Proposed Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5 Reading and Writing Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.1 Basic File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2 Other Useful Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.3 Handling Files and Directories. . . . . . . . . . . . . . . . . . . . . . . 101 5.4 Further Reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.5 Proposed Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Part II Mathematical Models 6 Basic Corpus Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.1 Fundamental Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.2 Word Co-Occurrences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.3 Accounting for Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.4 Further Reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.5 Proposed Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.6 Short Projects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 7 Statistical Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 7.1 Basic n-Gram Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 7.2 Discounting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 7.3 Model Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 7.4 Statistical Bag-of-Words . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 7.5 Further Reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 7.6 Proposed Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 7.7 Short Projects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 8 Geometrical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 8.1 The Term-Document Matrix . . . . . . . . . . . . . . . . . . . . . . . . 175 8.2 The Vector Space Model. . . . . . . . . . . . . . . . . . . . . . . . . . . 183 8.3 Association Scores and Distances. . . . . . . . . . . . . . . . . . . . . 192 8.4 Further Reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 8.5 Proposed Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 8.6 Short Projects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Contents xi 9 Dimensionality Reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 9.1 Vocabulary Pruning and Merging. . . . . . . . . . . . . . . . . . . . . 205 9.2 The Linear Transformation Approach. . . . . . . . . . . . . . . . . . 211 9.3 Non-linear Projection Methods. . . . . . . . . . . . . . . . . . . . . . . 222 9.4 Further Reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 9.5 Proposed Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 9.6 Short Projects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Part III Methods and Applications 10 Document Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 10.1 Data Collection Preparation. . . . . . . . . . . . . . . . . . . . . . . . . 237 10.2 Unsupervised Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 10.3 Supervised Classification in Vector Space. . . . . . . . . . . . . . . 252 10.4 Supervised Classification in Probability Space. . . . . . . . . . . . 260 10.5 Further Reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 10.6 Proposed Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 10.7 Short Projects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 11 Document Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 11.1 Binary Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 11.2 Vector-Based Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 11.3 Cross-Language Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 11.4 Further Reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 11.5 Proposed Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308 11.6 Short Projects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 12 Content Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 12.1 Dimensions of Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 12.2 Polarity Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 12.3 Property Extraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 12.4 Further Reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 12.5 Proposed Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 12.6 Short Projects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 Part I Fundamentals Fundamental brainwork, is what makes the difference in all art. Dante Gabriel Rossetti Chapter 2 Handling Textual Data This chapter introduces the main variable classes that are used in the MATLAB(cid:2) programming environment for representing and handling text. First, in Sect. 2.1, the basic variable type for representing text, which is the character array, is described. Then, in Sects. 2.2 and 2.3, cell arrays and structures, which are the most commonly used variable classes for handling and operating with text, are described, respectively. Finally, in Sect. 2.4, a brief overview is provided on the specific MATLAB(cid:2) built-in functions for operating with text, as well as other useful functions worth to be known. 2.1 Characters and Character Arrays The basic variable type for representing text in the MATLAB(cid:2) programming environmentisthecharacter.Acharacterisavariableusedtorepresentsymbolsin somepredefinedencodingsystem.Dependingontheencodingschemebeingused, a character can be represented with one, two or more bytes. By default, when writingandreadingtextfilesfromyoursystem,theMATLAB(cid:2)environmentuses thedefaultencodingoftheoperatingsystem.However,inthecaseofMATLAB(cid:2) data files, the unicode encoding system is used. This guarantees the portability of data files across systems. The specific encoding of a given text can be changed at any moment by using MATLAB(cid:2) functions native2unicode and uni- code2native. More details on the encoding scheme issue will be given in Chap. 5, where we will focus our attention in the problem of writing and reading text files. Atextstringisrepresentedasanarray(ormatrix)ofcharacters.Fordefininga variable as a character array, it is required that its value is provided within apostrophes. Try the following example in the command line: R.E.Banchs,TextMiningwithMATLAB(cid:2),DOI:10.1007/978-1-4614-4151-9_2, 15 (cid:3)SpringerScience+BusinessMediaNewYork2013