Natural Language Processing: Python and NLTK Learn to build expert NLP and machine learning projects using NLTK and other Python libraries A course in three modules BIRMINGHAM - MUMBAI Natural Language Processing: Python and NLTK Copyright © 2016 Packt Publishing Published on: November 2016 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78728-510-1 www.packtpub.com Preface NLTK is one of the most popular and widely used library in the natural language processing (NLP) community. The beauty of NLTK lies in its simplicity, where most of the complex NLP tasks can be implemented using a few lines of code. Start off by learning how to tokenize text into component words. Explore and make use of the WordNet language dictionary. Learn how and when to stem or lemmatize words. Discover various ways to replace words and perform spelling correction. Create your own custom text corpora and corpus readers, including a MongoDB backed corpus. Use part-of-speech taggers to annotate words with their parts of speech. Create and transform chunked phrase trees using partial parsing. Dig into feature extraction for text classification and sentiment analysis. Learn how to do parallel and distributed text processing, and to store word distributions in Redis. This learning path will teach you all that and more, in a hands-on learn-by-doing manner. Become an expert in using NLTK for Natural Language Processing with this useful companion. What this learning path covers Module 1, NLTK Essentials, talks about all the preprocessing steps required in any text mining/NLP task. In this module, we discuss tokenization, stemming, stop word removal, and other text cleansing processes in detail and how easy it is to implement these in NLTK. Module 2, Python 3 Text Processing with NLTK 3 Cookbook, explains how to use corpus readers and create custom corpora. It also covers how to use some of the corpora that come with NLTK. It covers the chunking process, also known as partial parsing, which can identify phrases and named entities in a sentence. It also explains how to train your own custom chunker and create specific named entity recognizers. Module 3, Mastering Natural Language Processing with Python, covers how to calculate word frequencies and perform various language modeling techniques. It also talks about the concept and application of Shallow Semantic Analysis (that is, NER) and WSD using Wordnet. It will help you understand and apply the concepts of Information Retrieval and text summarization. What you need for this learning path Module 1: We need the following software for this module: Chapter Software Free/ Download links Hardware OS number required Proprietary to the software specifications required (with version) 1-5 Python/ Free https://www. Common Unix any Anaconda python.org/ Printing System NLTK http:// continuum.io/ downloads http://www. nltk.org/ 6 scikit-learn Free http:// Common Unix any and gensim scikit-learn. Printing System org/stable/ https:// radimrehurek. com/gensim/ 7 Scrapy Free http:// Common Unix any scrapy.org/ Printing System Chapter Software Free/ Download links Hardware OS number required Proprietary to the software specifications required (with version) 8 NumPy, Free http://www. Common any SciPy, numpy.org/ Unix Printing pandas, System http://www. and scipy.org/ matplotlib http:// pandas. pydata.org/ http:// matplotlib. org/ 9 Twitter Free https://dev. Common any Python twitter.com/ Unix Printing overview/ APIs and System api/twitter- Facebook libraries python APIs https:// developers. facebook.com Module 2: You will need Python 3 and the listed Python packages. For this learning path, the author used Python 3.3.5. To install the packages, you can use pip (https://pypi. python.org/pypi/pip/). The following is the list of the packages in requirements format with the version number used while writing this learning path: • NLTK>=3.0a4 • pyenchant>=1.6.5 • lockfile>=0.9.1 • numpy>=1.8.0 • scipy>=0.13.0 • scikit-learn>=0.14.1 • execnet>=1.1 • pymongo>=2.6.3 • redis>=2.8.0 • lxml>=3.2.3 • beautifulsoup4>=4.3.2 • python-dateutil>=2.0 • charade>=1.0.3 You will also need NLTK-Trainer, which is available at https://github.com/ japerk/nltk-trainer. Beyond Python, there are a couple recipes that use MongoDB and Redis, both NoSQL databases. These can be downloaded at http://www.mongodb.org/ and http://redis.io/, respectively. Module 3: For all the chapters, Python 2.7 or 3.2+ is used. NLTK 3.0 must be installed either on 32-bit machine or 64-bit machine. Operating System required is Windows/Mac/Unix. Who this learning path is for If you are an NLP or machine learning enthusiast and an intermediate Python programmer who wants to quickly master NLTK for natural language processing, then this Learning Path will do you a lot of good. Students of linguistics and semantic/sentiment analysis professionals will find it invaluable. Reader feedback Feedback from our readers is always welcome. Let us know what you think about this course—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the course's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a course, see our author guide at www.packtpub.com/authors. Customer support Now that you are the proud owner of a Packt course, we have a number of things to help you to get the most from your purchase. Downloading the example code You can download the example code files for this course from your account at http://www.packtpub.com. If you purchased this course elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. You can download the code files by following these steps: 1. Log in or register to our website using your e-mail address and password. 2. Hover the mouse pointer on the SUPPORT tab at the top. 3. Click on Code Downloads & Errata. 4. Enter the name of the course in the Search box. 5. Select the course for which you're looking to download the code files. 6. Choose from the drop-down menu where you purchased this course from. 7. Click on Code Download. You can also download the code files by clicking on the Code Files button on the course's webpage at the Packt Publishing website. This page can be accessed by entering the course's name in the Search box. Please note that you need to be logged in to your Packt account. Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of: • WinRAR / 7-Zip for Windows • Zipeg / iZip / UnRarX for Mac • 7-Zip / PeaZip for Linux The code bundle for the course is also hosted on GitHub at https://github.com/ PacktPublishing/Natural-Language-Processing-Python-and-NLTK. We also have other code bundles from our rich catalog of books, videos and courses available at https://github.com/PacktPublishing/. Check them out! Errata Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this course. If you find any errata, please report them by visiting http://www.packtpub. com/submit-errata, selecting your course, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/ content/support and enter the name of the book in the search field. The required information will appear under the Errata section. Course Module 1: NLTK Essentials Chapter 1: Introduction to Natural Language Processing 3 Why learn NLP? 4 Let's start playing with Python! 7 Diving into NLTK 13 Your turn 19 Summary 19 Chapter 2: Text Wrangling and Cleansing 21 What is text wrangling? 21 Text cleansing 24 Sentence splitter 24 Tokenization 25 Stemming 26 Lemmatization 28 Stop word removal 28 Rare word removal 29 Spell correction 30 Your turn 30 Summary 31 Chapter 3: Part of Speech Tagging 33 What is Part of speech tagging 33 Named Entity Recognition (NER) 42 Your Turn 44 Summary 45 Chapter 4: Parsing Structure in Text 47 Shallow versus deep parsing 48 The two approaches in parsing 48 Why we need parsing 48 Different types of parsers 50 Dependency parsing 52 Chunking 54 Information extraction 57 Summary 60 Chapter 5: NLP Applications 61 Building your first NLP application 62 Other NLP applications 65 Summary 74 Chapter 6: Text Classification 75 Machine learning 76 Text classification 77 Sampling 79 The Random forest algorithm 89 Text clustering 89 Topic modeling in text 91 References 93 Summary 94 Chapter 7: Web Crawling 95 Web crawlers 95 Writing your first crawler 96 Data flow in Scrapy 99 The Sitemap spider 107 The item pipeline 108 External references 110 Summary 110 Chapter 8: Using NLTK with Other Python Libraries 111 NumPy 112 SciPy 120 pandas 126 matplotlib 132 External references 137 Summary 137 Chapter 9: Social Media Mining in Python 139 Data collection 140