ebook img

Python 3 Text Processing with NLTK 3 Cookbook: Over 80 practical recipes on natural language processing techniques using Python's NLTK 3.0 PDF

304 Pages·2014·1.88 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Python 3 Text Processing with NLTK 3 Cookbook: Over 80 practical recipes on natural language processing techniques using Python's NLTK 3.0

www.it-ebooks.info Python 3 Text Processing with NLTK 3 Cookbook Over 80 practical recipes on natural language processing techniques using Python's NLTK 3.0 Jacob Perkins BIRMINGHAM - MUMBAI www.it-ebooks.info Python 3 Text Processing with NLTK 3 Cookbook Copyright © 2014 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: November 2010 Second edition: August 2014 Production reference: 1200814 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78216-785-3 www.packtpub.com Cover image by Faiz Fattohi ([email protected]) www.it-ebooks.info Credits Author Project Coordinator Jacob Perkins Leena Purkait Reviewers Proofreaders Patrick Chan Simran Bhogal Mohit Goenka Paul Hindle Lihang Li Indexers Maurice HT Ling Hemangini Bari Jing (Dave) Tian Mariammal Chettiyar Commissioning Editor Tejal Soni Kevin Colaco Priya Subramani Acquisition Editor Graphics Kevin Colaco Ronak Dhruv Disha Haria Content Development Editor Yuvraj Mannari Amey Varangaonkar Abhinash Sahu Technical Editor Production Coordinators Humera Shaikh Pooja Chiplunkar Conidon Miranda Copy Editors Nilesh R. Mohite Deepa Nambiar Laxmi Subramanian Cover Work Pooja Chiplunkar www.it-ebooks.info About the Author Jacob Perkins is the cofounder and CTO of Weotta, a local search company. Weotta uses NLP and machine learning to create powerful and easy-to-use natural language search for what to do and where to go. He is the author of Python Text Processing with NLTK 2.0 Cookbook, Packt Publishing, and has contributed a chapter to the Bad Data Handbook, O'Reilly Media. He writes about NLTK, Python, and other technology topics at http://streamhacker.com. To demonstrate the capabilities of NLTK and natural language processing, he developed http://text-processing.com, which provides simple demos and NLP APIs for commercial use. He has contributed to various open source projects, including NLTK, and created NLTK-Trainer to simplify the process of training NLTK models. For more information, visit https://github.com/japerk/nltk-trainer. I would like to thank my friends and family for their part in making this book possible. And thanks to the editors and reviewers at Packt Publishing for their helpful feedback and suggestions. Finally, this book wouldn't be possible without the fantastic NLTK project and team: http://www.nltk.org/. www.it-ebooks.info About the Reviewers Patrick Chan is an avid Python programmer and uses Python extensively for data processing. I would like to thank my beautiful wife, Thanh Tuyen, for her endless patience and understanding in putting up with my various late night hacking sessions. Mohit Goenka is a software developer in the Yahoo Mail team. Earlier, he graduated from the University of Southern California (USC) with a Master's degree in Computer Science. His thesis focused on Game Theory and Human Behavior concepts as applied in real-world security games. He also received an award for academic excellence from the Office of International Services at the University of Southern California. He has showcased his presence in various realms of computers including artificial intelligence, machine learning, path planning, multiagent systems, neural networks, computer vision, computer networks, and operating systems. During his tenure as a student, he won multiple competitions cracking codes and presented his work on Detection of Untouched UFOs to a wide range of audience. Not only is he a software developer by profession, but coding is also his hobby. He spends most of his free time learning about new technology and developing his skills. What adds feather to his cap is his poetic skills. Some of his works are part of the University of Southern California Libraries archive under the cover of The Lewis Carroll collection. In addition to this, he has made significant contributions by volunteering his time to serve the community. www.it-ebooks.info Lihang Li received his BE degree in Mechanical Engineering from Huazhong University of Science and Technology (HUST), China, in 2012, and now is pursuing his MS degree in Computer Vision at National Laboratory of Pattern Recognition (NLPR) from the Institute of Automation, Chinese Academy of Sciences (IACAS). As a graduate student, he is focusing on Computer Vision and specially on vision-based SLAM algorithms. In his free time, he likes to take part in open source activities and is now the President of the Open Source Club, Chinese Academy of Sciences. Also, building a multicopter is his hobby and he is with a team called OpenDrone from BLUG (Beijing Linux User Group). His interests include Linux, open source, cloud computing, virtualization, computer vision, operating systems, machine learning, data mining, and a variety of programming languages. You can find him by visiting his personal website http://hustcalm.me. Many thanks to my girlfriend Jingjing Shao, who is always with me. Also, I must thank the entire team at Packt Publishing, I would like to thank Kartik who is a very good Project Coordinator. I would also like to thank the other reviewers; though we haven't met, I'm really happy working with you. Maurice HT Ling completed his PhD in Bioinformatics and BSc (Hons) in Molecular and Cell Biology from The University of Melbourne. He is currently a Research Fellow in Nanyang Technological University, Singapore, and an Honorary Fellow in The University of Melbourne, Australia. He co-edits The Python Papers and co-founded the Python User Group (Singapore), where he has been serving as the executive committee member since 2010. His research interests lie in life—biological life, and artificial life and artificial intelligence—and in using computer science and statistics as tools to understand life and its numerous aspects. His personal website is http://maurice.vodien.com. www.it-ebooks.info Jing (Dave) Tian is now a graduate research fellow and a PhD student in the Computer and Information Science and Engineering (CISE) department at the University of Florida. His research direction involves system security, embedded system security, trusted computing, and static analysis for security and virtualization. He is interested in Linux kernel hacking and compilers. He also spent a year on AI and machine learning directions and taught classes on Intro to Problem Solving using Python and Operating System in the Computer Science department at the University of Oregon. Before that, he worked as a software developer in the Linux Control Platform (LCP) group in Alcatel-Lucent (former Lucent Technologies) R&D for around 4 years. He has got BS and ME degrees of EE in China. His website is http://davejingtian.org. I would like to thank the author of the book, who has made a good job for both Python and NLTK. I would also like to thank to the editors of the book, who made this book perfect and offered me the opportunity to review such a nice book. www.it-ebooks.info www.PacktPub.com Support files, eBooks, discount offers, and more You might want to visit www.PacktPub.com for support files and downloads related to your book. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. TM http://PacktLib.PacktPub.com Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books. Why Subscribe? f Fully searchable across every book published by Packt f Copy and paste, print and bookmark content f On demand and accessible via web browser Free Access for Packt account holders If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access. www.it-ebooks.info Table of Contents Preface 1 Chapter 1: Tokenizing Text and WordNet Basics 7 Introduction 7 Tokenizing text into sentences 8 Tokenizing sentences into words 10 Tokenizing sentences using regular expressions 12 Training a sentence tokenizer 14 Filtering stopwords in a tokenized sentence 16 Looking up Synsets for a word in WordNet 18 Looking up lemmas and synonyms in WordNet 20 Calculating WordNet Synset similarity 23 Discovering word collocations 25 Chapter 2: Replacing and Correcting Words 29 Introduction 29 Stemming words 30 Lemmatizing words with WordNet 32 Replacing words matching regular expressions 34 Removing repeating characters 37 Spelling correction with Enchant 39 Replacing synonyms 43 Replacing negations with antonyms 46 Chapter 3: Creating Custom Corpora 49 Introduction 49 Setting up a custom corpus 50 Creating a wordlist corpus 52 Creating a part-of-speech tagged word corpus 55 Creating a chunked phrase corpus 59 Creating a categorized text corpus 64 www.it-ebooks.info

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.