Python 2.6 Text Processing Beginner's Guide The easiest way to learn how to manipulate text with Python Jeff McNeil BIRMINGHAM - MUMBAI Python 2.6 Text Processing Beginner's Guide Copyright © 2010 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: December 2010 Production Reference: 1081210 Published by Packt Publishing Ltd. 32 Lincoln Road Olton Birmingham, B27 6PA, UK. ISBN 978-1-849512-12-1 www.packtpub.com Cover Image by John Quick ([email protected]) Credits Author Editorial Team Leader Jeff McNeil Mithun Sehgal Reviewer Project Team Leader Maurice HT Ling Priya Mukherji Acquisition Editor Project Coordinator Steven Wilding Shubhanjan Chatterjee Development Editor Proofreader Reshma Sundaresan Jonathan Todd Technical Editor Graphics Gauri Iyer Nilesh R. Mohite Indexer Production Coordinator Tejal Daruwale Kruthika Bangera Cover Work Kruthika Bangera About the Author Jeff McNeil has been working in the Internet Services industry for over 10 years. He cut his teeth during the late 90's Internet boom and has been developing software for Unix and Unix-flavored systems ever since. Jeff has been a full-time Python developer for the better half of that time and has professional experience with a collection of other languages, including C, Java, and Perl. He takes an interest in systems administration and server automation problems. Jeff recently joined Google and has had the pleasure of working with some very talented individuals. I'd like to above all thank Julie, Savannah, Phoebe, Maya, and Trixie for allowing me to lock myself in the office every night for months. The Web.com gang and those in the Python community willing to share their authoring experiences. Finally, Steven Wilding, Reshma Sundaresan, Shubhanjan Chatterjee, and the rest of the Packt Publishing team for all of the hard work and guidance. About the Reviewer Maurice HT Ling completed his Ph.D. in Bioinformatics and B.Sc(Hons) in Molecular and Cell Biology from the University of Melbourne where he worked on microarray analysis and text mining for protein-protein interactions. He is currently an honorary fellow in the University of Melbourne, Australia. Maurice holds several Chief Editorships, including the Python papers, Computational, and Mathematical Biology, and Methods and Cases in Computational, Mathematical and Statistical Biology. In Singapore, he co-founded the Python User Group (Singapore) and is the co-chair of PyCon Asia-Pacific 2010. In his free time, Maurice likes to train in the gym, read, and enjoy a good cup of coffee. He is also a senior fellow of the International Fitness Association, USA. www.PacktPub.com Support files, eBooks, discount offers and more You might want to visit www.PacktPub.com for support files and downloads related to your book. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com, and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks. http://PacktLib.PacktPub.com Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read, and search across Packt's entire library of books. Why Subscribe? Fully searchable across every book published by Packt Copy and paste, print, and bookmark content On demand and accessible via web browser Free Access for Packt account holders If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access. Table of Contents Preface 1 Chapter 1: Getting Started 7 Categorizing types of text data 8 Providing information through markup 8 Meaning through structured formats 9 Understanding freeform content 9 Ensuring you have Python installed 9 Providing support for Python 3 10 Implementing a simple cipher 10 Time for action – implementing a ROT13 encoder 11 Processing structured markup with a filter 15 Time for action – processing as a filter 15 Time for action – skipping over markup tags 18 State machines 22 Supporting third-party modules 23 Packaging in a nutshell 23 Time for action – installing SetupTools 23 Running a virtual environment 25 Configuring virtualenv 25 Time for action – configuring a virtual environment 25 Where to get help? 28 Summary 28 Chapter 2: Working with the IO System 29 Parsing web server logs 30 Time for action – generating transfer statistics 31 Using objects interchangeably 35 Time for action – introducing a new log format 35 Accessing files directly 37 Table of Contents Time for action – accessing files directly 37 Context managers 39 Handling other file types 41 Time for action – handling compressed files 41 Implementing file-like objects 42 File object methods 43 Enabling universal newlines 45 Accessing multiple files 45 Time for action – spell-checking HTML content 46 Simplifying multiple file access 50 Inplace filtering 51 Accessing remote files 52 Time for action – spell-checking live HTML pages 52 Error handling 55 Time for action – handling urllib 2 errors 55 Handling string IO instances 57 Understanding IO in Python 3 58 Summary 59 Chapter 3: Python String Services 61 Understanding the basics of string object 61 Defining strings 62 Time for action – employee management 62 Building non-literal strings 68 String formatting 68 Time for action – customizing log processor output 68 Percent (modulo) formatting 74 Mapping key 75 Conversion flags 76 Minimum width 76 Precision 76 Width 77 Conversion type 77 Using the format method approach 78 Time for action – adding status code data 79 Making use of conversion specifiers 83 Creating templates 86 Time for action – displaying warnings on malformed lines 86 Template syntax 88 Rendering a template 88 Calling string object methods 89 Time for action – simple manipulation with string methods 89 Aligning text 92 [ ii ] Table of Contents Detecting character classes 92 Casing 93 Searching strings 93 Dealing with lists of strings 94 Treating strings as sequences 95 Summary 96 Chapter 4: Text Processing Using the Standard Library 97 Reading CSV data 98 Time for action – processing Excel formats 98 Time for action – CSV and formulas 101 Reading non-Excel data 103 Time for action – processing custom CSV formats 103 Writing CSV data 106 Time for action – creating a spreadsheet of UNIX users 106 Modifying application configuration files 110 Time for action – adding basic configuration read support 110 Using value interpolation 114 Time for action – relying on configuration value interpolation 114 Handling default options 116 Time for action – configuration defaults 116 Writing configuration data 118 Time for action – generating a configuration file 119 Reconfiguring our source 122 A note on Python 3 122 Time for action – creating an egg-based package 122 Understanding the setup.py file 131 Working with JSON 132 Time for action – writing JSON data 132 Encoding data 134 Decoding data 135 Summary 136 Chapter 5: Regular Expressions 137 Simple string matching 138 Time for action – testing an HTTP URL 138 Understanding the match function 140 Learning basic syntax 140 Detecting repetition 140 Specifying character sets and classes 141 Applying anchors to restrict matches 143 Wrapping it up 144 [ iii ]