ebook img

Text processing in Python PDF

479 Pages·2.537 MB·English
by  MertzDavid
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Text processing in Python

[ Team LiB ] • Table of Contents Text Processing in Python By David Mertz Publisher: Addison Wesley Pub Date: June 06, 2003 ISBN: 0-321-11254-7 Pages: 544 Text Processing in Python is an example-driven, hands-on tutorial that carefully teaches programmers how to accomplish numerous text processing tasks using the Python language. Filled with concrete examples, this book provides efficient and effective solutions to specific text processing problems and practical strategies for dealing with all types of text processing challenges. Text Processing in Python begins with an introduction to text processing and contains a quick Python tutorial to get you up to speed. It then delves into essential text processing subject areas, including string operations, regular expressions, parsers and state machines, and Internet tools and techniques. Appendixes cover such important topics as data compression and Unicode. A comprehensive index and plentiful cross-referencing offer easy access to available information. In addition, exercises throughout the book provide readers with further opportunity to hone their skills either on their own or in the classroom. A companion Web site (http://gnosis.cx/TPiP) contains source code and examples from the book. Here is some of what you will find in thie book: When do I use formal parsers to process structured and semi-structured data? Page 257 How do I work with full text indexing? Page 199 What patterns in text can be expressed using regular expressions? Page 204 How do I find a URL or an email address in text? Page 228 How do I process a report with a concrete state machine? Page 274 How do I parse, create, and manipulate internet formats? Page 345 How do I handle lossless and lossy compression? Page 454 How do I find codepoints in Unicode? Page 465 [ Team LiB ] [ Team LiB ] • Table of Contents Text Processing in Python By David Mertz Publisher: Addison Wesley Pub Date: June 06, 2003 ISBN: 0-321-11254-7 Pages: 544 Copyright Preface Section 0.1. What Is Text Processing? Section 0.2. The Philosophy of Text Processing Section 0.3. What You'll Need to Use This Book Section 0.4. Conventions Used in This Book Section 0.5. A Word on Source Code Examples Section 0.6. External Resources Acknowledgments Chapter 1. Python Basics Section 1.1. Techniques and Patterns Section 1.2. Standard Modules Section 1.3. Other Modules in the Standard Library Chapter 2. Basic String Operations Section 2.1. Some Common Tasks Section 2.2. Standard Modules Section 2.3. Solving Problems Chapter 3. Regular Expressions Section 3.1. A Regular Expression Tutorial Section 3.2. Some Common Tasks Section 3.3. Standard Modules Chapter 4. Parsers and State Machines Section 4.1. An Introduction to Parsers Section 4.2. An Introduction to State Machines Section 4.3. Parser Libraries for Python Chapter 5. Internet Tools and Techniques Section 5.1. Working with Email and Newsgroups Section 5.2. World Wide Web Applications Section 5.3. Synopses of Other Internet Modules Section 5.4. Understanding XML Appendix A. A Selective and Impressionistic Short Review of Python Section A.1. What Kind of Language Is Python? Section A.2. Namespaces and Bindings Section A.3. Datatypes Section A.4. Flow Control Section A.5. Functional Programming Appendix B. A Data Compression Primer Section B.1. Introduction Section B.2. Lossless and Lossy Compression Section B.3. A Data Set Example Section B.4. Whitespace Compression Section B.5. Run-Length Encoding Section B.6. Huffman Encoding Section B.7. Lempel Ziv-Compression Section B.8. Solving the Right Problem Section B.9. A Custom Text Compressor Section B.10. References Appendix C. Understanding Unicode Section C.1. Some Background on Characters Section C.2. What Is Unicode? Section C.3. Encodings Section C.4. Declarations Section C.5. Finding Codepoints Section C.6. Resources Appendix D. A State Machine for Adding Markup to Text Appendix E. Glossary [ Team LiB ] [ Team LiB ] Copyright Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designators appear in this book, and Addison-Wesley was aware of the trademark claim, the designations have been printed in initial capital letters or all capital letters. The author and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The publisher offers discounts on this book when ordered in quantity for bulk purchases and special sales. For more information, please contact: U.S. Corporate and Government Sales (800) 382-3419 [email protected] For sales outside of the U.S., please contact: International Sales (317) 581-3793 [email protected] Visit Addison-Wesley on the Web: www.awprofessional.com Library of Congress Cataloging-in-Publication Data Mertz, David. Text processing in Python / David Mertz. p. cm. Includes bibliographical references and index. ISBN 0-321-11254-7 (alk. Paper) 1. Text processing (Computer science) 2. Python (Computer program language) I. Title. QA76.9.T48M47 2003 005.13'-dc21 2003043686 Copyright © 2003 by Pearson Education, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form, or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior consent of the publisher. Printed in the United States of America. Published simultaneously in Canada. For information on obtaining permission for use of material from this work, please submit a written request to: Pearson Education, Inc. Rights and Contracts Department 75 Arlington Street, Suite 300 Boston, MA 02116 Fax: (617) 848-7047 1 2 3 4 5 6 7 8 9 10-CRS-0706050403 First printing, June 2003 [ Team LiB ] [ Team LiB ] Preface Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one—and preferably only one—obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than right now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea—let's do more of those! —Tim Peters, "The Zen of Python" [ Team LiB ] [ Team LiB ] 0.1 What Is Text Processing? At the broadest level text processing is simply taking textual information and doing something with it. This doing might be restructuring or reformatting it, extracting smaller bits of information from it, algorithmically modifying the content of the information, or performing calculations that depend on the textual information. The lines between "text" and the even more general term "data" are extremely fuzzy; at an approximation, "text" is just data that lives in forms that people can themselves read—at least in principle, and maybe with a bit of effort. Most typically computer "text" is composed of sequences of bits that have a "natural" representation as letters, numerals, and symbols; most often such text is delimited (if delimited at all) by symbols and formatting that can be easily pronounced as "next datum." The lines are fuzzy, but the data that seems least like text—and that, therefore, this particular book is least concerned with—is the data that makes up "multimedia" (pictures, sounds, video, animation, etc.) and data that makes up UI "events" (draw a window, move the mouse, open an application, etc.). Like I said, the lines are fuzzy, and some representations of the most nontextual data are themselves pretty textual. But in general, the subject of this book is all the stuff on the near side of that fuzzy line. Text processing is arguably what most programmers spend most of their time doing. The information that lives in business software systems mostly comes down to collections of words about the application domain—maybe with a few special symbols mixed in. Internet communications protocols consist mostly of a few special words used as headers, a little bit of constrained formatting, and message bodies consisting of additional wordish texts. Configuration files, log files, CSV and fixed-length data files, error files, documentation, and source code itself are all just sequences of words with bits of constraint and formatting applied. Programmers and developers spend so much time with text processing that it is easy to forget that that is what we are doing. The most common text processing application is probably your favorite text editor. Beyond simple entry of new characters, text editors perform such text processing tasks as search/replace and copy/paste, which—given guided interaction with the user—accomplish sophisticated manipulation of textual sources. Many text editors go farther than these simple capabilities and include their own complete programming systems (usually called "macro processing"); in those cases where editors include "Turing-complete" macro languages, text editors suffice, in principle, to accomplish anything that the examples in this book can. After text editors, a variety of text processing tools are widely used by developers. Tools like "File Find" under Windows, or "grep" on Unix (and other platforms), perform the basic chore of locating text patterns. "Little languages" like sed and awk perform basic text manipulation (or even nonbasic). A large number of utilities—especially in Unix-like environments—perform small custom text processing tasks: wc, sort, tr, md5sum, uniq, split, strings, and many others. At the top of the text processing food chain are general-purpose programming languages, such as Python. I wrote this book on Python in large part because Python is such a clear, expressive, and general-purpose language. But for all Python's virtues, text editors and "little" utilities will always have an important place for developers "getting the job done." As simple as Python is, it is still more complicated than you need to achieve many basic tasks. But once you get past the very simple, Python is a perfect language for making the difficult things possible (and it is also good at making the easy things simple). [ Team LiB ] [ Team LiB ] 0.2 The Philosophy of Text Processing Hang around any Python discussion groups for a little while, and you will certainly be dazzled by the contributions of the Python developer, Tim Peters (and by a number of other Pythonistas). His "Zen of Python" captures much of the reason that I choose Python as the language in which to solve most programming tasks that are presented to me. But to understand what is most special about text processing as a programming task, it is worth turning to Perl creator Larry Wall's cardinal virtues of programming: laziness, impatience, hubris. What sets text processing most clearly apart from other tasks computer programmers accomplish is the frequency with which we perform text processing on an ad hoc or "one-shot" basis. One rarely bothers to create a one-shot GUI interface for a program. You even less frequently perform a one-shot normalization of a relational database. But every programmer with a little experience has had numerous occasions where she has received a trickle of textual information (or maybe a deluge of it) from another department, from a client, from a developer working on a different project, or from data dumped out of a DBMS; the problem in such cases is always to "process" the text so that it is usable for your own project, program, database, or work unit. Text processing to the rescue. This is where the virtue of impatience first appears—we just want the stuff processed, right now! But text processing tasks that were obviously one-shot tasks that we knew we would never need again have a habit of coming back like restless ghosts. It turns out that that client needs to update the one-time data they sent last month. Or the boss decides that she would really like a feature of that text summarized in a slightly different way. The virtue of laziness is our friend here—with our foresight not to actually delete those one-shot scripts, we have them available for easy reuse and/or modification when the need arises. Enough is not enough, however. That script you reluctantly used a second time turns out to be quite similar to a more general task you will need to perform frequently, perhaps even automatically. You imagine that with only a slight amount of extra work you can generalize and expand the script, maybe add a little error checking and some runtime options while you are at it; and do it all in time and under budget (or even as a side project, off the budget). Obviously, this is the voice of that greatest of programmers' virtues: hubris. The goal of this book is to make its readers a little lazier, a smidgeon more impatient, and a whole bunch more hubristic. Python just happens to be the language best suited to the study of virtue. [ Team LiB ] [ Team LiB ] 0.3 What You'll Need to Use This Book This book is ideally suited for programmers who are a little bit familiar with Python, and whose daily tasks involve a fair amount of text processing chores. Programmers who have some background in other programming languages—especially with other "scripting" languages—should be able to pick up enough Python to get going by reading Appendix A. While Python is a rather simple language at heart, this book is not intended as a tutorial on Python for nonprogrammers. Instead, this book is about two other things: getting the job done, pragmatically and efficiently; and understanding why what works works and what doesn't work doesn't work, theoretically and conceptually. As such, we hope this book can be useful both to working programmers and to students of programming at a level just past the introductory. Many sections of this book are accompanied by problems and exercises, and these in turn often pose questions for users. In most cases, the answers to the listed questions are somewhat open-ended—there are no simple right answers. I believe that working through the provided questions will help both self-directed and instructor-guided learners; the questions can typically be answered at several levels and often have an underlying subtlety. Instructors who wish to use this text are encouraged to contact the author for assistance in structuring a curriculum involving it. All readers are encouraged to consult the book's Web site to see possible answers provided by both the author and other readers; additional related questions will be added to the Web site over time, along with other resources. The Python language itself is conservative. Almost every Python script written ten years ago for Python 1.0 will run fine in Python 2.3+. However, as versions improve, a certain number of new features have been added. The most significant changes have matched the version number changes—Python 2.0 introduced list comprehensions, augmented assignments, Unicode support, and a standard XML package. Many scripts written in the most natural and efficient manner using Python 2.0+ will not run without changes in earlier versions of Python. The general target of this book will be users of Python 2.1+, but some 2.2+ specific features will be utilized in examples. Maybe half the examples in this book will run fine on Python 1.5.1+ (and slightly fewer on older versions), but examples will not necessarily indicate their requirement for Python 2.0+ (where it exists). On the other hand, new features introduced with Python 2.1 and above will only be utilized where they make a task significantly easier, or where the feature itself is being illustrated. In any case, examples requiring versions past Python 2.0 will usually indicate this explicitly. In the case of modules and packages—whether in the standard library or third-party—we will explicitly indicate what Python version is required and, where relevant, which version added the module or package to the standard library. In some cases, it will be possible to use later standard library modules with earlier Python versions. In important cases, this possibility will be noted. [ Team LiB ]

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.