ebook img

Text Processing with Ruby: Extract Value from the Data That Surrounds You PDF

263 Pages·2015·5.238 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Text Processing with Ruby: Extract Value from the Data That Surrounds You

WOW! eBook wMowre wboo.wks oat w1Beoobkcoasoe.kco.morg WOW! eBook wMowre wboo.wks oat w1Beoobkcoasoe.kco.morg Early praise for Text Processing with Ruby It is rare that a programming language can be unequivocally stated to be the right tool for a job. But when it comes to scanning, extracting, and transforming text, Ruby is that tool, and Rob Miller is the right guide to instruct you in the most ef- fective and efficient application of it. ➤ Avdi Grimm Author, Confident Ruby; Head Chef, RubyTapas.com This is a fun, readable, and very useful book. I’d recommend it to anyone who needs to deal with text—which is probably everyone. ➤ Paul Battley Developer, maintainer of text gem While Ruby has become established as a Web development language, thanks to Rails, it’s an excellent language for working with text as well. Text Processing with Ruby covers the nuts and bolts of what I believe is a natural domain for Ruby, all the way from bringing text into the environment via files, the Web, and other means through to parsing what it says and sending it back out again. ➤ Peter Cooper Editor of Ruby Weekly, Cooper Press I’d recommend this book to anyone who wants to get started with text processing. Ruby has powerful tools and libraries for the whole ETL workflow, and this book describes everything you need to get started and succeed in learning. ➤ Hajba Gábor László Developer A lot of people get into Ruby via Rails. This book is really well suited to anyone who knows Rails, but wants to know more Ruby. ➤ Drew Neil Director, Studio Nelstrom, and author of Practical Vim WOW! eBook wMowre wboo.wks oat w1Beoobkcoasoe.kco.morg We've left this page blank to make the page numbers the same in the electronic and paper books. We tried just leaving it out, but then people wrote us to ask about the missing pages. Anyway, Eddy the Gerbil wanted to say “hello.” WOW! eBook wMowre wboo.wks oat w1Beoobkcoasoe.kco.morg Text Processing with Ruby Extract Value from the Data That Surrounds You Rob Miller The Pragmatic Bookshelf Dallas, Texas • Raleigh, North Carolina WOW! eBook wMowre wboo.wks oat w1Beoobkcoasoe.kco.morg Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and The Pragmatic Programmers, LLC was aware of a trademark claim, the designations have been printed in initial capital letters or in all capitals. The Pragmatic Starter Kit, The Pragmatic Programmer, Pragmatic Programming, Pragmatic Bookshelf, PragProg and the linking g device are trade- marks of The Pragmatic Programmers, LLC. Every precaution was taken in the preparation of this book. However, the publisher assumes no responsibility for errors or omissions, or for damages that may result from the use of information (including program listings) contained herein. Our Pragmatic courses, workshops, and other products can help you and your team create better software and have more fun. For more information, as well as the latest Pragmatic titles, please visit us at https://pragprog.com. The team that produced this book includes: Jacquelyn Carter (editor) Potomac Indexing, LLC (index) Cathleen Small; Liz Welch (copyedit) Dave Thomas (layout) Janet Furlow (producer) Ellie Callahan (support) For international rights, please contact [email protected]. Copyright © 2015 The Pragmatic Programmers, LLC. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form, or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior consent of the publisher. Printed in the United States of America. ISBN-13: 978-1-68050-070-7 Encoded using the finest acid-free high-entropy binary digits. Book version: P1.0—September 2015 WOW! eBook wMowre wboo.wks oat w1Beoobkcoasoe.kco.morg Contents Acknowledgments . . . . . . . . . . . ix Introduction . . . . . . . . . . . . . xi Part I — Extract: Acquiring Text 1. Reading from Files . . . . . . . . . . . 3 Opening a File 3 Reading from a File 4 Treating Files as Streams 7 Reading Fixed-Width Files 12 Wrapping Up 18 2. Processing Standard Input . . . . . . . . . 19 Redirecting Input from Other Processes 19 Example: Extracting URLs 22 Concurrency and Buffering 25 Wrapping Up 27 3. Shell One-Liners . . . . . . . . . . . 29 Arguments to the Ruby Interpreter 30 Prepending and Appending Code 35 Example: Parsing Log Files 37 Wrapping Up 39 4. Flexible Filters with ARGF . . . . . . . . . 41 Reading from ARGF as a Stream 42 Modifying Files 45 Manipulating ARGV 47 Wrapping Up 49 WOW! eBook wMowre wboo.wks oat w1Beoobkcoasoe.kco.morg Contents • vi 5. Delimited Data . . . . . . . . . . . . 51 Parsing a TSV 52 Delimited Data and the Command Line 56 The CSV Format 58 Wrapping Up 62 6. Scraping HTML . . . . . . . . . . . . 63 The Right Tool for the Job: Nokogiri 63 Searching the Document 64 Working with Elements 72 Exploring a Page 77 Example: Reading a League Table 80 Wrapping Up 88 7. Encodings . . . . . . . . . . . . . 89 A Brief Introduction to Character Encodings 90 Ruby’s Support for Character Encodings 92 Detecting Encodings 98 Wrapping Up 99 Part II — Transform: Modifying and Manipulating Text 8. Regular Expressions Basics . . . . . . . . 103 A Gentle Introduction 104 Pattern Syntax 105 Regular Expressions in Ruby 108 Wrapping Up 114 9. Extraction and Substitution with Regular Expressions . . 115 Matching Against Patterns 115 Global Match Variables 117 Extracting Multiple Matches 119 Transforming Text 122 Wrapping Up 126 10. Writing Parsers. . . . . . . . . . . . 127 Simple Parsers with StringScanner 128 Example: Parsing a Config File 132 Rule-Based Parsers 135 WOW! eBook wMowre wboo.wks oat w1Beoobkcoasoe.kco.morg Contents • vii Example: Parsing RTF Files 143 Wrapping Up 153 11. Natural Language Processing . . . . . . . . 155 What Is Natural Language Processing? 155 Example: Extracting Keywords from Articles 156 Example: Fuzzy Searching 161 Wrapping Up 169 Part III — Load: Writing Text 12. Standard Output and Standard Error . . . . . . 173 Simple Output 173 Formatting Output with printf 178 Redirecting Standard Output 182 Wrapping Up 187 13. Writing to Other Processes and to Files . . . . . . 189 Writing to Other Processes 189 Writing to Files 193 Temporary Files 195 Wrapping Up 198 14. Serialization and Structure: JSON, XML, CSV . . . . 199 JSON 200 XML 205 CSV 207 Wrapping Up 211 15. Templating Output with ERB . . . . . . . . 213 Writing Templates 214 Example: Generating a Purchase Ledger 217 Evaluating Templates 218 Passing Data to Templates 221 Controlling Presentation with Decorators 224 Wrapping Up 226 WOW! eBook wMowre wboo.wks oat w1Beoobkcoasoe.kco.morg Contents • viii Part IV — Appendices A1. A Shell Primer . . . . . . . . . . . . 229 Running Commands 229 Controlling Output 230 Exit Statuses and Flow Control 232 A2. Useful Shell Commands . . . . . . . . . 235 Index . . . . . . . . . . . . . . 245 WOW! eBook wMowre wboo.wks oat w1Beoobkcoasoe.kco.morg

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.