D C ATA LEANING LICENSE, DISCLAIMER OF LIABILITY, AND LIMITED WARRANTY By purchasing or using this book and its companion files (the “Work”), you agree that this license grants permission to use the contents contained herein, including the companion files, but does not give you the right of ownership to any of the textual content in the book / files or ownership to any of the infor- mation or products contained in it. This license does not permit uploading of the Work onto the Internet or on a network (of any kind) without the written consent of the Publisher. Duplication or dissemination of any text, code, sim- ulations, images, etc. contained herein is limited to and subject to licensing terms for the respective products, and permission must be obtained from the Publisher or the owner of the content, etc., in order to reproduce or network any portion of the textual material (in any media) that is contained in the Work. MERCURY LEARNING AND INFORMATION (“MLI” or “the Publisher”) and anyone involved in the creation, writing, or production of the companion files, accompanying algorithms, code, or computer programs (“the software”), and any accompanying Web site or software of the Work, cannot and do not war- rant the performance or results that might be obtained by using the contents of the Work. The author, developers, and the Publisher have used their best efforts to insure the accuracy and functionality of the textual material and/or programs contained in this package; we, however, make no warranty of any kind, express or implied, regarding the performance of these contents or pro- grams. The Work is sold “as is” without warranty (except for defective materi- als used in manufacturing the book or due to faulty workmanship). The sole remedy in the event of a claim of any kind is expressly limited to replacement of the book and/or companion files, and only at the discretion of the Publisher. The use of “implied warranty” and certain “exclusions” vary from state to state, and might not apply to the purchaser of this product. The companion files are available for downloading by writing to the publisher at [email protected]. D C ATA LEANING Pocket Primer Oswald Campesato MERCURY LEARNING AND INFORMATION Dulles, Virginia Boston, Massachusetts New Delhi Copyright © 2018 by MERCURY LEARNING AND INFORMATION LLC. All rights reserved. This publication, portions of it, or any accompanying software may not be reproduced in any way, stored in a retrieval system of any type, or transmitted by any means, media, electronic display or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or scanning, without prior permission in writing from the publisher. Publisher: David Pallai MERCURY LEARNING AND INFORMATION 22841 Quicksilver Drive Dulles, VA 20166 [email protected] www.merclearning.com 800-232-0223 O. Campesato. Data Cleaning Pocket Primer. ISBN: 978-1-68392-217-9 The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to infringe on the property of others. Library of Congress Control Number: 2017964301 181920321 Printed on acid-free paper in the United States of America Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc. For additional information, please contact the Customer Service Dept. at 800-232-0223(toll free). All of our titles are available in digital format at authorcloudware.com and other digital vendors. The sole obligation of MERCURY LEARNING AND INFORMATION to the purchaser is to replace the book, based on defective materials or faulty workmanship, but not based on the operation or functionality of the product. I’d like to dedicate this book to my parents – may this bring joy and happiness into their lives. C ONTENTS Preface: Data Cleaning Pocket Primer xiii What Is the Goal? xiii Is This Book is for Me and What Will I Learn? xiii How Were the Code Samples Created? xiv What You Need to Know for This Book xiv Which bash Commands are Excluded? xiv How Do I Set Up a Command Shell? xv What Are the “Next Steps” after Finishing This Book? xv About the Technical Reviewer xvii Chapter 1: Introduction 1 What Is Unix? 2 Available Shell Types 2 What Is bash? 3 Getting Help for bash Commands 4 Navigating Around Directories 4 The history Command 5 Listing Filenames with the ls Command 6 Displaying Contents of Files 7 The cat Command 8 The head and tail Commands 9 The Pipe Symbol 10 The fold Command 11 File Ownership: Owner, Group, and World 12 Hidden Files 12 Handling Problematic Filenames 13 Working with Environment Variables 14 The env Command 14 Useful Environment Variables 14 viii • CONTENTS Setting the PATH Environment Variable 15 Specifying Aliases and Environment Variables 15 Finding Executable Files 16 What Are Shell Scripts? 17 A Simple Shell Script 18 Using a Semicolon to Separate Commands 19 The printf Command and the echo Command 20 The echo Command and Whitespaces 20 Command Substitution (“back tick”) 21 Setting Environment Variables via Shell Scripts 22 Sourcing or “Dotting” a Shell Script 23 Working with Arrays 24 Working with Nested Loops 27 The paste Command 29 Inserting Blank Lines with the paste Command 29 The cut Command 30 Working with Metacharacters 31 Working with Character Classes 32 The “pipe” Symbol and Multiple Commands 33 A Simple Use Case 34 Another Simple Use Case 35 Summary 37 Chapter 2: Useful Commands 39 The join Command 40 The fold Command 41 The split Command 41 The sort Command 42 The uniq Command 44 How to Compare Files 45 The od Command 45 The tr Command 46 A Simple Use Case 50 The Command 51 find The tee Command 52 File Compression Commands 52 The tar command 52 The cpio Command 53 The gzip and gunzip Commands 54 The bunzip2 Command 54 The zip Command 55 Commands for zip Files and bz Files 55 CONTENTS • ix Internal Field Separator (IFS) 56 Data from a Range of Columns in a Dataset 56 Working with Uneven Rows in Datasets 58 Working with Functions in Shell Scripts 59 Recursion and Shell Scripts 61 Iterative Solutions for Factorial Values 62 Summary 64 Chapter 3: Filtering Data with grep 65 What Is the grep Command? 66 Metacharacters and the grep Command 67 Escaping Metacharacters with the grep Command 68 Useful Options for the grep Command 69 Character Classes and the grep Command 73 Working with the –c Option in grep 74 Matching a Range of Lines 75 Using Back References in the grep Command 77 Finding Empty Lines in Datasets 79 Using Keys to Search Datasets 80 The Backslash Character and the grep Command 81 Multiple Matches in the grep Command 81 The grep Command and the xargs Command 81 Searching zip Files for a String 83 Checking for a Unique Key Value 84 Redirecting Error Messages 85 The egrep Command and the fgrep Command 85 Displaying “Pure” Words in a Dataset with egrep 86 The fgrep Command 88 A Simple Use Case 88 Summary 90 Chapter 4: Transforming Data with sed 91 What Is the sed Command? 91 The sed Execution Cycle 92 Matching String Patterns Using sed 92 Substituting String Patterns Using sed 93 Replacing Vowels from a String or a File 95 Deleting Multiple Digits and Letters from a String 96 Search and Replace with sed 96 Datasets with Multiple Delimiters 99 Useful Switches in sed 99