Understanding Compression DATA COMPRESSION FOR MODERN DEVELOPERS Colt McAnlis & Aleks Haecky Understanding Compression Data Compression for Modern Developers Colt McAnlis and Aleks Haecky BBeeiijjiinngg BBoossttoonn FFaarrnnhhaamm SSeebbaassttooppooll TTookkyyoo Understanding Compression by Colt McAnlis and Aleks Haecky Copyright © 2016 Colton McAnlis and Aleks Haecky. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or [email protected]. Editor: Tim McGovern Indexer: Ellen Troutman-Zaig Production Editor: Melanie Yarbrough Interior Designer: David Futato Copyeditor: Octal Publishing, Inc. Cover Designer: Karen Montgomery Proofreader: Jasmine Kwityn Illustrator: Melanie Yarbrough July 2016: First Edition Revision History for the First Edition 2016-07-11: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491961537 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Understanding Compression, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-96153-7 [LSI] From CLM To JAM and MLM: I swear to Zuul, that if you don’t eat your broccoli right now, I’m going to write a book. And in the dedication of that book, I’m going to call you out as being afraid of a piece of foliage that humans have been eating for thousands of genera‐ tions. Then, 20 years from now, when you have kids of your own, I’m going to pull that book out, and show you what I wrote, and laugh in your face, because you’ll know how crazy you’re making me right now. #parenting To KMKM: How about another decade, just for good measure? From AH To AHS and GHS: I hoped you’d learn to cook. Instead, you proved that humankind can survive on fresh apples and stale supermarket sushi. Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Chapter Synopsis 18 1. Let’s Not Be Boring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 The Five Buckets of Compression Algorithms 1 Claude Shannon Is Infuriating! 2 The Only Thing You Need to Know about Data Compression 3 A World Built on Data Compression 4 2. Do Not Skip This Chapter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Understanding Binary 9 Base 10 System 9 Binary Number System 10 Information Theory 13 An Excursion into Binary Search 14 Entropy: The Minimum Bits Needed to Represent a Number 16 Standard Number Lengths 17 3. Breaking Entropy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Understanding Entropy 19 What This Entropy Stuff Is Good For 21 Understanding Probability 22 Breaking Entropy 23 Example 1: Delta Coding 24 Example 2: Symbol Grouping 25 Example 3: Permutations 26 v Information Theory Versus Data Compression 31 4. Variable-Length Codes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Morse Code 33 Probability, Entropy, and Codeword Size 36 Variable-Length Codes 38 Using VLCs 38 Creating VLCs 42 A Handful of Example VLCs 44 Finding the Right Code for Your Data Set 51 5. Statistical Encoding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Statistically Compressing to Entropy 53 Huffman Coding 55 Building a Huffman Tree 55 Generating Codewords 57 Encoding and Decoding 58 Practical Implementations 58 Arithmetic Coding 60 Finding the Right Number 61 Encoding 62 Picking the Right Output Value 64 Decoding 64 Practical Implementations 69 Asymmetric Numeral Systems 69 Encoding and Decoding Using a Transform Table 70 Creating the Reference Table 71 Using ANS for Compression 74 Decoding Example 75 So Where Does the Compression Come From? 76 Practical Compression: Which Statistical Algorithm Do I Choose? 77 6. Adaptive Statistical Encoding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Locality Matters for Entropy 79 Adaptive VLC Encoding 81 Dynamically Building a VLC Table 81 Literals 84 Resets 87 Knowing When to Reset 88 Using This in Practice 89 Adaptive Arithmetic Coding 89 Adaptive Huffman Coding 90 vi | Table of Contents The Modern Choice 91 7. Dictionary Transforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 A Basic Dictionary Transform 94 Finding the Right “Words” 95 The Lempel-Ziv Algorithm 98 How LZ Works 99 Encoding 104 Decoding 105 Compressing LZ output 106 LZ Variants 107 Collect Them All! 110 8. Contextual Data Transforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Run-Length Encoding 112 Dealing with Short Runs 112 Compressing 114 Delta Coding 115 XOR Delta Coding 118 Frame of Reference Delta Coding 119 Patched Frame of Reference Delta Coding 120 Compressing Delta-Encoded Data 123 Does It Work on Text? 123 Move-to-Front Coding 123 Avoiding Rogue Symbols 125 Compressing MTF 126 Burrows–Wheeler Transform 126 Ordering Is Important! 128 How BWT Works 128 Inverse BWT 130 Practical Implementations 132 Compressing BWT 132 9. Data Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 The Chains of Markov 136 Markov and Compression 139 Practical Implementations 145 Prediction by Partial Matching 145 The Search Trie 147 Compressing a Symbol 149 Choosing a Sensible N Value 150 Dealing with Unknown Symbols 150 Table of Contents | vii Context Mixing 150 Types of Models 151 Types of Mixing 153 The Next Big Thing? 154 10. Switching Gears. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Media-Specific Compression 155 General-Purpose Compression 156 Compression in Practice 157 11. Evaluating Compression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Compression Usage Scenarios 159 Compressed Offline, Decompressed On-Client 159 Compressed On-Client, Decompressed In-Cloud 160 Compressed In-Cloud, Decompressed On-Client 160 Compressed On-Client, Decompressed On-Client 161 Compression Need 161 Compression Ratio 162 Compression Performance 163 Decompression Performance 164 Ability to Decode-Stream 164 Comparing Compressors 165 12. Compressing Image Data Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Understanding Quality Versus File Size 167 What Reduces Image Quality? 169 Measuring Image Quality 171 Making This Work 173 Image Dimensions Are Important 173 Choosing the Correct Image Format 175 PNG 175 JPG 176 GIF 177 WebP 177 And Now for Choosing... 177 GPU Texture Formats 179 Vector Formats 180 Eyes on the Prize 182 13. Serialized Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Understanding Common Use Cases 184 Dynamically Server-Built Data 184 viii | Table of Contents
Description: