ebook img

Julia for Data Analysis PDF

474 Pages·2023·13.4 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Julia for Data Analysis

Bogumił Kamin`ski Foreword by Viral Shah M A N N I N G 2 CHAPTER Julia packages discussed in the book, mapped to their domains of application Data ingestion and sharing •Arrow.jl—Reading and writing data in Apache Arrow format (cid:129)CodecBzip2.jl—Handling data compressed in Bzip2 format (cid:129)CSV.jl—Reading and writing CSV files (cid:129)Genie.jl—Web framework; can be used to create a web service (cid:129)HTTP.jl—HTTP client and server functionality (cid:129)JSON3.jl—Working with JSON data (cid:129)SQLite.jl—Working with SQLite databases (cid:129)ZipFile.jl—Handling data stored in ZIP files Data manipulation (cid:129)CategoricalArrays.jl—Working with categorical data (cid:129)DataFrames.jl and DataFramesMeta.jl—Performing operations on data frames (cid:129)Impute.jl—Missing data imputation functionalities (cid:129)Missings.jl—Utilities for handling missing values Data analysis (cid:129)Distributions.jl—Statistical distribution definitions and utilities (cid:129)FreqTables.jl—Creating frequency tables (cid:129)GLM.jl—Estimating generalized linear models (cid:129)Graphs.jl—Working with graphs (cid:129)Loess.jl—Estimating LOESS models (cid:129)Plots.jl—Data visualization (cid:129)ROCAnalysis.jl—Evaluating probabilistic binary classifiers (cid:129)Statistics.jl and StatsBase.jl—Basic statistical functionalities Utilities (cid:129)BenchmarkTools.jl—Measuring code performance (cid:129)Conda.jl and PyCall.jl—Integrating with Python (scikit-learn is used in the book) (cid:129)InlineStrings.jl—Efficient storage of short strings (cid:129)PooledArrays.jl—Compression of arrays with few unique elements (cid:129)RCall.jl—Integrating with R (cid:129)ThreadsX.jl—Multithreading utilities Julia for Data Analysis Julia for Data Analysis ´ BOGUMIŁ KAMINSKI FOREWORD BY VIRAL SHAH MAN NING SHELTER ISLAND For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: [email protected] ©2023 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. The author and publisher have made every effort to ensure that the information in this book was correct at press time. The author and publisher do not assume and hereby disclaim any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause, or from any usage of the information herein. Manning Publications Co. Development editor: Marina Michaels 20 Baldwin Road Technical development editors: German Gonzalez-Morris PO Box 761 Review editor: Adriana Sabo Shelter Island, NY 11964 Production editor: Deirdre S. Hiam Copy editor: Sharon Wilkey Proofreader: Melody Dolab Technical proofreader: Mike Haller Typesetter and cover designer: Marija Tudor ISBN 9781633439368 Printed in the United States of America brief contents 1 ■ Introduction 1 PART 1 ESSENTIAL JULIA SKILLS ................................................ 17 2 ■ Getting started with Julia 19 3 ■ Julia’s support for scaling projects 49 4 ■ Working with collections in Julia 70 5 ■ Advanced topics on handling collections 100 6 ■ Working with strings 123 7 ■ Handling time-series data and missing values 154 PART 2 TOOLBOX FOR DATA ANALYSIS .................................... 183 8 ■ First steps with data frames 185 9 ■ Getting data from a data frame 209 10 ■ Creating data frame objects 233 11 ■ Converting and grouping data frames 265 12 ■ Mutating and transforming data frames 291 13 ■ Advanced transformations of data frames 327 14 ■ Creating web services for sharing data analysis results 365 v contents foreword xiii preface xv acknowledgments xvii about this book xix about the author xxiv about the cover illustration xxv 1 Introduction 1 1.1 What is Julia and why is it useful? 2 1.2 Key features of Julia from a data scientist’s perspective 6 Julia is fast because it is a compiled language 6 ■ Julia provides full support for interactive workflows 8 ■ Julia programs are highly reusable and easy to compose together 8 ■ Julia has a built-in state-of-the-art package manager 9 ■ It is easy to integrate existing code with Julia 10 1.3 Usage scenarios of tools presented in the book 10 1.4 Julia’s drawbacks 11 1.5 What data analysis skills will you learn? 13 1.6 How can Julia be used for data analysis? 13 PART 1 ESSENTIAL JULIA SKILLS ................................. 17 2 Getting started with Julia 19 2.1 Representing values 20 vi CONTENTS vii 2.2 Defining variables 23 2.3 Using the most important control-flow constructs 26 Computations depending on a Boolean condition 26 ■ Loops 32 Compound expressions 33 ■ A first approach to calculating the winsorized mean 35 2.4 Defining functions 36 Defining functions using the function keyword 37 ■ Positional and keyword arguments of functions 37 ■ Rules for passing arguments to functions 39 ■ Short syntax for defining simple functions 39 ■ Anonymous functions 40 ■ Do blocks 41 Function-naming convention in Julia 42 ■ A simplified definition of a function computing the winsorized mean 43 2.5 Understanding variable scoping rules 44 3 Julia’s support for scaling projects 49 3.1 Understanding Julia’s type system 50 A single function in Julia may have multiple methods 50 Types in Julia are arranged in a hierarchy 51 ■ Finding all supertypes of a type 52 ■ Finding all subtypes of a type 52 Union of types 53 ■ Deciding what type restrictions to put in method signature 54 3.2 Using multiple dispatch in Julia 55 Rules for defining methods of a function 55 ■ Method ambiguity problem 56 ■ Improved implementation of winsorized mean 57 3.3 Working with packages and modules 59 What is a module in Julia? 59 ■ How can packages be used in Julia? 61 ■ Using StatsBase.jl to compute the winsorized mean 63 3.4 Using macros 65 4 Working with collections in Julia 70 4.1 Working with arrays 70 Getting the data into a matrix 72 ■ Computing basic statistics of the data stored in a matrix 76 ■ Indexing into arrays 78 ■ Performance considerations of copying vs. making a view 81 ■ Calculating correlations between variables 82 ■ Fitting a linear regression 83 ■ Plotting the Anscombe’s quartet data 86 4.2 Mapping key-value pairs with dictionaries 88 viii CONTENTS 4.3 Structuring your data by using named tuples 93 Defining named tuples and accessing their contents 94 Analyzing Anscombe’s quartet data stored in a named tuple 95 ■ Understanding composite types and mutability of values in Julia 96 5 Advanced topics on handling collections 100 5.1 Vectorizing your code using broadcasting 101 Understanding syntax and meaning of broadcasting in Julia 101 ■ Expanding length-1 dimensions in broadcasting 103 ■ Protecting collections from being broadcasted over 106 ■ Analyzing Anscombe’s quartet data using broadcasting 109 5.2 Defining methods with parametric types 112 Most collection types in Julia are parametric 112 ■ Rules for subtyping of parametric types 114 ■ Using subtyping rules to define the covariance function 116 5.3 Integrating with Python 117 Preparing data for dimensionality reduction using t-SNE 117 Calling Python from Julia 118 ■ Visualizing the results of the t-SNE algorithm 120 6 Working with strings 123 6.1 Getting and inspecting the data 124 Downloading files from the web 125 ■ Using common techniques of string construction 125 ■ Reading the contents of a file 127 6.2 Splitting strings 128 6.3 Using regular expressions to work with strings 130 Working with regular expressions 130 ■ Writing a parser of a single line of movies.dat file 131 6.4 Extracting a subset from a string with indexing 132 UTF-8 encoding of strings in Julia 132 ■ Character vs. byte indexing of strings 133 ■ ASCII strings 134 ■ The Char type 135 6.5 Analyzing genre frequency in movies.dat 135 Finding common movie genres 135 ■ Understanding genre popularity evolution over the years 137 6.6 Introducing symbols 140 Creating symbols 140 ■ Using symbols 141

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.