1. 1. A Gentle Introduction to Spark 1. What is Apache Spark? 2. Spark’s Basic Architecture 1. Spark Applications 3. Using Spark from Scala, Java, SQL, Python, or R 1. Key Concepts 4. Starting Spark 5. SparkSession 6. DataFrames 1. Partitions 7. Transformations 1. Lazy Evaluation 8. Actions 9. Spark UI 10. A Basic Transformation Data Flow 11. DataFrames and SQL 2. 2. Structured API Overview 1. Spark’s Structured APIs 2. DataFrames and Datasets 3. Schemas 4. Overview of Structured Spark Types 1. Columns 2. Rows 3. Spark Value Types 4. Encoders 5. Overview of Spark Execution 1. Logical Planning 2. Physical Planning 3. Execution 3. 3. Basic Structured Operations 1. Chapter Overview 2. Schemas 3. Columns and Expressions 1. Columns 2. Expressions 4. Records and Rows 1. Creating Rows 5. DataFrame Transformations 1. Creating DataFrames 2. Select & SelectExpr 3. Converting to Spark Types (Literals) 4. Adding Columns 5. Renaming Columns 6. Reserved Characters and Keywords in Column Names 7. Removing Columns 8. Changing a Column’s Type (cast) 9. Filtering Rows 10. Getting Unique Rows 11. Random Samples 12. Random Splits 13. Concatenating and Appending Rows to a DataFrame 14. Sorting Rows 15. Limit 16. Repartition and Coalesce 17. Collecting Rows to the Driver 4. 4. Working with Different Types of Data 1. Chapter Overview 1. Where to Look for APIs 2. Working with Booleans 3. Working with Numbers 4. Working with Strings 1. Regular Expressions 5. Working with Dates and Timestamps 6. Working with Nulls in Data 1. Drop 2. Fill 3. Replace 7. Working with Complex Types 1. Structs 2. Arrays 3. split 4. Array Contains 5. Explode 6. Maps 8. Working with JSON 9. User-Defined Functions 5. 5. Aggregations 1. What are aggregations? 2. Aggregation Functions 1. count 2. Count Distinct 3. Approximate Count Distinct 4. First and Last 5. Min and Max 6. Sum 7. sumDistinct 8. Average 9. Variance and Standard Deviation 10. Skewness and Kurtosis 11. Covariance and Correlation 12. Aggregating to Complex Types 3. Grouping 1. Grouping with expressions 2. Grouping with Maps 4. Window Functions 1. Rollups 2. Cube 3. Pivot 5. User-Defined Aggregation Functions 6. 6. Joins 1. What is a join? 1. Join Expressions 2. Join Types 2. Inner Joins 3. Outer Joins 4. Left Outer Joins 5. Right Outer Joins 6. Left Semi Joins 7. Left Anti Joins 8. Cross (Cartesian) Joins 9. Challenges with Joins 1. Joins on Complex Types 2. Handling Duplicate Column Names 10. How Spark Performs Joins 1. Node-to-Node Communication Strategies 7. 7. Data Sources 1. The Data Source APIs 1. Basics of Reading Data 2. Basics of Writing Data 3. Options 2. CSV Files 1. CSV Options 2. Reading CSV Files 3. Writing CSV Files 3. JSON Files 1. JSON Options 2. Reading JSON Files 3. Writing JSON Files 4. Parquet Files 1. Reading Parquet Files 2. Writing Parquet Files 5. ORC Files 1. Reading Orc Files 2. Writing Orc Files 6. SQL Databases 1. Reading from SQL Databases 2. Query Pushdown 3. Writing to SQL Databases 7. Text Files 1. Reading Text Files 2. Writing Out Text Files 8. Advanced IO Concepts 1. Reading Data in Parallel 2. Writing Data in Parallel 3. Writing Complex Types 8. 8. Spark SQL 1. Spark SQL Concepts 1. What is SQL? 2. Big Data and SQL: Hive 3. Big Data and SQL: Spark SQL 2. How to Run Spark SQL Queries 1. SparkSQL Thrift JDBC/ODBC Server 2. Spark SQL CLI 3. Spark’s Programmatic SQL Interface 3. Tables 1. Creating Tables 2. Inserting Into Tables 3. Describing Table Metadata 4. Refreshing Table Metadata 5. Dropping Tables 4. Views 1. Creating Views 2. Dropping Views 5. Databases 1. Creating Databases 2. Setting The Database 3. Dropping Databases 6. Select Statements 1. Case When Then Statements 7. Advanced Topics 1. Complex Types 2. Functions 3. Spark Managed Tables 4. Subqueries 5. Correlated Predicated Subqueries 8. Conclusion 9. 9. Datasets 1. What are Datasets? 1. Encoders 2. Creating Datasets 1. Case Classes 3. Actions 4. Transformations 1. Filtering 2. Mapping 5. Joins 6. Grouping and Aggregations 1. When to use Datasets 10. 10. Low Level API Overview 1. The Low Level APIs 1. When to use the low level APIs? 2. The SparkConf 3. The SparkContext 4. Resilient Distributed Datasets 5. Broadcast Variables 6. Accumulators 11. 11. Basic RDD Operations 1. RDD Overview 1. Python vs Scala/Java 2. Creating RDDs 1. From a Collection 2. From Data Sources 3. Manipulating RDDs 4. Transformations 1. Distinct 2. Filter 3. Map 4. Sorting 5. Random Splits 5. Actions 1. Reduce 2. Count 3. First 4. Max and Min 5. Take 6. Saving Files 1. saveAsTextFile 2. SequenceFiles 3. Hadoop Files 7. Caching 8. Interoperating between DataFrames, Datasets, and RDDs 9. When to use RDDs? 1. Performance Considerations: Scala vs Python 2. RDD of Case Class VS Dataset 12. 12. Advanced RDDs Operations 1. Advanced “Single RDD” Operations 1. Pipe RDDs to System Commands 2. mapPartitions 3. foreachPartition 4. glom 2. Key Value Basics (Key-Value RDDs) 1. keyBy 2. Mapping over Values 3. Extracting Keys and Values 4. Lookup 3. Aggregations 1. countByKey 2. Understanding Aggregation Implementations 3. aggregate 4. AggregateByKey 5. CombineByKey 6. foldByKey 7. sampleByKey 4. CoGroups 5. Joins 1. Inner Join 2. zips 6. Controlling Partitions 1. coalesce 7. repartitionAndSortWithinPartitions 1. Custom Partitioning 8. repartitionAndSortWithinPartitions 9. Serialization 13. 13. Distributed Variables 1. Chapter Overview 2. Broadcast Variables 3. Accumulators 1. Basic Example 2. Custom Accumulators 14. 14. Advanced Analytics and Machine Learning 1. The Advanced Analytics Workflow 2. Different Advanced Analytics Tasks 1. Supervised Learning 2. Recommendation 3. Unsupervised Learning 4. Graph Analysis 3. Spark’s Packages for Advanced Analytics 1. What is MLlib? 4. High Level MLlib Concepts 5. MLlib in Action 1. Transformers 2. Estimators 3. Pipelining our Workflow 4. Evaluators 5. Persisting and Applying Models 6. Deployment Patterns 15. 15. Preprocessing and Feature Engineering 1. Formatting your models according to your use case 2. Properties of Transformers 3. Different Transformer Types 4. High Level Transformers 1. RFormula 2. SQLTransformers 3. VectorAssembler 5. Text Data Transformers 1. Tokenizing Text 2. Removing Common Words 3. Creating Word Combinations 4. Converting Words into Numbers 6. Working with Continuous Features 1. Bucketing 2. Scaling and Normalization 3. StandardScaler 7. Working with Categorical Features 1. StringIndexer 2. Converting Indexed Values Back to Text 3. Indexing in Vectors 4. One Hot Encoding 8. Feature Generation
Description: