ebook img

Mastering Apache Spark PDF

541 Pages·2016·10.01 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Mastering Apache Spark

Mastering Apache Spark Table of Contents Introduction 0 Overview of Spark 1 Anatomy of Spark Application 2 SparkConf - Configuration for Spark Applications 2.1 SparkContext - the door to Spark 2.2 RDD - Resilient Distributed Dataset 2.3 Operators - Transformations and Actions 2.3.1 mapPartitions 2.3.1.1 Partitions 2.3.2 Caching and Persistence 2.3.3 Shuffling 2.3.4 Checkpointing 2.3.5 Dependencies 2.3.6 Types of RDDs 2.3.7 ParallelCollectionRDD 2.3.7.1 MapPartitionsRDD 2.3.7.2 CoGroupedRDD 2.3.7.3 HadoopRDD 2.3.7.4 ShuffledRDD 2.3.7.5 BlockRDD 2.3.7.6 Spark Tools 3 Spark Shell 3.1 WebUI - UI for Spark Monitoring 3.2 Executors Tab 3.2.1 spark-submit 3.3 spark-class 3.4 Spark Architecture 4 Driver 4.1 Master 4.2 Workers 4.3 2 Mastering Apache Spark Executors 4.4 Spark Runtime Environment 5 DAGScheduler 5.1 Jobs 5.1.1 Stages 5.1.2 Task Scheduler 5.2 Tasks 5.2.1 TaskSets 5.2.2 TaskSetManager 5.2.3 TaskSchedulerImpl - Default TaskScheduler 5.2.4 Scheduler Backend 5.3 CoarseGrainedSchedulerBackend 5.3.1 Executor Backend 5.4 CoarseGrainedExecutorBackend 5.4.1 Shuffle Manager 5.5 Block Manager 5.6 HTTP File Server 5.7 Broadcast Manager 5.8 Dynamic Allocation 5.9 Data Locality 5.10 Cache Manager 5.11 Spark, Akka and Netty 5.12 OutputCommitCoordinator 5.13 RPC Environment (RpcEnv) 5.14 Netty-based RpcEnv 5.14.1 ContextCleaner 5.15 MapOutputTracker 5.16 ExecutorAllocationManager 5.17 Deployment Environments 6 Spark local 6.1 Spark on cluster 6.2 Spark Standalone 6.2.1 Master 6.2.1.1 web UI 6.2.1.2 3 Mastering Apache Spark Management Scripts for Standalone Master 6.2.1.3 Management Scripts for Standalone Workers 6.2.1.4 Checking Status 6.2.1.5 Example 2-workers-on-1-node Standalone Cluster (one executor per worker) 6.2.1.6 Spark on Mesos 6.2.2 Spark on YARN 6.2.3 Execution Model 7 Advanced Concepts of Spark 8 Broadcast variables 8.1 Accumulators 8.2 Security 9 Spark Security 9.1 Securing Web UI 9.2 Data Sources in Spark 10 Using Input and Output (I/O) 10.1 Spark and Parquet 10.1.1 Serialization 10.1.2 Using Apache Cassandra 10.2 Using Apache Kafka 10.3 Spark Application Frameworks 11 Spark Streaming 11.1 StreamingContext 11.1.1 Stream Operators 11.1.2 Windowed Operators 11.1.2.1 SaveAs Operators 11.1.2.2 Stateful Operators 11.1.2.3 web UI and Streaming Statistics Page 11.1.3 Streaming Listeners 11.1.4 Checkpointing 11.1.5 JobScheduler 11.1.6 JobGenerator 11.1.7 DStreamGraph 11.1.8 Discretized Streams (DStreams) 11.1.9 4 Mastering Apache Spark Input DStreams 11.1.9.1 ReceiverInputDStreams 11.1.9.2 ConstantInputDStreams 11.1.9.3 ForEachDStreams 11.1.9.4 WindowedDStreams 11.1.9.5 MapWithStateDStreams 11.1.9.6 StateDStreams 11.1.9.7 TransformedDStream 11.1.9.8 Receivers 11.1.10 ReceiverTracker 11.1.10.1 ReceiverSupervisors 11.1.10.2 ReceivedBlockHandlers 11.1.10.3 Ingesting Data from Kafka 11.1.11 KafkaRDD 11.1.11.1 RecurringTimer 11.1.12 Streaming DataFrames 11.1.13 Backpressure 11.1.14 Dynamic Allocation (Elastic Scaling) 11.1.15 Settings 11.1.16 Spark SQL 11.2 SQLContext 11.2.1 Dataset 11.2.2 DataFrame 11.2.3 DataFrameReaders 11.2.4 ContinuousQueryManager 11.2.5 Aggregation (GroupedData) 11.2.6 Windows in DataFrames 11.2.7 Catalyst optimizer 11.2.8 Into the depths 11.2.9 Datasets vs RDDs 11.2.10 Settings 11.2.11 Spark MLlib - Machine Learning in Spark 11.3 ML Pipelines 11.3.1 5 Mastering Apache Spark Distributed graph computations with GraphX 11.4 Monitoring, Tuning and Debugging 12 Logging 12.1 Performance Tuning 12.2 Spark Metrics System 12.3 Scheduler Listeners 12.4 Debugging Spark using sbt 12.5 Varia 13 Building Spark 13.1 Spark and Hadoop 13.2 Spark and software in-memory file systems 13.3 Spark and The Others 13.4 Distributed Deep Learning on Spark 13.5 Spark Packages 13.6 Spark Tips and Tricks 14 Access private members in Scala in Spark shell 14.1 SparkException: Task not serializable 14.2 Exercises 15 One-liners using PairRDDFunctions 15.1 Learning Jobs and Partitions Using take Action 15.2 Spark Standalone - Using ZooKeeper for High-Availability of Master 15.3 Spark’s Hello World using Spark shell and Scala 15.4 WordCount using Spark shell 15.5 Your first complete Spark application (using Scala and sbt) 15.6 Spark (notable) use cases 15.7 Using Spark SQL to update data in Hive using ORC files 15.8 Developing Custom SparkListener to monitor DAGScheduler in Scala 15.9 Developing RPC Environment 15.10 Developing Custom RDD 15.11 Further Learning 16 Courses 16.1 Books 16.2 Commercial Products using Apache Spark 17 IBM Analytics for Apache Spark 17.1 6 Mastering Apache Spark Google Cloud Dataproc 17.2 Spark Advanced Workshop 18 Requirements 18.1 Day 1 18.2 Day 2 18.3 Spark Talks Ideas (STI) 19 10 Lesser-Known Tidbits about Spark Standalone 19.1 Learning Spark internals using groupBy (to cause shuffle) 19.2 Glossary 7 Mastering Apache Spark Mastering Apache Spark Welcome to Mastering Apache Spark! I’m Jacek Laskowski, an independent consultant who offers development and training services for Apache Spark (and Scala, sbt, Akka Actors/Stream/HTTP with a bit of Apache Kafka, Apache Mesos, RxScala, Docker). I run Warsaw Scala Enthusiasts and Warsaw Spark meetups. Contact me at [email protected] to discuss Spark opportunities, e.g. courses, workshops, or other mentoring or development services. This collections of notes (what some may rashly call a "book") serves as the ultimate place of mine to collect all the nuts and bolts of using Apache Spark. The notes aim to help me designing and developing better products with Spark. It is also a viable proof of my understanding of Apache Spark. I do eventually want to reach the highest level of mastery in Apache Spark. It may become a book one day, but surely serves as the study material for trainings, workshops, videos and courses about Apache Spark. Follow me on twitter @jaceklaskowski to know it early. You will also learn about the upcoming events about Apache Spark. Expect text and code snippets from Spark’s mailing lists, the official documentation of Apache Spark, StackOverflow, blog posts, books from O’Reilly, press releases, YouTube/Vimeo videos, Quora, the source code of Apache Spark, etc. Attribution follows. Introduction 8 Mastering Apache Spark Overview of Spark When you hear Apache Spark it can be two things - the Spark engine aka Spark Core or the Spark project - an "umbrella" term for Spark Core and the accompanying Spark Application Frameworks, i.e. Spark SQL, Spark Streaming, Spark MLlib and Spark GraphX that sit on top of Spark Core and the main data abstraction in Spark called RDD - Resilient Distributed Dataset. Figure 1. The Spark Platform It is pretty much as Hadoop where it can mean different things for different people, and Spark has heavily been and still is influenced by Hadoop. Why Spark Let’s list a few of the many reasons for Spark. We are doing it first, and then comes the overview that lends a more technical helping hand. Diverse Workloads As said by Matei Zaharia - the author of Apache Spark - in Introduction to AmpLab Spark Internals video (quoting with few changes): Overview of Spark 9 Mastering Apache Spark One of the Spark project goals was to deliver a platform that supports a very wide array of diverse workflows - not only MapReduce batch jobs (there were available in Hadoop already at that time), but also iterative computations like graph algorithms or Machine Learning. And also different scales of workloads from sub-second interactive jobs to jobs that run for many hours. Spark also supports near real-time streaming workloads via Spark Streaming application framework. ETL workloads and Analytics workloads are different, however Spark attempts to offer a unified platform for a wide variety of workloads. Graph and Machine Learning algorithms are iterative by nature and less saves to disk or transfers over network means better performance. There is also support for interactive workloads using Spark shell. You should watch the video What is Apache Spark? by Mike Olson, Chief Strategy Officer and Co-Founder at Cloudera, who provides a very exceptional overview of Apache Spark, its rise in popularity in the open source community, and how Spark is primed to replace MapReduce as the general processing engine in Hadoop. Leverages the Best in distributed batch data processing When you think about distributed batch data processing, Hadoop naturally comes to mind as a viable solution. Spark draws many ideas out of Hadoop MapReduce. They work together well - Spark on YARN and HDFS - while improving on the performance and simplicity of the distributed computing engine. For many, Spark is Hadoop++, i.e. MapReduce done in a better way. And it should not come as a surprise, without Hadoop MapReduce (its advances and deficiencies), Spark would not have been born at all. RDD - Distributed Parallel Scala Collections As a Scala developer, you may find Spark’s RDD API very similar (if not identical) to Scala’s Collections API. It is also exposed in Java, Python and R (as well as SQL, i.e. SparkSQL, in a sense). Overview of Spark 10

Description:
Learning Jobs and Partitions Using take Action. Spark services for Apache Spark (and Scala, sbt, Akka Actors/Stream/HTTP with a bit of Apache.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.