Table of Contents Introduction 1.1 Overview of Apache Spark 1.2 Spark MLlib Spark MLlib — Machine Learning in Spark 2.1 ML Pipelines (spark.ml) 2.2 Pipeline 2.2.1 PipelineStage 2.2.2 Transformers 2.2.3 Transformer 2.2.3.1 Tokenizer 2.2.3.2 Estimators 2.2.4 Estimator 2.2.4.1 StringIndexer 2.2.4.1.1 KMeans 2.2.4.1.2 TrainValidationSplit 2.2.4.1.3 Predictor 2.2.4.2 RandomForestRegressor 2.2.4.2.1 Regressor 2.2.4.3 LinearRegression 2.2.4.3.1 Classifier 2.2.4.4 RandomForestClassifier 2.2.4.4.1 DecisionTreeClassifier 2.2.4.4.2 Models 2.2.5 Model 2.2.5.1 Evaluator — ML Pipeline Component for Model Scoring 2.2.6 BinaryClassificationEvaluator — Evaluator of Binary Classification Models 2.2.6.1 ClusteringEvaluator — Evaluator of Clustering Models 2.2.6.2 MulticlassClassificationEvaluator — Evaluator of Multiclass Classification 1 Models 2.2.6.3 RegressionEvaluator — Evaluator of Regression Models 2.2.6.4 CrossValidator — Model Tuning / Finding The Best Model 2.2.7 CrossValidatorModel 2.2.7.1 ParamGridBuilder 2.2.7.2 CrossValidator with Pipeline Example 2.2.7.3 Params and ParamMaps 2.2.8 ValidatorParams 2.2.8.1 HasParallelism 2.2.8.2 ML Persistence — Saving and Loading Models and Pipelines 2.3 MLWritable 2.3.1 MLReader 2.3.2 Example — Text Classification 2.4 Example — Linear Regression 2.5 Logistic Regression 2.6 LogisticRegression 2.6.1 Latent Dirichlet Allocation (LDA) 2.7 Vector 2.8 LabeledPoint 2.9 Streaming MLlib 2.10 GeneralizedLinearRegression 2.11 Alternating Least Squares (ALS) Matrix Factorization 2.12 ALS — Estimator for ALSModel 2.12.1 ALSModel — Model for Predictions 2.12.2 ALSModelReader 2.12.3 Instrumentation 2.13 MLUtils 2.14 Spark SQL Spark SQL — Batch and Streaming Queries Over Structured Data on Massive Scale 3.1 Structured Streaming 2 Spark Structured Streaming — Streaming Datasets 4.1 Spark Core / Tools Spark Shell — spark-shell shell script 5.1 Web UI — Spark Application’s Web Console 5.2 Jobs Tab 5.2.1 Stages Tab — Stages for All Jobs 5.2.2 Stages for All Jobs 5.2.2.1 Stage Details 5.2.2.2 Pool Details 5.2.2.3 Storage Tab 5.2.3 BlockStatusListener Spark Listener 5.2.3.1 Environment Tab 5.2.4 EnvironmentListener Spark Listener 5.2.4.1 Executors Tab 5.2.5 ExecutorsListener Spark Listener 5.2.5.1 JobProgressListener Spark Listener 5.2.6 StorageStatusListener Spark Listener 5.2.7 StorageListener — Spark Listener for Tracking Persistence Status of RDD Blocks 5.2.8 RDDOperationGraphListener Spark Listener 5.2.9 SparkUI 5.2.10 Spark Submit — spark-submit shell script 5.3 SparkSubmitArguments 5.3.1 SparkSubmitOptionParser — spark-submit’s Command-Line Parser 5.3.2 SparkSubmitCommandBuilder Command Builder 5.3.3 spark-class shell script 5.4 AbstractCommandBuilder 5.4.1 SparkLauncher — Launching Spark Applications Programmatically 5.5 Spark Core / Architecture Spark Architecture 6.1 Driver 6.2 3 Executor 6.3 TaskRunner 6.3.1 ExecutorSource 6.3.2 Master 6.4 Workers 6.5 Spark Core / RDD Anatomy of Spark Application 7.1 SparkConf — Programmable Configuration for Spark Applications 7.2 Spark Properties and spark-defaults.conf Properties File 7.2.1 Deploy Mode 7.2.2 SparkContext 7.3 HeartbeatReceiver RPC Endpoint 7.3.1 Inside Creating SparkContext 7.3.2 ConsoleProgressBar 7.3.3 SparkStatusTracker 7.3.4 Local Properties — Creating Logical Job Groups 7.3.5 RDD — Resilient Distributed Dataset 7.4 RDD Lineage — Logical Execution Plan 7.4.1 TaskLocation 7.4.2 ParallelCollectionRDD 7.4.3 MapPartitionsRDD 7.4.4 OrderedRDDFunctions 7.4.5 CoGroupedRDD 7.4.6 SubtractedRDD 7.4.7 HadoopRDD 7.4.8 NewHadoopRDD 7.4.9 ShuffledRDD 7.4.10 BlockRDD 7.4.11 Operators 7.5 Transformations 7.5.1 PairRDDFunctions 7.5.1.1 Actions 7.5.2 4 Caching and Persistence 7.6 StorageLevel 7.6.1 Partitions and Partitioning 7.7 Partition 7.7.1 Partitioner 7.7.2 HashPartitioner 7.7.2.1 Shuffling 7.8 Checkpointing 7.9 CheckpointRDD 7.9.1 RDD Dependencies 7.10 NarrowDependency — Narrow Dependencies 7.10.1 ShuffleDependency — Shuffle Dependencies 7.10.2 Map/Reduce-side Aggregator 7.11 AppStatusStore 7.12 AppStatusPlugin 7.13 Spark Core / Optimizations Broadcast variables 8.1 Accumulators 8.2 AccumulatorContext 8.2.1 Spark Core / Services SerializerManager 9.1 MemoryManager — Memory Management 9.2 UnifiedMemoryManager 9.2.1 SparkEnv — Spark Runtime Environment 9.3 DAGScheduler — Stage-Oriented Scheduler 9.4 Jobs 9.4.1 Stage — Physical Unit Of Execution 9.4.2 ShuffleMapStage — Intermediate Stage in Execution DAG 9.4.2.1 ResultStage — Final Stage in Job 9.4.2.2 StageInfo 9.4.2.3 5 DAGScheduler Event Bus 9.4.3 JobListener 9.4.4 JobWaiter 9.4.4.1 TaskScheduler — Spark Scheduler 9.5 Tasks 9.5.1 ShuffleMapTask — Task for ShuffleMapStage 9.5.1.1 ResultTask 9.5.1.2 TaskDescription 9.5.2 FetchFailedException 9.5.3 MapStatus — Shuffle Map Output Status 9.5.4 TaskSet — Set of Tasks for Stage 9.5.5 TaskSetManager 9.5.6 Schedulable 9.5.6.1 Schedulable Pool 9.5.6.2 Schedulable Builders 9.5.6.3 FIFOSchedulableBuilder 9.5.6.3.1 FairSchedulableBuilder 9.5.6.3.2 Scheduling Mode — spark.scheduler.mode Spark Property 9.5.6.4 TaskInfo 9.5.6.5 TaskSchedulerImpl — Default TaskScheduler 9.5.7 Speculative Execution of Tasks 9.5.7.1 TaskResultGetter 9.5.7.2 TaskContext 9.5.8 TaskContextImpl 9.5.8.1 TaskResults — DirectTaskResult and IndirectTaskResult 9.5.9 TaskMemoryManager 9.5.10 MemoryConsumer 9.5.10.1 TaskMetrics 9.5.11 ShuffleWriteMetrics 9.5.11.1 TaskSetBlacklist — Blacklisting Executors and Nodes For TaskSet 9.5.12 SchedulerBackend — Pluggable Scheduler Backends 9.6 CoarseGrainedSchedulerBackend 9.6.1 DriverEndpoint — CoarseGrainedSchedulerBackend RPC Endpoint 9.6.1.1 ExecutorBackend — Pluggable Executor Backends 9.7 6 CoarseGrainedExecutorBackend 9.7.1 MesosExecutorBackend 9.7.2 BlockManager — Key-Value Store for Blocks 9.8 MemoryStore 9.8.1 DiskStore 9.8.2 BlockDataManager 9.8.3 ShuffleClient 9.8.4 BlockTransferService — Pluggable Block Transfers 9.8.5 NettyBlockTransferService — Netty-Based BlockTransferService 9.8.5.1 NettyBlockRpcServer 9.8.5.2 BlockManagerMaster — BlockManager for Driver 9.8.6 BlockManagerMasterEndpoint — BlockManagerMaster RPC Endpoint 9.8.6.1 DiskBlockManager 9.8.7 BlockInfoManager 9.8.8 BlockInfo 9.8.8.1 BlockManagerSlaveEndpoint 9.8.9 DiskBlockObjectWriter 9.8.10 BlockManagerSource — Metrics Source for BlockManager 9.8.11 StorageStatus 9.8.12 MapOutputTracker — Shuffle Map Output Registry 9.9 MapOutputTrackerMaster — MapOutputTracker For Driver 9.9.1 MapOutputTrackerMasterEndpoint 9.9.1.1 MapOutputTrackerWorker — MapOutputTracker for Executors 9.9.2 ShuffleManager — Pluggable Shuffle Systems 9.10 SortShuffleManager — The Default Shuffle System 9.10.1 ExternalShuffleService 9.10.2 OneForOneStreamManager 9.10.3 ShuffleBlockResolver 9.10.4 IndexShuffleBlockResolver 9.10.4.1 ShuffleWriter 9.10.5 BypassMergeSortShuffleWriter 9.10.5.1 SortShuffleWriter 9.10.5.2 UnsafeShuffleWriter — ShuffleWriter for SerializedShuffleHandle 9.10.5.3 7 BaseShuffleHandle — Fallback Shuffle Handle 9.10.6 BypassMergeSortShuffleHandle — Marker Interface for Bypass Merge Sort Shuffle Handles 9.10.7 SerializedShuffleHandle — Marker Interface for Serialized Shuffle Handles 9.10.8 ShuffleReader 9.10.9 BlockStoreShuffleReader 9.10.9.1 ShuffleBlockFetcherIterator 9.10.10 ShuffleExternalSorter — Cache-Efficient Sorter 9.10.11 ExternalSorter 9.10.12 Serialization 9.11 Serializer — Task SerDe 9.11.1 SerializerInstance 9.11.2 SerializationStream 9.11.3 DeserializationStream 9.11.4 ExternalClusterManager — Pluggable Cluster Managers 9.12 BroadcastManager 9.13 BroadcastFactory — Pluggable Broadcast Variable Factories 9.13.1 TorrentBroadcastFactory 9.13.1.1 TorrentBroadcast 9.13.1.2 CompressionCodec 9.13.2 ContextCleaner — Spark Application Garbage Collector 9.14 CleanerListener 9.14.1 Dynamic Allocation (of Executors) 9.15 ExecutorAllocationManager — Allocation Manager for Spark Core 9.15.1 ExecutorAllocationClient 9.15.2 ExecutorAllocationListener 9.15.3 ExecutorAllocationManagerSource 9.15.4 HTTP File Server 9.16 Data Locality 9.17 Cache Manager 9.18 OutputCommitCoordinator 9.19 RpcEnv — RPC Environment 9.20 RpcEndpoint 9.20.1 RpcEndpointRef 9.20.2 8 RpcEnvFactory 9.20.3 Netty-based RpcEnv 9.20.4 TransportConf — Transport Configuration 9.21 Spark Core / Security Securing Web UI 10.1 Spark Deployment Environments Deployment Environments — Run Modes 11.1 Spark local (pseudo-cluster) 11.2 LocalSchedulerBackend 11.2.1 LocalEndpoint 11.2.2 Spark on cluster 11.3 Spark on YARN Spark on YARN 12.1 YarnShuffleService — ExternalShuffleService on YARN 12.2 ExecutorRunnable 12.3 Client 12.4 YarnRMClient 12.5 ApplicationMaster 12.6 AMEndpoint — ApplicationMaster RPC Endpoint 12.6.1 YarnClusterManager — ExternalClusterManager for YARN 12.7 TaskSchedulers for YARN 12.8 YarnScheduler 12.8.1 YarnClusterScheduler 12.8.2 SchedulerBackends for YARN 12.9 YarnSchedulerBackend 12.9.1 YarnClientSchedulerBackend 12.9.2 YarnClusterSchedulerBackend 12.9.3 YarnSchedulerEndpoint RPC Endpoint 12.9.4 9 YarnAllocator 12.10 Introduction to Hadoop YARN 12.11 Setting up YARN Cluster 12.12 Kerberos 12.13 ConfigurableCredentialManager 12.13.1 ClientDistributedCacheManager 12.14 YarnSparkHadoopUtil 12.15 Settings 12.16 Spark Standalone Spark Standalone 13.1 Standalone Master 13.2 Standalone Worker 13.3 web UI 13.4 Submission Gateways 13.5 Management Scripts for Standalone Master 13.6 Management Scripts for Standalone Workers 13.7 Checking Status 13.8 Example 2-workers-on-1-node Standalone Cluster (one executor per worker) 13.9 StandaloneSchedulerBackend 13.10 Spark on Mesos Spark on Mesos 14.1 MesosCoarseGrainedSchedulerBackend 14.2 About Mesos 14.3 Execution Model Execution Model 15.1 Monitoring, Tuning and Debugging 10
Description: