Parallel Programming With Spark Matei Zaharia UC Berkeley www.spark-‐project.org UC BERKELEY What is Spark? Fast and expressive cluster computing system compatible with Apache Hadoop Improves efficiency through: » General execution graphs Up to 10× faster on disk, » In-‐memory storage 100× in memory Improves usability through: » Rich APIs in Java, Scala, Python 2-‐5× less code » Interactive shell Project History Spark started in 2009, open sourced 2010 In use at Intel, Yahoo!, Adobe, Quantifind, Conviva, Ooyala, Bizo and others Entered Apache Incubator in June Open Source Community 1000+ meetup members 70+ contributors 20 companies contributing This Talk Introduction to Spark Tour of Spark operations Job execution Standalone apps Key Idea Write programs in terms of transformations on distributed datasets Concept: resilient distributed datasets (RDDs) » Collections of objects spread across a cluster » Built through parallel transformations (map, filter, etc) » Automatically rebuilt on failure » Controllable persistence (e.g. caching in RAM) Operations Transformations (e.g. map, filter, groupBy) » Lazy operations to build RDDs from other RDDs Actions (e.g. count, collect, save) » Return a result or write it to storage Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base RTDraDn sformed RDD Cache 1 lines = spark.textFile(“hdfs://...”) results Worker errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) tasks Block 1 Driver messages.cache() Action messages.filter(lambda s: “foo” in s).count() messages.filter(lambda s: “bar” in s).count() Cache 2 Worker . . . Cache 3 Worker Block 2 ReRseusltu:l tf:u sllc-‐taelexdt steoa 1r cThB o dfa Wtai kinip 5e dseiac in 0.5(v sse 1c8 (0v ss e2c0 fso fro orn o-‐nd-‐isdkis dka dtaat) a) Block 3 Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data Ex: msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2]) HDFS File Filtered RDD Mapped RDD filter map (func = _.contains(...)) (func = _.split(...)) Behavior with Less RAM 100 9 ) s 80 6 ( 8 e 5 m 60 i 1 t 4 n 0 o 40 3 i t u 2 c e 20 1 x E 0 Cache 25% 50% 75% Fully disabled cached % of working set in cache
Description: