Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015 UC BERKELEY The Berkeley AMPLab lgorithms January 2011 – 2017 • 8 faculty • > 50 students • 3 software engineer team achines eople Organized for collaboration AMPCamp (since 2012) 3 day retreats 400+ campers (twice a year) (100s companies) The Berkeley AMPLab Governmental and industrial funding: Goal: Next generation of open source data analytics stack for industry & academia: Berkeley Data Analytics Stack (BDAS) Generic Big Data Stack Processing Layer Resource Management Layer Storage Layer Hadoop Stack g Hive Pig a h n m i l p s a s Processring Layer … a e p o c r t m i o S G r HadoopMR P I t n . sm Resource MaYnaargne ment Layer e Rg M e g StoraHgDeF SL ayer a r o t S BDAS Stack g Sample n BlinkDB R X MLBase g rkmi Clean k h n a r p si pa Processiang Laayer Velox s e p Velox e S r r SparkSQL S G MLlib c t o S r P Spark Core t n . sm e MMeessRooess s ource ManagementH Laadyoeor p Yarn Rg M Succinct e g Storage LayHeDr FS, S3, Ceph, … a Tachyon r o t S BDAS Stack 3rd party Today’s Talk g Sample n BlinkDB R X MLBase g rkmi Clean k h n a r p si pa a a Velox s e p Velox e S r r SparkSQL S G MLlib c t o S r P Spark Core t n . sm e MMeessRooess s ource ManagementH Laadyoeor p Yarn Rg M Succinct e g Storage LayHeDr FS, S3, Ceph, … a Tachyon r o t S BDAS Stack 3rd party Today’s Talk Overview 1. Introduction 2. RDDs 3. Generality of RDDs (e.g. streaming) 4. DataFrames 5. Project Tungsten Overview 1. Introduction 2. RDDs 3. Generality of RDDs (e.g. streaming) 4. DataFrames 5. Project Tungsten A Short History Started at UC Berkeley in 2009 Open Source: 2010 Apache Project: 2013 Today: most popular big data project
Description: