Table Of Contenthttp://training.databricks.com/workshop/datasci.pdf
Data Science Training
Spark
“I want to die on Mars— b Eutlo nno Mt ouns ki,m ipntaecrtv” iew with Chris Anderson
tenta“tTivhee" CI cfso "oyhmTonrheupce uwltuotrdeserrt i u ogaMrunreeee t–d shin asetot ,hde tedafhat sTaece r lat ofsanen,srr geato iecl netntilho oyheu-n -iy gsn-!mp-ht-oe-,o iWtrtshpw wteero eivaslrtlarik asecl” uo,tn in-atoo-hfbtn ee Jglsse eos.c " irt cnooo-go umaF itrnnoareyi egh toShaedfine rontigyhgcu m.dehs" a – oNltt eahHui aierban pytlBk z V ierttsau’osrcrn ilahaaeeengt r s,�� !!
ADVANCED: DATA SCIENCE WITH APACHE SPARK
Data Science applications with Apache Spark combine the scalability of Spark and the
distributed machine learning algorithms.
This material expands on the “Intro to Apache Spark” workshop. Lessons focus on
industry use cases for machine learning at scale, coding examples based on public
data sets, and leveraging cloud-based notebooks within a team context. Includes
limited free accounts on Databricks Cloud.
Prerequisites:
Intro to Apache Spark workshop or
Topics covered include:
equivalent (e.g., Spark Developer Certificate)
Experience coding in Scala, Python, SQL
Data transformation techniques based on both Spark SQL and functional
Have some familiarity with Data Science
programming in Scala and Python.
topics (e.g., business use cases)
Predictive analytics based on MLlib, clustering with KMeans, building classifiers
with a variety of algorithms and text analytics – all with emphasis on an
iterative cycle of feature engineering, modeling, evaluation.
Visualization techniques (matplotlib, ggplot2, D3, etc.) to surface insights.
Understand how the primitives like Matrix Factorization are implemented in a
distributed parallel framework from the designers of MLlib
Several hands-on exercises using datasets such as Movielens, Titanic, State Of
the Union speeches, and RecSys Challenge 2015.
Agenda
•
Detailed)agenda)In)Google)doc)
•
https://docs.google.com/document/d/
1T9AkXUmL6gDYTpAEEgqsy9hfJtqGlavjnGzMqzUwDf
E/edit)
Goals
• Patterns):)Data)wrangling)(Transform,)Model)&)Reason))with)Spark)
o Use)RDDs,)Transformations)and)Actions)in)the)context)of)a)Data)Science)Problem,)an)Algorithms))&)a)
Dataset)
• Spend)time)working)through)MLlib))
• Balance)between)internals)&)handsWon)
o Internals)from)Reza,)the)MLlib)lead)
• ~65%)of)time)on)Databricks)Cloud)&)Notebooks)
o Take)the)time)to)get)familiar)with)the)Interface)&)the)Data)Science)Cloud)
o Make*mistakes,*experiment,…*
• Good)Time)for)this)course,)this)version)
o Will)miss)many)of)the)gory)details)as)the)framework)evolves)
• Summarized)materials)for)a)3)day)course)
o Even)if)we)don’t)finish)the)exercises)today,)that)is)fine)
o Complete)the)work)at)home)W)There*are*also**homework*notebooks*
o Ask)us)questions)@ksankar,@pacoid,*@reza_zadeh,*@mhfalaki,*@andykonwinski,*@xmeng,*
@michaelarmbrust,*@tathadas*
Tutorial)Outline:)
morning' a)ernoon'
o Welcome)+)Ge3ng)Started)(Krishna)) o Ex)3):)Clustering)B)In)which)we)explore)SegmenFng)
o Databricks)Cloud)mechanics)(Andy))) Frequent)InterGallacFcHoppers)(Krishna))
o Ex)0:)PreBFlight)Check)(Krishna)) o Ex)4):)RecommendaFon)(Krishna))
o DataScience)DevOps)B)IntroducFon)to)Spark)(Krishna)) o Theory):)Matrix)FactorizaFon,)SVD,…)(Reza))
o Ex)1:)MLlib):)StaFsFcs,)Linear)Regression)(Krishna)) o OnBline)kBmeans,)spark)streaming)(Reza))
o MLlib)Deep)Dive)–)Lecture)(Reza))
o Design)Philosophy,)APIs)
o Ex)2:)In)which)we)explore)Disasters,)Trees,) o Ex)5):)Mood)of)the)UnionBText)AnalyFcs(Krishna))
ClassificaFon)&)the)Kaggle)CompeFFon)(Krishna)) o In)which)we)analyze)the)Mood)of)the)naFon)from)
inferences)on)SOTU)by)the)POTUS)(State)of)the)Union)
o Random)Forest,)Bagging,)Data)DeBcorrelaFon)
Addresses)by)The)President)Of)the)US))
o Deepdive)B)Leverage)parallelism)of)RDDs,)sparse) o Ex)99):)RecSys)2015)Challenge)(Krishna))
vectors,)etc)(Reza))
o Ask)Us)Anything)B)Panel)
Introducing:)
Andy Konwinski
@andykonwinski
Hossein Falaki Michael Armbrust
@mhfalaki @michaelarmbrust
Reza Zadeh
@Reza_Zadeh
Tathagata Das
Paco Nathan
@tathadas
@pacoid
Xiangrui Meng
@xmeng
Krishna Sankar
@ksankar
About Me
o Chief Data Scientist at BlackArrow.tv
o Have been speaking at OSCON, PyCon, Pydata,
The)Nuthead)band)!)
Strata et al
o Reviewer “Machine Learning with Spark”
o Picked up co-authorship Second Edition of “Fast
Data Processing with Spark”
o Have done lots of things:
• (cid:13)(cid:41)(cid:39)(cid:1)(cid:15)(cid:33)(cid:51)(cid:33)(cid:1)(cid:5)(cid:27)(cid:37)(cid:51)(cid:33)(cid:41)(cid:43)(cid:7)(cid:1)(cid:13)(cid:41)(cid:46)(cid:41)(cid:45)(cid:38)(cid:46)(cid:49)(cid:44)(cid:33)(cid:51)(cid:41)(cid:35)(cid:50)(cid:7)(cid:1)(cid:17)(cid:41)(cid:45)(cid:33)(cid:45)(cid:35)(cid:41)(cid:33)(cid:43)(cid:7)(cid:1)(cid:12)(cid:36)(cid:29)(cid:37)(cid:35)(cid:40)(cid:7)(cid:9)(cid:9)(cid:6)(cid:1)
• (cid:32)(cid:49)(cid:41)(cid:51)(cid:51)(cid:37)(cid:45)(cid:1)(cid:13)(cid:46)(cid:46)(cid:42)(cid:50)(cid:1)(cid:5)(cid:32)(cid:37)(cid:34)(cid:1)(cid:11)(cid:9)(cid:10)(cid:7)(cid:1)(cid:32)(cid:41)(cid:49)(cid:37)(cid:43)(cid:37)(cid:50)(cid:50)(cid:7)(cid:1)(cid:20)(cid:33)(cid:53)(cid:33)(cid:7)(cid:58)(cid:6)(cid:1)
• (cid:28)(cid:51)(cid:33)(cid:45)(cid:36)(cid:33)(cid:49)(cid:36)(cid:50)(cid:1)(cid:5)(cid:32)(cid:37)(cid:34)(cid:1)(cid:28)(cid:37)(cid:49)(cid:53)(cid:41)(cid:35)(cid:37)(cid:7)(cid:1)(cid:14)(cid:43)(cid:46)(cid:52)(cid:36)(cid:6)(cid:7)(cid:1)(cid:28)(cid:46)(cid:44)(cid:37)(cid:1)(cid:54)(cid:46)(cid:49)(cid:42)(cid:1)(cid:41)(cid:45)(cid:1)(cid:12)(cid:19)(cid:1)
• (cid:18)(cid:52)(cid:37)(cid:50)(cid:51)(cid:1)(cid:21)(cid:37)(cid:35)(cid:51)(cid:52)(cid:49)(cid:37)(cid:49)(cid:1)(cid:33)(cid:51)(cid:1)(cid:23)(cid:33)(cid:53)(cid:33)(cid:43)(cid:1)(cid:25)(cid:18)(cid:1)(cid:28)(cid:35)(cid:40)(cid:46)(cid:46)(cid:43)(cid:7)(cid:58)(cid:1)
• (cid:25)(cid:43)(cid:33)(cid:45)(cid:45)(cid:41)(cid:45)(cid:39)(cid:1)(cid:22)(cid:33)(cid:50)(cid:51)(cid:37)(cid:49)(cid:50)(cid:1)(cid:14)(cid:46)(cid:44)(cid:47)(cid:52)(cid:51)(cid:33)(cid:51)(cid:41)(cid:46)(cid:45)(cid:33)(cid:43)(cid:1)(cid:17)(cid:41)(cid:45)(cid:33)(cid:45)(cid:35)(cid:37)(cid:1)(cid:46)(cid:49)(cid:1)(cid:28)(cid:51)(cid:33)(cid:51)(cid:41)(cid:50)(cid:51)(cid:41)(cid:35)(cid:50)(cid:1)(cid:1)
• (cid:31)(cid:46)(cid:43)(cid:52)(cid:45)(cid:51)(cid:37)(cid:37)(cid:49)(cid:1)(cid:33)(cid:50)(cid:1)(cid:27)(cid:46)(cid:34)(cid:46)(cid:51)(cid:41)(cid:35)(cid:50)(cid:1)(cid:20)(cid:52)(cid:36)(cid:39)(cid:37)(cid:1)(cid:33)(cid:51)(cid:1)(cid:17)(cid:41)(cid:49)(cid:50)(cid:51)(cid:1)(cid:21)(cid:37)(cid:39)(cid:46)(cid:1)(cid:43)(cid:37)(cid:33)(cid:39)(cid:52)(cid:37)(cid:1)(cid:32)(cid:46)(cid:49)(cid:43)(cid:36)(cid:1)(cid:14)(cid:46)(cid:44)(cid:47)(cid:37)(cid:51)(cid:41)(cid:51)(cid:41)(cid:46)(cid:45)(cid:50)(cid:1)
o @ksankar, doubleclix.wordpress.com ksankar42@gmail.com
Pre-requisites
① Register)&)Download)data)from)Kaggle.)
We)cannot)distribute)Kaggle)data.)
Moreover)you)need)an)account)to)submit)entries)
a) Setup)an)account)in)Kaggle)(www.kaggle.com))
b) We)will)be)using)the)data)from)the)competition)“Titanic:)
Machine)Learning)from)Disaster”)
c) Download)data)from)
http://www.kaggle.com/c/titanicWgettingStarted)
② Register)for)RecSys)2015)Competition)
a) http://2015.recsyschallenge.com/)
9:00
Welcome +
Getting Started
Getting Started: Step 1
Everyone will receive a username/password for one !
of the Databricks Cloud shards. Use your laptop and browser to login there.
We find that cloud-based notebooks are a simple way to get started using
Apache Spark – as the motto “Making Big Data Simple” states.
Please create and run a variety of notebooks on your account throughout the
tutorial. These accounts will remain open long enough for you to export your
work.
See the product page or FAQ for more details, or contact Databricks to register
for a trial account.
10
Description:Data Science applications with Apache Spark combine the scalability of Spark
and the . tutorial. These accounts will remain open long enough for you to
export your .. Hadoop. Spark. 110 s / iteration first iteration 80 s further iterations
1 s.