ebook img

What is Spark? - UC Berkeley AMP Camp PDF

49 Pages·2013·1.32 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview What is Spark? - UC Berkeley AMP Camp

Parallel  Programming   With  Spark   Matei  Zaharia     UC  Berkeley     www.spark-­‐project.org     UC  BERKELEY What  is  Spark?   Fast  and  expressive  cluster  computing  system   compatible  with  Apache  Hadoop   Improves  efficiency  through:   » General  execution  graphs   Up  to  10×  faster  on  disk,   » In-­‐memory  storage   100×  in  memory   Improves  usability  through:   » Rich  APIs  in  Java,  Scala,  Python   2-­‐5×  less  code   » Interactive  shell Project  History   Spark  started  in  2009,  open  sourced  2010   In  use  at  Intel,  Yahoo!,  Adobe,  Quantifind,   Conviva,  Ooyala,  Bizo  and  others   Entered  Apache  Incubator  in  June Open  Source  Community   1000+  meetup  members   70+  contributors   20  companies  contributing This  Talk   Introduction  to  Spark   Tour  of  Spark  operations   Job  execution   Standalone  apps Key  Idea   Write  programs  in  terms  of  transformations   on  distributed  datasets   Concept:  resilient  distributed  datasets  (RDDs)   » Collections  of  objects  spread  across  a  cluster   » Built  through  parallel  transformations  (map,  filter,  etc)   » Automatically  rebuilt  on  failure   » Controllable  persistence  (e.g.  caching  in  RAM) Operations   Transformations  (e.g.  map,  filter,  groupBy)   » Lazy  operations  to  build  RDDs  from  other  RDDs   Actions  (e.g.  count,  collect,  save)   » Return  a  result  or  write  it  to  storage Example:  Log  Mining   Load  error  messages  from  a  log  into  memory,  then   interactively  search  for  various  patterns   Base  RTDraDn  sformed  RDD   Cache  1   lines = spark.textFile(“hdfs://...”) results   Worker   errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) tasks   Block  1   Driver   messages.cache() Action   messages.filter(lambda s: “foo” in s).count() messages.filter(lambda s: “bar” in s).count() Cache  2   Worker   . . . Cache  3   Worker   Block  2   ReRseusltu:l  tf:u  sllc-­‐taelexdt    steoa  1r  cThB  o  dfa  Wtai  kinip  5e  dseiac    in   0.5(v  sse  1c8  (0v  ss  e2c0    fso  fro  orn  o-­‐nd-­‐isdkis  dka  dtaat)  a)   Block  3 Fault  Recovery   RDDs  track  lineage  information  that  can  be  used   to  efficiently  recompute  lost  data   Ex:   msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])     HDFS  File   Filtered  RDD   Mapped  RDD   filter   map   (func  =  _.contains(...))   (func  =  _.split(...)) Behavior  with  Less  RAM   100       9 ) s 80   6   ( 8   e 5 m 60     i 1 t 4   n   0 o 40   3 i t u   2 c e 20   1 x E 0   Cache   25%   50%   75%   Fully   disabled   cached   %  of  working  set  in  cache

Description:
General execution graphs. » In-‐memory storage. Improves usability through: » Rich APIs in Java, Scala, Python. » Interactive shell. Up to 10× faster on disk,.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.