Table Of Content

Machine Learning with Apache Spark PTC workshop , 2018-02-13 Mathijs Kattenberg Jeroen Schot About us Mathijs Kattenberg Jeroen Schot Technical consultant at SURFsara since 2013 Technical consultant at SURFsara since 2012 ● Working with Big Data technologies ● Working with Big Data technologies (Hadoop, Spark, Kafka) (Hadoop, Spark, Kafka) Before: Before: ● Scientific programmer at VU Amsterdam ● MSc Physics at Utrecht University ● MSc Artificial Intelligence at VU Amsterdam Program for today 09:00 - 09:15 Welcome & introduction 09:15 - 10:30 Apache Spark core and structured API’s 10:30 - 10:45 Coffee break 10:45 - 12:00 Hands-on Jupyter notebooks 12:00 - 13:00 Lunch 13:00 - 14:30 Apache Spark MLlib 14:30 - 14:45 Coffee break 14:45 - 16:15 Hands-on Jupyter notebooks 16:15 - 16:30 Coffee break 16:30 - 17:00 Practical advice, summary Apache Spark core and structured API’s ● Differences with traditional HPC approaches ● Distributed data processing ● Resilient Distributed Datasets (RDDs) ● DataFrames (DFs) “Traditional” (scientific) software applications Application developed as: Stand-alone binary application • Assumes a specific environment (e.g. Linux OS, CLI) • Operates on input files and parameters • Produces output files • Researcher specifies input files and params via CLI • Scaling “traditional” applications Now the one running the application needs to: Distribute and split data • Handle faults and errors inherent with scale • Submit and track applications • An example Consider from a tweet we are interested in finding: Names of persons • Names of organisations • Locations and places • I will be watching the election results from Trump Tower in Manhattan with my family and friends. Very exciting! A straightforward implementation Store tweets on disk • Small Python program uses NLTK and Stanford NER to tag • Write output back to disk • But… http://bit.ly/1rxKY0n Scaling Bottlenecks Store tweets on disk: it will eventually fill, many readers • Small Python program: it can do a tweet every few msecs/secs so need • to run separate processes Write output back to disk: it will eventually fill, many writers • Run separate processes: they all need input •

Description:

MSc Artificial Intelligence at VU Amsterdam. Jeroen Schot. Technical . Write programs in terms of distributed datasets and operations on them. • Accessible from multiple programming languages: Scala. Java. Python. R (only via .. https://spark.apache.org/docs/2.1.1/api/scala/index.html. 4. Read th

Machine Learning with Apache Spark PDF

119 Pages·2012·4 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Machine Learning with Apache Spark

Description:

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.