Machine Learning with Apache Spark PTC workshop , 2018-02-13 Mathijs Kattenberg Jeroen Schot About us Mathijs Kattenberg Jeroen Schot Technical consultant at SURFsara since 2013 Technical consultant at SURFsara since 2012 ● Working with Big Data technologies ● Working with Big Data technologies (Hadoop, Spark, Kafka) (Hadoop, Spark, Kafka) Before: Before: ● Scientific programmer at VU Amsterdam ● MSc Physics at Utrecht University ● MSc Artificial Intelligence at VU Amsterdam Program for today 09:00 - 09:15 Welcome & introduction 09:15 - 10:30 Apache Spark core and structured API’s 10:30 - 10:45 Coffee break 10:45 - 12:00 Hands-on Jupyter notebooks 12:00 - 13:00 Lunch 13:00 - 14:30 Apache Spark MLlib 14:30 - 14:45 Coffee break 14:45 - 16:15 Hands-on Jupyter notebooks 16:15 - 16:30 Coffee break 16:30 - 17:00 Practical advice, summary Apache Spark core and structured API’s ● Differences with traditional HPC approaches ● Distributed data processing ● Resilient Distributed Datasets (RDDs) ● DataFrames (DFs) “Traditional” (scientific) software applications Application developed as: Stand-alone binary application • Assumes a specific environment (e.g. Linux OS, CLI) • Operates on input files and parameters • Produces output files • Researcher specifies input files and params via CLI • Scaling “traditional” applications Now the one running the application needs to: Distribute and split data • Handle faults and errors inherent with scale • Submit and track applications • An example Consider from a tweet we are interested in finding: Names of persons • Names of organisations • Locations and places • I will be watching the election results from Trump Tower in Manhattan with my family and friends. Very exciting! A straightforward implementation Store tweets on disk • Small Python program uses NLTK and Stanford NER to tag • Write output back to disk • But… http://bit.ly/1rxKY0n Scaling Bottlenecks Store tweets on disk: it will eventually fill, many readers • Small Python program: it can do a tweet every few msecs/secs so need • to run separate processes Write output back to disk: it will eventually fill, many writers • Run separate processes: they all need input •
Description: