ebook img

Algo Prog Objet Python - I3S PDF

29 Pages·2016·0.33 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Algo Prog Objet Python - I3S

PPaarraalllleelliissmm MMaasstteerr 11 IInntteerrnnaattiioonnaall Andrea G. B. Tettamanzi Université de Nice Sophia Antipolis Département Informatique [email protected] Andrea G. B. Tettamanzi, 2016 1 Lecture 7 – Part a Distributed Computing and Data Base Systems for the Big Data Andrea G. B. Tettamanzi, 2016 2 Plan • NoSQL Databases • Hadoop and MapReduce • Apache Spark • Document-based DBs: MongoDB • Graph-based DBs • RDF, triple stores, and SPARQL (→ MOOC) • The CAP Theorem Andrea G. B. Tettamanzi, 2016 3 • Apache Hadoop is a software library • Framework for the distributed processing of large data sets • Designed to – scale up from single servers to 1000s of machines – detect and handle failures at the application layer • Four modules (components) – Hadoop Common: file-system and OS-level abstractions – Hadoop distributed file system (HDFS) – Hadoop YARN: job scheduling and cluster resource mgmt – MapReduce: parallel processing of large data sets • Related projects: Cassandra, HBase, Spark, and several others Andrea G. B. Tettamanzi, 2016 4 Hadoop Architecture Location awareness: Rack (= switch) name JRE 1.6 or higher SSH Andrea G. B. Tettamanzi, 2016 5 HDFS • Distributed, scalable, and portable file-system written in Java • A Hadoop cluster has nominally a single namenode plus a cluster of datanodes, although redundancy options are available • Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. • TCP/IP sockets used for communication • Clients use RPC to communicate with one another • Large files stored across multiple machines • Reliability achieved through replication (cf. Part b of this lecture) Andrea G. B. Tettamanzi, 2016 6 MapReduce (1) • A programming model for processing parallelizable problems on huge datasets using a large number of computers (nodes) • Inspired by map and reduce functions in functional programming • Originally a proprietary Google technology • MapReduce libraries have been written in many programming languages • A popular open-source implementation is part of Apache Hadoop • Key contributions of the MapReduce framework are – Scalability (optimized distributed shuffle operation) – Fault-tolerance Andrea G. B. Tettamanzi, 2016 7 MapReduce (2) Three-Step Parallel and Distributed Computation: • "Map" operation: Each worker node applies the “map(key, value)” function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of redundant input data is processed. • "Shuffle" operation: Worker nodes redistribute data based on the output keys (produced by the “map()” function), such that all data belonging to one key is located on the same worker node. • "Reduce" operation: Worker nodes apply the “reduce(key, list of values)” to each group of output data, per key, in parallel. Andrea G. B. Tettamanzi, 2016 8 MapReduce: Example function map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1) function reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += pc emit (word, sum) Andrea G. B. Tettamanzi, 2016 9 Apache Spark • The Spark project was started by Matel Zaharia at UC Berkeley to provide programming tools for big data volumes that are easy to use and versatile as those for single machines • Main motivation: overcome limitations in the MapReduce model • It has language-integrated APIs in Python, Java, R, and Scala • Supports streaming, batch, and interactive computations • Open source (donated to the Apache Software Foundation) • Implicit data parallelism and fault-tolerance • Spark requires: – cluster manager (e.g., native Spark or Apache YARN), – distributed storage system (e.g., HDFS, Cassandra, …) Andrea G. B. Tettamanzi, 2016 10

Description:
A dialect of Lisp created by Rich Hickey. • Runs on the JVM, the Common Language Runtime, and the. JavaScript interpreter. • Purely functional; immutable core
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.