Table Of Content

Bachelorarbeit Tim Horgas Performance-Analyse von Apache Spark und Apache Hadoop FakultätTechnikundInformatik FacultyofEngineeringandComputerScience StudiendepartmentInformatik DepartmentofComputerScience Tim Horgas Performance-Analyse von Apache Spark und Apache Hadoop BachelorarbeiteingereichtimRahmenderBachelorprüfung imStudiengangBachelorofScienceWirtschaftsinformatik amDepartmentInformatik derFakultätTechnikundInformatik derHochschulefürAngewandteWissenschaftenHamburg BetreuenderPrüfer:Prof.Dr.Zukunft Zweitgutachter:Prof.Dr.Steffens Eingereichtam:24.09.15 TimHorgas ThemaderArbeit Performance-AnalysevonApacheSparkundApacheHadoop Stichworte ApacheSpark,ApacheHadoop,BigData,Benchmarking,Performance-Analyse Kurzzusammenfassung DieseBachelorarbeitbeschäftigtsichimKontextBigDatamitderAnalysederPerformance vonApacheSparkimVergleichzuApacheHadoop.DabeiwerdenApacheHadoopundApache SparkinFormeinesBenchmarksverglichenundanschließenddurcheineNutzwertanalyse bewertet. TimHorgas Titleofthepaper Performance-AnalysisofApacheSparkandApacheHadoop Keywords ApacheSpark,ApacheHadoop,BigData,Benchmarking,Performance-Analysis Abstract ThisBachelor-ThesisdescribesaPerformance-AnalysesoftheframeworksApacheSparkand ApacheHadoopincontextofBigData. ThePerformance-Analysescontainsabenchmarkof ApacheHadoopandApacheSparkwithanevaluationachievedasvaluebenefitanalysisof bothframeworks. Inhaltsverzeichnis 1 Einleitung 1 1.1 BigDataalsGrundlagefürneueTechnologien . . . . . . . . . . . . . . . . . . 1 1.2 TechnologienzurUmsetzungvonBigDataAnalysen . . . . . . . . . . . . . . 2 1.3 ApacheHadoopundApacheSpark . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Zielsetzung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 MapReduce 5 2.1 VerwendungundAufbau . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Map-Funktion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 Reduce-Funktion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 BewertungdesMapReduce-Verfahrens . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 Vorteile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.2 Nachteile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 HDFS 9 3.1 AnwendungsmöglichkeitenundZielevonHDFS . . . . . . . . . . . . . . . . . 9 3.2 Master-Slave-Replikation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3 AufbauvonHDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.4 BewertungvonHadoopHDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.4.1 Vorteile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.4.2 Nachteile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4 ApacheHadoop 17 4.1 AnwendungsmöglichkeitenundZielevonApacheHadoop . . . . . . . . . . . 17 4.2 AufbauvonHadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2.1 ArchitekturvonYARN . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2.2 HadoopMapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2.3 ApacheHive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3 BewertungvonApacheHadoop . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3.1 Vorteile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3.2 Nachteile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5 ApacheSpark 25 5.1 AnwendungsmöglichkeitenundZielevonApacheSpark . . . . . . . . . . . . 25 iv Inhaltsverzeichnis 5.2 AufbauvonApacheSpark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.2.1 Programmiermodell . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.2.2 ArchitektureinesSpark-Clusters . . . . . . . . . . . . . . . . . . . . . 30 5.2.3 SparkSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.3 BewertungvonApacheSpark . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.3.1 Vorteile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.3.2 Nachteile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 6 Performance-Test 34 6.1 AufbaudesBenchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 6.1.1 Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 6.1.2 Daten . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.1.3 Hypothesen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.1.4 Operationen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 6.1.5 AusführungsmodelldesBenchmarks . . . . . . . . . . . . . . . . . . . 37 6.2 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.2.1 Hypothese1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.2.2 Hypothese2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.2.3 Hypothese3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.2.4 Hypothese4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 6.2.5 Hypothese5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.3 KorrektheitdesBenchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.4 ErgebnissedesPerformance-Tests . . . . . . . . . . . . . . . . . . . . . . . . . 53 7 Auswertung 56 7.1 NutzwertanalysedesPerformance-Tests . . . . . . . . . . . . . . . . . . . . . . 56 7.1.1 Benennung des Entscheidungsproblemsund Auswahl derEntschei- dungsalternativen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 7.1.2 SammlungderEntscheidungskriterien . . . . . . . . . . . . . . . . . . 56 7.1.3 GewichtungderEntscheidungskriterien . . . . . . . . . . . . . . . . . 58 7.1.4 BewertungderEntscheidungskriterien . . . . . . . . . . . . . . . . . . 59 7.1.5 Nutzwertberechnung . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 7.2 AnalysederpraktischenVor-undNachteilevonApacheHadoopundApache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 8 ZusammenfassungundAusblick 65 8.1 Zusammenfassung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 8.2 AnmerkungenundAnregungenfürweitereForschung . . . . . . . . . . . . . 66 8.3 ChancenvonApacheSparkimBereichBigData . . . . . . . . . . . . . . . . . 66 AnhangA ClusterMonitoring 68 A.1 Hypothese1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 A.2 Hypothese2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 v Inhaltsverzeichnis AnhangB Beispieldaten 73 Literaturverzeichnis 81 vi Tabellenverzeichnis 5.1 InterfaceeinesRDDsinSpark . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 6.1 ErgebnissezuHypothese1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.2 ErgebnissezuHypothese2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 6.3 ErgebnissezuHypothese3vonHadoop . . . . . . . . . . . . . . . . . . . . . . 45 6.4 ErgebnissezuHypothese3vonSpark . . . . . . . . . . . . . . . . . . . . . . . 45 6.5 ErgebnissezuHypothese4vonApacheHadoop . . . . . . . . . . . . . . . . . 47 6.6 ErgebnissezuHypothese4vonApacheSpark . . . . . . . . . . . . . . . . . . 50 6.7 ErgebnissezuHypothese5vonApacheHadoop . . . . . . . . . . . . . . . . . 51 6.8 ErgebnissezuHypothese5vonApacheSpark . . . . . . . . . . . . . . . . . . 51 7.1 AuflistungderKriteriengruppenmitBewertungskriterien . . . . . . . . . . . . 57 7.2 BerechnungderKriteriengewichtemitHilfevonKriteriengruppen . . . . . . 59 7.3 SkalafürdieBerechnungderBewertungen . . . . . . . . . . . . . . . . . . . . 60 7.4 BerechnungderNutzwertanalyse . . . . . . . . . . . . . . . . . . . . . . . . . 61 7.5 NutzwertanalyseGesamtergebnis . . . . . . . . . . . . . . . . . . . . . . . . . 63 vii Abbildungsverzeichnis 3.1 BeispielhaftesHadoopClustermitZusammensetzungausmehrerenracks . . 11 3.2 KomponenteneinesNameNodes . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 KomponenteneinesDataNodes . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 BeispielhaftesHadoopClustermitSchreibvorganganderDateifile1. . . . . . 15 4.1 ArchitekturvonYARN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Slave-NodeineinemYARNCluster . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3 WichtigeKomponentendesRessourceManagers . . . . . . . . . . . . . . . . . 20 4.4 AusführungeinerMapReduce-AnwendungmitYARN . . . . . . . . . . . . . . 22 5.1 DependenciesinSpark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.2 Master/Slave-ArchitektureinesSparkClusters . . . . . . . . . . . . . . . . . . 31 6.1 ErgebnissederHypothese1:Diagramm . . . . . . . . . . . . . . . . . . . . . . 39 6.2 AuslastungdesHauptspeichersvonApacheHadoopundApacheSparkwäh- rendderOperationWordCount . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.3 AuslastungdesHauptspeichersvonApacheHadoopundApacheSparkwäh- rendderOperationURLCount . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.4 DurchschnittlicheAnzahllaufenderProzessevonApacheHadoopundApache SparkwährendderOperationURLCount . . . . . . . . . . . . . . . . . . . . . 41 6.5 ErgebnissederHypothese2:Diagramm . . . . . . . . . . . . . . . . . . . . . . 42 6.6 AuslastungdesHauptspeichersvonApacheHadoopundApacheSparkwäh- rendderOperationJoin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 6.7 ErgebnissederHypothese3vonApacheHadoop:Diagramm . . . . . . . . . . 44 6.8 ErgebnissederHypothese3vonApacheSpark:Diagramm . . . . . . . . . . . 46 6.9 VergleichderNetzwerkauslastungvonApacheSparkwährendderOperation WordCount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.10 ErgebnissederHypothese4vonApacheHadoop:Diagramm . . . . . . . . . . 48 6.11 ErgebnissederHypothese4vonApacheSpark:Diagramm . . . . . . . . . . . 49 6.12 VergleichderNetzwerkauslastungvonApacheSparkwährendderOperation WordCount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.13 VergleichderAusführungszeitenvonApacheHadoopundApacheSparkbei SkalierungderNodesmitdenOperationenWordCount,URLCountundJoin . 52 6.14 VergleichdertatsächlichenSkalierungmitderangenommenenlinearenSka- lierungbeiderOperationWordCount . . . . . . . . . . . . . . . . . . . . . . . 53 viii Abbildungsverzeichnis A.1 DurchschnittlicheAnzahllaufenderProzessevonApacheHadoopundApache SparkwährendderOperationSumYear . . . . . . . . . . . . . . . . . . . . . . 68 A.2 AuslastungdesHauptspeichersvonApacheHadoopundApacheSparkwäh- rendderOperationSumYear . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 A.3 DurchschnittlicheAuslastungderCPUsvonApacheHadoopundApacheSpark währendderOperationSumYear . . . . . . . . . . . . . . . . . . . . . . . . . . 69 A.4 DurchschnittlicheAuslastungdesNetzwerksvonApacheHadoopundApache SparkwährendderOperationSumYear . . . . . . . . . . . . . . . . . . . . . . 69 A.5 DurchschnittlicheAnzahllaufenderProzessevonApacheHadoopundApache SparkwährendderOperationJoin . . . . . . . . . . . . . . . . . . . . . . . . . 70 A.6 DurchschnittlicheCPU-AuslastungvonApacheHadoopundApacheSpark währendderOperationJoin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 A.7 DurchschnittlicheAuslastungdesNetzwerksvonApacheHadoopundApache SparkwährendderOperationJoin . . . . . . . . . . . . . . . . . . . . . . . . . 70 A.8 DurchschnittlicheAnzahllaufenderProzessevonApacheHadoopundApache SparkwährendderOperationSumYear . . . . . . . . . . . . . . . . . . . . . . 71 A.9 AuslastungdesHauptspeichersvonApacheHadoopundApacheSparkwäh- rendderOperationSumYear . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 A.10 DurchschnittlicheCPU-AuslastungvonApacheHadoopundApacheSpark währendderOperationSumYear . . . . . . . . . . . . . . . . . . . . . . . . . . 71 A.11 DurchschnittlicheAuslastungdesNetzwerksvonApacheHadoopundApache SparkwährendderOperationSumYear . . . . . . . . . . . . . . . . . . . . . . 72 B.1 BeispielfürN-Gramme,dieimPerformance-Testbenutztwurden . . . . . . . 73 ix Code 5.1 Sparkshell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.2 Lineagedesweblog-RDDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.3 LineagedesurlCounts-RDDs . . . . . . . . . . . . . . . . . . . . . . . . . . 29 x

Description:

Tim Horgas. Title of the paper. Performance-Analysis of Apache Spark and Apache Hadoop. Keywords. Apache Spark, Apache Hadoop, Big Data,

Performance-Analyse von Apache Spark und Apache Hadoop PDF

92 Pages·2015·2.69 MB·English

by Tim Horgas

Checking for file health...

Save to my drive

Quick download

Download

Download Performance-Analyse von Apache Spark und Apache Hadoop PDF Free - Full Version

by Tim Horgas| 2015| 92 pages| 2.69| English

Download Performance-Analyse von Apache Spark und Apache Hadoop by Tim Horgas in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Performance-Analyse von Apache Spark und Apache Hadoop

Tim Horgas. Title of the paper. Performance-Analysis of Apache Spark and Apache Hadoop. Keywords. Apache Spark, Apache Hadoop, Big Data,

Detailed Information

Author:	Tim Horgas
Publication Year:	2015
Pages:	92
Language:	English
File Size:	2.69
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Performance-Analyse von Apache Spark und Apache Hadoop Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Performance-Analyse von Apache Spark und Apache Hadoop PDF?

Yes, on https://PDFdrive.to you can download Performance-Analyse von Apache Spark und Apache Hadoop by Tim Horgas completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Performance-Analyse von Apache Spark und Apache Hadoop on my mobile device?

After downloading Performance-Analyse von Apache Spark und Apache Hadoop PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Performance-Analyse von Apache Spark und Apache Hadoop?

Yes, this is the complete PDF version of Performance-Analyse von Apache Spark und Apache Hadoop by Tim Horgas. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Performance-Analyse von Apache Spark und Apache Hadoop PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.