Table Of Content

Studies in Big Data 43 Mamta Mittal · Valentina E. Balas Lalit Mohan Goyal · Raghvendra Kumar Editors Big Data Processing Using Spark in Cloud Studies in Big Data Volume 43 Series editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland e-mail: [email protected] Theseries“StudiesinBigData”(SBD)publishesnewdevelopmentsandadvances in the various areas of Big Data- quickly and with a high quality. The intent is to coverthetheory,research,development,andapplicationsofBigData,asembedded inthefieldsofengineering,computerscience,physics,economicsandlifesciences. The books of the series refer to the analysis and understanding of large, complex, and/or distributed data sets generated from recent digital sources coming from sensorsorotherphysicalinstrumentsaswellassimulations,crowdsourcing,social networks or other internet transactions, such as emails or video click streams and others. The series contains monographs, lecture notes and edited volumes in Big Data spanning the areas of computational intelligence including neural networks, evolutionary computation, soft computing, fuzzy systems, as well as artificial intelligence, data mining, modern statistics and operations research, as well as self-organizing systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. More information about this series at http://www.springer.com/series/11970 Mamta Mittal Valentina E. Balas (cid:129) Lalit Mohan Goyal Raghvendra Kumar (cid:129) Editors Big Data Processing Using Spark in Cloud 123 Editors Mamta Mittal Lalit MohanGoyal Department ofComputer Science Department ofComputer Science andEngineering andEngineering GB PantGovernment EngineeringCollege Bharati Vidyapeeth’s Collegeof NewDelhi Engineering India NewDelhi India Valentina E.Balas Department ofAutomation Raghvendra Kumar andAppliedInformatics Department ofComputer Science AurelVlaicu University of Arad andEngineering Arad LaxmiNarayan College ofTechnology Romania Jabalpur, MadhyaPradesh India ISSN 2197-6503 ISSN 2197-6511 (electronic) Studies in BigData ISBN978-981-13-0549-8 ISBN978-981-13-0550-4 (eBook) https://doi.org/10.1007/978-981-13-0550-4 LibraryofCongressControlNumber:2018940888 ©SpringerNatureSingaporePteLtd.2019 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpart of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission orinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilar methodologynowknownorhereafterdeveloped. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfrom therelevantprotectivelawsandregulationsandthereforefreeforgeneraluse. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authorsortheeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinor for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictionalclaimsinpublishedmapsandinstitutionalaffiliations. Printedonacid-freepaper ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSingaporePteLtd. partofSpringerNature Theregisteredcompanyaddressis:152BeachRoad,#21-01/04GatewayEast,Singapore189721, Singapore Preface Theeditedbook“BigDataProcessingusingSparkinCloud”takesdeepintoSpark whilestartingwiththebasicsofScalaandcoreSparkframework,andthenexplore Spark data frames, machine learning using MLlib, graph analytics using graph X, andreal-timeprocessingwithApacheKafka,AWSKinesis,andAzureEventHub. WewillalsoexploreSparkusingPySparkandR.,applytheknowledgethatsofar wehavelearntaboutSpark,andwillworkonrealdatasetsanddosomeexploratory analytics first, then move on to predictive modeling on Boston Housing Datasets, and then move forward to build news content-based recommender system using NLP and MLlib, collaborative filtering-based movies recommender system, and page rank using GraphX. This book also discusses how to tune Spark parameters for production scenarios and how to write robust applications in Apache Spark using Scala in cloud computing environment. The book is organized into 11 chapters. Chapter “A Survey on Big Data—Its Challenges and Solution from Vendors” carriedoutadetailedsurveydepictingtheenormousinformationanditsdifficulties alongside the advancements required to deal with huge data. This moreover portrays the conventional methodologies which were utilized before to manage information,theirimpediments,andhowit isbeing overseenbythenew approach Hadoop. It additionally portrays the working of Hadoop along with its pros and cons and security on huge data. Chapter “Big Data Streaming with Spark” introduces many concepts associated withSparkStreaming,includingadiscussionofsupportedoperations.Finally,two other important platforms and their integration with Spark, namely Apache Kafka and Amazon Kinesis, are explored. Chapter “Big Data Analysis in Cloud and Machine Learning” discusses data whichisconsideredtobethelifebloodofanybusinessorganization,asitisthedata that streams into actionable insights of businesses. The data available with the organizations is so much in volume that it is popularly referred as Big Data. It is thehottestbuzzwordspanningthebusinessandtechnologyworlds.Economiesover theworldareusingBigDataandBigDataanalyticsasanewfrontierforbusinessso astoplansmarterbusinessmoves,improveproductivityandperformance,andplan v vi Preface strategymoreeffectively.TomakeBigDataanalyticseffective,storagetechnologies andanalyticaltoolsplay acritical role.However,it isevidentthatBigDataplaces rigorous demands on networks, storage, and servers, which has motivated organi- zationsandenterprisestomoveoncloud,inordertoharvestmaximumbenefitsofthe availableBigData.Furthermore,wearealsoawarethattraditionalanalyticstoolsare notwellsuitedtocapturingthefullvalueofBigData.Hence,machinelearningseems to be an ideal solution for exploiting the opportunities hidden in Big Data. In this chapter, we shall discuss Big Data and Big Data analytics with a special focus on cloudcomputingandmachinelearning. Chapter “Cloud Computing Based Knowledge Mapping Between Existing and Possible Academic Innovations—An Indian Techno-Educational Context” discusses various applications in cloud computing that allow healthy and wider efficient computing services in terms of providing centralized services of storage, applications, operating systems, processing, and bandwidth. Cloud computing is a type of architecture which helps in the promotion of scalable computing. Cloud computingisalsoakindofresource-sharingplatformandthusneededinalmostall the spectrum and areas regardless of its type. Today, cloud computing has a wider market,anditisgrowingrapidly.Themanpowerinthisfieldismainlyoutsourced from the IT and computing services, but there is an urgent need to offer cloud computing as full-fledged bachelors and masters programs. In India also, cloud computing is rarely seen as an education program, but the situation is now changing. There is high potential to offer cloud computing in Indian educational segment. This paper is conceptual in nature and deals with the basics of cloud computing, its need, features, types existing, and possible programs in the Indian context, and also proposed several programs which ultimately may be helpful for building solid Digital India. The objective of the Chapter “Data Processing Framework Using Apache and Spark Technologies in Big Data” is to provide an overall view of Hadoop’s MapReduce technology used for batch processing in cluster computing. Then, Spark was introduced to help Hadoop work faster, but it can also work as a stand-alone system with its own processing engine that uses Hadoop’s distributed file storage or cloud storage of data. Spark provides various APIs according to the type of data and processing required. Apart from that, it also provides tools for queryprocessing,graphprocessing,andmachinelearningalgorithms.SparkSQLis averyimportantframeworkofSparkforqueryprocessingandmaintainsstorageof large datasets on cloud. It also allows taking input data from different data sources and performing operations on it. It provides various inbuilt functions to directly create and maintain data frames. Chapter“ImplementingBigDataAnalyticsThroughNetworkAnalysisSoftware Applications in Strategizing Higher Learning Institutions” discusses the common utilityamongthesesocialmediaapplications,sothattheyareabletocreatenatural network data. These online social media networks (OSMNs) represent the links or relationships between content generators as they look, react, comment, or link to one another’s content. There are many forms of computer-mediated social inter- action which includes SMS messages, emails, discussion groups, blogs, wikis, Preface vii videos,andphotograph-sharingsystems,chatrooms,and“socialnetworkservices.” All these applications generate social media datasets of social friendships. Thus OSMNs have academic and pragmatic value and can be leveraged to illustrate the crucialcontributorsandthecontent.Ourstudyconsideredalltheabovepointsinto accountandexploredthevariousNetworkAnalysisSoftwareApplicationstostudy the practical aspects of Big Data analytics that can be used to better strategies in higher learning institutions. Chapter “Machine Learning on Big Data: A Developmental Approach on Societal Applications” concentrates on the most recent progress over researches with respect to machine learning for Big Data analytic and different techniques in the context of modern computing environments for various societal applications. Specifically, our aim is to investigate the opportunities and challenges of ML on Big Data and how it affects the society. The chapter covers a discussion on ML in Big Data in specific societal areas. Chapter “Personalized Diabetes Analysis Using Correlation Based Incremental Clustering Algorithm” describes the details about incremental clustering approach, correlation-based incremental clustering algorithm (CBICA) to create clusters by applying CBICA to the data of diabetic patients and observing any relationship which indicates the reason behind the increase of the diabetic level over a specific periodoftimeincludingfrequentvisitstohealthcarefacility.Theseobtainedresults from CBICA are compared with the results obtained from other incremental clustering approaches, closeness factor-based algorithm (CFBA), which is a probability-based incremental clustering algorithm. “Cluster-first approach” is the distinctiveconceptimplementedinbothCFBAandCBICAalgorithms.Boththese algorithms are “parameter-free,” meaning only end user requires to give input dataset to these algorithms, and clustering is automatically performed using no additional dependencies from user including distance measures, assumption of centroids,andnumberofclusterstoform.Thisresearchintroducesanewdefinition of outliers, ranking of clusters, and ranking of principal components. Scalability: Such personalization approach can be further extended to cater the needsofgestational,juvenile,andtype1andtype2diabeticpreventioninsociety. Such research can be further made distributed in nature so as to consider diabetic patients’ data from all across the world and for wider analysis. Such analysis may varyorcanbeclusteredbasedonseasonality,foodintake,personalexerciseregime, heredity, and other related factors. Without such integrated tool, the diabetologist in hurry, while prescribing new details, may consider only the latest reports, without empirical details of an indi- vidual.Suchsituationsareverycommoninthesestressfulandtime-constraintlives, which may affect the accurate predictive analysis required for the patient. Chapter “Processing Using Spark—A Potent of BD Technology” sustains the major potent of processing behind Spark-connected contents like resilient distributed datasets (RDDs), scalable machine learning libraries (MLlib), Spark incremental streaming pipeline process, parallel graph computation interface throughGraphX,SQLdataframes,SparkSQL(dataprocessingparadigmsupports columnar storage), and recommendation systems with MlLib. All libraries operate viii Preface on RDDs as the data abstraction is very easy to compose with any applications. RDDs are fault-tolerant computing engines (RDDs are the major abstraction and provide explicit support for data sharing (user’s computations) and can capture a wide range of processing workloads and fault-tolerant collections of objects par- titioned across a cluster which can be manipulated in parallel). These are exposed throughfunctional programming APIs(or BDsupportedlanguages)like Scala and Python. This chapter also throws a viewpoint on core scalability of Spark to build high-level data processing libraries for the next generation of computer applications, wherein a complex sequence of processing steps is involved. To understand and simplify the entire BD tasks, focusing on the processing hindsight, insights, foresight by using Spark’s core engine, its members of ecosystem components are explained with a neat interpretable way, which is mandatory for data science compilers at this moment. One of the tools in Spark, cloud storage, is explored in thisinitiative toreplacethebottleneckstowardthedevelopmentofanefficientand comprehend analytics applications. Chapter “Recent Developments inBigDataAnalysisTools andApacheSpark” illustrates different tools used for the analysis of Big Data in general and Apache Spark (AS) in particular. The data structure used in AS is Spark RDD, and it also uses Hadoop. This chapter also entails merits, demerits, and different components of AS tool. Chapter“SCSI:Real-TimeDataAnalysiswithCassandraandSpark”focusedon understandingtheperformanceevaluations,andSmartCassandraSparkIntegration (SCSI) streaming framework is compared with the file system-based data stores suchasHadoopstreamingframework.SCSIframeworkisfoundscalable,efficient, and accurate while computing big streams of IoT data. There have been several influences from our family and friends who have sac- rificed a lot of their time and attention to ensure that we are kept motivated to complete this crucial project. The editors are thankful to all the members of Springer (India) Private Limited, especially Aninda Bose and Jennifer Sweety Johnson for the given opportunity to edit this book. New Delhi, India Mamta Mittal Arad, Romania Valentina E. Balas New Delhi, India Lalit Mohan Goyal Jabalpur, India Raghvendra Kumar Contents A Survey on Big Data—Its Challenges and Solution from Vendors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Kamalinder Kaur and Vishal Bharti Big Data Streaming with Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Ankita Bansal, Roopal Jain and Kanika Modi Big Data Analysis in Cloud and Machine Learning . . . . . . . . . . . . . . . . 51 Neha Sharma and Madhavi Shamkuwar Cloud Computing Based Knowledge Mapping Between Existing and Possible Academic Innovations—An Indian Techno-Educational Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 P. K. Paul, Vijender Kumar Solanki and P. S. Aithal Data Processing Framework Using Apache and Spark Technologies in Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Archana Singh, Mamta Mittal and Namita Kapoor Implementing Big Data Analytics Through Network Analysis Software Applications in Strategizing Higher Learning Institutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Meenu Chopra and Cosmena Mahapatra Machine Learning on Big Data: A Developmental Approach on Societal Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Le Hoang Son, Hrudaya Kumar Tripathy, Acharya Biswa Ranjan, Raghvendra Kumar and Jyotir Moy Chatterjee Personalized Diabetes Analysis Using Correlation-Based Incremental Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Preeti Mulay and Kaustubh Shinde ix

Description:

The book describes the emergence of big data technologies and the role of Spark in the entire big data stack. It compares Spark and Hadoop and identifies the shortcomings of Hadoop that have been overcome by Spark. The book mainly focuses on the in-depth architecture of Spark and our understanding o

Big Data Processing Using Spark in Cloud PDF

274 Pages·2019·8.49 MB·English

by Mamta Mittal

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Big Data Processing Using Spark in Cloud

Description:

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.