Applied Machine Learning for Smart Data Analysis Computational Intelligence in Engineering Problem Solving Series Editor: Nilanjan Dey Techno India College of Technology, India Proposals for the series should be sent directly to the series editor above, or submitted to: Chapman & Hall/CRC Taylor and Francis Group 52 Vanderbilt Avenue, New York, NY 10017 Applied Machine Learning for Smart Data Analysis Nilanjan Dey, Sanjeev Wagh, Parikshit N. Mahalle and Mohd. Shafi Pathan Formoreinformationaboutthisseries,pleasevisithttp://www.crcpress.com Applied Machine Learning for Smart Data Analysis Edited by Nilanjan Dey Sanjeev Wagh Parikshit N. Mahalle fi Mohd. Sha Pathan MATLAB® and Simulink® are trademarks of The MathWorks, Inc. and are used with permission. The MathWorks does not warrant the accuracy of the text or exercises in this book. This book’s use or discussionofMATLAB®andSimulink®softwareorrelatedproductsdoesnotconstituteendorsementor sponsorshipbyTheMathWorksofaparticularpedagogicalapproachorparticularuseoftheMATLAB® andSimulink®software. CRCPress Taylor&FrancisGroup 6000BrokenSoundParkwayNW,Suite300 BocaRaton,FL33487-2742 ©2019byTaylor&FrancisGroup,LLC CRCPressisanimprintofTaylor&FrancisGroup,anInformabusiness NoclaimtooriginalU.S.Governmentworks Printedonacid-freepaper InternationalStandardBookNumber-13:978-1-138-33979-8(Hardback) Thisbookcontainsinformationobtainedfromauthenticandhighlyregardedsources.Reasonableefforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibilityforthevalidityofallmaterialsortheconsequencesoftheiruse.Theauthorsandpublishers haveattemptedtotracethecopyrightholdersofallmaterialreproducedinthispublicationandapologize tocopyrightholdersifpermissiontopublishinthisformhasnotbeenobtained.Ifanycopyrightmaterial hasnotbeenacknowledgedpleasewriteandletusknowsowemayrectifyinanyfuturereprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafterinvented,includingphotocopying,microfilming,andrecording,orinanyinformationstorageor retrievalsystem,withoutwrittenpermissionfromthepublishers. Forpermissiontophotocopyorusematerialelectronicallyfromthiswork,pleaseaccesswww.copyright. com(http://www.copyright.com/)orcontacttheCopyrightClearanceCenter,Inc.(CCC),222Rosewood Drive,Danvers,MA01923,978-750-8400.CCCisanot-for-profitorganizationthatprovideslicensesand registration for avariety ofusers.Fororganizations thathave beengrantedaphotocopy license bythe CCC,aseparatesystemofpaymenthasbeenarranged. TrademarkNotice:Productorcorporatenamesmaybetrademarksorregisteredtrademarks,andareused onlyforidentificationandexplanationwithoutintenttoinfringe. LibraryofCongressCataloging-in-PublicationData Names:Dey,Nilanjan,1984-author.|Wagh,Sanjeev,author.|Mahalle, ParikshitN.,author.|Pathan,Mohd.Shafi,author. Title:Appliedmachinelearningforsmartdataanalysis/NilanjanDey,SanjeevWagh,ParikshitN. MahalleandMohd.ShafiPathan. Description:Firstedition.|NewYork,NY:CRCPress/Taylor&FrancisGroup,2019.|Series: ComputationalIntelligenceinEngineeringProblemSolving|Includes bibliographicalreferencesandindex. Identifiers:LCCN2018033217|ISBN9781138339798(hardback:acid-freepaper)| ISBN9780429440953(ebook) Subjects:LCSH:Datamining–Industrialapplications.|Electronicdataprocessing.| Decisionsupportsystems.|Machinelearning–Industrialapplications.|Internetof things.|Systemsengineering–Dataprocessing. Classification:LCCQA76.9.D343D492019|DDC006.3/12–dc23 LCrecordavailableathttps://lccn.loc.gov/2018033217 VisittheTaylor&FrancisWebsiteat http://www.taylorandfrancis.com andtheCRCPressWebsiteat http://www.crcpress.com Contents Preface .........................................................vii Editors ......................................................... xi ListofContributors ..............................................xv Section I Machine Learning . . . . . . . . . . . . . . . . . . . . . . . .1 1. HindiandUrdutoEnglishNamedEntityStatisticalMachine TransliterationUsingSourceLanguageWordOriginContext.......3 M.L.DhoreandP.H.Rathod 2. Anti-DepressionPsychotherapistChatbotforExamand Study-RelatedStress .........................................21 Mohd.ShafiPathan,RushikeshJain,RohanAswani,KshitijKulkarni,and SanchitGupta 3. Plagiasil:APlagiarismDetectorBasedonMASScalable FrameworkforResearchEffortEvaluationbyUnsupervised MachineLearning–HybridPlagiarismModel...................41 SangramGawali,DevendraSinghThakore,ShashankD.Joshi,and VidyasagarSachinShinde Section II Machine Learning in Data Mining . . . . . . . . .69 4. DigitalImageProcessingUsingWavelets:BasicPrinciplesand Application..................................................71 LuminițaMoraru,SimonaMoldovanu,SalamKhan,andAnjanBiswas 5. ProbabilityPredictorUsingData-MiningTechniques ............97 P.N.Railkar,PinakinParkhe,NamanVerma,SameerJoshi, KetakiPathak,andShaikhNaserHussain 6. BigDataSummarizationUsingModifiedFuzzyClustering Algorithm,SemanticFeature,andDataCompressionApproach...117 ShilpaG.KolteandJagdishW.Bakal 7. Topic-SpecificNaturalLanguageChatbotasGeneralAdvisor forCollege .................................................135 VarunPatil,YogeshwarChaudhari,HarshRohila,PranavBhosale,and P.S.Desai v vi Contents Section III Machine Learning in IoT. . . . . . . . . . . . . . . .153 8. ImplementationofMachineLearningintheEducationSector: AnalyzingtheCausesbehindAverageStudentGrades ..........155 PrayagTiwari,JiaQian,andQiuchiLi 9. Priority-BasedMessage-ForwardingSchemeinVANETwith IntelligentNavigation .......................................169 SachinP.Godse,ParikshitN.Mahalle,andMohd.ShafiPathan Section IV Machine Learning in Security . . . . . . . . . . .183 10. AComparativeAnalysisandDiscussionofEmailSpam ClassificationMethodsUsingMachineLearningTechniques.....185 AakashAtulAlurkar,SourabhBharatRanade,ShreeyaVijayJoshi, SiddheshSanjayRanade,GitanjaliR.Shinde,PiyushA.Sonewar,and ParikshitN.Mahalle 11. MalwarePreventionandDetectionSystemforSmartphone: AMachineLearningApproach ...............................207 SachinM.Kolekar Index..........................................................223 Preface Applied Machine Learning for Smart Data Analysis discusses varied emerging and developing domains of computer technology. This book is divided into four sections covering machine learning, data mining, Internet of things, and information security. Machine learning is a technique in which a system first understands information from the available context and then makes decisions. Trans- literation from Indian languages to English is a recurring need, not only for converting out-of-vocabulary words during machine translation but also for providing multilingual support for printing electricity bills, telephone bills, municipal corporation tax bills, and so on, in English as well as Hindi language. The origin for source language named entities are taken as either Indo-Aryan-Hindi (IAH) or Indo-Aryan- Urdu (IAU). Two separate language models are built for IAH and IAU. GIZA++ is used for word alignment. Anxiety is a common problem among our generation. Anxiety is usually treated by visiting a therapist thus altering the hormone adrena- line in the body with medication. There are also some studies based on talking to a chatbot. Cognitive Behavioral Therapy (CBT) mainly focuses on user’s ability to accept behavior, clarify problems, and understanding the reasoning behind setting goals. Anxiety can be reduced by detecting the emotion and clarifying the problems. Natural conversation between humans and machines is aimed at providing general bot systems for members of particular organizations. It uses natural language processing along with pattern matching techniques to provide appropriate response totheenduserforarequestedquery.Experimentalanalysissuggeststhat topic-specific dialogue coupled with conversational knowledge yields maximum dialogue sessions when compared with general conversational dialogue. We also discuss the need for a plagiarism detector system i.e. Plagiasil. Some of our research highlights a technical scenario by predicting the knowledge base or source as local dataset, Internet resources, online or offline books, research published by various publications and industries. The architecture herewith highlights on scheming an algorithm, which is compliant to dynamic environment of datasets. Thus, information extrac- tion, predicting key aspects from it and compression or transforming information for storing and faster comparison are addressed to explore research methodology. Data summarization is an important data analysis technique. Sum- marization is broadly classified into two types based on the metho- dology: semantic and syntactic. Clustering algorithms like K-means vii viii Preface algorithms are used for semantic summarization such as. Exploratory results using Iris dataset show that the proposed modified k-means calculation performs better than K-means and K-medoids calculation. Predicting the performance of a student is a great concern to higher education management. Our proposed system attempts to apply education data mining techniques to interpret students’ performance. By applying C4.5 decision tree algorithm, the results get thoroughly analyzed and easily perusable. Text summarization is one of the most popular and useful applications for information compression. Summarization systems provide the possibi- lity of searching the important keywords of the texts and so that the consumer expends less time on reading the whole document. We study various existing techniques with needs of novel multi-document summar- ization schemes. In the Internet of Things, smart cities need smart museums, as cultural interactivity is an important aspect of any nation. VANET (Vehicular Adhoc Network) provide Intelligent Transportation System (ITS). There has been an increasing trend in traffic accidents and congestion in the past years. So, advanced technological solutions have been proposed in an attempt to reduce such mishaps and improve traffic discipline. In the proposed system, we implement an android system that alerts the driver with messages. There is a need to prioritize message processing at node level. We address a novel scheme for assigning priority to messages and forwarding messages as per assigned priority. Time required to reach the destination is reduced due to proposed scheme. Security is a very important concern in information computing systems. It can be addressed using various machine learning techniques. Spam emails present a huge problem. It is of utmost importance that the spam classifierusedbyanyemailclientisasefficientaspossiblebecauseitcould potentially help not just with clearing out clutter and freeing up storage space but also blot out malicious emails from the eyes of layperson users. As email spam classification needs can vary for different types of usages, a comparative analysis helps us observe the performance of various algo- rithms on various parameters such as efficiency, performance and scalability. Mobile malware is a malicious software entity, and it is used to disrupt the mobile operations and the respective functionalities. We propose machine-learning based classification of android applications to decide whether these applications are malware or normal applications. Malware detection technique by permission analysis approach and package malware setup is also presented. Preface ix MATLAB® is a registered trademark of The MathWorks, Inc. For product information, please contact: The MathWorks, Inc. 3 Apple Hill Drive Natick, MA 01760-2098 USA Tel: 508-647-7000 Fax: 508-647-7001 E-mail: [email protected] Web: www.mathworks.com