Unsupervised and Semi-Supervised Learning Series Editor: M. Emre Celebi Deepak P Anna Jurek-Loughrey Editors Linking and Mining Heterogeneous and Multi-view Data Unsupervised and Semi-Supervised Learning Springer’s Unsupervised and Semi-Supervised Learning book series covers the latest theoretical and practical developments in unsupervised and semi-supervised learning.Titles–includingmonographs,contributedworks,professionalbooks,and textbooks–tacklevariousissuessurroundingtheproliferationofmassiveamounts of unlabeled data in many application domains and how unsupervised learning algorithms can automatically discover interesting and useful patterns in such data. The books discuss how these algorithms have found numerous applications including pattern recognition, market basket analysis, web mining, social network analysis, information retrieval, recommender systems, market research, intrusion detection, and fraud detection. Books also discuss semi-supervised algorithms, which can make use of both labeled and unlabeled data and can be useful in applicationdomainswhereunlabeleddataisabundant,yetitispossibletoobtaina smallamountoflabeleddata. Moreinformationaboutthisseriesathttp://www.springer.com/series/15892 Deepak P (cid:129) Anna Jurek-Loughrey Editors Linking and Mining Heterogeneous and Multi-view Data 123 Editors DeepakP AnnaJurek-Loughrey Queen’sUniversityBelfast Queen’sUniversityBelfast NorthernIreland,UK NorthernIreland,UK ISSN2522-848X ISSN2522-8498 (electronic) UnsupervisedandSemi-SupervisedLearning ISBN978-3-030-01871-9 ISBN978-3-030-01872-6 (eBook) https://doi.org/10.1007/978-3-030-01872-6 LibraryofCongressControlNumber:2018962409 ©SpringerNatureSwitzerlandAG2019 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthisbook arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor theeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinorforany errorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictional claimsinpublishedmapsandinstitutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland Preface Withdatabeingrecordedatneverseenbeforescaleswithinanincreasinglycomplex world,computingsystemsroutinelyneedtodealwithalargevarietyofdatatypes fromapluralityofdatasources.Moreoftenthannot,thesameentitymaymanifest across different data sources differently. To be able to arrive at a well-rounded view across all data sources to fuel data-driven applications, such different entity representations may need to be linked, and such linked information be mined while being mindful of the differences between the different views. The goal of this volume is to offer a snapshot of the advances in this emerging area of data science and big data systems. The intended audience includes researchers and practitioners who work on applications and methodologies related to linking and miningheterogeneousandmulti-viewdata. The book covers a number of chapters on methods and algorithms—including both novel methods and surveys—and some that are more application-oriented and targeted towards specific domains. To keep the narrative interesting and to avoid monotony for the reader, chapters on methods and applications have been interleaved. The first four chapters provide reviews of some of the foundational streams of research in the area. The first chapter is an overview of state-of-the-art methods for multi-view data completion, a task of much importance since each entity may notberepresentedfullyacrossalldifferentviews.Thesecondchaptersurveysthe fieldofmulti-viewdataclustering,themostexploredareaofunsupervisedlearning within the realm of multi-view data. We then turn our attention to linking, with the third chapter taking stock of unsupervised and semi-supervised methods to linkingdatarecordsfromacrossdatasources.Recordlinkageisacomputationally expensive task, and blocking methods often come in handy to rein in the search space.Unsupervisedandsemi-supervisedblockingtechniquesformthefocusofthe fourthchapter. Thefifthchapteristhefirstofapplicationarticlesinthevolume,anditsurveysa richfieldofapplicationofmulti-viewdataanalytics,thatoftransportationsystems thatencompassawidevarietyofsensorscapturingdifferentkindsoftransportation data.Socialmediadiscussionsoftenmanifestwithintricatestructurealongwiththe v vi Preface textual content, and identifying discourse acts within such data forms the topic of thesixthchapter.Theseventhchapterlooksatimbalanceddatasetswithaneyeon improvingper-classclassificationaccuracy.Theyproposenovelmethodsthatmodel the co-operation between different data views in order to improve classification accuracy. The eighth chapter is focused on an enterprise information retrieval scenario, where a fast and simple model for linking entities from domain-specific knowledgegraphstoIR-stylequeriesisproposed. Chapter 9 takes us back to multi-view clustering and focuses on surveying methodsthatbuilduponnon-negativematrixfactorizationandmanifoldlearningfor thetask.Thetenthchapterconsidershowheterogeneousdatasourcesareleveraged indatasciencemethodsforfakenewsdetection,anemergingareaofhighinterest from a societal perspective. Chapter 11 is an expanded version of a conference paperthatappearedinAISTATS2018andaddressesthetaskofmulti-viewmetric learning. Social media analytics, in addition to leveraging conventional types of data, is characterized by abundant usage of structure. Community detection is amongthefundamentaltaskswithinsocialmediaanalytics;thelastchapterinthis volumeconsidersevaluationofcommunitydetectionmethodsforsocialmedia. We hope that this volume, focused on both linking heterogeneous data sources andmethodsforanalyticsoverlinkeddatasources,willdemonstratethesignificant progress that has occurred in this field in recent years. We also hope that the developmentsreportedinthisvolumewillmotivatefurtherresearchinthisexciting field. Belfast,UK DeepakP Belfast,UK AnnaJurek-Loughrey Contents 1 Multi-ViewDataCompletion.............................................. 1 SahelyBhadra 2 Multi-ViewClustering ..................................................... 27 DeepakPandAnnaJurek-Loughrey 3 Semi-supervisedandUnsupervisedApproaches toRecord PairsClassificationinMulti-SourceDataLinkage ..................... 55 AnnaJurek-LoughreyandDeepakP 4 A Review of Unsupervised and Semi-supervised Blocking MethodsforRecordLinkage.............................................. 79 KevinO’Hare,AnnaJurek-Loughrey,andCassiodeCampos 5 TrafficSensingandAssessinginDigitalTransportationSystems..... 107 HanaRabbouch,FouedSaâdaoui,andRafaaMraihi 6 HowDidtheDiscussionGo:DiscourseActClassification inSocialMediaConversations ............................................ 137 SubhabrataDutta,TanmoyChakraborty,andDipankarDas 7 Learning from Imbalanced Datasets with Cross-View Cooperation-BasedEnsembleMethods.................................. 161 CécileCapponiandSokolKoço 8 EntityLinkinginEnterpriseSearch:CombiningTextual andStructuralInformation ............................................... 183 SumitBhatia 9 Clustering Multi-View Data Using Non-negative Matrix Factorization and Manifold Learning for Effective Understanding:ASurveyPaper.......................................... 201 KhanhLuongandRichiNayak vii viii Contents 10 LeveragingHeterogeneousDataforFakeNewsDetection............. 229 K.Anoop,ManjaryP.Gangan,DeepakP,andV.L.Lajish 11 GeneralFrameworkforMulti-ViewMetricLearning ................. 265 RiikkaHuusari,HachemKadri,andCécileCapponi 12 OntheEvaluationofCommunityDetectionAlgorithms onHeterogeneousSocialMediaData .................................... 295 AntonelaTommaselandDanielaGodoy Index............................................................................... 335 Chapter 1 Multi-View Data Completion SahelyBhadra Abstract Multi-view learning has been explored in various applications such as bioinformatics,naturallanguageprocessingandmultimediaanalysis.Oftenmulti- view learning methods commonly assume that full feature matrices or kernel matricesforallviewsareavailable.However,inpartialdataanalytics,itiscommon thatinformationfromsomesourcesisnotavailableormissingforsomedata-points. Such lack of information can be categorized into two types. (1) Incomplete view: information of a data-point is partially missing in some views. (2) Missing view: information of a data-point is entirely missing in some views, but information for thatdata-pointisfullyavailableinotherviews(nopartiallymissingdata-pointina view). Althoughmulti-viewlearninginthepresenceofmissingdatahasdrawnagreat amountofattentionintherecentpastandtherearequitealotofresearchpaperson multi-viewdatacompletion,butthereisnocomprehensiveintroductionandreview of current approaches on multi-view data completion. We address this gap in this chapterthroughdescribingthemulti-viewdatacompletionmethods. Inthischapter,wewillmainlydiscussexistingmethodstodealwithmissingview problem. Wedescribeasimpletaxonomy of thecurrentapproaches. And foreach category, representative as well as newly proposed models are presented. We also attempttoidentifypromisingavenuesandpointoutsomespecificchallengeswhich canhopefullypromotefurtherresearchinthisrapidlydevelopingfield. 1.1 Introduction Multi-View Data Multi-view learning is an emerging field to deal with data collectedfrommultiplesourcesor“views”toutilizethecomplementaryinformation inthem.Incaseofmulti-viewlearning,thescientificdataarecollectedfromdiverse S.Bhadra((cid:2)) IndianInstituteofTechnology,Palakkad,India e-mail:[email protected];https://sites.google.com/iitpkd.ac.in/sahelybhadra/home ©SpringerNatureSwitzerlandAG2019 1 DeepakP,A.Jurek-Loughrey(eds.),LinkingandMiningHeterogeneous andMulti-viewData,UnsupervisedandSemi-SupervisedLearning, https://doi.org/10.1007/978-3-030-01872-6_1