Table Of Content

t Hannah Bast r A - Claudius Korzen e h Ulrich Meyer t - f Manuel Penschuck (Eds.) o y -e e v t ar u t SS 1 0 Algorithms 2 3 1 S for Big Data C N L DFG Priority Program 1736 Lecture Notes in Computer Science 13201 Founding Editors Gerhard Goos Karlsruhe Institute of Technology, Karlsruhe, Germany Juris Hartmanis Cornell University, Ithaca, NY, USA Editorial Board Members Elisa Bertino Purdue University, West Lafayette, IN, USA Wen Gao Peking University, Beijing, China Bernhard Steffen TU Dortmund University, Dortmund, Germany Moti Yung Columbia University, New York, NY, USA More information about this series at https://link.springer.com/bookseries/558 Hannah Bast Claudius Korzen (cid:129) (cid:129) Ulrich Meyer Manuel Penschuck (Eds.) (cid:129) Algorithms for Big Data DFG Priority Program 1736 123 Editors Hannah Bast Claudius Korzen University of Freiburg University of Freiburg Freiburgim Breisgau, Germany Freiburg, Germany Ulrich Meyer Manuel Penschuck Goethe University Frankfurt Goethe University Frankfurt Frankfurt, Germany Frankfurt, Germany Goethe-UniversitätFrankfurtamMain ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notesin Computer Science ISBN 978-3-031-21533-9 ISBN978-3-031-21534-6 (eBook) https://doi.org/10.1007/978-3-031-21534-6 ©TheEditor(s)(ifapplicable)andTheAuthor(s)2022.Thisbookisanopenaccesspublication. OpenAccessThisbookislicensedunderthetermsoftheCreativeCommonsAttribution4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution andreproductioninanymediumorformat,aslongasyougiveappropriatecredittotheoriginalauthor(s)and thesource,providealinktotheCreativeCommonslicenseandindicateifchangesweremade. Theimagesorotherthirdpartymaterialinthisbookareincludedinthebook'sCreativeCommonslicense, unlessindicatedotherwiseinacreditlinetothematerial.Ifmaterialisnotincludedinthebook'sCreative Commonslicenseandyourintendeduseisnotpermittedbystatutoryregulationorexceedsthepermitteduse, youwillneedtoobtainpermissiondirectlyfromthecopyrightholder. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthors,andtheeditorsaresafetoassumethattheadviceandinformationinthisbookare believedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsortheeditors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictionalclaimsin publishedmapsandinstitutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland Preface Computer systems pervade all parts of human activity: transportation systems, energy supply, medicine, the whole financial sector, and modern science have become unthinkable without hardware and software support. As these systems continuously acquire, process, exchange, and store data, we live in a big-data world where information is accumulated at an exponential rate. The urgent problem has shifted from collecting enough data to dealing with its impetuous growth and abundance. In particular, data volumes often grow faster than thetransistorbudgetofcomputersaspredictedbyMoore’slaw(i.e.,doublingevery18 months). On top of this, we cannot any longer rely on transistor budgets to automat- ically translate into application performance, since the speed improvement of single processing cores has basically stalled and the requirements of algorithms that use the fullmemoryhierarchygetmoreandmorecomplicated.Asaresult,algorithmshaveto bemassivelyparallelusingmemoryaccesspatternswithhighlocality.Furthermore,an x-times machine performance improvement only translates into x-times larger man- ageable data volumes if we have algorithms that scale nearly linearly with the input size. All these are challenges that need new algorithmic ideas. Last but not least, to havemaximumimpact,oneshouldnotonly strive for theoretical results, butintendto follow the whole algorithm engineering development cycle consisting of theoretical work followed by experimental evaluation. The“curse”ofbigdataincombinationwithincreasinglycomplicatedhardwarehas reached all kinds of application areas: genomics research, information retrieval (web search engines, ...), traffic planning, geographical information systems, or communication networks. Unfortunately, most of these communities do not interact in a structured way even though they are often dealing with similar aspects of big-data problems. Frequently, they face poor scale-up behaviour from algorithms that have beendesignedbasedonmodelsofcomputationthatarenolongerrealisticforbigdata. About the SPP 1736 This volume surveys the progress in selected aspects of this important and growing field. It emerged from a research program established by the German Research Foundation(DFG)aspriorityprogramSPP1736onAlgorithmicsforBigData(https:// www.big-data-spp.de) in 2013 where researchers from theoretical computer science worked together with application experts in order to tackle some of the problems discussed above. The research program was prepared collaboratively by Susanne Albers, Hannah Bast, Kurt Mehlhorn, Ulrich Meyer (coordinator), Eugene Myers, Peter Sanders, Christian Scheideler, and Martin Skutella. The first meetings took place in Frankfurt/Main in 2012. Subsequently a grant proposal was worked out, submitted to theDFGonOctober15,andtheprogramwasgrantedinthespringmeetingoftheDFG vi Preface Senat in2013.The duration oftheprogramwassixyears, dividedinto twoperiodsof three years each. A nationwide call for the individual projects attracted over 40 proposals out of which an international reviewer panel selected 15 funded research projects plus a coordinationproject(totallingabout20fullPhDstudentpositions)bytheendof2013. Additionally, a few more projects with their own funding were associated in order to benefitfromcollaborationandjointevents(workshops,PhDmeetings,summerschools etc.) organised by the SPP. The members of the priority programme produced about 300 publications with more than 8200 citations by May 2022. About This Book Thechaptersofthisvolumesummarizeresultsofprojects realized withintheprogram andsurvey-relatedwork.Morethanhalfofthemcentrallydealwithvariousaspectsof algorithms for large and complex networks: – In “Algorithms for Large-Scale Network Analysis and the NetworKit Toolkit” (Chapter 1) Eugenio Angriman, Alexander van der Grinten, Michael Hamann, Henning Meyerhenke, and Manuel Penschuck survey SPP contributions to a scalable software library for the analysis of huge networks. While their focus is on recentalgorithmiccontributionsintheareasofcentralitycomputations,community detection, and sparsification, they also cover aspects such as current software engineering principles of the project and ways to visualize network data within a NetworKit-based workflow. – In “Generating Synthetic Graph Data from Random Network Models” (Chapter 2) Ulrich Meyer and Manuel Penschuck report on novel randomized graph instance generation algorithms which have been developed in SPP collaborations. The described implementations heavily exploit parallelism and/or cache-efficiency, and most of them have been integrated into NetworKit, too. Furthermore, several generators have been used to supplement experimental campaigns of SPP works described in subsequent chapters including the following. – In the two chapters “Increasing the Sampling Efficiency for the Link Assessment Problem” (Chapter 3) and “A Custom Hardware Architecture for the Link Assessment Problem” (Chapter 4) André Chinazzo, Christian De Schryver, Katharina Zweig, and Norbert Wehn provide an in-depth treatment of a specific network motif search problem—both from an algorithmic and a hardware focused point of view. A link assessment (LA) algorithm can be used to clean up large network data sets with noisy data. Using instances of a particular type of random graphs discussed in the network generation chapter as a null model, the LA algorithm evaluates the structural similar- itiesbetweenthenodes,andthusdifferentiatesmeaningfulrelationshipsbetweennodes fromnoisyones.Afteradetaileddiscussionofthealgorithmicfoundations(Chapter3), the authors present the design of a dedicated hardware accelerator (Chapter 4) for solvingtheLAproblem,which—comparedtoanIntelcluster—uses38(cid:1)lessmemory and is 1030(cid:1) more energy efficient. Preface vii – In “Graph-Based Methods for Rational Drug Design” (Chapter 5) Andre Droschinsky, Lina Humbeck, Oliver Koch, Nils M. Kriege, Petra Mutzel, and Till Schäfer disuss computational methods for the goal-directed development of new drugs. The connection to graphs is based on the frequently valid assumption that chemical molecules with similar structure also show similar effects in a drug. Hence, molecules are modelled as graphs with attributes and large-scale graph algorithmsforsimilaritysearchandclusteringcomeintoplay.Theauthorsprovide an overview of recent results with a focus on the search for maximum common subgraphs and their extension to domain specific requirements. – In “Recent Advances in Practical Data Reduction” (Chapter 6) Faisal N. Abu-Khzam, Sebastian Lamm, Matthias Mnich, Alexander Noe, Christian Schulz, and Darren Strash discuss recent algorithm engineering work in the area of fixed-parameter tractable NP-hard problems. They survey recent trends in data reduction engineering results for selected problems in NP (like Independent Sets, VertexCover,Treewidth,SteinerTrees,etc.)andP(MinimumCutandMatching). Furthermore, the authors describe techniques that may be useful for future implementations and list a number of open problems and research questions. – In “Skeleton-based Clustering by Quasi-Threshold Editing” (Chapter 7) Ulrik Brandes, Michael Hamann, Luise Häuser, and Dorothea Wagner report on SPP work for community detection on real-world graphs. They extend an earlier approachbyNastosandGaowhoproposedtoviewcommunitydetectionasagraph modification problem where the input is to be transferred into a quasi-threshold graphwithaminimumnumberofedgeadditionsanddeletionsanduseitsresulting connected components to determine the clustering. As minimizing the number of edit steps is NP hard, existing solutions rely on heuristics. The authors of the chapter introduce a new linear time heuristic for the inclusion-minimal variant of this edit problem and present improvements for the resulting clustering both in terms of running time and quality. – In “The Space Complexity of Undirected Graph Exploration” (Chapter 8) Yann DisserandMaxKlimmconsiderasettingwhereanagentwithsmallmemoryhasto visit all vertices of a huge graph at least once. The n vertices are indistinguishable for the agent but at least the edges have a locally unique color, which can be exploited for the traversal. The authors revisit results for this setting showing that HðlognÞ bits of memory are necessary and sufficient for an agent to explore any graph with n vertices. Subsequently they provide SPP results for collaborative exploration using several agents each having sublogarithmic memory. Thetopicsofthechaptersinthesecondpartofthisvolumerangefromchallengesin scalable cryptography, data streams, and energy-efficient scheduling to generic optimization and text (pre)processing including applications: – In“ScalableCryptography”(Chapter9)DennisHofheinzandEikeKiltzshedlight on the quest for cryptographic methods that keep on working for significantly increased data set sizes. The security guarantees of currently used RSA encryption technology, for example, degrade linearly in the number of users and ciphertexts. This limits their applicability to smaller data sets or requires significantly larger viii Preface keylengths which in turn slows down and complicates the whole process (in particular if the keylengths are to grow dynamically). The authors discuss a number of settings in which it is possible to provide alter- native scalable cryptographic building blocks. In particular, they survey SPP work on the construction of scalable public-key encryption schemes (a central cryptographic building block that helps secure communication), but also briefly mention other settings such as “reconfigurable cryptography”. – In “Distributed Data Streams” (Chapter 10) Jannik Castenow, Björn Feldkord, Jonas Hanselle, Till Knollmann, Manuel Malatyali, and Friedhelm Meyer auf der Heideconsiderabigdatascenariowhereaserveriswirelesslyconnectedtoahuge number of sensor nodes that continuously measure data. At each time step the server needs to calculate a function defined over the current measurements of the sensors. Due to the sensors’ restricted compute and battery power, the communication between server and sensors has to be optimized, for example by minimizing the total number of messages using clever randomized protocols. The authors review SPP results for three concrete functions: Top-k-Value Monitoring, Top-k-Position Moni- toring, and (Approximate) Count Distinct Monitoring. – In “Energy-Efficient Scheduling” (Chapter 11) Susanne Albers reports on algo- rithmictechniquesforenergyreductioninprocessingenvironmentswheremachine parameterscanbechangedatruntime.Inthefirstpartsheaddressesdynamicspeed scaling: Given a typically superlinear increase of energy consumption with rising processorspeed,thegoalistocleverlyusethewholespeedrangesoastominimize energy consumption while still providing the desired service. The author in particular reports on SPP results for multi-processor platforms with heterogeneous CPUs. She also examines power-down mechanisms (i.e., idle devices can be transitioned into low-power standby and sleep states) in multi-processor environ- ments, where the active and idle periods of the components have to be carefully coordinated in order to maintain a guaranteed performance level. – In “The GENO Software Stack” (Chapter 12)) Joachim Giesen, Lars Kuehne, and Sören Laue present a domain specific language for large-scale mathematical optimization called GENO (for generic optimization). The GENO software generates a solver from a specification of an optimization problem,i.e.objectivefunctionandconstraintsarespecifiedinaformallanguage.The problem specification is then translated into a general normal form, which in turn is passed on to a general purpose solver with optimized support for various hardware platforms including GPUs by carefully integrated BLAS (Basic Linear Algebra Subroutines) calls. The authors show that by putting all the components together the generated solvers are competitive with problem-specific hand-written solvers and orders of magnitude faster than competing approaches that offer comparable ease-of-use. Preface ix – In“AlgorithmsforBigDataProblemsindeNovoGenomeAssembly”(Chapter13) AnandSrivastav,AxelWedemeyer,ChristianSchielke,andJanSchiemannaddress some algorithmic problems related to genome assembly. Concretely speaking they first present an algorithm which significantly reduces the inputdatasizewithoutpracticallyimpactingtheassemblyquality.Theythenturntothe important subproblem of efficiently counting k-mers for which they provide an external-memory solution. Further reconstruction steps boil down to the longest path problem and the Eulerian tour problem. In order to tackle those they present a linear time (per edge) streaming algorithm for heuristically constructing long paths in undirected graphs, and a streaming algorithm for the Euler tour problem with optimal one-pass complexity. – In “Scalable Text Index Construction” (Chapter 14) Timo Bingmann, Patrick Dinklage,JohannesFischer,FlorianKurpicz,EnnoOhlebusch,andPeterSanders discuss the current state of the art in large-scale computation of text-indices. Whentreatingdistributed,external,andsharedmemoryapproachesfordifferenttext indicesandtheirapplicationstheauthorspointoutcommontechniquesthatareusedin different models of computation or in the computation of different text indices. While mostofthediscussedworksolelyfocusesontheconstructionofthetextindices,they alsoshowapproachestoactuallyanswerqueriesontextindicesindistributedmemory. Inadditiontheydiscussreal-worldapplicationsinbioinformaticsandtextcompression and future challenges. We would like to thank all authors who submitted their work, the referees for their helpful comments, as well as the DFG for accepting and sponsoring the priority program SPP 1736 on Algorithms for Big Data. We hope that this volume will prove useful for further research in big data algorithms. May 2022 Hannah Bast Claudius Korzen Ulrich Meyer Manuel Penschuck