ebook img

Research Directions for Principles of Data Management (Abridged) PDF

13 Pages·2017·0.49 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Research Directions for Principles of Data Management (Abridged)

Research Directions for Principles of Data Management (Abridged) Serge Abiteboul Marcelo Arenas Pablo Barceló Meghyn Bienvenu Diego Calvanese Claire David Richard Hull Eyke Hüllermeier Benny Kimelfeld Leonid Libkin Wim Martens Tova Milo Filip Murlak Frank Neven Magdalena Ortiz Thomas Schwentick Julia Stoyanovich Jianwen Su Dan Suciu Victor Vianu Ke Yi In April 2016, a community of researchers work- ThefieldofPDMisbroad. Ithasrangedfromthe ing in the area of Principles of Data Management developmentofformalframeworksforunderstanding (PDM) joined in a workshop at the Dagstuhl Castle and managing data and knowledge (including data in Germany. The workshop was organized jointly models,querylanguages,ontologies,andtransaction by the Executive Committee of the ACM Sympo- models) to data structures and algorithms (includ- sium on Principles of Database Systems (PODS) ing query optimizations, data exchange mechanisms, and the Council of the International Conference on and privacy-preserving manipulations). Data man- Database Theory (ICDT). The mission of the work- agementisattheheartofmostITapplicationstoday, shop was to identify and explore some of the most and will be a driving force in personal life, social important research directions that have high rele- life, industry, and research for the foreseeable future. vance to society and to Computer Science today, We anticipate on-going expansion of PDM research andwherethePDMcommunityhasthepotentialto as the technology and applications involving data make significant contributions. This article presents management continue to grow and evolve. a summary of the report created by the workshop PDM played a foundational role in the relational [4]. That report describes the family of research database model, with the robust connection be- directions that the workshop focused on from three tween algebraic and calculus-based query languages, perspectives: potential practical relevance, results the connection between integrity constraints and already obtained, and research questions that ap- database design, key insights for the field of query pear surmountable in the short and medium term. optimization, and the fundamentals of consistent The report organizes the identified research chal- concurrent transactions. This early work included lenges for PDM around seven core themes, namely richcross-fertilizationbetweenPDMandotherdisci- Managing Data at Scale, Multi-model Data, Uncer- plines in mathematics and computer science, in- tain Information, Knowledge-enriched Data, Data cluding logic, complexity theory, and knowledge Management and Machine Learning, Process and representation. Since the 1990s we have seen an Data, and Ethics and Data Management. Since new overwhelming increase in both the production of challenges in PDM arise all the time, we note that data and the ability to store and access such data. this list of themes is not intended to be exclusive. This has led to a phenomenal metamorphosis in the The Dagstuhl report is intended for a diverse ways that we manage and use data. During this audience,rangingfromfundingagencies,touniversi- time, we have gone (1) from stand-alone disk-based ties and industrial research labs, to researchers and databases to data that is spread across and linked scientists who are exploring the many issues that by the Web, (2) from rigidly structured towards arise in modern data management. The report is looselystructureddata,and(3)fromrelationaldata also intended for policy makers, sociologists, and to many different data models (hierarchical, graph- philosophers, because it re-iterates the importance structured, data points, NoSQL, text data, image of considering ethics in many aspects of data cre- data, etc.). Research on PDM has developed during ation, access, and usage, and suggests how research that time, too, following, accompanying and influ- can help to find new ways for maximizing the bene- encing this process. It has intensified research on fits of massive data while nevertheless safeguarding extensions of the relational model (data exchange, the privacy and integrity of citizens and societies. incomplete data, probabilistic data, ...), on other SIGMOD Record, December 2016 (Vol. 45, No. 4) 5 data models (hierachical, semi-structured, graph, driving force in data management usage and re- text, ...), and on a variety of further data man- searchinthecurrentera. Thecurrentreportfocuses agement areas, including knowledge representation primarily on research challenges where a mathemat- and the semantic web, data privacy and security, ically based perspective has had and will continue and data-aware (business) processes. Along the way, to have substantial impact. This includes for ex- thePDMcommunityexpandeditscross-fertilization ample new algorithms for large-scale parallelized with related areas, to include automata theory, web query processing and Machine Learning, and mod- services, parallel computation, document processing, els and languages for heterogeneous and uncertain data structures, scientific workflow, business process information. The current report also considers ad- management, data-centered dynamic systems, data ditional areas where research into the principles of mining, machine learning, information extraction, datamanagementcanmakegrowingcontributionsin etc. the coming years, including for example approaches for combining data structured according to different Lookingforward,threebroadthemesindataman- models,processtakentogetherwithdata,andethics agement stand out where principled, mathematical in data management. thinkingcanbringnewapproachesandmuch-needed clarity. The first relates to the overall lifecycle of The remainder of this article includes overviews so-called “Big Data Analytics”, that is, the applica- of the seven PDM research areas mentioned above, tion of statistical and machine learning techniques and a concluding section with comments about the to make sense out of, and derive value from, mas- roadaheadforPDMresearch. Theinterestedreader sive volumes of data. As documented in numerous is referred to the full Dagstuhl Report [4] for more sources, so-called “data wrangling” can form 50% detail and references. to 80% of the labor costs in an analytics investi- gation. As discussed in the Dagstuhl report, the PDM research areas of Managing Data at Scale, 1. MANAGINGDATAATSCALE Knowledge-enriched Data, Multi-model Data, Un- Volume is still the most prominent feature of Big certain Information, and Data Management and Data. The PDM community, as well as the general Machine Learning are all relevant to supporting theoretical computer science community, has made Big Data Analytics. The second broad theme of significant contributions to efficient data processing data management where principled thinking can at scale. Still, however, we face important practical help stems from new forms of data creation and pro- challenges such as the following: cessing, especiallyasitarisesinapplicationssuchas web-basedcommerce, socialmediaapplications, and NewParadigmsforMulti-WayJoinProcessing. data-aware workflow and business process manage- A celebrated result by Atserias, Grohe, and Marx ment. ThePDMresearchareasofMulti-model Data, [17] has sparked a flurry of research efforts in re- Knowledge-enriched Data, Uncertain Information, examining how multi-way joins should be computed. and Process and Data are all relevant to this theme. In all current relational database systems, a multi- These are providing approaches that make it easier way join is processed in a pairwise framework using to understand and process the myriad kinds of data a binary tree (plan), which is chosen by the query and updates involved, and to enable higher degrees optimizer. However, the recent theoretical studies of confidence in transactional software that is used have discovered that for many queries and data in- toprocessthedata. Thethirdbroadtheme,whichis stances, even the best binary plan is suboptimal by just beginning to emerge, is the development of new a large polynomial factor. Several worst-case algo- principles and approaches in support of ethical data rithms have been designed in different computation management. Emerging research suggests that the models [70, 51, 20, 6], all of which have abandoned use of mathematical principles in research on Ethics the binary tree paradigm, while adopting a more and Data Management can lead to new approaches holistic approach. In particular, leapfrog join [87] to ensure data privacy for individuals, a broader has been implemented inside a full-fledged database perspective on notions of “fair” data dissemination system. We believe that these newly developed al- and analysis, and compliance with government and gorithms have a potential to change how multi-way societal regulations at the corporate level. join processing is currently done in database sys- The findings of the Dagstuhl report differ from, tems. Of course, this can only be achieved with and complement, the findings of the 2016 Beckman significant efforts for designing and implementing Report [1] in two main aspects. Both reports stress new query optimizers and cost estimation under the the importance of “Big Data” as the single largest new paradigm. 6 SIGMOD Record, December 2016 (Vol. 45, No. 4) ApproximateQueryProcessing. Recent studies (e.g., [20]) have established tight The area of online aggregation [49] studies new bounds on the communication cost for computing algorithms that allow to return approximate results join queries, but many questions remain open: (1) (with statistical guarantees) for analytical queries at The existing bounds are tight only for one-round early stages of the processing, so that the user can algorithms. However, new large-scale systems like terminate it as soon as the accuracy is acceptable. Spark have greatly improved the efficiency of multi- Recent studies have shown encouraging results [48, round iterative computation, thus the one-round 61], but there is still much room for improvement: limit seems unnecessary. The communication com- (1) The existing algorithms have only used simple plexity of multi-round computation remains largely randomsamplingorsamplerandomwalkstosample open. (2) The existing work has only focused on a from query results. More sophisticated techniques small set of queries (full conjunctive queries), while basedonMarkovChainMonteCarlomightbemore many other types of queries remain unaddressed. effective. (2) The streaming algorithms community Broadly,thereisgreatinterestinlarge-scalemachine has developed many techniques to summarize large learning using these systems, thus it is important data sets into compact data structures while pre- to study the communication complexity of classical servingpropertiesofthedata. Thesesummarization machine learning tasks under these models. This is techniques can also be useful in approximate query developed in more detail in Section 5, which sum- processing. (3) Integrating these techniques into marizes research opportunites at the crossroads of data processing engines is still a significant chal- data management and machine learning. lenge. Wethinkthatthefollowingtechniqueswillbeuse- ful in handling these challenges: statistics, sampling These practical challenges raise the following the- and approximation theory, communication complex- oretical challenges: ity, information theory, and convex optimization. The Relationship Among Various Big Data Com- 2. MULTI-MODELDATA:ANOPEN putationModels. ECOSYSTEMOFDATAMODELS The theoretical computer science community has developed many beautiful models of computation Over the past 20 years, the landscape of available aimedathandlingdatasetsthataretoolargeforthe data has dramatically changed. While the huge traditional random access machine (RAM) model, amount of available data is perceived as a clear the most prominent ones including parallel RAM asset, exploiting this data meets the challenges of (PRAM), external memory (EM) model, streaming the “4 V’s” mentioned in the Introduction. model, the BSP model and its recent refinements One particular aspect of the variety of data is to model modern distributed architectures. Several the existence and coexistence of different models for studies seem to suggest that there are deep connec- semi-structured and unstructured data, in addition tions between seemingly unrelated Big Data compu- to the widely used relational data model. Examples tation models for streaming computation, parallel include tree-structured data (XML, JSON), graph processing, and external memory, especially for the data (RDF, property graphs, networks), tabular classofproblemsinterestingtothePDMcommunity data (CSV), temporal and spatial data, text, and (e.g., relational algebra) [80, 45, 58]. Investigating multimedia. We can expect that in the near future, this relationship would reveal the inherent nature new data models will arise in order to cover particu- of these problems with respect to scalable computa- larneeds. Importantly,datamodelsincludenotonly tion, andwouldalsoallowustoleveragetherichset a data structuring paradigm, but also approaches of ideas and tools that the theory community has for queries, updates, integrity constraints, views, developed over the decades. integration, and transformation, among others. Followingthesuccessoftherelationaldatamodel, The Communication Complexity of Parallel Query originatingfromthecloseinteractionbetweentheory Processing. andpractice,thePDMcommunityhasbeenworking Newlarge-scaledataanalyticssystemsusemassive for many years towards understanding each one of parallelism to support complex queries on datasets. the aforementioned models formally. Classical DB These systems use clusters of servers and proceed in topics—schema and query languages, query evalu- multiple communication rounds. In these systems, ation and optimization, incremental processing of the communication cost is usually the bottleneck, evolvingdata,dealingwithinconsistencyandincom- and therefore has become the main measure of com- pleteness, dataintegrationandexchange, etc.—have plexity for algorithms designed for these models. been revisited. This line of work has been successful SIGMOD Record, December 2016 (Vol. 45, No. 4) 7 from both the theoretical and practical points of SchemaLanguages. view. As these questions are not yet fully answered Design flexible and robust multi-model schema for the existing data models and will be asked again languages. Multi-model schema languages should whenever new models arise, it will continue to offer offer a uniform treatment of different models, the practicallyrelevanttheoreticalchallenges. Butwhat ability to describe mappings between models (im- we view as a new grand challenge is the coexistence plementing different views on the same data, in andinterconnectionofallthesemodels, complicated the spirit of data integration), and the flexibility to further by the need to be prepared to embrace new seamlessly incorporate new models as they emerge. models at any time. The coexistence of different data models resem- SchemaExtraction. bles the fundamental problem of data heterogeneity Provide efficient algorithms to extract schemas within the relational model, which arises when se- from the data, or at least discover partial struc- mantically related data is organized under different tural information (cf. [23, 26]). The long-standing schemas. This problem has been tackled by data challenge of entity resolution is exacerbated in the integration and data exchange, but since these clas- contextoffindingcorrespondencesbetweendatasets sical solutions have been proposed, the nature of structured according to different models [85]. availabledatahaschangeddramatically,makingthe questions open again. This is particularly evident VisualizationofDataandMetadata. in the Web scenario, where not only the data comes Develop user-friendly paradigms for presenting in huge amounts, in different formats, is distributed, the metadata information and statistical properties and changes constantly, but also it comes with very ofthedatainawaythathelpsinformulatingqueries. little information about its structure and almost no This requires understanding and defining what the controlofthesources. Thus,whiletheexistenceand relevant information in a given context is, and rep- coexistence of various data models is not new, the resenting it in a way allowing efficient updates as recent changes in the nature of available data raise the context changes (cf. [30, 15]). astrongneedforanewprincipledapproachfordeal- ing with different data models: an approach flexible QueryLanguages. enough to allow keeping the data in their original Go beyond bespoke query languages for the spe- format (and be open for new formats), while still cific data models [13] and design a query language providing a convenient unique interface to handle suitable for multi-model data, either incorporating data from different sources. It faces the following the specialized query languages as sub-languages or four specific practical challenges. offering a uniform approach to querying, possibly at the cost of reduced expressive power or higher ModellingData. complexity. How does one turn raw data into a database? Could we create methodologies allowing engineers EvaluationandOptimization. to design a new data model? Provide efficient algorithms for computing mean- ingful answers to a query, based on structural in- UnderstandingData. formation about data, both inter-model and intra- How does one make sense of the data? Could we model; this can be tackled either directly [56, 46] or help the user and systems to understand the data via static optimization [21, 33]. without first discovering its structure in full? AlltheseproblemsrequirestrongtoolsfromPDM and theoretical computer science in general (com- AccessingData. plexity,logic,automata,etc.). Butsolvingthemwill How does one extract information? How can we also involve knowledge and techniques from neigh- help users formulate queries in a more uniform way? boring communities. For example, the second, third and fifth challenges naturally involve data mining ProcessingData. and machine learning aspects (see Section 5). The How does one evaluate queries efficiently? first, second, and third raise knowledge representa- tion issues (see Section 4). The first and fourth will require expertise in programming languages. The These practical challenges raise concrete theoret- fifthisattheinterfacebetweenPDMandalgorithms, ical problems, some of which go beyond the tradi- butalsobetweenPDMandsystems. Thethirdraises tional scope of PDM. Within PDM, the key theoret- human-computer interaction issues. ical challenges are the following. 8 SIGMOD Record, December 2016 (Vol. 45, No. 4) 3. UNCERTAININFORMATION field should urgently address what can actually be done efficiently. Incomplete, uncertain, and inconsistent informa- tion is ubiquitous in data management applications. Making theoretical results applicable in practice This was recognized already in the 1970s [32], and is the biggest practical challenge for incomplete and since then the significance of the issues related to uncertain data. To move away from the focus on incompleteness and uncertainty has been steadily intractability and to produce results of practical growing: it is a fact of life that data we need to han- relevance, the PDM community needs to address dle on an everyday basis is rarely complete. How- several challenges. ever, while the data management field developed techniques specifically for handling incomplete data, RDBMSTechnologyinthePresenceofIncomplete their current state leaves much to be desired, both Data. theoreticallyandpractically. Evenafter40+yearsof It must be capable of finding query answers one relational technology, when evaluating SQL queries can trust, and do so efficiently. But how do we overincompletedatabasesonegetsresultsthatmake find good quality query answers with correctness people say “you can never trust the answers you get guarantees when we have theoretical intractability? from[anincomplete]database” [34]. Infactweknow For this we need new approximation schemes, quite thatSQLcanproduceeverytypeoferrorimaginable different from those that have traditionally been when nulls are present [63]. used in the database field. Such schemes should On the theory side, we appear to have a good provideguaranteesthatanswerscanbetrusted, and understanding of what is needed in order to pro- shouldalsobeimplementableusingexistingRDBMS duce correct results: computing certain answers to technology. queries. These are answers that are true in all com- plete databases that are compatible with the given ModelsofUncertainty. incomplete database. This idea, that dates back to What is provided by current practical solutions thelate1970saswell, hasbecomethewayofprovid- is rather limited. Looking at relational databases, ing query answers in all applications, from classical we know that they try to model everything with databases with incomplete information [53] to new primitive null values, but this is clearly insufficient. applications such as data integration and exchange We need to understand types of uncertainty that [59, 14], consistent query answering [22], ontology- need to be modeled and introduce appropriate rep- based dataaccess [29], and others. The reason these resentation mechanisms. ideas have found limited application in mainstream database systems is their complexity. Typically, an- BenchmarksforUncertainData. swering queries over incomplete databases with cer- Whatshouldweuseasbenchmarkswhenworking tainty can be done efficiently for conjunctive queries with incomplete/uncertain data? Quite amazingly, or some closely related classes, but beyond the com- this has not been addressed; in fact standard bench- plexityquicklygrowstointractable–andsometimes marks tend to just ignore incomplete data, making even undecidable, see [62]. Since this cannot be tol- it hard to test efficiency of solutions in practice. erated by real life systems, they resort to ad hoc solutions,whichgoforefficiencyandsacrificecorrect- HandlingInconsistentData. ness; thus bizarre and unexpected behavior occurs. Howdowemakehandlinginconsistency(inpartic- While even basic problems related to incomplete- ular, consistent query answering) work in practice? nessinrelationaldatabasesremainunsolved,wenow How do we use it in data cleaning? Again, there constantlydealwithmorevariedtypesofincomplete are many strong theoretical results here, but they and inconsistent data. A prominent example is that concentrateprimarilyontractabilityboundariesand of probabilistic databases [81], where the confidence variouscomplexitydichotomiesforsubclassesofcon- in a query answer is the total weight of the worlds junctive queries, rather than practicality of query that support the answer. Just like certain answers, answering techniques. There are promising works computing exact answer probabilities is usually in- on enriching theoretical repairs with user prefer- tractable,andyetithasbeenthefocusoftheoretical ences [78], or ontologies [44], along the lines of ap- research. proaches described in Section 4, but much more The key challenge in addressing the problem of foundational work needs to be done before they can handling incomplete and uncertain data is to pro- get to the level of practical tools. vide theoretical solutions that are usable in practice. Instead of proving more impossibility results, the HandlingProbabilisticData. SIGMOD Record, December 2016 (Vol. 45, No. 4) 9 Thecommonmodelsofprobabilisticdatabasesare tured, neatly organized in relational databases, and arguably simpler and more restricted than the mod- treated as complete, to a world where data is het- els studied by the Statistics and Machine Learning erogenous and distributed, and can no longer be communities. Yet common complex models can be treated as complete. Moreover, not only do we have simulated by probabilistic databases if one can sup- massive amounts of data; we also have very large port expressive query languages [55]; hence, model amounts of rich knowledge about the application complexity can be exchanged for query complex- domain of the data, in the form of taxonomies or ity. Therefore, it is of great importance to develop full-fledgedontologies,andrulesabouthowthedata techniques for approximate query answering, on ex- should be interpreted, among other things. Tech- pressivequerylanguages,overlargevolumesofdata, niques and tools for managing such complex infor- with practical execution costs. mation have been studied extensively in Knowledge Representation, a subarea of Artificial Intelligence. The theoretical challenges can be split into three Inparticularlogic-basedformalisms,suchasdescrip- groups. tion logics and different rule-based languages, have Modeling. beenproposedandassociatedreasoningmechanisms have been developed. However, work in this area We need to provide a solid theoretic basis for the did not put a strong emphasis on the traditional practical modeling challenge above; this means un- challenges of data management, namely huge vol- derstanding different types of uncertainty and their umes of data, and the need to specify and perform representations. As with any type of information complex operations on the data efficiently, including stored in databases, there are lots of questions for both queries and updates. the PDM community to work on, related to data Both practical and theoretical challenges arise structures, indexing techniques, and so on. when rich domain-specific knowledge is combined Reasoning. with large amounts of data and the traditional data management requirements, and the techniques and There is much work on this subject; see Section approaches coming from the PDM community will 4 concerning the need to develop next-generation provideimportanttoolstoaddressthem. Wediscuss reasoning tools for data management tasks. When first the practical challenges. it comes to using such tools with incomplete and uncertain data, the key challenges are: How do we Providing End Users with Flexible and Integrated do inference with incomplete data? How do we AccesstoData. integrate different types of uncertainty? How do we A key requirement in dealing with complex, dis- learn queries on uncertain data? What do query tributed,andheterogeneousdataistogiveendusers answersactuallytellusifwerunqueriesondatathat the ability to directly manage such data. This is a isuncertain? Thatis,howresultscanbegeneralized challenge since end users might have deep expertise from a concrete incomplete data set. about a specific domain of interest, but in general Algorithms. are not data management experts. Ontology-based data management has been proposed recently as a To overcome high complexity, we often need to general paradigm to address this challenge. resort to approximate algorithms, but approxima- tion techniques are different from the standard ones Ensuring Interoperability at the Level of Systems used in databases, as they do not just speed up ExchangingData. evaluation but rather ensure correctness. The need Enriching data with knowledge is not only rele- for such approximations leads to a host of theoreti- vant for providing end-user access, but also enables cal challenges. How do we devise such algorithms? direct inter-operation between systems, based on How do we express correctness in relational data the exchange of data and knowledge at the system and beyond? How do we measure the quality of level. A specific area where this is starting to play query answers? How do we take user preferences an important role is e-commerce, where standard into account? ontologies are already available [50]. 4. KNOWLEDGE-ENRICHEDDATA Personalized and Context-Aware Data Access and MANAGEMENT Management. Over the past two decades we have witnessed a Informationisincreasinglyindividualizedandonly gradual shift from a world where most data used fragmentsoftheavailabledataandknowledgemight by companies and organizations was regularly struc- 10 SIGMOD Record, December 2016 (Vol. 45, No. 4) be relevant in specific situations or for specific users. might include suitable forms of average case or pa- Heterogeneity needs to be dealt with, both with re- rameterized complexity, complexity taking into ac- specttothemodelingformalismandwithrespectto count data distribution (on the Web), and forms of the modeling structures chosen to capture a specific smoothed analysis. real-world phenomenon. Next-GenerationReasoningServices. Bringing Knowledge to Data Analytics and Data Thekindsofreasoningservicesthatbecomeneces- Extraction. sary in the context of knowledge-enriched data man- Increasing amounts of data are being collected agement applications go well beyond traditional rea- to perform complex analysis and predictions. Cur- soning studied in knowledge representation, which rently, such operations are mostly based on data typically consists of consistency checking, classifi- in “raw” form, but there is a huge potential for cation, and retrieval of class instances. The forms increasing their effectiveness by enriching and com- of reasoning that are required include processing of plementing such data with domain knowledge, and complex forms of queries in the presence of knowl- leveraging this knowledge during the data analytics edge, explanation (which can be considered as a and extraction process. generalization of provenance), abductive reasoning, hypothetical reasoning, inconsistency-tolerant rea- MakingtheManagementUserFriendly. soning, and defeasible reasoning to deal with excep- Systems combining large amounts of data with tions. complex knowledge are themselves very complex, and thus difficult to design and maintain. Appro- IncorporatingTemporalandDynamicAspects. priate tools that support all phases of the life-cycle A key challenge is represented by the fact that of such systems need to be designed and developed, data and knowledge is not static, and changes over based on novel user interfaces for the various com- time, e.g., due to updates on the data while taking ponents. into account knowledge, forms of streaming data, and more in general data manipulated by processes. To provide adequate solutions to the above prac- Dealing with dynamicity and providing forms of tical challenges, several key theoretical challenges inference (e.g., formal verification) in the presence need to be addressed, requiring a blend of formal ofbothdataandknowledgeisextremelychallenging techniques and tools traditionally studied in data andwillrequirethedevelopmentofnoveltechniques management, with those typically adopted in knowl- and tools [28, 16]. edge representation in AI. In summary, incorporating domain-specific knowl- edgetodatamanagementisbothagreatopportunity DevelopmentofReasoning-TunedDBSystems. and a major challenge. It opens up huge possibili- Such systems will require new/improved database tiesformakingdata-centricsystemsmoreintelligent, engines optimized for reasoning over large amounts flexible, and reliable, but entails computational and of data and knowledge, able to compute both crisp technical challenges that need to be overcome. We andapproximateanswers,andtoperformdistributed believe that much can be achieved in the coming reasoning and query evaluation. years. Indeed,theincreasinginteractionofthePDM andtheKnowledgeRepresentationcommunitieshas Choosing/DesigningtheRightLanguages. been very fruitful, particularly by attempting to un- The languages and formalisms adopted in the var- derstandthesimilaritiesanddifferencesbetweenthe ious components of knowledge-enriched data man- formalisms and techniques used in both areas, and agement systems have to support different types of obtaining new results building on mutual insights. knowledge and data, e.g., mixing open and closed Further bridging this gap by the close collaboration worldassumption,andallowingforrepresentingtem- of both areas appears as the most promising way of poral, spatial, and other modalities of information fulfilling the promises of Knowledge-enriched Data [27, 18, 25, 16, 69]. Management. NewMeasuresofComplexity. 5. DATAMANAGEMENTANDMACHINE To appropriately assess the performance of such LEARNING systems and be able to distinguish easy cases that seem to work well in practice from difficult ones, al- WebelievethatDataManagement(DM)andMa- ternative complexity measures are required that go chineLearning(ML)canmutuallybenefitfromeach beyondthetraditionalworst-casecomplexity. These other. Nowadays, systems that emerge from the SIGMOD Record, December 2016 (Vol. 45, No. 4) 11 ML community are strong in their capabilities of networks by means of expressiveness and/or effi- statistical reasoning, and systems that emerge from ciency? the DM community are strong in their support for semantics, maintenance and scale. This complemen- Large-ScaleMachineLearning. tarityinassetsisaccompaniedbyadifferenceinthe Machine learning is nowadays applied to massive core mechanisms: the PDM community has largely data sets of considerable size, including potentially adopted methodologies driven by logic, while the unbounded streams of data. Under such conditions, ML community centralized around probability and an effective data management and the use of ap- statistics. Yet, modern applications require systems propriate data structures that offer the learning that are strong in both aspects, providing a thor- algorithm fast access to the data are major prereq- ough and sophisticated management of data while uisites for realizing model induction and inference incorporating its inherent statistical nature. in an efficient manner [72]. Research along this di- We envision a plethora of research opportunities rection has amplified in recent years and includes, in the intersection of PDM and ML. We outline for example, the use of hashing [88], Bloom filters several directions, which we classify into two cate- [31], tree-based data structures [38] in learning al- gories: DM for ML and ML for DM. The required gorithms. Related to this is work on distributed methodologies and formal foundations span a vari- machine learning, where data storage and compu- ety of related fields such as logic, formal languages, tation is accomplished in a network of distributed computational complexity, statistical analysis, and units [7], and the support of machine learning by distributed computing. data stream management systems [67]. ComplexityAnalysis. CategoryDMforML The PDM community has established a strong Key challenges in this area are as follows. machinery for fine-grained analysis of querying com- plexity; see, e.g., [10]. Complexity analysis of such FeatureGenerationandEngineering. granularity is highly desirable for the ML commu- Feature engineering refers to the challenge of de- nity, especially for analyzing learning algorithms signing and extracting signals to provide to the that involve various parameters like I/O dimension, general-purpose ML algorithm at hand, in order andnumberoftrainingexamples[54]. Resultsalong to properly perform the desired operation. This is a this direction, connecting DM querying complex- critical and time-consuming task [57], and a central ity and ML training complexity, have been recently theme of modern ML methodologies. Unlike usual shown [75]. ML algorithms that view features as numerical val- ues, the database has access to, and understanding of, the queries that transform raw data into these CategoryMLforDM features. Thus, PDM can contribute to feature en- gineering in various ways, especially on a semantic Data management systems support a core set of level, and provide solutions to problems such as the querying operators (e.g., relational algebra, group- following: How to develop effective languages for ing and aggregate functions, recursion) that are con- query-based feature creation? How to use such lan- sidered as the common requirement of applications. guagesfordesigningasetofcomplementaryfeatures Webelievethatthiscoresetshouldberevisited,and optimallysuitedfortheMLtaskathand? Isagiven specificallythatitshouldbeextendedwithcommon language suitable for a certain ML task? Important ML operators. criteria for the goodness of a feature language in- Incorporating ML features is a natural evolution cludetherisksofunder/overfitting thetrainingdata, for PDM. Database systems with such features have as well as the computational complexity of evalu- already been developed [77, 12]. Query languages ation. The PDM community has already studied have traditionally been designed with emphasis on problems of a similar nature [47]. being declarative: a query states how the answer The promise of deep (neural network) learning should logically relate to the database, not how it is brings substantial hope for reducing the effort in to be computed algorithmically. Incorporating ML manual feature engineering. Is there a general way introduces a higher level of declarativity, where one of solving ML tasks by applying deep learning di- states how the end result should behave (via exam- rectlytothedatabase(ashasalreadybeendone,for ples), but not necessarily which query is deployed example,withsemantichashing [74])? Candatabase for the task. In that spirit, we propose the following queries (of different languages) complement neural directions for PDM research. 12 SIGMOD Record, December 2016 (Vol. 45, No. 4) UnifiedModels. entific workflow helped to establish the basic frame- AnimportantroleofthePDMcommunityisines- works for supporting these workflows, to enable the tablishingcommonformalismsandsemanticsforthe systematic recording and use of provenance informa- databasecommunity. Itisthereforeanimportantop- tion, and to support systems for exploration that portunitytoestablishthe“relationalalgebra” ofdata involve multiple runs of a workflow with varying management systems with built-in ML/statistics op- configurations [36]. erators. Foundational work on data-aware BPM began in the mid-00’s [24, 41], enabled in part by IBM’s LossyOptimization. “Business Artifacts” model for business process [71], Fromtheearlydays,thefocusofthePDMcommu- thatcombinesdataandprocessinaholisticmanner. nityhasbeenonlossless optimization,i.e.,optimiza- Deutch and Milo [39] provide a survey and compari- tionthatdoesnotmodifythefinalresult[76,89]. As son of several of the most important early models mentioned in Section 1, in some scenarios it makes and results on process and data. One variant of the sense to apply lossy optimization that guarantees business artifact model has provided the conceptual only an approximation of the answer. Incorporating basis for the recent OMG Case Management Model MLintothequerymodelgivesfurtheropportunities and Notation (CMMN) standard [65]. Rich work for lossy optimization, as training paradigms are onverificationfordata-awareprocesseshasemerged typically associated with built-in quality (or “risk”) [28, 40], and the artifact-based perspective is en- functions. Hence, we may consider reducing the abling an approach to managing the interaction of execution cost if it entails a bounded impact on the business processes and legacy data systems [83]. quality of the end result [9]. Foundationalworkintheareaofprocessanddata hasthepotentialforcontinuedandexpandedimpact ConfidenceEstimation. in the following six practical challenge areas. Once statistical and ML components are incor- porated in a data management system, it becomes AutomatingManualProcesses. crucial to properly estimate the confidence in query While many business processes have been auto- answers [77]. It is then an important direction to mated using techniques from the BPM field, there establish probabilistic models that capture the com- are many other processes that are still manual – of- bined process and allow to estimate probabilities of ten because high levels of variation make it cost end results. For example, by applying the notion prohibitive to automate using current techniques. of the VC-dimension, an important theoretical con- cept in generalization theory, to database queries, EvolutionandMigrationofBusinessProcesses. Riondatoetal.[73]provideaccurateboundsfortheir Managing change of business processes remains selectivity estimates that hold with high probability. largely manual, highly expensive, time consuming, This direction can leverage the past decade of re- and risk-prone. search on probabilistic databases [82], which can be combined with theoretical frameworks of machine BusinessProcessComplianceandCorrectness. learning, such as PAC learning [86]. Compliance with government regulations and cor- porate policies is a rapidly growing challenge, e.g., as governments attempt to enforce policies around 6. PROCESSANDDATA financial stability and data privacy. Ensuring com- Many forms of data evolve over time, and most pliance is largely manual today, and involves un- processes access and modify data sets. Industry derstanding how regulations can impact or define workswithmassivevolumesofevolvingdata,primar- portions of business processes, and then verifying ilyintheformoftransactionalsystemsandBusiness that process executions will comply. ProcessManagement(BPM)systems. Overthepast half century, computer science research has studied BusinessProcessInteractionandInteroperation. foundational issues of process and of data mainly Managing business processes that flow across en- as separated phenomena, but research into basic terprise boundaries has become increasingly impor- questions about systems that combine process and tant with globalization of business and the splinter- data has been growing over the past decade. Two ingofbusinessactivitiesacrossnumerouscompanies. key areas where data and process have been stud- The recent industrial interest in shared ledger tech- ied together are scientific workflows and data-aware nologies, e.g., Blockchain, highlights the importance BPM [52]. of this area. Inthe1990’sand00’s,foundationalresearchinsci- SIGMOD Record, December 2016 (Vol. 45, No. 4) 13 BusinessProcessDiscoveryandUnderstanding. them. The field of Business Intelligence, which provides Researchinprocessanddatawillrequireon-going techniques for mining and analyzing information extensionsofthetraditionalapproaches,onboththe about business operations, is essential to business database and process-centric sides, and also exten- success, but is today based on a broad variety of sions along the lines just mentioned. A new foun- largely ad hoc and manual techniques [37]. dationalmodelformodernBPMmayemerge,which builds on the artifact and shared-ledger approaches WorkflowandBusinessProcessUsability. but facilitates a multi-perspective understanding, Enabling people to understand and work effec- analogoustothewayrelationalalgebraand calculus tivelytomanagelargenumbersofprocessdefinitions provide two perspectives on data querying. and process instances remains elusive, especially One cautionary note is that research in the area when considering the interactions between process, of process and data today is hampered by a lack of data(bothnewlycreatedandlegacy),resources,the large sets of examples, e.g., sets of process schemas workforce, and business partners. that include explicit specifications concerning data, and process histories that include how data sets The above practical BPM challenges raise key re- were used and affected. More broadly, increased col- search challenges that need to be addressed using laboration between PDM researchers, applied BPM approaches that include mathematical and algorith- researchers, and businesses would enable more rapid mic frameworks and tools. progress towards resolving the concrete problems in BPM faced by industry today. VerificationandStaticAnalysis. Because of the infinite state space inherent in 7. HUMAN-RELATEDDATA&ETHICS data-aware processes [28, 40], verification currently relies on faithful abstractions reducing the prob- More and more “human-related” data is massively lem to classical finite-state model checking. Further generated, in particular on the Web and in phone work is needed to develop more powerful abstrac- apps. Massive data analysis, using data parallelism tions, address new application areas, enable incre- and machine learning techniques, is applied to this mental verification techniques, and enable modular data to generate more data. We, individually and styles of verification that support “plug and play” collectively, are losing control over this data. We approaches. do not know the answers to questions as important as: Is my medical data really available so that I get ToolsforDesignandSynthesis. proper treatment? Is it properly protected? Can a Although compilers and relational database de- private company like Google or Facebook influence sign have both benefited from solid mathematical the outcome of national elections? Should I trust foundations(contextfreegrammarsanddependency thestatisticsIfindontheWebaboutthecrimerate theory, respectively), there is still no robust frame- in my neighborhood? work that supports principled design of business Althoughwekeepeagerlyconsumingandenjoying processes in the larger context of data, resources, more new Web services and phone apps, we have and workforce. growing concerns about criminal behavior on the Web, including racist, terrorist, and pedophile sites; Models and Semantics for Views, Interaction, and identity theft; cyber-bullying; and cyber crime. We Interoperation. are also feeling growing resentment against intrusive A robust theory of views for data-aware business government practices such as massive e-surveillance processes has the potential to enable substantial even in democratic countries, and against aggres- advancesinthesimplificationofprocesscomparison, sive company behaviors such as invasive marketing, process composition, process interoperation, process unexpected personalization, and cryptic or discrimi- out-sourcing, and process privacy (e.g., see [3]). natory business decisions. Societal impact of big data technologies is receiv- AnalyticsforBusinessProcesses. ing significant attention in the popular press [11], The new, more holistic perspective of data-aware and is under active investigation by policy mak- processes can help to provide a new foundation for ers [68] and legal scholars [19]. It is broadly rec- the field of business intelligence, including new ap- ognized that this technology has the potential to proachesforinstrumentingprocessestosimplifydata improve people’s lives, accelerate scientific discovery discovery [64], and new styles of modularity and hi- and innovation, and bring about positive societal erarchy in both the processes and the analytics on change. It is also clear that the same technology 14 SIGMOD Record, December 2016 (Vol. 45, No. 4)

Description:
sium on Principles of Database Systems (PODS) and the Council of the by the Web, (2) from rigidly structured towards loosely structured data . multimedia. We can expect that .. The above practical BPM challenges raise key re-.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.