PhD-FSTC-2016-43 The Faculty of Science, Technology and Communication Dissertation Defence held on 08/11/2016 in Luxembourg to obtain the degree of Docteur de l’Université du Luxembourg en Informatique by Thomas Hartmann Born on 13th November 1981 in Kaufbeuren (Germany) Enabling Model-Driven Live Analytics For Cyber-Physical Systems: The Case of Smart Grids Dissertation defence committee Prof. Dr. Nicolas Navet, chairman Professor, University of Luxembourg, Luxembourg, Luxembourg Dr. François Fouquet, vice-chairman Research Associate, University of Luxembourg, Luxembourg, Luxembourg Prof. Dr. Yves Le Traon, supervisor Professor, University of Luxembourg, Luxembourg, Luxembourg Prof. Dr. Jordi Cabot, member Professor, Universitat Oberta de Catalunya, Castelldefels (Barcelona), Spain Prof. Dr. François Taïani, member Professor, Université de Rennes 1, Rennes Cedex, France Dr. Jacques Klein, expert Senior Research Scientist, University of Luxembourg, Luxembourg, Luxembourg Abstract Advances in software, embedded computing, sensors, and networking technologies will lead to a new generation of smart cyber-physical systems that will far exceed the capa- bilities of today’s embedded systems. They will be entrusted with increasingly complex tasks like controlling electric grids or autonomously driving cars. These systems have the potential to lay the foundations for tomorrow’s critical infrastructures, to form the basis of emerging and future smart services, and to improve the quality of our everyday lives in many areas. In order to solve their tasks, they have to continuously monitor and collect data from physical processes, analyse this data, and make decisions based on it. Making smart decisions requires a deep understanding of the environment, in- ternal state, and the impacts of actions. Such deep understanding relies on e cient data models to organise the sensed data and on advanced analytics. Considering that cyber-physical systems are controlling physical processes, decisions need to be taken very fast. This makes it necessary to analyse data in live, as opposed to conventional batch analytics. However, the complex nature combined with the massive amount of data generated by such systems impose fundamental challenges. While data in the context of cyber-physical systems has some similar characteristics as big data, it holds a particular complexity. This complexity results from the complicated physical phe- nomena described by this data, which makes it dicult to extract a model able to explain such data and its various multi-layered relationships. Existing solutions fail to provide sustainable mechanisms to analyse such data in live. This dissertation presents a novel approach, named model-driven live analytics. The main contribution of this thesis is a multi-dimensional graph data model that brings raw data, domain knowledge, and machine learning together in a single model, which can drive live analytic processes. This model is continuously updated with the sensed data and can be leveraged by live analytic processes to support decision-making of cyber-physical systems. The presented approach has been developed in collaboration with an industrial partner and, in form of a prototype, applied to the domain of smart grids. The addressed challenges are derived from this collaboration as a response to shortcomings in the current state of the art. More specifically, this dissertation provides solutions for the following challenges: First, data handled by cyber-physical systems is usually dynamic—data in motion as opposed to traditional data at rest—and changes frequently and at di↵erent paces. Analysing such data is challenging since data models usually can only represent a snapshot of a system at one specific point in time. A common approach consists in a discretisation, which regularly samples and stores such snapshots at specific times- tamps to keep track of the history. Continuously changing data is then represented as a finite sequence of such snapshots. Such data representations would be very ine cient to analyse, since it would require to mine the snapshots, extract a relevant dataset, and finally analyse it. For this problem, this thesis presents a temporal graph data model and storage system, which consider time as a first-class property. A time-relative navigation concept enables to analyse frequently changing data very eciently. i Secondly, making sustainable decisions requires to anticipate what impacts certain actions would have. Considering complex cyber-physical systems, it can come to sit- uations where hundreds or thousands of such hypothetical actions must be explored before a solid decision can be made. Every action leads to an independent alternative from where a set of other actions can be applied and so forth. Finding the sequence of actions that leads to the desired alternative, requires to e ciently create, represent, and analyse many di↵erent alternatives. Given that every alternative has its own his- tory, this creates a very high combinatorial complexity of alternatives and histories, which is hard to analyse. To tackle this problem, this dissertation introduces a multi- dimensional graph data model (as an extension of the temporal graph data model) that enables to e ciently represent, store, and analyse many di↵erent alternatives in live. Thirdly, complex cyber-physical systems are often distributed, but to fulfil their tasks these systems typically need to share context information between computational en- tities. This requires analytic algorithms to reason over distributed data, which is a complex task since it relies on the aggregation and processing of various distributed and constantly changing data. To address this challenge, this dissertation proposes an approach to transparently distribute the presented multi-dimensional graph data model in a peer-to-peer manner and defines a stream processing concept to eciently handle frequent changes. Fourthly, to meet future needs, cyber-physical systems need to become increasingly intelligent. To make smart decisions, these systems have to continuously refine be- havioural models that are known at design time, with what can only be learned from live data. Machine learning algorithms can help to solve this unknown behaviour by extracting commonalities over massive datasets. Nevertheless, searching a coarse- grained common behaviour model can be very inaccurate for cyber-physical systems, which are composed of completely di↵erent entities with very di↵erent behaviour. For these systems, fine-grained learning can be significantly more accurate. However, mod- elling, structuring, and synchronising many fine-grained learning units is challenging. To tackle this, this thesis presents an approach to define reusable, chainable, and in- dependently computable fine-grained learning units, which can be modelled together with and on the same level as domain data. This allows to weave machine learning directly into the presented multi-dimensional graph data model. In summary, this thesis provides an e cient multi-dimensional graph data model to enable live analytics of complex, frequently changing, and distributed data of cyber- physical systems. This model can significantly improve data analytics for such systems and empower cyber-physical systems to make smart decisions in live. The presented so- lutions combine and extend methods from model-driven engineering, Acknowledgments This work has been funded by the National Research Fund Luxembourg (grant 6816126) and Creos Luxembourg S.A. under the SnT-Creos partnership program. The PhD experience goes beyond research, experimentations, and paper writing. It is indeed a challenging life experience that started in August 2013 and which outcome owes much to the support and help of many people. First of all, I want to express my sincere thanks to my supervisor Prof. Dr. Yves Le Traon for giving me the opportunity to pursue my PhD studies within his group and under his supervision. He always encouraged me, had a permanent confidence in me, and supported me throughout these years. I have learned a lot from his rigorous scientific guidance as a researcher and from his positive, motivating, and open-minded attitude as a team leader. I am equally grateful to my co-supervisor Dr. Jacques Klein for his advice, optimism, countless discussions, guidance, and for always emphasising the bright side of things. My special thanks goes to my daily advisor Dr. Fran¸cois Fouquet for his patience, advice, and flawless guidance throughout the sometimes daunting world of academia. He taught me how to do research, conduct rigorous experiments, write scientific papers, and always pushed me a step further. I am very happy about the friendship we have built up during the years. I am grateful to the members of my dissertation committee, Prof. Dr. Jordi Cabot, Prof. Dr. Fran¸cois Ta¨ıani, and Prof. Dr. Nicolas Navet, for their time to review my work and for providing interesting and valuable feedback. My sincere thanks also goes to Yves Reckinger and Robert Graglia from Creos for the many fruitful discussions and the time they found to collaborate with us. It was really useful and rewarding to me to be able to apply my research on a concrete industrial case. I would also like to express my warm thanks to all current and former members of the SerVal team for the plenty good co↵ee breaks, discussions, and the great times we had. I wish all of them the very best. In particular, I want to thank Dr. Assaad Moawad, Dr. Gr´egory Nain, and Matthieu Jimenez for their continuous feedback and proofreading. I also want to express my thanks to all my co-authors and to the people I have been in touch with during my PhD journey and that are not explicitly mentioned here. Finally, and more personally, I would like to express my heartfelt thanks to my family and my friends for their continuous and unconditional support during these last years. The role they all played in this journey is more important than they can possibly know. iii Contents List of abbreviations and acronyms xi List of figures xv List of tables xvii List of algorithms and listings xviii 1 Introduction 1 1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 The smart grid case study . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 The smart grid vision . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Smart grids in the context of this thesis . . . . . . . . . . . . . . 6 1.3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4.2 Challenges addressed in this thesis . . . . . . . . . . . . . . . . 11 1.5 Approach: model-driven live analytics . . . . . . . . . . . . . . . . . . . 14 1.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.7 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 I Background and state of the art 19 2 Background 21 2.1 Data analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1.1 Taxonomy of data analytics . . . . . . . . . . . . . . . . . . . . 22 2.1.2 Batch and (near) real-time analytics . . . . . . . . . . . . . . . 23 2.1.3 Complex event processing . . . . . . . . . . . . . . . . . . . . . 24 2.1.4 Extract-transform-load and extract-load-transform . . . . . . . 24 2.1.5 OLAP and OLTP . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2 Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.1 Model-driven engineering . . . . . . . . . . . . . . . . . . . . . . 26 2.2.2 MOF: The Meta Object Facility . . . . . . . . . . . . . . . . . . 27 2.2.3 Contents 2.4 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3 State of the art 41 3.1 Analysing data of cyber-physical systems . . . . . . . . . . . . . . . . . 42 3.2 Data analytics platforms . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.1 Online analytical processing (OLAP) . . . . . . . . . . . . . . . 42 3.2.2 The Hadoop stack . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.3 The Spark stack . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3 Stream processing frameworks . . . . . . . . . . . . . . . . . . . . . . . 48 3.4 Graph processing frameworks . . . . . . . . . . . . . . . . . . . . . . . 53 3.5 Graph databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.6 Analysing data in motion . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.6.1 Temporal databases . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.6.2 Temporal RDF and OWL . . . . . . . . . . . . . . . . . . . . . 64 3.6.3 Model versioning . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.6.4 Time series databases . . . . . . . . . . . . . . . . . . . . . . . . 65 3.6.5 Temporal graph processing . . . . . . . . . . . . . . . . . . . . . 66 3.7 Exploring hypothetical actions . . . . . . . . . . . . . . . . . . . . . . . 70 3.8 Reasoning over distributed data in motion . . . . . . . . . . . . . . . . 72 3.9 Combining domain knowledge and machine learning . . . . . . . . . . . 74 3.10 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 II Analysing data in motion and what-if analysis 79 4 A continuous temporal data model to eciently analyse data in mo- tion 81 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.2 Time as a first-class property . . . . . . . . . . . . . . . . . . . . . . . 84 4.3 Continuous validity of model elements . . . . . . . . . . . . . . . . . . 85 4.4 Navigating in time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.4.1 Selecting model element versions . . . . . . . . . . . . . . . . . 87 4.4.2 Time-relative navigation . . . . . . . . . . . . . . . . . . . . . . 87 4.5 Storing temporal data . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.6 Implementation details and API . . . . . . . . . . . . . . . . . . . . . . 90 4.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.7.1 KPI-1: Model updates . . . . . . . . . . . . . . . . . . . . . . . 92 4.7.2 KPI-2: Navigating the context model in time . . . . . . . . . . 94 4.7.3 KPI-3: Storing temporal data . . . . . . . . . . . . . . . . . . . 95 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5 A multi-dimensional graph data model to support what-if analysis 99 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.2 Motivating example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.3 Many-world graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.3.1 Key concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.3.2 Many-world graph semantics . . . . . . . . . . . . . . . . . . . . 105 5.3.3 Base graph (BG) . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.3.4 Temporal graph (TG) . . . . . . . . . . . . . . . . . . . . . . . 106 vi Contents 5.3.5 Many-world graph (MWG) . . . . . . . . . . . . . . . . . . . . . 108 5.4 MWG implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.4.1 Mapping graph nodes to state chunks . . . . . . . . . . . . . . . 110 5.4.2 Indexing and resolving state chunks . . . . . . . . . . . . . . . . 112 5.4.2.1 Index time tree (ITT) . . . . . . . . . . . . . . . . . . 112 5.4.2.2 World index maps (WIM) . . . . . . . . . . . . . . . . 113 5.4.2.3 Chunk resolution algorithm . . . . . . . . . . . . . . . 114 5.4.3 Scaling the processing of graphs . . . . . . . . . . . . . . . . . . 115 5.4.4 Querying and traversing graphs . . . . . . . . . . . . . . . . . . 115 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.5.2 Base graph benchmarks . . . . . . . . . . . . . . . . . . . . . . 117 5.5.3 Temporal graph benchmarks . . . . . . . . . . . . . . . . . . . . 118 5.5.4 MWG benchmarks of a node . . . . . . . . . . . . . . . . . . . . 119 5.5.5 MWG benchmarks of a graph . . . . . . . . . . . . . . . . . . . 120 5.5.6 Deep what-if simulations . . . . . . . . . . . . . . . . . . . . . . 121 5.5.7 Smart grid case study . . . . . . . . . . . . . . . . . . . . . . . 122 5.5.8 Discussion and perspectives . . . . . . . . . . . . . . . . . . . . 123 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 III Reasoning over distributed data and combining domain knowledge with machine learning 125 6 A peer-to-peer distribution and stream processing model 127 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.2 Reactive distributed Contents 7.2.5.1 Weaving learned attributes into domain classes . . . . 154 7.2.5.2 Defining a learning scope for coarse-grained learning in domain models . . . . . . . . . . . . . . . . . . . . . . 154 7.2.5.3 Modelling relations between learning units and domain classes . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 7.2.5.4 Decomposing complex learning tasks into several micro learning units . . . . . . . . . . . . . . . . . . . . . . . 155 7.2.6 Framework implementation details . . . . . . . . . . . . . . . . 156 7.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 7.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 157 7.3.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 7.3.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 7.4 Discussion: meta learning and meta modelling . . . . . . . . . . . . . . 161 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 IV Industrial application and conclusion 163 8 Industrial application: electric overload prediction and warning 165 8.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 8.1.1 The Creos partnership . . . . . . . . . . . . . . . . . . . . . . . 166 8.1.2 The REASON project . . . . . . . . . . . . . . . . . . . . . . . 166 8.2 Smart grid meta model . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 8.3 Electric overload prediction and warning . . . . . . . . . . . . . . . . . 168 8.4 Electric load approximation . . . . . . . . . . . . . . . . . . . . . . . . 170 8.4.1 General considerations . . . . . . . . . . . . . . . . . . . . . . . 170 8.4.2 Topology scenarios . . . . . . . . . . . . . . . . . . . . . . . . . 173 8.4.2.1 Single cable . . . . . . . . . . . . . . . . . . . . . . . . 173 8.4.2.2 Cabinet connecting several cables . . . . . . . . . . . . 173 8.4.2.3 Parallel cables . . . . . . . . . . . . . . . . . . . . . . 174 8.4.3 Considering active and reactive energy . . . . . . . . . . . . . . 175 8.4.4 Deriving the electric load . . . . . . . . . . . . . . . . . . . . . . 175 8.4.5 Integration into the smart grid meta model . . . . . . . . . . . . 177 8.5 Predicting consumption behaviour . . . . . . . . . . . . . . . . . . . . . 177 8.5.1 General considerations . . . . . . . . . . . . . . . . . . . . . . . 177 8.5.2 Live machine learning . . . . . . . . . . . . . . . . . . . . . . . 178 8.5.3 Gaussian mixture models . . . . . . . . . . . . . . . . . . . . . . 178 8.5.4 Profiling power consumption . . . . . . . . . . . . . . . . . . . . 179 8.5.5 Integration into the smart grid meta model . . . . . . . . . . . . 180 8.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 8.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 181 8.6.2 Performance of electric load approximation . . . . . . . . . . . . 182 8.6.3 Accuracy of electric load approximation . . . . . . . . . . . . . 182 8.6.4 Eciency of electric consumption prediction . . . . . . . . . . . 183 8.6.5 Accuracy of electric consumption prediction . . . . . . . . . . . 184 8.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 9 Conclusion 187 9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 viii