Department of Mathematics and Computer Science Architecture of Information Systems Research Group Realizing a Process Cube Allowing for the Comparison of Event Data Master Thesis Tatiana Mamaliga Supervisors: prof. dr. ir. W.M.P. van der Aalst MSc J.C.A.M. Buijs dr. G.H.L. Fletcher Final version Eindhoven, August 2013 Contents 1 Introduction 5 1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Challenges - Then & Now . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Assignment Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Preliminaries 11 2.1 Business Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Process Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Concepts and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 ProM Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 OLAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Concepts and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2 The Many Flavors of OLAP . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3 Process Cube 21 3.1 Process Cube Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Process Cube by Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 From XES Data to Process Cube Structure . . . . . . . . . . . . . . . . . . 24 3.2.2 Applying OLAP Operations to the Process Cube . . . . . . . . . . . . . . . 26 3.2.3 Materialization of Process Cells . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4 Comparison to Other Hypercube Structures . . . . . . . . . . . . . . . . . . . . . . 30 4 OLAP Open Source Choice 32 4.1 Existing OLAP Open Source Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 Advantages & Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 Palo - Motivation of Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5 Implementation 36 5.1 Architectural Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.2 Event Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.3 Load/Unload of the Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.4 Basic Operations on the Database Subsets . . . . . . . . . . . . . . . . . . . . . . . 41 5.4.1 Dice & Slice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.4.2 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.4.3 Drill-down & Roll-up. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.5 Integration with ProM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.6 Result Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2 6 Case Study and Benchmarking 49 6.1 Evaluation of Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.1.1 Synthetic Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.1.2 Real-life Log Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 7 Conclusions & Future Work 59 7.1 Summary of Contributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 7.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 7.2.1 Conceptual Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 7.2.2 Implementation Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 7.3 Further Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3 Abstract Continuous efforts to improve processes, require a deep understanding of process inner working. In this context, the process mining discipline aims at discovering process behavior from historical records, i.e., event logs. Process mining results can be used for analysis of process dynamics. However, mining on realistic event logs is difficult due to complex interdependencies within a process. Therefore, to gain more in-depth knowledge about a certain process, it can be split intosubprocesses, whichcanthenbeseparatelyanalysedandcompared. Typicaltoolsforprocess mining,e.g.,ProM,aredesignedtohandleasingleeventlogatatime,whichdoesnotparticularly facilitate the comparison of multiple processes. To tackle this issue, Van der Aalst proposed in [4] to organize the event log in a cubic data structure, called process cube, with a selection of the event attributes forming the dimensions of the cube. Although, multidimensional data structures are already employed in various business intelli- gence tools, the data used has a static character. This is in stark contrast to process mining, since event data characterizes a dynamic process that evolves in time. The aim of this thesis is to develop a framework that supports the construction of the process cube and permits multi- dimensional filtering on it, in order to separate subcubes for further processing. We start with the OLAP foundation and reformulate its corresponding operations for event logs. Moreover, the semantics of a traditional OLAP aggregate are changed. Numerical aggregates are substituted by sublog data. With these adjustments, a tool is developed and integrated as a plugin in ProM to support the aforementioned operations on the event logs. The user can unload sublogs from the process cube, give them as parameters to other plug-ins in ProM and visualize different results simultaneously. During the development of the tool, we had to deal with a shortcoming of the multidimen- sionaldatabasetechnologieswhenstoringeventlogs,i.e.,thesparsityoftheresultedprocesscube. Sparsity in multidimensional data structures occurs when a large number of cells in a cube are empty, i.e., there are missing data values at the intersection of dimensions. Taking a single at- tributeofaneventlogasadimensionintheprocesscuberesultsinaverysparsemultidimensional data structure. As a result, the computational time required to unload a sublog for processing increases dramatically. This shortcoming was addressed by designing a hybrid database structure that combines a high-speed in-memory multidimensional database with a sparsity-immune rela- tional database. Within this solution, only a subset of event attributes actually contribute to the constructionoftheprocess,whereastherestarestoredintherelationaldatabaseandusedfurther only for event log reconstruction. The hybrid database solution proved to provide the flexibility neededforreal-lifelogs,whilekeepingresponsetimesacceptableforefficientuserinteraction. The applicability of the tool was demonstrated using two event log examples, a synthetic event log and a real-life event log from the CoSeLog project. The thesis concludes with a detailed load- ing and unloading performance analysis of the developed hybrid structure, for different database configurations. Keywords: event log, relational database, in-memory database, OLAP, process mining, visu- alization, performance analysis 4 Chapter 1 Introduction The greatest challenge to any thinker is stating the problem in a way that will allow a solution. Bertrand Russell, British author, mathematician, & philosopher (1872 - 1970) ThisthesiscompletesmygraduationprojectfortheComputerScienceandEngineeringmaster at Eindhoven University of Technology (TU/e). The project was conducted in the Architecture of Information Systems (AIS) group. The AIS group has a distinct research reputation and is specialized in process modeling and analysis, process mining and Process-Aware Information Systems (PAIS). Theprocessminingfield,detailedfurtherinthischapter,providesvaluableanalysistechniques andtools,butalsofacesaseriesofchallenges. Mainissuesarelargedatastreamsandrapidchanges overtime. Thisprojectcreatesaproof-of-conceptprototype,whichconsiderstheso-calledprocess cube concept as a starting point for possible solutions to the above-mentioned challenges. The outcome is further used for visual comparison of event data. This chapter describes the assignment within its scientific context. Section 1.1 provides the research background. Section 1.2 enumerates the most important advances in process mining and identifies the current issues in the field. Section 1.3 specifies the problem and the project objectives. Section 1.4 continues with a short summary on the problem solution. Finally, Section 1.5 provides an overview on the remaining chapters of the thesis. 1.1 Context Technology has become an integral part of any organization. For example, current systems and installations are heavily controlled and monitored remotely by integrated internet technologies [23]. Moreover, employing automated solutions in any line-of-business has become a trend. As a result, Enterprise Systems software, offering a seamless integration of all the information flowing through a company [22], is used in any modern organization. EnterpriseInformationSystems(EIS)keepbusinessesrunning,improveservicetimesandthus, attractmoreclients. Still,likeineverycomplexsystem,therearemultiplepointswherethingscan go wrong. System errors, fraud, security issues, inefficient distribution of tasks are just a few to mention. Tocopewiththeseissues,EIShadtoextenditsfunction-orientedenterpriseapplications withBusinessIntelligence(BI)techniques. Thatis,BIapplicationshavebeeninstalledtosupport managementinmeasuringcompany’sperformanceandderivingappropriatedecisions[39]. Among most important functions of BI are online analytical processing (OLAP), data mining, business performance management and predictive analytics. Beingawareoftheexistingproblemsinanorganizationandapplyingstandardizedsolutionsto solve them, is usually not enough. Consider a doctor that always prescribes pain killers indepen- 5 dentofthepatientcomplaints. Ofcourse,thesekindofpillswilltemporarilyreleasethepain,but theywillnottreattherealdisease. Agooddoctorshouldruntests,identifytherootcausesofthe health problem and only then, give an adequate treatment. This is what the process mining field tries to accomplish. It goes beyond analyzing merely individual data records, but rather focuses on the underlying process which glues event data together. The deep understanding of the inside of a process can point to notorious deviations, persistent bottlenecks and unnecessary rework. All in all, technology has a major impact on organizations and it proved to be an enabler for business process improvement. Therefore, by means of business intelligence, and process mining, in particular, new opportunities are constantly exploited to keep pace with challenges such as change. 1.2 Challenges - Then & Now In the context of today’s rapidly changing environment, organizations are looking for new solu- tionstokeeptheirbusinessesrunningefficiently. Sloganssuchas“DrivingtheChange”(Renault), “ChangesfortheBetter”(MitsubishiSemiconductor),“EmpoweringChange”(CreditSuisseFirst Boston), “New Thinking. New Possibilities” (Hyundai) are used more and more often. Further- more,differentareasofbusinessresearcharetryingtokeepupwiththechangeandprocessmining is not an exception. In 2011, the Process Mining Manifesto [7] was released to describe the state-of-the-art in process mining on one hand, and its current challenges, on the other hand. A year later, the project proposal “Mining Process Cubes from Event Data (PROCUBE)” in [4] suggested the so- called process cube asasolutiondirection forsomeofthesechallenges. In thecontextofcurrently employed process mining solutions and using the Process Mining Manifesto as a reference, the PROCUBE project proposal presents several challenges that process mining is currently facing: From “small” event data to “big” event data. Due to increased storage capacity and advanced technologies, the vast amount of available event data have become difficult to control and analyse. Most of the traditional process miningtechniquesoperatewitheventlogswhosesizedoesnotexceedseveralthousandscases and a couple hundred thousands events (for example, in BPI Challenge [2] files). However, nowadayscorporationsworkonadifferentscaleofeventlogs. GiantslikeRoyalDutchShell, Walmart, IBM, would rather consider millions of events (a day or even a second) and this number will continue to grow. Ways to ensure that event data growth will not affect the importance of process mining techniques are constantly sought. From homogeneous to heterogeneous processes. With the increasing complexity of an event log, chances are that the variability in its corre- sponding process increases as well. For example, events in an event log can present different levelsofabstraction. Howevermanyminingtechniquesassumethatalleventsinaneventlog areloggedatthesamelevelofabstraction. Inthatsense,thediverseeventlogcharacteristics have to be properly considered. From one to many processes. Many companies have their agencies spread across the globe. Let’s take SAP AG as an example. Only its research and development units are located on four continents, but it has regional offices all around the world. That is, SAP units are executing basically the same set of processes. Still, this does not exclude possible variations. For instance, there might be various influences due to the characteristics of a certain SAP distribution region (Germany, India, Brazil, Israel, Canada, China, and others). Traditional process mining is oriented on stand-alone business processes. However, it is of great importance to be able to compare business processes of different organizations (units of an organization). For example, efficient and less efficient paths in different processes can be identified. Inefficient paths can be substituted and efficient paths can be applied to the rest of the processes to improve performance. 6 From steady-state to transient behavior. The change has a major impact not only on the size of event logs and on the necessity of dealing with many processes together, but also on the state of a business process. For example,companiesshouldbeabletoquicklyadjusttodifferentbusinessrequirements. Asa result,theircorrespondingprocessesundergodifferentmodifications. Currentprocessmining techniques assume business processes to be in a steady-state [5]. However, it is important to understand the changing nature of a process and to react appropriately. The notion of concept drift was introduced in process mining [33] to capture this second-order dynamics. Its target is to discover and analyze the dynamics of a process by detecting and adapting to change patterns in the ongoing work. From offline to online. As previously mentioned, systems produce an overwhelming amount of information. The idea of storing it as historical event data for later analysis, as it is currently done, may not seem as appealing any more. Instead, the emphasis should be more on the present and the future of an event. That is, an event should be analysed on-the-fly and predictions on the contingency of its occurrence should be made based on existing historical data. As such, online analysis of event data is yet another process mining challenge. Each of the issues discussed above, are extremely challenging. Analysing large scale event logs is difficult with the current process mining techniques. Solutions to mitigate some of the issues that appear when dealing with large scale event logs are proposed in [14], i.e., by event log simplification, by dealing with less-structured processes and others. A framework for time-based operational support is described in [8]. In [16], an approach is offered to compare collections of process models corresponding to different Dutch municipalities. Nevertheless, there is still the need for more elaborated solutions and a unified way of approaching them. 1.3 Assignment Description Stand-alone process analysis is the common way of analysing processes in today’s process mining approaches. However, inspecting a process as a single entity, impedes observing differences and similaritieswithotherprocesses. Let’stakeasimpleexamplefromtheairlineindustry. Thereisa constantdiscussionaboutwhichofthelow-costairlines,RyanairorWizzair,offersbetterservices. There are both advantages and disadvantages of traveling with either of these two. Generally, Ryanair is considered more punctual than Wizzair 1. To determine why Ryanair is more on-time with flights than Wizzair, we compare their processes. We noticed that while at Wizzair the luggageischeckedonlyonce,Ryanairisverystrictwiththeluggageprocedureandchecksittwice before embarking. As a result, passengers and crew are not busy with “fitting” luggage that does not fit and the hallway of the aircraft is kept free for new passengers that arrive at board. With minimizing the turnaround time, the airline punctuality improves. The procedure of checking the luggagemaynotbetheonlyfactorthatimprovesthepunctualityofRyanairairline,butitisclear from the comparison of the two airline processes that it contributes to reducing the flight delays. In conclusion, the comparison of the two processes helped in answering a specific question and identifying parts of these processes that can be further improved. When it comes to comparison of large processes, it is difficult to inspect processes entirely at a glance. Splitting and merging different parts of a process can offer more insightful details. Let’s consider the following scenario. In the car manufacturing process, there is a final polishing inspection step. Several resources check whether there is a scratch on a car that needs to be polished. During the last two weeks, it was noticed that one polishing crew worked slower than the others. To identify the cause of this issue, the car manufacturing process is analysed. First, the process is split by department type and the polishing department is selected. Then, only the process corresponding to the resources of this specific crew is isolated. The following aspects are 1http://www.flightontime.info/scheduled/scheduled.html 7 inspected: the car type, the engine type, the color type. When filtering by car type and engine type, it seems that there are no patterns indicating a potential delay. However, when inspecting the subprocesses corresponding to different car colors, a pattern emerges. The average working time of polishing a red car is much higher compared to the one of polishing cars of a different color. Sinceredcarstake,ingeneral,moretimetobepolishedthanothercars,thisindicatesthat there is a problem in the painting department. The red-colored cars are not painted properly and therefore need constant polishing. While at the beginning, it seemed like the crew is responsible for the delays, in fact, the crew members were just polishing more red-colored cars. Since red- colored cars required more polishing due to a painting issue, the crew worked slower compared to the other crews. Without filtering the initial process, it would have been difficult to identify such detailed problems. Taking into consideration the discussion above, the goal of this master project can be defined as follows: GOAL: Create a proof-of-concept tool to allow comparison of multiple processes. Inotherwords,theaimistosupportintegratedanalysisonmultipleprocesses,whileexamining different views of a process. Together with the main goal, there are some other targets: filtering processesbypreservingtheinitialdataset,mergingdifferentpartsofaprocess,visualizingprocess mining results simultaneously and placing them next to each other to facilitate comparison. In the following, we present the approach we propose to reach the enumerated objectives. 1.4 Approach Figure 1.1: The process cube. Concept proposed in the PROCUBE project. To accomplish the goal, we base our approach on the process cube concept, introduced in [4] andshowninFigure1.1. Aprocess cube isastructurecomposedofprocesscells. Eachprocesscell (or collection of cells) can be used to generate an event log and derive process mining results [4]. Note that traditional process mining algorithms are always applied to a specific event log without systematically considering the multidimensional nature of event data. In this project, the process cube is materialized as an online analytical processing (OLAP) hypercube structure. Except for the built-in multidimensional structure, one can benefit from the functionality of the OLAP operations and hopefully from the good performance of OLAP implementations. Transactional databases are designed to store and clean data, but are not tailored towards analysis. OLAP, on the other hand, is herein chosen to harbor complex event dataforfurtherprocessanalysis,intheviewofitsanalysis-optimizeddatabasesanditsspecialized “drilling” operations. Organizing event data in OLAP multidimensional structures, makes it easy 8 to get event data and to pick a side to look at it. There are also many ways to divide event data, e.g., one can always drill down and up in the multidimensional structure and inspect event data at different granularity levels. Finally, the retrieved event data can be used to obtain different process-related characteristics, e.g., process models, that can be further analysed and compared. There are however, some challenges with respect to this approach, mainly due to the fact that OLAP does not handle event data, but enterprise data: • Only the aggregation of large collections of numerical data is supported by the OLAP tools. • Process-related aspects are entirely missing in the OLAP framework. • Overlapping of cells (event) classes is not possible in OLAP cubes. Figure 1.2: Master Project Scope. Nevertheless, adjustments can be made to OLAP tools to accommodate process cube require- ments. The approach considers several steps shown also in Figure 1.2. First, event logs are introduced among OLAP data sources. Hence, it becomes possible to load XES event logs in the OLAP database. Second, the process cube is created to support the materialization of an event log. Moreover, the process cube is designed to allow the visualization of cells with overlapping event data. Finally, different process mining results can be produced for any section of the cube and further exported as images. ThematerializationoftheprocesscubeasanOLAPcubeallowstodefineourobjectiveevenmore precise: the goal is to create a proof-of-concept tool that exploits OLAP features to accommodate process mining solutions such that the comparison of multiple processes is possible. 1.5 Thesis Structure To describe the approach, the master thesis is structured as follows: Present a literature study on employed concepts and technologies (Chapter 2) Concepts from process mining and business intelligence fields will be introduced. Then, a discussion on the implemented OLAP and database technologies will follow. Elaborate on process cube functionality (Chapter 3) Theprocesscubenotionwillbeclearlydefinedtogetherwithitsstructure. Therequirements needed to attire the envisioned process cube functionality will be listed. Explain Palo software choice (Chapter 4) BasedontherequirementsfromChapter3, acollectionoftechnologicalsolutionsthatcould support the process cube structure is generated. After analyzing the pros and the cons of each solution, the choice to use Palo OLAP server is described and motivated. 9 Recall the most relevant implementation steps (Chapter 5) After presenting the architecture of the project, the implementation steps are described. The main functionality consists of: loading/unloading a XES file in/from the in-memory database, enabling the adjusted OLAP operations on event logs and visualizing process mining results. Report on the testing process and on the system test results (Chapter 6) Thefunctionalityofthesoftwareistestedanditsperformanceisevaluatedfordifferentevent logs and process cubes. Conclude with general remarks on the project (Chapter 7) The thesis concludes with a series of comments and observations on both the implemented solution and further research possibilities. 10
Description: