Andrea Burattin 7 Process Mining Techniques 0 2 P in Business Environments I B N L Theoretical Aspects, Algorithms, Techniques and Open Challenges in Process Mining 123 Lecture Notes in Business Information Processing 207 Series Editors Wil van der Aalst Eindhoven Technical University, Eindhoven, The Netherlands John Mylopoulos University of Trento, Povo, Italy Michael Rosemann Queensland University of Technology, Brisbane, QLD, Australia Michael J. Shaw University of Illinois, Urbana-Champaign, IL, USA Clemens Szyperski Microsoft Research, Redmond, WA, USA More information about this series at http://www.springer.com/series/7911 Andrea Burattin Process Mining Techniques in Business Environments Theoretical Aspects, Algorithms, Techniques and Open Challenges in Process Mining 123 Andrea Burattin University of Innsbruck Innsbruck Austria ISSN 1865-1348 ISSN 1865-1356 (electronic) Lecture Notesin Business Information Processing ISBN 978-3-319-17481-5 ISBN978-3-319-17482-2 (eBook) DOI 10.1007/978-3-319-17482-2 LibraryofCongressControlNumber:2015938082 SpringerChamHeidelbergNewYorkDordrechtLondon ©SpringerInternationalPublishingSwitzerland2015 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartofthe material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodologynow knownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthisbookare believedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsortheeditors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissionsthatmayhavebeenmade. Printedonacid-freepaper SpringerInternationalPublishingAGSwitzerlandispartofSpringerScience+BusinessMedia (www.springer.com) Preface This book encompasses a revised version of the Ph.D. dissertation, written by the author, at the Mathematics Department of the University of Padua (Italy), and at the Computer Science Department of the University of Bologna (Italy). In 2014, the dissertation won the “Best Process Mining Dissertation Award”, assigned by the IEEE Task Force on Process Mining to the most outstanding Ph.D. thesis, discussed between 2012 and 2013, focused on the area of business process intelligence. The increasing availability of storage and computing capability, combined withthe advent of new “smart” devices, represents the fundamental basis of the so-called “Internet of Things” (IoT). Business companies are focusing their attention to IoT as well, since it could be exploited in a valuable manner. One of the results of such IoT diffusion,butmoregenerally,acommontrendoftheseyears,isthatthedatacollection is monumentally increasing. It is important to remind that the value of data is intimately connected to the knowledgethatitispossibletosynthesizefromthem.Moreover,inordertostrengthen theirbusiness,thefocusofcompaniesshouldbeontheconsolidationandimprovement oftheirbusinessprocesses,ratherthanontheirdata.Thisisthescenariowhereprocess mining sits: in between data mining, and business process modeling. After a brief presentation of the state of the art of process mining techniques, this book proposes different scenarios for the deployment of process mining projects. In particular, a characterization of companies, in terms of their “process awareness” (and process awareness of their information systems), is detailed. The work continues identifying and reporting the possible circumstances where problems, both “practical” and “conceptual”, can emerge. We identified these three areasaspossiblesourcesofproblems:(i)datapreparation(e.g.,syntactictranslationof data, missing data); (ii) the actual mining phase (e.g., mining algorithm exploiting all data available); and (iii) results interpretation. Several problems are not limited to a single phase, but orthogonal to all the mentioned sources: for example, the configu- ration of parameters by non-expert users or the computational complexity of some techniques.Inthisbookwewillanalyzeatleastonesolutionforeachofthepresented problems.Thedescriptionsofthesesolutionsarekeptgeneral,inordertoeasilyallow their tailoring into specific application domains. VI Preface The solutions proposed in this book belong to two different computational para- digms: the first considers the classical “batch process mining” (also known as “off- line”); the second introduces the “on-line process mining”. Concerning batch process mining, we are going to investigate first the data prepa- ration problem and we will analyze and present a solution for the problem of hidden data(i.e.,whenarequiredfieldisnotexplicitlyindicated).Inourexamplewearegoing to consider the “case-id”. In particular, our approach tries to identify this missing information by looking at metadata recorded for each event. Afterthat,wewillconcentrateonthesecondstep(theminingphase)and,inparticular, onthe problem ofexploitingall the available information.As example,we propose the generalizationofawell-knowncontrol-flowdiscoveryalgorithm(i.e.,HeuristicsMiner) inordertoexploitnon-instantaneousevents.Theusageofinterval-basedrecordingleads toanimportantimprovementofthealgorithmperformance.Asanotherexampleofdata exploitation,wepresentanautomaticapproachfortheextensionofacontrol-flowmodel with social information (i.e., roles), in order to simplify the analysis of these two per- spectives(thecontrol-flowandresources)combined. Lateron,wewillfocusourattentiononanotherimportantand,fornon-expertusers, impacting problem: the parameters configuration. As example, we considered the configuration of a control-flow discovery algorithm. Our approach consists of two steps: first, we introduce a method to automatically discretize the space of parameter values.Then,wepresenttwoapproachestoselectthe“best”parametersconfiguration. The first, completely autonomous, uses the Minimum Description Length principle to balance the model complexity and the data explanation; the second requires human interaction to navigate a hierarchy of models and find the most suitable result. The data interpretation and results evaluation phase is not problem free, as well. Alsointhiscase,wewillanalyzetheproblemsandproposetwonewmetrics:amodel- to-model and a model-to-log (the latter considers models expressed in declarative language). The final part of this book deals with the adaptation of process mining to on-line settings. We will consider, asexample, theproblem ofon-line control-flow discovery. Specifically, we are going to propose a formal definition of the problem and then present two baseline approaches. These two basic approaches are used only for vali- dation purposes. The actual mining algorithms proposed will be two: the first is the adaptation,tothecontrol-flowdiscoveryproblem,ofawell-knownfrequencycounting algorithm (i.e., LossyCounting); thesecond constitutes aframeworkofmodelswhich can be used for different kinds of streams (for example, stationary streams or streams with concept drifts) Innsbruck, Austria Andrea Burattin February 2015 Acknowledgments I would like to thank, in primis, my Ph.D. supervisor: Alessandro Sperduti. His con- tinuous, expert, and passionate guidance incredibly simplified my job. It is a privilege to work with such a generous person and qualified professor and researcher. IwanttoexpressmyauthenticgratitudetoRobertoPinelli,fromSiav.Hehasbeen always willing to help me, by all means, and many parts of this book are due to the opportunities he gave me. Also, I’m very thankful to Paolo Baldan, Diogo Ferreira, Tullio Vardanega, and Barbara Weber who spent their time reading my Ph.D. thesis, and sharing their useful comments. As mentioned, this book comes as an elaborated version of my Ph.D. thesis. I’m particularlythankfultotheorganizersoftheBestProcessMiningDissertationAward: Dirk Fahland, Antonella Guzzo, and Marcello La Rosa. Their detailed comments and elaborate suggestions substantially helped me in shaping this work. SpecialthanksgotoWilvanderAalst:workingwithhimandhisteamhasbeenan incredibly formative experience. His remarkable professionalism and competence are sources of inspiration for my work. I would like to thank all my colleagues and friends, who shared with me the Ph.D. journey, at the University of Padua and Bologna, in Siav, and at the AIS group, in Eindhoven. Infine,ringraziomiamoglieSerena,imieigenitoriStefaniaeAntonio,etuttalamia famigliapernonaveremailesinatoneldarmiaiuto,fiduciaeserenità,elapossibilità di raggiungere i miei obiettivi. Innsbruck, Austria Andrea Burattin February 2015 Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Business Process Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Process Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Book Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Part I: State of the Art: BPM, Data Mining and Process Mining 2 Introduction to Business Processes, BPM, and BPM Systems. . . . . . . 11 2.1 Introduction to Business Processes. . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 Petri Nets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.2 BPMN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.3 YAWL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.4 Declare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1.5 Other Formalisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Business Process Management Systems . . . . . . . . . . . . . . . . . . . . 19 3 Data Generated by Information Systems (and How to Get It) . . . . . . 23 3.1 Information Extraction from Unstructured Sources. . . . . . . . . . . . . 23 3.2 Evaluation with the F Measure. . . . . . . . . . . . . . . . . . . . . . . . . . 24 1 4 Data Mining for Information System Data . . . . . . . . . . . . . . . . . . . . 27 4.1 Classification with Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Neural Networks Applied to Estimation . . . . . . . . . . . . . . . . . . . . 28 4.3 Association Rules Extraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.4 Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.4.1 Clustering with Self-organizing Maps . . . . . . . . . . . . . . . . 29 4.4.2 Clustering with Hierarchical Clustering . . . . . . . . . . . . . . . 30 4.5 Profiling Using Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5 Process Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.1 Process Mining as Control-Flow Discovery. . . . . . . . . . . . . . . . . . 35 5.2 Other Perspectives of Process Mining. . . . . . . . . . . . . . . . . . . . . . 45 X Contents 5.2.1 Organizational Perspective . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2.2 Conformance Checking . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2.3 Data Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3 Performance Evaluation of Process Mining Algorithm . . . . . . . . . . 46 6 Quality Criteria in Process Mining. . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.1 Model-to-Log Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.2 Model-to-Model Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 7 Event Streams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 7.1 Data Streams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 7.1.1 Data-Based Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 7.1.2 Task-Based Mining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 7.2 Common Stream Mining Approaches. . . . . . . . . . . . . . . . . . . . . . 54 7.3 Stream Mining and Process Mining . . . . . . . . . . . . . . . . . . . . . . . 54 Part II: Obstacles to Process Mining in Practice 8 Obstacles to Applying Process Mining in Practice. . . . . . . . . . . . . . . 59 8.1 Typical Deploy Scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 8.2 Problems with Data Preparation. . . . . . . . . . . . . . . . . . . . . . . . . . 60 8.3 Problems During the Mining Phase . . . . . . . . . . . . . . . . . . . . . . . 62 8.4 Problems with the Interpretation of the Mining Results and Extension of Processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 8.5 Incremental and Online Process Mining . . . . . . . . . . . . . . . . . . . . 63 9 Long-term View Scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 9.1 A Target Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 9.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Part III: Process Mining as an Emerging Technology 10 Data Preparation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 10.1 Process Mining in New Scenarios . . . . . . . . . . . . . . . . . . . . . . . . 72 10.2 Working Framework for Event Logs . . . . . . . . . . . . . . . . . . . . . . 73 10.3 Identification of Process Instances . . . . . . . . . . . . . . . . . . . . . . . . 75 10.3.1 Exploiting A-priori Knowledge. . . . . . . . . . . . . . . . . . . . . 75 10.3.2 Selection of the Identifier. . . . . . . . . . . . . . . . . . . . . . . . . 76 10.3.3 Results Organization and Filtering. . . . . . . . . . . . . . . . . . . 78 10.3.4 Deriving a Log to Mine. . . . . . . . . . . . . . . . . . . . . . . . . . 79 10.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 10.5 Similar Problems and Solutions. . . . . . . . . . . . . . . . . . . . . . . . . . 81 10.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 11 Heuristics Miner for Time Interval. . . . . . . . . . . . . . . . . . . . . . . . . . 85 11.1 Heuristics Miner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 11.2 Activities as Time Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Description: