Data-Centric Systems and Applications Wilfried Grossmann Stefanie Rinderle-Ma Fundamentals of Business Intelligence Data-Centric Systems and Applications SeriesEditors M.J.Carey S.Ceri EditorialBoard A.Ailamaki S.Babu P.Bernstein J.C.Freytag A.Halevy J.Han D.Kossmann I.Manolescu G.Weikum K.-Y.Whang J.X.Yu Moreinformationaboutthisseriesat http://www.springer.com/series/5258 Wilfried Grossmann • Stefanie Rinderle-Ma Fundamentals of Business Intelligence 123 WilfriedGrossmann StefanieRinderle-Ma UniversityofVienna UniversityofVienna Vienna Vienna Austria Austria ISSN2197-9723 ISSN2197-974X (electronic) Data-CentricSystemsandApplications ISBN978-3-662-46530-1 ISBN978-3-662-46531-8 (eBook) DOI10.1007/978-3-662-46531-8 LibraryofCongressControlNumber:2015938180 SpringerHeidelbergNewYorkDordrechtLondon ©Springer-VerlagBerlinHeidelberg2015 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthisbook arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor theeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinorforany errorsoromissionsthatmayhavebeenmade. Printedonacid-freepaper Springer-Verlag GmbH Berlin Heidelberg is part of Springer Science+Business Media (www.springer.com) Foreword IntelligentbusinessesneedBusinessIntelligence(BI).Theyneeditforrecognizing, analyzing,modeling,structuring,andoptimizingbusinessprocesses.Theyneedit, moreover, for making sense of massive amounts of unstructured data in order to support and improve highly sensible—if not highly critical—business decisions. The term “intelligent businesses” does notmerely refer to commercialcompanies but also to (hopefully)intelligent governments,intelligently managed educational institutions,efficienthospitals,andsoon.Everycomplexbusinessactivitycanprofit fromBI. BIhasbecomeamainstreamtechnologyandis—accordingtomostinformation technology analysts—looking forward to a more brilliant and prosperous future. Almostallmediumandlarge-sizedenterprisesandorganizationsareeitheralready using BI software or plan to make use of it in the next few years. There is thus a rapidly growing need of BI specialists. The need of experts in machine learning and data analytics is notorious. Because these disciplines are central to the Big Data hype,and because Google,Facebook,and other companiesseem to offeran infinite numberof jobs in these areas, students resolutely require more courses in machine learning and data analytics. Many Computer Science Departments have consequentlystrengthenedtheircurriculawithrespecttotheseareas. However, machine learning, including data analytics, is only one part of BI technology.Beforea “machine”canlearnfromdata,oneactuallyneedsto collect thedataandpresenttheminaunifiedform,aprocessthatisoftenreferredtoasdata provisioning.This, in turn, requiresextracting the data from the relevantbusiness processes and possibly also from Web sources such as social networks, cleaning, transforming,andintegratingthem,andloadingthemintoadatawarehouseorother typeof database.To make humansefficientlyinteractwith variousstagesof these activities,methodsandtoolsfordatavisualizationarenecessary.BIgoes,moreover, much beyond plain data and aims to identify, model, and optimize the business processesofanenterprise.AlltheseBIactivitieshavebeenthoroughlyinvestigated, andeachhasgivenrisetoanumberofmonographsandtextbooks.Whatwassorely missing,however,wasa bookthattiesitalltogetherandthatgivesa unifiedview ofthevariousfacetsofBusinessIntelligence. v vi Foreword The present book by Wilfried Grossmann and Stefanie Rinderle-Ma brilliantly fills this gap. This book is a thoughtfulintroduction to the major relevant aspects of BI. The book is, however, not merely an entry point to the field. It develops the various subdisciplines of BI with the appropriate depth and covers the major methods and techniques in sufficient detail so as to enable the reader to apply themin areal-worldbusinesscontext.Thebookfocuses,inparticular,onthefour major areas related to BI: (1) data modelingand data provisioningincludingdata extraction, integration, and warehousing; (2) data and process visualization; (3) machinelearning,dataandtextmining,anddataanalytics;and(4)processanalysis, mining,andmanagement.ThebookdoesnotonlycoverthestandardaspectsofBI butalsotopicsofmorerecentrelevancesuchassocialnetworkanalyticsandtopics ofmorespecializedinterestsuchastextmining.Theauthorshavedoneanexcellent jobinselectingandcombiningalltopicsrelevanttoamodernapproachtoBusiness Intelligenceandtopresentthecorrespondingconceptsandmethodswithinaunified framework.To the best of my knowledge,this is the first bookthatpresents BI at thislevelofbreadth,depth,andcoherence. The authors, Wilfried Grossmann and Stefanie Rinderle-Ma, joined to form an ideal team towards writing such a useful and comprehensive book about BI. They are both professors at the University of Vienna but have in addition gainedsubstantialexperiencewithcorporateandinstitutionalBIprojects:Stefanie Rinderle-MamoreintheprocessmanagementareaandWilfriedGrossmannmore inthefieldofdataanalytics.Totheprofitofthereader,theyputtheirknowledgeand experiencetogether to develop a common languageand a unified approachto BI. Theyare,moreover,expertsinpresentingmaterialtostudentsandhaveatthesame time the real-life background necessary for selecting the truly relevant material. Theywereabletocomeupwithappropriateandmeaningfulexamplestoillustrate themainconceptsandmethods.Infact,thefourrunningexamplesinthisbookare groundedinbothauthors’richprojectexperience. ThisbookissuitableforgraduatecoursesinaComputerScienceorInformation Systemscurriculum.Atthesametime,itwillbemostvaluabletodataorsoftware engineerswhoaimatlearningaboutBI,inordertogaintheabilitytosuccessfully deployBItechniquesinanenterpriseorotherbusinessenvironment.Icongratulate theauthorsonthiswell-written,timely,andveryusefulbook,andIhopethereader enjoysitandprofitsfromitasmuchaspossible. Oxford,UK GeorgGottlob March2015 Preface The main task of business intelligence (BI) is providing decision support for businessactivitiesbasedonempiricalinformation.Thetermbusinessisunderstood in a rather broad sense covering activities in different domain applications, for example, an enterprise, a university, or a hospital. In the context of the business under consideration, decision support can be at different levels ranging from the operationalsupportforaspecificbusinessactivityuptostrategicsupportatthetop levelofanorganization.Consequently,thetermBIsummarizesahugesetofmodels and analytical methodssuch as reporting,data warehousing,data mining, process mining,predictiveanalytics,organizationalmining,ortextmining. In this book, we present fundamental ideas for a unified approach towards BI activitieswithanemphasisonanalyticalmethodsdevelopedintheareasofprocess analysisandbusinessanalytics. ThegeneralframeworkisdevelopedinChap.1,whichalsogivesanoverviewon thestructureofthebook.Oneunderlyingideaisthatallkindsofbusinessactivities areunderstoodasaprocessintimeandtheanalysisofthisprocesscanemphasize different perspectives of the process. Three perspectives are distinguished: (1) the production perspective, which relates to the supplier of the business; (2) the customerperspective,whichrelatestousers/consumersoftheofferedbusiness;and (3)theorganizationalperspective,whichconsidersissuessuchasoperationsinthe productionperspectiveorsocialnetworksinthecustomerperspective. Core elements of BI are data about the business, which refer either to the description of the process or to instances of the process. These data may take different views on the process defined by the following structural characteristics: (1) an event view, which records detailed documentation of certain events; (2) a stateview,whichmonitorsthedevelopmentofcertainattributesofprocessinstances over time; and (3) a cross-sectional view, which gives summary information of characteristic attributes for process instances recorded within a certain period of time. The issues for which decision support is needed are often related to so-called keyperformanceindicators(KPIs)andtotheunderstandingofhowtheydependon certaininfluentialfactors,i.e.,specificitiesofthebusiness.Foranalyticalpurposes, vii viii Preface it is necessary to reformulate a KPI in a number of analytical goals. These goals correspond to well-known methods of analysis and can be summarized under the headings business description goals, business prediction goals, and business understandinggoals.Typicalbusinessdescriptiongoalsarereporting,segmentation (unsupervisedlearning),andtheidentificationofinterestingbehavior.Businesspre- dictiongoalsencompassestimationandclassificationandareknownassupervised learninginthe contextofmachinelearning.Businessunderstandinggoalssupport stakeholdersin understandingtheir business processesandmay consistin process identificationandprocessanalysis. Basedonthisframework,wedevelopamethodformatforBIactivitiesoriented towardsideasoftheL(cid:2)formatforprocessminingandCRISPforbusinessanalytics. Themaintasksoftheformatarethebusinessanddataunderstandingtask,thedata task, the modeling task, the analysis task, and the evaluation and reporting task. Thesetasksdefinethestructureofthefollowingchapters. Chapter 2 deals with questions of modeling. A broad range of models occur in BI corresponding to the different business perspectives, a number of possible views on the processes, and manifold analysis goals. Starting from possible ways of understandingthe term model,the mostfrequentlyused modelstructuresin BI are identified, such as logic-algebraic structures, graph structures, and probabilis- tic/statisticalstructures.Eachstructureisdescribedintermsofitsbasic properties and notation as well as algorithmic techniques for solving questions within these structures.Backgroundknowledgeisassumedaboutthesestructuresatthelevelof introductorycoursesinprogramsforappliedcomputerscience.Additionally,basic considerations about data generation, data quality, and handling temporal aspects arepresented. Chapter3elaboratesonthedataprovisioningprocess,rangingfromdatacollec- tionandextractiontoasoliddescriptionofconceptsandmethodsfortransforming dataintoanalyticaldataformatsnecessaryforusingthedataasinputforthemodels in the analysis. The analytical data formats also cover temporal data as used in processanalysis. InChap.4,wepresentbasicmethodsfordatadescriptionanddatavisualization thatareusedinthebusinessanddataunderstandingtaskaswellasintheevaluation andreportingtask. Methodsfor process-orienteddata and cross-sectionaldata are considered.Basedonthesefundamentaltechniques,wesketchaspectsofinteractive anddynamicvisualizationandreporting. Chapters5–8explaindifferentanalyticaltechniquesused forthe main analysis goals of supervised learning (prediction and classification), unsupervised learning (clustering), as well as process identification and process analysis. Each chapter is organized in such a way that we first present first an overview of the used terminologyandgeneralmethodologicalconsiderations.Thereafter,frequentlyused analyticaltechniquesarediscussed. Chapter 5 is devoted to analysis techniques for cross-sectional data, basically traditional data mining techniques. For prediction, different regression techniques are presented. For classification, we consider techniques based on statistical prin- ciples, techniques based on trees, and support vector machines. For unsupervised Preface ix learning, we consider hierarchical clustering, partitioning methods, and model- basedclustering. Chapter 6 focuses on analysis techniques for data with temporal structure. We startwithprobabilistic-orientedmodelsinparticular,Markovchainsandregression- based techniques (event history analysis). The remainder of the chapter considers analysis techniques useful for detecting interesting behavior in processes such as associationanalysis,sequencemining,andepisodemining. Chapter 7 treats methods for process identification, process performance man- agement, process mining, and process compliance. In Chap. 8, various analysis techniques for problems are elaborated, which look at a business process from differentperspectives.Thebasicsofsocialnetworkanalysis,organizationalmining, decision point analysis, and text mining are presented. The analysis of these problemscombinestechniquesfromthepreviouschapters. For explanationof a method, we use demonstrationexamples on the one hand andmorerealisticexamplesbasedonusecasesontheotherhand.Thelatterinclude the areas of medical applications, higher education, and customer relationship management. These use cases are introduced in Chap. 1. For software solutions, wefocusonopensourcesoftware,mainlyRforcross-sectionalanalysisandProM forprocessanalysis.Adetailedcodeforthesolutionstogetherwithinstructionson howtoinstallthesoftwarecanbefoundontheaccompanyingwebsite: www.businessintelligence-fundamentals.com The presentation tries to avoid too much mathematical formalism. For the derivation of properties of various algorithms, we refer to the corresponding literature. Throughout the text, you will find different types of boxes. Light grey boxesareusedforthepresentationoftheusecases,darkgreyboxesfortemplates thatoutlinethemainactivitiesinthedifferenttasks,andwhiteboxesforoverview summariesofimportantfactsandbasicstructuresofprocedures. The materialpresentedin the bookwas used by the authorsin a 4-h course on Business Intelligence running for two semesters. In case of shorter courses, one couldstartwithChaps.1and2,followedbyselectedtopicsofChaps.3,5,and7. Vienna,Austria WilfriedGrossmann Vienna,Austria StefanieRinderle-Ma
Description: