From Data Mining to Big Data & Data Science: a Computational Perspective e v i t c e p Miquel Sànchez-Marrè s r e P l a n Knowledge Engineering & Machine Learning Group (KEMLG) o i at Dept. de Llenguatges i Sistemes Informàtics t u p Universitat Politècnica de Catalunya-BarcelonaTech m o C [email protected] a : http://www.lsi.upc.edu/~miquel S D & 19 de Julio 2013 D B https://kemlg.upc.edu o t M D m o r F Content Introduction Knowledge Discovery in Databases e v i t Data Pre-processing c e p s Data Mining Techniques: Statistical & ML methods r e P Data Post-Processing l a n Big Data / Data Science o i t a ut Social Mining p m o Scaling from Data Mining to Big Mining C a Computational Techniques : S D Examples of Extrapolation of classical DM Techniques & D A look to Mahout B o t Big Data Tools & Resources M D Big Data Trends m o r F 2 Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela INTRODUCTION e v i t c e p s r e P l a n o i t a t u p m o C a : S D & D B https://kemlg.upc.edu o t M D m o r F Complex Real-world Systems/Domains Exist in the daily real life of human beings, and normally show a e strongcomplexityfor their understanding, analysis, management or v i solving. t c e p They imply several decision making tasks very complexand s r e difficult, which usually are faced up by human experts P al Some of them, in addition, could have catastrophicconsequences n o either for human beings or for the environment or for the economy of i t a one organization t u p m Examples o Environmental System/Domains C a Medical System/Domains : S Industrial Process Management Systems/Domains D & Business Administration & Management Systems/Domains D Marketing B o Decisions on products and prices t Decisions on human resources M D Decisions on strategies and company policy m Internet o r F 4 Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Need for Decision Making Support Tools Complexity of the decision making process e Accurate Evaluation of multiple alternatives v i t c e Need for forecasting capabilities p s r e Uncertainty P l a n Data Analysis and Exploitation o i t a Need for including experience and expertise (knowledge) t u p m o C a : S D Computational tools: Decision Support Systems (DSS) & D Intelligent Computational Tools: Intelligent Decision B o Support Systems (IDSS) t M D m o r F 5 Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Management Information Systems MANAGEMENT INFORMATION SYSTEMS e v i ct DECISION TRANSACTION e SUPPORT p PROCESSING s SYSTEMS r SYSTEMS e (DSS) P l a INTELLIGENT DECISION SUPPORT SYSTEMS (IDSS) n o i t a t u AUTOMATIC p MODEL-DRIVEN DATA-DRIVEN MODEL-DRIVEN m DSS DSS o DSS C a : S D TOP-DOWN BOTTOM-UP & APPROACH / APPROACH / D CONFIRMATIVE EXPLORATIVE B MODELS MODELS o t M STATISTICAL / ARTIF. INTELLIG / D NUMERICAL MACH. LEARN. m o METHODS METHODS r F 6 Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela USER T ECONOMICAL LAWS & R O COSTS REGULATIONS P USER INTERFACE P U S N O I S CI EXPLANATION ALTERNATIVES EVAL. e E D ctiv Intelligent PLANNING / PREDICTION / SUPERVISION DStercaitseiogny/ e S REASONING / MODELS’ INTEGRATION p I Decision S s O er GN AI STATISTICAL NUMERICAL A P Support I MODELS MODELS MODELS D l a DATA MINING Feedback G n Systems N KNOWLEDGE ACQUISITION / LEARNING o NI Background/ i at (IDSS) MI Subjective t A Knowledge u T A p [Sànchez-Marrè et al., 2000] D GIS DATA BASE m (SPATIAL DATA) (TEMPORAL DATA) o N C O a TI ON-LINE / A Off-line T OFF-LINE S: RE On-line data ACTUATORS D P R data E Spatial / & T N Geographical SUBJECTIVE I D data INFORMATION / Actuation A B T ANALYSES A D SENSORS o t M D m SYSTEM / PROCESS o r F 7 Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Why Data Mining is important ? From observations to decisions [Adapted from A.D. Witakker, 1993] e DDEECCIISSIIOONN v i ct HHiigghh e RReeccoommeennddaattiioonnss p s r e P IINNTTEERRPPRREETTAATTIIOONN CCoonnsseeqquueenncceess l a n o PPrreeddiiccttiioonnss ti VVaalluuee aanndd a ut RReelleevvaannccee KKNNOOWWLLEEDDGGEE p m ttoo o DDeecciissiioonnss UUnnddeerrssttaannddiinngg C a : DDaattaa S D & OObbsseerrvvaattiioonnss D B LLooww o t M D HHiigghh LLooww QQuuaannttiiyy ooff IInnffoorrmmaattiioonn m o r F 8 Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela KNOWLEDGE DISCOVERY / DATA MINING e v i t c e p s Intelligent Data Analysis r e P l a n o i t a t u p m o C a : S D & D B https://kemlg.upc.edu o t M D m o r F KDD Concepts e KDD v i t c Knowledge Discovery in Databases is the non-trivial e p process of identifying valid, novel, potentially useful, and s r e ultimately understandable patterns in data [Fayyad et al., P l a 1996] n o ati KDD is a multi-step process t u p Data Mining m o C Data Mining is a step in the KDD process consisting of a particular data mining algorithms that, under some : S D aceptable computational efficiency limitations, produces a & particular set of patterns over a set of examples or cases D B or data. [Fayyad et al., 1996] o t M D m o r F 10 Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela