ebook img

Data Cleaning PDF

285 Pages·2019·13.924 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data Cleaning

Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and incorrect business decisions. Poor data across businesses and the U.S. government are reported to cost trillions of dollars a year. Multiple surveys show that dirty data is the most common barrier faced by data scientists. Not surprisingly, developing effective and efficient data cleaning solutions is challenging and is rife with deep theoretical and engineering problems. This book is about data cleaning, which is used to refer to all kinds of tasks and activities to detect and repair errors in the data. Rather than focus on a particular data cleaning task, we give an overview of the end- to-end data cleaning process, describing various error detection and repair methods, and attempt to anchor these proposals with multiple taxonomies and views. Specifically, we cover four of the most common and important data cleaning tasks, namely, outlier detection, data transformation, error repair (including imputing missing values), and data deduplication. Furthermore, due to the increasing popularity and applicability of machine learning techniques, we include a chapter that specifically explores how machine learning techniques are used for data cleaning, and how data cleaning is used to improve machine learning models. This book is intended to serve as a useful reference for researchers and practitioners who are interested in the area of data quality and data cleaning. It can also be used as a textbook for a graduate course. Although we aim at covering state-of-the-art algorithms and techniques, we recognize that data cleaning is still an active field of research and therefore provide future directions of research whenever appropriate. ABOUT ACM BOOKS ACM Books is a series of high-quality books published by ACM for the computer science community. ACM Books publications are widely distributed in print and digital formats by major booksellers and are available to libraries and library consortia. Individual ACM members may access ACM Books publications via separate annual subscription. BOOKS.ACM.ORG • WWW.MORGANCLAYPOOLPUBLISHERS.COM Data Cleaning ACM Books EditorinChief M.TamerO¨zsu,UniversityofWaterloo ACMBooksisaseriesofhigh-qualitybooksforthecomputersciencecommunity,published by ACM and many in collaboration with Morgan & Claypool Publishers. ACM Books publicationsarewidelydistributedinbothprintanddigitalformatsthroughbooksellers andtolibraries(andlibraryconsortia)andindividualACMmembersviatheACMDigital Libraryplatform. DataCleaning IhabF.Ilyas,UniversityofWaterloo XuChu,GeorgiaInstituteofTechnology 2019 ConversationalUXDesign:APractitioner’sGuidetotheNatural ConversationFramework RobertJ.Moore,IBMResearch–Almaden RaphaelArar,IBMResearch–Almaden 2019 HeterogeneousComputing:HardwareandSoftwarePerspectives MohamedZahran,NewYorkUniversity 2019 HardnessofApproximationBetweenPandNP AviadRubinstein,StanfordUniversity 2019 MakingDatabasesWork:ThePragmaticWisdomofMichaelStonebraker Editor:MichaelL.Brodie,MassachusettsInstituteofTechnology 2018 TheHandbookofMultimodal-MultisensorInterfaces,Volume2: SignalProcessing,Architectures,andDetectionofEmotionandCognition Editors:SharonOviatt,MonashUniversity Bj¨ornSchuller,UniversityofAugsburgandImperialCollegeLondon PhilipR.Cohen,MonashUniversity DanielSonntag,GermanResearchCenterforArtificialIntelligence(DFKI) GerasimosPotamianos,UniversityofThessaly AntonioKru¨ger,SaarlandUniversityandGermanResearchCenterforArtificialIntelligence (DFKI) 2018 DeclarativeLogicProgramming:Theory,Systems,andApplications Editors:MichaelKifer,StonyBrookUniversity YanhongAnnieLiu,StonyBrookUniversity 2018 TheSparseFourierTransform:TheoryandPractice HaithamHassanieh,UniversityofIllinoisatUrbana-Champaign 2018 TheContinuingArmsRace:Code-ReuseAttacksandDefenses Editors:PerLarsen,Immunant,Inc. Ahmad-RezaSadeghi,TechnischeUniversit¨atDarmstadt 2018 FrontiersofMultimediaResearch Editor:Shih-FuChang,ColumbiaUniversity 2018 Shared-MemoryParallelismCanBeSimple,Fast,andScalable JulianShun,UniversityofCalifornia,Berkeley 2017 ComputationalPredictionofProteinComplexesfromProteinInteraction Networks SriganeshSrihari,TheUniversityofQueenslandInstituteforMolecularBioscience ChernHanYong,Duke-NationalUniversityofSingaporeMedicalSchool LimsoonWong,NationalUniversityofSingapore 2017 TheHandbookofMultimodal-MultisensorInterfaces,Volume1: Foundations,UserModeling,andCommonModalityCombinations Editors:SharonOviatt,IncaaDesigns Bj¨ornSchuller,UniversityofPassauandImperialCollegeLondon PhilipR.Cohen,VoiceboxTechnologies DanielSonntag,GermanResearchCenterforArtificialIntelligence(DFKI) GerasimosPotamianos,UniversityofThessaly AntonioKru¨ger,SaarlandUniversityandGermanResearchCenterforArtificialIntelligence (DFKI) 2017 CommunitiesofComputing:ComputerScienceandSocietyintheACM ThomasJ.Misa,Editor,UniversityofMinnesota 2017 Text Data Management and Analysis: A Practical Introduction to Information RetrievalandTextMining ChengXiangZhai,UniversityofIllinoisatUrbana–Champaign SeanMassung,UniversityofIllinoisatUrbana–Champaign 2016 AnArchitectureforFastandGeneralDataProcessingonLargeClusters MateiZaharia,StanfordUniversity 2016 ReactiveInternetProgramming:StateChartXMLinAction FranckBarbier,UniversityofPau,France 2016 VerifiedFunctionalProgramminginAgda AaronStump,TheUniversityofIowa 2016 TheVRBook:Human-CenteredDesignforVirtualReality JasonJerald,NextGenInteractions 2016 Ada’sLegacy:CulturesofComputingfromtheVictoriantotheDigitalAge RobinHammerman,StevensInstituteofTechnology AndrewL.Russell,StevensInstituteofTechnology 2016 EdmundBerkeleyandtheSocialResponsibilityofComputerProfessionals BernadetteLongo,NewJerseyInstituteofTechnology 2015 CandidateMultilinearMaps SanjamGarg,UniversityofCalifornia,Berkeley 2015 SmarterThanTheirMachines:OralHistoriesofPioneersinInteractiveComputing John Cullinane, Northeastern University; Mossavar-Rahmani Center for Business andGovernment,JohnF.KennedySchoolofGovernment,HarvardUniversity 2015 AFrameworkforScientificDiscoverythroughVideoGames SethCooper,UniversityofWashington 2014 Trust Extension as a Mechanism for Secure Code Execution on Commodity Computers BryanJeffreyParno,MicrosoftResearch 2014 EmbracingInterferenceinWirelessSystems ShyamnathGollakota,UniversityofWashington 2014 Data Cleaning Ihab F. Ilyas UniversityofWaterloo Xu Chu GeorgiaInstituteofTechnology ACMBooks#28 Copyright©2019byAssociationforComputingMachinery Allrightsreserved.Nopartofthispublicationmaybereproduced, storedinaretrieval system,ortransmittedinanyformorbyanymeans—electronic,mechanical,photocopy, recording,oranyotherexceptforbriefquotationsinprintedreviews—withouttheprior permissionofthepublisher. Designationsusedbycompaniestodistinguishtheirproductsareoftenclaimedastrade- marksorregisteredtrademarks.InallinstancesinwhichtheAssociationforComputing Machineryisawareofaclaim,theproductnamesappearininitialcapitalorallcapital letters.Readers,however,shouldcontacttheappropriatecompaniesformorecomplete informationregardingtrademarksandregistration. DataCleaning IhabF.Ilyas XuChu books.acm.org http://books.acm.org ISBN:978-1-4503-7152-0 hardcover ISBN:978-1-4503-7153-7 paperback ISBN:978-1-4503-7154-4 ePub ISBN:978-1-4503-7155-1 eBook SeriesISSN: 2374-6769print 2374-6777electronic DOIs: 10.1145/3310205 Book 10.1145/3310205.3310211 Chapter5 10.1145/3310205.3310206 Preface 10.1145/3310205.3310212 Chapter6 10.1145/3310205.3310207 Chapter1 10.1145/3310205.3310213 Chapter7 10.1145/3310205.3310208 Chapter2 10.1145/3310205.3310214 Chapter8 10.1145/3310205.3310209 Chapter3 10.1145/3310205.3310215 References/Index/Bios 10.1145/3310205.3310210 Chapter4 ApublicationintheACMBooksseries,#28 EditorinChief:M.TamerO¨zsu,UniversityofWaterloo ThisbookwastypesetinArnhemPro10/14andFlamausingZzTEX. Coverphoto:JasonDorfmanMIT/CSAIL FirstEdition 10 9 8 7 6 5 4 3 2 1

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.