Shazia Sadiq Editor Handbook of Data Quality Research and Practice Handbook of Data Quality Shazia Sadiq Editor Handbook of Data Quality Research and Practice 123 Editor ShaziaSadiq UniversityofQueensland Brisbane Australia ISBN978-3-642-36256-9 ISBN978-3-642-36257-6(eBook) DOI10.1007/978-3-642-36257-6 SpringerHeidelbergNewYorkDordrechtLondon LibraryofCongressControlNumber:2013935616 ACMComputingClassification(1998):H.2,H.3,E.5,K.6 (cid:2)c Springer-VerlagBerlinHeidelberg2013 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped.Exemptedfromthislegalreservationarebriefexcerptsinconnection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’slocation,initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer. PermissionsforusemaybeobtainedthroughRightsLinkattheCopyrightClearanceCenter.Violations areliabletoprosecutionundertherespectiveCopyrightLaw. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. While the advice and information in this book are believed to be true and accurate at the date of publication,neithertheauthorsnortheeditorsnorthepublishercanacceptanylegalresponsibilityfor anyerrorsoromissionsthatmaybemade.Thepublishermakesnowarranty,expressorimplied,with respecttothematerialcontainedherein. Printedonacid-freepaper SpringerispartofSpringerScience+BusinessMedia(www.springer.com) Advisory Panel CarloBatini Universita`degliStudidiMilano-Bicocca,Milano,Italy YangLee NorthEasternUniversity,Boston,MA,USA ChenLi UniversityofCaliforniaIrvine,Irvine,CA,USA TamerOzsu UniversityofWaterloo,Waterloo,ON,Canada FelixNaumann HassoPlattnerInstitute,Potsdam,Germany BarbaraPernici PolitecnicodiMilano,Milano,Italy ThomasRedman NavesinkConsultingGroup,Rumson,NJ,USA DiveshSrivastava AT&TLabs-Research,FlorhamPark,NJ,USA JohnTalburt UniversityofArkansasatLittleRock,LittleRock,AR,USA XiaofangZhou TheUniversityofQueensland,Brisbane,Australia v To mylatefather,AzizBeg historian,authoranddiplomat Preface The impact of data quality on the information chain has been widely recognized sincetheonsetoflarge-scaledataprocessing.Furthermore,recentyearshaveseen aremarkablechangeinthenatureandusageofdataitselfduetothesheervolume ofdata,highaccessibilityleadingtounprecedenteddistributionandsharingofdata, andlackofmatchbetweentheintentionofdatacreationanditssubsequentusage,to nameafew.Theimportanceoftheunderstandingandmanagementofdataquality forindividuals,groups,organizations,andgovernmenthasthusincreasedmultifold. The data (and information) quality domain is supported by several decades of high quality research contributions and commercial innovations. Research and practiceindataandinformationqualityischaracterizedbymethodologicalaswell astopicaldiversities.Thecross-disciplinarynatureofdataqualityproblemsaswell as a strong focus on solutions based on the fitness for use principle has further diversified the related body of knowledge. Although research pluralism is highly warranted, there is evidence that substantial developments in the past have been isolationist.Asdataqualityincreasesinimportanceandcomplexity,thereisaneed tomotivateexploitationofsynergiesacrossdiverseresearchcommunities. Theabovefactorswarrantamultiprongedapproachtothestudyofdataquality management spanning: organizational aspects, i.e. strategies to establish people, processes, policies, and standards required to manage data quality objectives; architectural aspects, i.e. the technology landscape required to deploy developed processes, standards, and policies; and computational aspects which relate to effectiveandefficienttoolsandtechniquesfordataquality. Despite a significant body of knowledge on data quality, the community is lacking a resource that provides a consolidated coverage of data quality over the threedifferentaspects.Thisgapmotivatedmetoassembleapointofreferencethat reflectsthefullscopeofdataqualityresearchandpractice. In the first chapter of this handbook, I provide a detailed analysis of the data quality body of knowledge and present the rationale and approach for this handbook,particularlyhighlightingtheneedforcross-fertilizationwithinandacross research and practitioner communities. This handbook is then accordingly struc- tured into three parts representing contributions on organizational, architectural, ix