ebook img

Data Cleaning. A Practical Perspective PDF

72 Pages·2013·0.683 MB·english
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data Cleaning. A Practical Perspective

Data Cleaning APracticalPerspective Venkatesh Ganti AlationInc. Anish Das Sarma GoogleInc. SYNTHESISLECTURESONDATAMANAGEMENT#36 M &C Morgan &cLaypool publishers Copyright©2013byMorgan&Claypool DataCleaning:APracticalPerspective VenkateshGantiandAnishDasSarma www.morganclaypool.com ISBN:9781608456772 paperback ISBN:9781608456789 ebook DOI10.2200/S00523ED1V01Y201307DTM036 APublicationintheMorgan&ClaypoolPublishersseries SYNTHESISLECTURESONDATAMANAGEMENT Lecture#36 SeriesEditor:M.TamerÖzsu,UniversityofWaterloo SeriesISSN SynthesisLecturesonDataManagement Print2153-5418 Electronic2153-5426 ABSTRACT Data warehouses consolidate various activities of a business and often form the backbone for generatingreportsthatsupportimportantbusinessdecisions.Errorsindatatendtocreepinfor a variety of reasons. Some of these reasons include errors during input data collection and er- rors while merging data collected independently across different databases. ese errors in data warehousesoftenresultinerroneousupstreamreports,andcouldimpactbusinessdecisionsneg- atively. erefore, one of the critical challenges while maintaining large data warehouses is that of ensuring the quality of data in the data warehouse remains high. e process of maintaining highdataqualityiscommonlyreferredtoasdatacleaning. Inthisbook,wefirstdiscussthegoalsofdatacleaning.Often,thegoalsofdatacleaningare notwelldefinedandcouldmeandifferentsolutionsindifferentscenarios.Towardclarifyingthese goals,weabstractoutacommonsetofdatacleaningtasksthatoftenneedtobeaddressed.is abstractionallowsustodevelopsolutionsforthesecommondatacleaningtasks.Wethendiscuss a few popular approaches for developing such solutions. In particular, we focus on an operator- centricapproachfordevelopingadatacleaningplatform.eoperator-centricapproachinvolves thedevelopmentofcustomizableoperatorsthatcouldbeusedasbuildingblocksfordeveloping commonsolutions.isissimilartotheapproachofrelationalalgebraforqueryprocessing.e basicsetofoperatorscanbeputtogethertobuildcomplexqueries.Finally,wediscussthedevel- opmentofcustomscriptswhichleveragethebasicdatacleaningoperatorsalongwithrelational operatorstoimplementeffectivesolutionsfordatacleaningtasks. KEYWORDS datacleaning,deduplication,recordmatching,datacleaningscripts,schemamatch- ing, ETL, clustering, record matching, deduplication, data standardization, ETL dataflows,setsimilarityjoin,segmentation,parsing,stringsimilarityfunctions,edit distance, edit similarity, jaccard similarity, cosine similarity, soundex, constrained deduplication,blocking Contents Preface ...........................................................xiii 1 Introduction ....................................................... 1 1.1 EnterpriseDataWarehouse..........................................1 1.2 ComparisonShoppingDatabase ......................................2 1.3 DataCleaningTasks ...............................................2 1.4 RecordMatching ..................................................3 1.5 SchemaMatching .................................................4 1.6 Deduplication.....................................................4 1.7 DataStandardization ...............................................5 1.8 DataProfiling.....................................................6 1.9 FocusofthisBook .................................................6 2 TechnologicalApproaches ............................................ 7 2.1 Domain-SpecificVerticals ...........................................7 2.2 GenericPlatforms .................................................8 2.3 Operator-basedApproach ...........................................8 2.4 GenericDataCleaningOperators.....................................8 2.4.1 SimilarityJoin ...............................................9 2.4.2 Clustering ..................................................9 2.4.3 Parsing....................................................10 2.5 Bibliography .....................................................11 3 SimilarityFunctions ................................................ 13 3.1 EditDistance ....................................................13 3.2 JaccardSimilarity .................................................14 3.3 CosineSimilarity .................................................15 3.4 Soundex ........................................................15 3.5 CombinationsandLearningSimilarityFunctions .......................16 3.6 Bibliography .....................................................16 4 Operator:SimilarityJoin ............................................ 17 4.1 SetSimilarityJoin(SSJoin) .........................................17 4.2 Instantiations ....................................................20 4.2.1 EditDistance ..............................................21 4.2.2 JaccardContainmentandSimilarity.............................22 4.3 ImplementingtheSSJoinOperator...................................23 4.3.1 BasicSSJoinImplementation ..................................24 4.3.2 FilteredSSJoinImplementation ................................25 4.4 Bibliography .....................................................28 5 Operator:Clustering ............................................... 29 5.1 Definitions ......................................................29 5.2 Techniques ......................................................32 5.2.1 HashPartition..............................................32 5.2.2 Graph-basedClustering ......................................33 5.3 Bilbiography .....................................................34 6 Operator:Parsing .................................................. 35 6.1 RegularExpressions ...............................................36 6.2 HiddenMarkovModels............................................36 6.2.1 TrainingHMMs ............................................37 6.2.2 UseofHMMsforParsing ....................................41 6.3 Bibliography .....................................................42 7 Task:RecordMatching ............................................. 43 7.1 SchemaMatching ................................................44 7.2 RecordMatching .................................................45 7.2.1 BipartiteGraphConstruction..................................46 7.2.2 WeightedEdges ............................................46 7.2.3 GraphMatching ............................................48 7.3 Bibliography .....................................................48 8 Task:Deduplication ................................................ 49 8.1 GraphPartitioningApproach .......................................50 8.1.1 GraphConstruction .........................................51 8.1.2 GraphPartitioning ..........................................51 8.2 Merging ........................................................51 8.3 UsingConstraintsforDeduplication..................................52 8.3.1 CandidateSetsofPartitions ...................................53 8.3.2 MaximizingConstraintSatisfaction.............................54 8.4 Blocking ........................................................54 8.5 Bibliography .....................................................55 9 DataCleaningScripts .............................................. 57 9.1 RecordMatchingScripts ...........................................57 9.2 DeduplicationScripts..............................................58 9.3 SupportforScriptDevelopment .....................................59 9.3.1 UserInterfaceforDevelopingScripts............................60 9.3.2 ConfigurableDataCleaningScripts.............................61 9.4 Bibliography .....................................................62 10 Conclusion ....................................................... 63 Bibliography ...................................................... 65 Preface Datacleaningistheprocessofstartingwithrawdatafromoneormoresourcesandmaintaining reliablequalityforyourapplications.Weweremotivatedtowritethisbooksincewefoundagap intechnicalmaterialthatclearlyexplainedthegoalsandcapabilitiesofadatacleaningsolution; ingeneral,datacleaningisusuallythoughtofasasolutionforanindividualproblem.Oneofthe prominentissueswehadwasthattherewasnoguideofferingpracticaladviceonoptionsavailable forbuildingorchoosingadatacleaningsolution.Inthisbook,wefillthisgap. Ourapproachtowardthisbookwastoconceptualizedatacleaningsolutionsasbeingcom- posedoftasksandoperators.Eachsolutionisacompositionofmultiplehigh-leveltasks,andeach taskmayhaveoneormoreoperator-basedsolutions.Inthisbookweelaborateonthemostcom- mon tasks, and their implementations leveraging critical operators. Our book can be seen as a practitioner’s guide to understand the space of options for evaluating or building a good data cleaning solution. We provide an overview of the capabilities required in such a system, which are the set of tasks described in this book. People building complete solutions may use the set of tasks described here, and choose from the space of operators. erefore, this book is ideally suitedforpractitionersofdatacleaningandstudentsinterestedinthetopic.Althoughourbook lists the useful tools, techniques, and pointers, some of them require custom implementations with no open-source components available. erefore, if students or engineers are looking for goodabstractionsforpluginstobuild,wehopethatourbookprovidessomeoptions. Forbeginnersinterestedindatacleaning,wesuggestreadingthematerialsequentiallyfrom thefirstchapter.Advancedreadersmaydirectlyjumptoanyrelevantchapterforreference;each chapterisselfcontainedandprovidesfurtherpointerstoexistingresearch. Weenjoyedwritingthisbookandgainednewinsightsintheprocessthatwe’vesharedin this material. We sincerely wish we had more time, in which case we would have been able to addmoredepthonseveraldirectlyrelatedtopics.Forexample,theuser-interfaceaspectsofdata cleaninghavenotreceiveddueattentioninthisbook. VenkateshGantiandAnishDasSarma September2013 C H A P T E R 1 Introduction Databasesareubiquitousinenterprisesystems,andformthebackboneforsystemskeepingtrack of business transactions and operational data. ey also have become the defacto standard for supportingdataanalysistasksgeneratingreportsindicatingthehealthofthebusinessoperations. ese reports are often critical to track performance as well as to make informed decisions on severalissuesconfrontingabusiness.ereportingfunctionalityhasbecomesoimportantonits ownthatbusinessesoftencreateconsolidateddatarepositories.eserepositoriescanbeobserved inseveralscenariossuchasdatawarehousingforanalysis,aswellasforsupportingsophisticated applicationssuchascomparisonshopping. 1.1 ENTERPRISEDATAWAREHOUSE Datawarehousesarelargedatarepositoriesrecordinginteractionsbetweenvariousentitiesthatan enterprisedealswith:customers,products,geographies,etc.Byconsolidatingmostoftherelevant datadescribingtheinteractionsintoonerepository,datawarehousesfacilitatecannedandadhoc dataanalysisoversuchinteractions. eresultsofsuchanalysisqueriesoftenformthebackboneofseveralcriticalreports,which helpevaluateandmonitorperformanceofvariousbusinessprojects.esereportsmayoftenbe usefulforprioritizingamongvariousbusinessinitiatives.erefore,accuracyofdatainthesedata warehouses is critical. Errors in these databases can result in significant downstream reporting errors.Sometimes,sucherrorscanresultinbaddecisionsbeingtakenbytheexecutives. Errors in data tend to creep in from a variety of sources, say when new sales records are inserted.Forinstance,enterprisesroutinelyobtainresellers’salesinteractionswithcustomersfrom resellers.Dataentryatthepointofsalesisoftenperformedinarushandcausesmanyerrorsin data.Sometimes,theseerrorsareintroducedbecausethesalesagentdoesnottrytofindoutthe correct data, and enters a default or a typical value. So, the data about the customer sent by the resellermaynotmatchwiththecurrentrecordinthedatawarehouse. Alternatively,alargenumberoferrorsareoftenintroducedintothedatawarehousewhen datafromanewsourcedatabaseismergedwithit.Suchdataconsolidationisrequiredwhensales transactionsfromanewdatafeed(say,anOLTPdatabase)areinsertedintothedatawarehouse. Ifsomeofthenewrecordsinboththesourceandtargetdescribethesameentities,thenitisoften possible that the data merger results in several data quality issues because interactions with the sameentityarenowdistributedacrossmultiplerecords. 2 1. INTRODUCTION 1.2 COMPARISONSHOPPINGDATABASE Many popular comparison shopping search engines (e.g., Bing Shopping, Google Products, ShopZilla)arebackedbycomprehensiveproductcatalogandofferdatabasesconsisting,respec- tively, of products and offers from multiple merchants to sell them at specific prices. e cat- alog and offer databases enable a comparison shopping engine to display products relevant to a user’s search query and for each product the offers from various merchants. ese databases are populated and maintained by assimilating feeds from both catalog providers (such as CNet, PriceGrabber)aswellasfrommerchants(e.g.,NewEgg.com,TigerDirect.com).esefeedsare consolidated into a master catalog along with any other information per product received from merchantsorfromothersources.Whenausersearchesforaproductoracategoryofproducts, these comparison shopping sites display a set of top-ranking items for the specific user query. Whenauserisinterestedinaspecificproduct,theuseristhenshownthelistofmerchantsalong withoffersforeachofthem. ese product catalog and merchant feeds are obtained from independently developed databases. erefore, identifiers and descriptions of the same product and those in the corre- sponding offers will very likely be different across each of the input feeds. Reconciling these differencesiscrucialforenablingacompellingusefulcomparisonshoppingexperiencetoauser. Otherwise, information about the same product would be split across multiple records in the mastercatalog.Whicheverrecordisshowntotheuser,theuserisonlyshownapartoftheinfor- mationinthemastercatalogabouttheproduct.erefore,oneofthemaingoalsistomaintaina correctlyconsolidatedmastercatalogwhereeachproductsoldatseveralmerchantshasonlyone representation. Similar data quality issues arise in the context of MasterDataManagement(MDM). e goal of an MDM system is to maintain a unified view of non-transactional data entities (e.g., customers, products) of an enterprise. Like in the data warehousing and comparison shopping scenarios,thesemasterdatabasesoftengrowthroughincrementalorbatchinsertionofnewen- tities. us, the same issues and challenges of maintaining a high data quality also arise in the contextofmasterdatamanagement. 1.3 DATACLEANINGTASKS Datacleaningisanoverloadedterm,andisoftenusedlooselytorefertoavarietyoftasksaimed at improving the quality of data. Often, these tasks may have to be accomplished by stitching together multiple operations. We now discuss some common data cleaning tasks to better un- derstandtheunderlyingoperations.Wenotethatthislistincludescommonlyencounteredtasks, andisnotcomprehensive. 1.4. RECORDMATCHING 3 1.4 RECORDMATCHING Informally,thegoalofrecordmatchingistomatcheachrecordfromasetofrecordswithrecords inanothertable.Often,thistaskneedstobeaccomplishedwhenanewsetofentitiesisimported tothetargetrelationtomakesurethattheinsertiondoesnotintroduceduplicateentitiesinthe targetrelation. Enterprise DataWarehousingScenario: Consider a scenario when a new batch of customer recordsisbeingimportedintoasalesdatabase.Inthisscenario,itisimportanttoverifywhether or not the same customer is represented in both the existing as well as the incoming sets and only retain one record in the final result. Due to representational differences and errors, records inbothbatchescouldbedifferentandmaynotmatchexactlyontheirkeyattributes(e.g.,name and address or the CustomerId). e goal of a record matching task is to identify record pairs, oneineachoftwoinputrelations,whichcorrespondtothesamereal-worldentity.Challengesto be addressed in this task include (i) identification of criteria under which two records represent thesamereal-worldentity,and(ii)efficientcomputationstrategiestodeterminesuchpairsover largeinputrelations. Table1.1: Twosetsofcustomerrecords ID Name Street City Phone r1 SweetlegalInvestmentsInc 202North Redmond 425-444-5555 r2 ABCGroceriesCorp AmphitheatrePkwy MountainView 4081112222 r3 Cabletelevisionservices OneOxfordDr Cambridge 617-123-4567 s1 SweetlegalInvesmentsIncorporated 202N Redmond s2 ABCGroceriesCorpn. AmphitheetreParkway MountainView s3 CableServices OneOxfordDr Cambridge 6171234567 Comparison Shopping Scenario: Recall the comparison shopping scenario, where the target comparison shopping site maintains a master catalog of products. Suppose a merchant sends a new feed of products, as shown in Table 1.2. Each of these products has to be matched with a targetinthemaster,orifthereisnosuchmatchingproduct,additasanewproducttothemaster catalog. Ideally,themerchantcouldalsosendauniqueidentifierthatmatchesaglobalidentifierin the master catalog. In the case of books,ISBN is an identifier that everyoneagrees to and uses. However, in other categories of products, there is no such global identifier that can be used for matching. e main challenge here is that the description often used by the merchant may not match with the description at the target comparison shopping site. Hence, matching products “correctly”requiresseveralchallengestobeaddressed. e hardness is further exacerbated in the case of products where the underlying product descriptionisoftenaconcatenationofseveralattributevalues.eindividualvaluesmaythem-

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.