ebook img

Cleansing/Matching Data Using the SAS Data Quality Toolset PDF

22 Pages·2015·1.72 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Cleansing/Matching Data Using the SAS Data Quality Toolset

Cleansing/Matching Data Using the SAS Data Quality Toolset Nigel Light, Data Governance and Quality Analyst Ecclesiastical Insurance © Ecclesiastical Insurance Office plc 2011 Ecclesiastical Insurance – niches where we conduct business • Charitable Insurer with over 125 years of experience (aim of giving £50M to charity in next 3 years) • Insure a diverse mix of organisations and risks aligned to our core values © Ecclesiastical Insurance Office plc 2011 Data Quality tools – why buy one? Data Quality doesn’t need a toolset. Without one, using standard database/spreadsheet software, you can : • Match and deduplicate data • Write code to • Profile data • Reformat data • Identify where data does not match expected patterns or values But, with a data quality toolset, this is all generally available ‘out of the box’ (and there are often additional features too) © Ecclesiastical Insurance Office plc 2011 Cleansing data using SAS Dataflux – the basics Things that a Data Quality tool, such as SAS, can do…. • Parsing • ie Breaking out a string of data into its standard elements Eg Address string “123 Brookside Close, Henleze, Bristol, BS2 7BJ” Can be ‘parsed’ into House Number : 123 individual data elements Street : Brookside Close Address line 2 : Henleze City : Bristol Postcode : BS2 7BJ © Ecclesiastical Insurance Office plc 2011 Cleansing data using SAS – the basics Also…. • Standardisation • Putting data into a defined standard form (for the defined data element) Eg1 Phone number 07891425687 standardised to (07891) 425867 ie defined 01452 678923 (01452) 678923 standard 078 123 98756 (07812) 398756 form Eg2 Suffix Ltd, Lmtd, Ltd., Limited standardised to Ltd. Eg3 Name Bob, Bobby, Rob, Robert etc standardised to Robert © Ecclesiastical Insurance Office plc 2011 Cleansing data using SAS– the basics As well it can…. • Pattern identification of invalid data items • Identification on inappropriate or incomplete field contents based on defined element values Eg Postcode must be one of 6 formats XX99 9XX, XX9 9XX etc • A postcode of eg G19L 9P2 would be recognised as invalid. • Ditto GL21 3 (incomplete) NB It cannot identify invalid entries eg postcodes, of valid format, which do not exist (need to match to a reference dataset eg Royal Mail postcode file) © Ecclesiastical Insurance Office plc 2011 Cleansing data using SAS– the basics And…. • Profile data • Gain an understanding and insight of data from a specified source Eg For a particular field • What are the 5 most common values? The 5 least common? • What is the maximum, minimum values? • What is the type of data in the field (eg alphabetic?, numeric? date?) • What is the longest/shortest value? (alphabetic field) • What is the average value? (numeric field) etc © Ecclesiastical Insurance Office plc 2011 Cleansing data using SAS – the basics How? • SAS - Quality Knowledge Base (QKB) • Set of pre-defined ‘out-of-the-box’ templates • Target specific • Location specific • Can also define you our own values eg to accommodate clergy salutations The Most Reverend and Right Honourable the Lord Archbishop of Canterbury (www.crockford.org.uk) © Ecclesiastical Insurance Office plc 2011 Cleansing data using SAS – the basics BUT • SAS Data Quality cannot do ‘magic’ and fix all data problems • Data needs to be of a certain ‘standard’ for the QKB to work satisfactorily • Eg A phone number entry of ‘Ext 2378’ cannot be standardised • The UK QKB may also find certain Eastern European and Asian name standardisation hard © Ecclesiastical Insurance Office plc 2011 Cleansing data using SAS Data Quality – grouping and deduplication Grouping De-duplication © Ecclesiastical Insurance Office plc 2011

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.