ebook img

Enterprise Knowledge Management. The Data Quality Approach PDF

491 Pages·2001·6.484 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Enterprise Knowledge Management. The Data Quality Approach

PREFACE While data quality problems are widespread, it is rare for an event to take place that provides a high-profile example of how questionable information quality can have a worldwide business effect. The 2000 US Presidential election and the subsequent confusion around the Florida recount highlights the business need for high quality data. The winner of this election is in a position to influence economies around the world. The uncertainty associated with the lack of a clear winner had an immediate effect the day after the election when stock prices plum meted. Whether it is unintuitive data presentation, questions about the way information is aggregated, or the method by which information policy regulates the use of data, valuable information quality lessons regarding at least six data quality issues that can be learned from the election. POOR DATA REPRESENTATION A poor decision with respect to data presentation resulted in voter con fusion. The use of the butterfly ballot in Palm Beach County, FL is an example how the presentation of information did not correlate to users' expectations, leading to a large number of voting errors. In fact, many voters later claimed that they were dismayed to learn they may have voted for Pat Buchanan, a right-wing Conservative candidate, instead of Al Gore, the Democratic party candidate. XIII XIV PREFACE DATA VALIDATION With no built-in mechanism to vaHdate the data before it enters the sys tem, the use of punch cards and the "butterfly ballot" leads to problems with vote validation. When using a punch card ballot, (which, accord ing to the LA Times was used by more than 37 percent of registered nationwide voters in 1996), the voter selects a candidate by poking out the chad—^the perforated section that should be ejected when the hole is punctured. The cards are read by a tabulation machine, which counts a vote when it reads the hole in the card. The validation issue occurs when the chad is not completely ejected. The automated tabulation of both "hanging chads" (chads that are still partially attached) and "pregnant chads" (chads that are bulging but not punched out) is questionable, and so it is not clear whether all votes are counted. What constitutes a valid vote selection is primarily based on whether the tabulation machine can read the card. In the case of recounts, the cards are passed through the reader multiple times. In that process some of the hanging chads are shaken free which leads to different tallies after each recount. In addition, if someone mistakenly punches out more than one selection, the vote is automatically nullified. It is claimed that 19,000 ballots were disqualified because more than one vote for president had been made on a single ballot. This is an example where a policy to pre- qualify the ballot before it is sent to be counted could be instituted. Since the rules for what constitutes a valid vote are well described, it should be possible to have a machine evaluate the punch card to deter mine whether it is valid or not, and notify the voter that the ballot would be invalidated before it is cast. INVALID ANALYTICAL MODELS For many people on election eve, it is customary to sit in front of their televisions and watch as their favorite newscasters predict the alloca tion of electoral votes. These predictions are based on the results of exit polls and election results provided by an organization called the Voter News Service (VNS), which is jointly owned by a coalition of news companies in order to cut the cost of data collection. Typically, the VNS feeds both vote counts and winner predictions to all the news media simultaneously, which is why all the different broadcasters seem to pre dict the winners all around the same time. PREFACE XV In the case of the 2000 election, the networks were led to predict the winner of Florida incorrectly, not just once, but twice. The first error occured because predicting elections is based on statistical models generated from past voting behavior that 1) were designed to catch vote swings an order of magnitude greater than the actual (almost final) tal lies and 2) did not take changes in demographics into account. This meant that the prediction of Gore's winning Florida was retracted about 2 hours after it was made. CONFLICTING DATA SOURCES By 2:00AM the VNS (and consequently, the reporting organizations) switched their allocation of Florida's electoral votes from Gore to Bush, and declared Bush to have enough electoral votes to win the election. However, a second retraction occured when actual vote tallies were dis puted. While the VNS report indicated that Bush led in Florida by 29,000 votes, information posted on the Florida Board of Elections web site indicated that Bush's lead was closer to 500 votes, with the gap nar rowing quickly. A computer glitch in Volusia County led to an overesti- mation of Bush's total by more than 25,000 votes. EXPECTATION OF ERROR According to Title IX, Chapter 102 of Florida law, "if the returns for any office reflect that a candidate was defeated or eliminated by one- half of a percent or less of the votes cast for such office...the board responsible for certifying the results of the vote...shall order a recount of the votes..." This section of the law contains data-accuracy implication that there is an expected margin of error of one-half of one percent of the votes. The automatic recount is a good example where the threshold for potential error is recognized and where there is defined governance associated with a data quality problem. TIMELINESS Timeliness is an important aspect of information quality. We have come to expect that the results of a national election are decided by the time XVI PREFACE we wake up the next day. Even in a close election when the results are inconclusive, there are timeliness constraints for the reporting and certi fication of votes. DATA QUALITY ON THE PERSONAL LEVEL My personal connection to data quality began at a very early age, although I did not realize the connection until recently. My parents, through family coercion, decided to give me the first name "Howard," although they had wanted to give me the name "David." So David became my middle name, but even though they named me Howard, I have always been called — by my parents and, subsequently, everyone else — David. Hence the source of the problem. This fixed-field database-oriented world is ill prepared to deal with a person who is called by his middle name. Everywhere you go, you are asked for your first name and middle initial. So, officially, I am Howard, but I always go by the name David. At school, at camp, at college, at work, filling out applications, opening bank accounts, filing tax returns, and so on, I fill out my name in its official form. People try to bend the rules: "Put down David as your first name and H as your middle initial." "Just use David." "Scratch out first name and change it to middle name." Unfortunately, these ideas are too radi cal for the poor data entry people, so I either end up as David H. Loshin or I am stuck with being called Howard. This really happened: At a doctor's office recently, the receptionist insisted that if my name were officially Howard D. Loshin on my in surance card, they were only going to call me Howard. I have three dif ferent credit cards, all with different names on them. Not only that — my last name, Loshin, sounds the same as "lotion," and I find my last name is consistently misspelled: Lotion, Loshen, Loshion, Loshian. When it comes to how my mail is addressed, I never know what to expect — except for one thing: I will get at least two of every direct marketing sales letter. Despite my inherited data quality connection, it was not until a few years ago that I found a new motivation with data quality. I was asso ciated with a securities processing group at a large financial services company, who were analyzing their accounts database. What was inter esting was that up to that point they had considered their accounts as PREFACE XVII just that: accounts. Over a period of time, however, some people there became convinced of the benefits of looking at the people associated with those accounts as customers, and a new project was born that would turn the accounts database inside out. My role in that project was to interpret the different information paradigms that appeared in the accounts database name and address field. For it turned out that a single customer might be associated with many different accounts, in many dif ferent roles: as an owner, a trustee, an investment advisor, and so forth. I learned two very interesting things about this project. The first was that the knowledge that can be learned from combining multiple databases was much greater than from the sum total of analyzing the databases individually. The second was the realization that the prob lems that I saw at this organization were not limited to this company — in fact, these problems are endemic and not only within the financial industry but in any industry that uses information to run its businesses. The insight that brought full circle the world of data quality was this: Every business process that uses data has some inherent assump tions and expectations about the data. And these assumptions and expectations can be expressed in a formal way, and this formality can expose much more knowledge than simple database schema and Cobol programs. So I left that company and formed a new company. Knowledge Integrity Incorporated, (www.knowledge-integrity.com) whose purpose is to understand, expose, and correct data quality problems. Our goal is to create a framework for evaluating the impacts that can be caused by low data quality, to assess the state of data quality within an enterprise, to collect the assumptions and expectations about the data that is used, and recast those assumptions and expectations as a set of data quality and business rules. In turn, these rules are incorporated as the central core of a corporate knowledge management environment, to capture corporate knowledge and manage it as content. This book is the product of that goal. In it, we elaborate on our philosophy and methods for evaluating data quality problems and how we aim to solve them. I believe that the savvy manager understands the importance of high-quality data as a means for increasing business effectiveness and productivity, and this book puts these issues into the proper context. I hope the reader finds this book helpful, and I am cer tainly interested in hearing about others' experiences. Please feel free to contact me at [email protected] and let me know how your data quality projects are moving along! XVIII PREFACE I have been collecting what I call "Data Quality Horror Stories" and placing them on our corporate Web site (www.knowledge- integrity.com/horror.htm). If you have any interesting personal experi ences, or if you see news stories that demonstrate how poor data quality has serious (or comical) effects, please e-mail them to me at Loshin® knowledge-integrity.com. I would like to thank the people who have helped make this book possible. First and foremost, my wife, Jill, and my children, Kira and Jonah, are always there when I need them. Ken Morton acted as acqui sitions editor and general enabler. Thomas Park, who took over the project from Ken, was invaluable in helping me get this project com pleted. I thank Thomas Redman, with whom I worked for a short period of time and consulted on some of the concepts in this book. Thank you to Mary O'Brien, who read through early drafts of the pro posal and was a big supporter. Thanks also go to Bob Shelly, whose experienced eye validated my content, and to the rest of the Morgan Kaufmann staff involved in this project. Sheri Dean and Julio Esperas also provided significant help in the preparation of the book. I also must thank Justin Kestelyn of Intelligent Enterprise maga zine, who has vetted some of my ideas by publishing abridged versions of a few chapters. Thanks also go to Dennis Shasha at New York Uni versity, who gave me the opportunity to teach this material as a special topic graduate class at the Courant Institute. I also thank my wife's par ents, Marty and Phyllis Fingerhut, who are two of my biggest support ers. Last, I want to thank and remember my mother, Betty Loshin, who was a source of inspiration and who passed away earlier this year. 1 INTRODUCTION Without even realizing it, everyone is affected by poor data quality. Some are affected directly in annoying ways, such as receiving two or three identical mailings from the same sales organization in the same week. Some are affected in less direct ways, such as the 20-minute wait on hold for a customer service department. Some are affected more malevolently through deliberate fraud, such as identity theft. But when ever poor data quality, inconsistencies, and errors bloat both companies and government agencies and hamper their ability to provide the best possible service, everyone suffers. Data quality seems to be a hazy concept, but the lack of data qual ity severely hampers the ability of organizations to effectively accumu late and manage enterprise-wide knowledge. The goal of this book is to demonstrate that data quality is not an esoteric notion but something that can be quantified, measured, and improved, all with a strict focus on return on investment. Our approach is that knowledge management is a pillar that must stand securely on a pedestal of data quality, and by the end of this book, the reader should be able to build that pedestal. This book covers these areas. • Data ownership paradigms • The definition of data quality • An economic framework for data quality, including steps in build ing a return on investment model to justify the costs of a data quality program • The dimensions of data quality • Using statistical process control as a tool for measurement ENTERPRISE KNOWLEDGE MANAGEMENT • Data domains and mappings between those domains • Data quality rules and business rules • Measurement and current state assessment • Data quality requirements analysis • Metadata and policy • Rules-based processing • Discovery of metadata and data quality and business rules • Data cleansing • Root cause analysis and supplier management • Data enhancement • Putting it all into practice The end of the book summarizes the processes discussed and the steps to building a data quality practice. Before we dive into the technical components, however, it is worth while to spend some time looking at some real-world examples for motivation. In the next section, you will see some examples of "data quality horror stories" — tales of adverse effects of poor data quality. 1.1 DATA QUALITY HORROR STORIES 1.1.1 Bank Deposit? In November of 1998, it was reported by the Associated Press that a New York man allegedly brought a dead deer into a bank in Stamford, Connecticut, because he was upset with the bank's service. Police say the 70-year-old argued with a teller over a clerical mistake with his checking account. Because he was apparently unhappy with the teller, he went home, got the deer carcass and brought it back to the branch office. 1.1.2 CD Mail Fraud Here is a news story taken from the Associated Press newswire. The text is printed with permission. Newark — For four years a Middlesex County man fooled the computer fraud programs at two music-by-mail clubs, using 1,630 aliases to buy music CDs at rates offered only to first-time buyers. INTRODUCTION David Russo, 33, of Sayerville, NJ, admitted yesterday that he received 22,260 CDs by making each address — even if it listed the same post office box — different enough to evade fraud-detection computer programs. Among his methods: adding fictitious apartment numbers, unneeded direction abbreviations and extra punctuation marks. (Emphasis mine) The scam is believed to be the largest of its kind in the nation, said Assistant U.S. Attorney Scott S. Christie, who prosecuted the case. The introductory offers typically provided nine free CDs with the purchase of one CD at the regular price, plus ship ping and handling. Other CDs then had to be purchased later to fulfill club requirements. Russo paid about $56,000 for CDs, said Paul B. Brickfield, his lawyer, or an average of $2.50 each. He then sold the CDs at flea markets for about $10 each. Brickfield said. Russo pleaded guilty to a single count of mail fraud. He faces about 12 to 18 months in prison and a fine of up to $250,000. 1.1.3 Mars Orb/Cer The Mars Climate Orbiter, a key part of NASA's program to explore the planet Mars, vanished in September 1999 after rockets were fired to bring it into orbit of the planet. It was later discovered by an investiga tive board that NASA engineers failed to convert English measures of rocket thrusts to newtons, a metric system measuring rocket force, and that was the root cause of the loss of the spacecraft. The orbiter smashed into the planet instead of reaching a safe orbit. This discrepancy between the two measures, which was relatively small, caused the orbiter to approach Mars at too low an altitude. The result was the loss of a $125 million spacecraft and a significant setback in NASA's ability to explore Mars. 1.1.4 Credit Card Woes After having been a loyal credit card customer for a number of years, I had mistakenly missed a payment when the bill was lost during the ENTERPRISE KNOWLEDGE MANAGEMENT move to our new house. I called the customer service department and explained the omission, and they were happy to remove the service charge, provided that I sent in my payment right away, which I did. A few months later, I received a letter indicating that "immediate action" was required. Evidently, I had a balance due of $0.00, and because of that, the company had decided to revoke my charging privileges! Not only that, I was being reported to credit agencies as being delinquent. Needless to say, this was ridiculous, and after some intense conver sations with a number of people in the customer service department, they agreed to mark my account as being paid in full. They notified the credit reporting agencies that I was not, and never had been, delinquent on the account (see Figure 1.1). 1.1.5 Open or Closed Account? Three months after canceling my cellular telephone service, I continue to receive bills from my former service provider indicating that I was being billed for $0.00 — "Do not remit." 1.1.6 Business Credit Card A friend of mine is the president of a small home-based business. He received an offer from a major charge card company for a corporate charge card with no annual fee. He accepted, and a short time later, he received his card in the mail. Not long after that, he began to receive the same offer from the same company, but those offers were addressed dif ferently. Evidently, his name had been misspelled on one of his maga zine subscriptions, and that version had been submitted to the credit card company as a different individual. Not only that, his wife started to receive offers too. Six months later, this man still gets four or five mail offers per week in the mail from the same company, which evidently not only cannot fig ure out who he is but also can't recognize that he is already a customer! 1.1.7 Direct Marketing One would imagine that if any business might have the issue of data quality on top of its list, it would be the direct marketing industry. Yet, I

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.