ebook img

Data Mining: Exploring the Data PDF

22 Pages·1997·0.365 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data Mining: Exploring the Data

T E C H T O P I C 6 DATA MINING: EXPLORING THE DATA PART 2 by W.H. Inmon [This Tech Topic is part 2 in the series of Tech Topics on data mining and data exploration. It is assumed that the reader has already read Part 1.] Data Correlation The basis for data mining and data exploration is the correlation of data. When data can be correlated mathematically and from a business basis, assumptions can be made and the basis for commercial exploitation is formed. The groundwork for correlation is the mathematical relationship between two or more variables. There are a wide variety of correlations of data. Figure 2.1 shows some simple types of correlations of data. a perfect correlation between two variables a strong correlation between two variables a weak correlation between two variables different strengths of correlation Figure 2.1 The first correlation in Figure 2.1 is a perfect correlation of data. In this case, for every occurrence of A there is an occurrence of B, and vice versa. Such an occurrence seldom happens, but when it does, the perfect correlation forms a very sound basis for exploitation. A more normal case is a strong correlation of data, which is represented by the second set of data shown in Figure 2.1. In the second set of data, in most cases, where there is an A there is also a B. But in a few cases A will exist when there is no B, and B will sometimes exist where there is no A. This correlation is fairly common and also forms a sound basis for further exploitation. The third correlation shown in Figure 2.1 is a weak correlation. In a weak correlation, in some cases when there exists an A there will also exist a B, or in some cases where there exists a B there will also exist an A. But in many cases A and B will exist independently. As shown, the weak correlation between A and B does not form a particularly good case for commercial exploitation. And, if that is all there is, then nothing more can be done. 1 © 1997 William H. Inmon, all rights reserved But in a way, weak correlations are the most interesting of all for some very good reasons. The allure of weak correlations is that they may point to important trends that are as yet undiscovered, and because they are undiscovered, can lead to opportunities for exploitation that are as yet unknown. Therefore, there are two very important aspects of weak correlations that are worth exploring: n Is the correlation growing more significant over time? n Is the correlation very strong for a subset of either A or B? If the weak correlation is growing stronger over time, it is entirely possible that the weak correlation is a harbinger of large new trends that are just now developing. If such is the case, there may be massive opportunities for exploitation. Of course, how weak the correlation is and how fast the strength of the correlation is increasing is very relevant. If a correlation is increasing in strength at glacial speed, it will be very difficult to exploit the correlation. And if the correlation is so weak that it is useless, it will be a long time for the correlation to become strong enough to be able to be exploited. The second case where a weak correlation is of interest is where there is a weak correlation for the entire population, but a quite strong correlation for a subset of the population. In other words, if there are other characteristics of A that can be used to select a subpopulation of A, and if after having selected that subpopulation, the correlation between A and B becomes much stronger, then there will most likely be a significant opportunity for commercial exploitation by targeting the subpopulation of A that strongly correlates to B. For these reasons, weak correlations are some of the most interesting of all the different ways that data can be correlated. Of course, where there is no correlation of data at all, there is little chance for exploitation through the techniques of data mining. There are of course other types of correlations other than correlations based on existence criteria. The correlations that have been discussed are based on whether two variables exist in the presence of each other. Another very common type of correlation is not based on existence at all, but is based on values. As an example, suppose A and B always exist in each other's presence. When A has a value greater than 50, B has a value greater than 100, and when A has a value greater than 100, B has a value greater than 200, and so forth. The correlation is measured not in terms of existence of variables, but in terms of the values of the variables compared to each other. Multivariate Analysis Of course there are correlations between more than two variables. Figure 2.2 suggests a simple form of multivariate analysis. multivariate analysis Figure 2.2 The relationships that can be divined doing multivariate analysis can be as interesting to the DSS analyst doing data mining and data exploration on the more common dual variable analysis. When 2 © 1997 William H. Inmon, all rights reserved the DSS analyst discovers a multivariate correlation, there is the potential for exploitation just as there is in the case of a correlation between two variables. However, the discovery of a multivariate correlation is a difficult thing to accomplish in a data warehouse for two very important reasons: n the volume of data in the data warehouse makes analysis of multiple variables a very difficult thing to do, and n the number of variables is usually so many that it is not clear which combination of variables should be analyzed together. In addition, even when a multivariate correlation is discovered, the underlying business case may be very spurious. There is then great potential in doing multivariate analysis in a data mining and data exploration effort. However, the DSS analyst should be aware of the obstacles that await. The Spectrum of Correlation Whether dual variables are correlated or multiple variables are correlated, the result will be a spectrum of strengths of correlation. Figure 2.3 depicts the spectrum of correlation that lies ahead for the DSS analyst. very strong no correlation correlation whatsoever there is a large spectrum of correlation between variables Figure 2.3 Corresponding to the spectrum of correlation is the spectrum of opportunity that awaits the DSS analyst that would exploit a correlation. As a general rule, the stronger the correlation, the greater the chance that the correlation is already well known. The greater the chance the correlation is well known, the greater the chance it is already being exploited. In other words, for well known correlations, even though the correlation is strong, the opportunity for exploitation is minimal because the competition is already making use of the relationship. However, if a strong relationship develops between two variables that are not well known, then there is a major opportunity for exploitation. For correlations that are not so strong there is a real opportunity for exploitation, if the correlation: n has not been widely discovered, n is growing in strength, or n is very strong for an identifiable subset of the population being studied. The DSS analyst should be aware of the spectrum of strength of correlation and should be aware of what opportunities are possible based on the strength of the correlation. 3 © 1997 William H. Inmon, all rights reserved The Business Relationship Even where there is a valid mathematical relationship that has been discovered, there is no guarantee that this relationship will lead to an opportunity for exploitation. Figure 2.4 shows that, after the mathematical correlation has been established, there must be an analysis of the underlying business relationship. f(x) dx business x=1, 10 mathematical just because there is a mathematical relationship between two or more variables does not necessarily mean that there is a business relationship. If there is no business relationship, exploitation will be very difficult to do Figure 2.4 Some of the possibilities in the analyzing of the underlying business relationship are: n the correlation is a false positive and has no underlying business relationship whatsoever, n the correlation is a previously unknown relationship and is ripe for exploitation, n the correlation has a business relationship at its basis, but the business relationship is well known and is already being exploited to the fullest, and n the correlation is mathematically valid and has no underlying business relationship, but the relationship is so strong that the correlation will still present opportunities for exploitation. Each of these circumstances will be discussed. In the case where there is a false positive and where there is no business relationship whatsoever, it is important that the DSS analyst know because the DSS analyst can save a large amount of resources by not attempting to try to exploit an opportunity that is not viable. In the case where the DSS analyst has discovered a previously unknown correlation and there is a business basis for the relationship, the DSS analyst has a rare and powerful opportunity for exploitation. In the case where there is both a mathematical relationship and a business basis for the relationship and where the relationship is well known, it is unlikely that there will be an opportunity for exploitation simply because the relationship has already been exploited. In the case where there is a strong mathematical correlation, but no apparent business basis, there are plenty of opportunities. One opportunity is to examine the business basis very carefully to discover whether there really is a basis, however subtle. Where there is a subtle business basis, there will be 4 © 1997 William H. Inmon, all rights reserved plenty of opportunity for exploitation because it is unlikely that this business relationship will have been found. Even in the case where there is no discernible business basis for a correlation, if the mathematical relationship is strong enough and is consistent enough, there will still be an opportunity for exploitation through the sheer strength of the relationship. Looking for Correlations The easiest place to start to look for correlations is in the most obvious places. Consider the scenario shown in Figure 2.5. FIVE AND DIME BUS foot traffic at the five and dime is heavier on rainy days the starting place to look for correlations is the obvious place Figure 2.5 In Figure 2.5 it is seen that on rainy days that shops near bus stops will get more walk-in traffic. There may not be any purchases that take place, but there will be an abundance of foot traffic. There are many cases where the correlation is obvious — in the summer, a lot of beer is consumed. In the winter, snow chains are frequently sold. In the spring, lawn mowers are popular, and so forth. In the case of obvious correlations, the opportunity lies not so much in the discovery of the correlation but in the creation of novel ways to exploit the correlation. 5 © 1997 William H. Inmon, all rights reserved As an example of a macro analysis of summary data and its use to discover correlations, consider the graph in Figure 2.6. $ 1 bil 900 mil 800 mil 700 mil 600 mil 500 mil 400 mil 300 mil 200 mil 100 mil Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec retail sales peak at Christmas time an obvious pattern Figure 2.6 Figure 2.6 shows that for a retailer, sales progress throughout the year and peak in December, at the time of the Christmas holidays. The summary table suggests that when this year's highs are greater than last year's highs and this year's lows are less than last year's lows, there might be some interesting correlations of data that could be observed. At an even more macro level, consider the profitability over a long period of time of several insurance companies, as seen in Figure 2.7. 100 mil 50 mil 0 -50 mil 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 insurance company A comparing the profitability of insurance insurance company B companies over time insurance company C - how have they all been alike? insurance company D - how have they been different? Figure 2.7 6 © 1997 William H. Inmon, all rights reserved In Figure 2.7, four insurance companies have shown their profitability over a decade. The obvious points of interest (and the most likely place to look for interesting correlations) are: n points where one company falls lower than all other companies or where one company soars higher than other companies, and n points where one company is operating on a different trend line than other companies. These macro observations lead to places where productive correlations can be found. Finding Interesting Correlations at the Micro Level Looking at macro indicators is a good way to find where in the grand scheme of things interesting correlations will reside, but analysis at the macro level can never get to the level of detail that will satisfy the DSS analyst. Analysis must proceed to the micro level in order to be productive. Figure 2.8 shows accumulation of a large number of sales data. sale amt = 2.98 sale amt = 4.65 sale amt = 4.98 sale amt = 5.56 sale amt = 3.09 sale amt = 1.71 sale amt = 65.98 sale amt = 3.65 sale amt = 3.22 sale amt = 3.09 sale amt = 3.37 sale amt = 3.09 sale amt = 3.87 sale amt = 2.87 sale amt = 2.01 sale amt = 5.92 sale amt = 1.87 sale amt = 2.86 sale amt = 48.76 sale amt = 3.97 sale amt = 3.97 sale amt = 4.46 sale amt = 4.92 sale amt = 2.19 sale amt = 3.97 sale amt = 2.87 sale amt = 2.72 sale amt = 3.31 sale amt = 17.29 sale amt = 3.19 sale amt = 2.76 sale amt = 78.32 sale amt = 4.16 sale amt = 1.29 sale amt = 2.33 sale amt = 2.37 sale amt = 1.87 sale amt = 3.54 sale amt = 3.84 sale amt = 3.97 sale amt = 3.97 sale amt = 2.18 sale amt = 3.75 sale amt = 4.33 sale amt = 1.76 sale amt = 1.79 sale amt = 1.79 sale amt = 1.98 sale amt = 4.97 sale amt = 4.72 sale amt = 3.77 sale amt = 3.29 sale amt = 3.28 sale amt = 2.07 sale amt = 5.01 sale amt = 3.34 sale amt = 1.22 sale amt = 1.56 sale amt = 2.38 sale amt = .98 sale amt = 2.98 sale amt = 1.32 sale amt = 4.42 sale amt = 2.87 sale amt = 1.29 sale amt = 1.62 sale amt = 4.65 sale amt = 3.37 sale amt = 3.20 sale amt = 2.97 sale amt = 3.41 sale amt = 3.82 sale amt = 3.97 sale amt = 3.97 sale amt = 3.97 sale amt = 3.97 sale amt = 1.87 sale amt = 28.27 sale amt = 2.86 sale amt = 1.45 sale amt = 1.53 sale amt = 1.72 sale amt = 1.75 sale amt = .93 sale amt = 1.86 sale amt = 2.22 sale amt = 3.66 sale amt = 3.97 sale amt = 2.96 sale amt = 1.67 sale amt = 2.97 sale amt = 8.42 sale amt = 1.10 sale amt = 2.73 sale amt = 3.19 sale amt = 2.97 find all sales amounts greater than 10.00 Figure 2.8 One of the ways to find where there are interesting correlations is to ask — where are the exceptions to the norm? In the case of the data found in Figure 2.8, the DSS analyst has honed in on all sales greater than $10.00. There are five sales greater than $10.00. In fact, the sales greater than $10.00 are so significantly larger than all the other sales that there must be something interesting about the sales. The DSS analyst is tipped off to look at such things as: n What items were purchased? n Were the items purchased as a group? n Who were the items purchased by? n When were the items purchased? These questions may not lead to any interesting observations, but there is something unusual about the purchases greater than $10.00. Discovering what other data correlates to the sale amount is a productive place to start the analysis of the data. 7 © 1997 William H. Inmon, all rights reserved In the same vein, finding the smallest and the largest sales may well lead to productive results. Figure 2.9 shows this simple analysis. sale amt = 2.98 sale amt = 4.65 sale amt = 4.98 sale amt = 5.56 sale amt = 3.09 sale amt = 1.71 sale amt = 65.98 sale amt = 3.65 sale amt = 3.22 sale amt = 3.09 sale amt = 3.37 sale amt = 3.09 sale amt = 3.87 sale amt = 2.87 sale amt = 2.01 sale amt = 5.92 sale amt = 1.87 sale amt = 2.86 sale amt = 48.76 sale amt = 3.97 sale amt = 3.97 sale amt = 4.46 sale amt = 4.92 sale amt = 2.19 sale amt = 3.97 sale amt = 2.87 sale amt = 2.72 sale amt = 3.31 sale amt = 17.29 sale amt = 3.19 sale amt = 2.76 sale amt = 78.32 sale amt = 4.16 sale amt = 1.29 sale amt = 2.33 sale amt = 2.37 sale amt = 1.87 sale amt = 3.54 sale amt = 3.84 sale amt = 3.97 sale amt = 3.97 sale amt = 2.18 sale amt = 3.75 sale amt = 4.33 sale amt = 1.76 sale amt = 1.79 sale amt = 1.79 sale amt = 1.98 sale amt = 4.97 sale amt = 4.72 sale amt = 3.77 sale amt = 3.29 sale amt = 3.28 sale amt = 2.07 sale amt = 5.01 sale amt = 3.34 sale amt = 1.22 sale amt = 1.56 sale amt = 2.38 sale amt = .98 sale amt = 2.98 sale amt = 1.32 sale amt = 4.42 sale amt = 2.87 sale amt = 1.29 sale amt = 1.62 sale amt = 4.65 sale amt = 3.37 sale amt = 3.20 sale amt = 2.97 sale amt = 3.41 sale amt = 3.82 sale amt = 3.97 sale amt = 3.97 sale amt = 3.97 sale amt = 3.97 sale amt = 1.87 sale amt = 28.27 sale amt = 2.86 sale amt = 1.45 sale amt = 1.53 sale amt = 1.72 sale amt = 1.75 sale amt = .93 sale amt = 1.86 sale amt = 2.22 sale amt = 3.66 sale amt = 3.97 sale amt = 2.96 sale amt = 1.67 sale amt = 2.97 sale amt = 8.42 sale amt = 1.10 sale amt = 2.73 sale amt = 3.19 sale amt = 2.97 find the largest and the smallest Figure 2.9 And along the same line, looking for the five largest (or the five smallest) may be productive. Figure 2.10 shows this criteria. sale amt = 2.98 sale amt = 4.65 sale amt = 4.98 sale amt = 5.56 sale amt = 3.09 sale amt = 1.71 sale amt = 65.98 sale amt = 3.65 sale amt = 3.22 sale amt = 3.09 sale amt = 3.37 sale amt = 3.09 sale amt = 3.87 sale amt = 2.87 sale amt = 2.01 sale amt = 5.92 sale amt = 1.87 sale amt = 2.86 sale amt = 48.76 sale amt = 3.97 sale amt = 3.97 sale amt = 4.46 sale amt = 4.92 sale amt = 2.19 sale amt = 3.97 sale amt = 2.87 sale amt = 2.72 sale amt = 3.31 sale amt = 17.29 sale amt = 3.19 sale amt = 2.76 sale amt = 78.32 sale amt = 4.16 sale amt = 1.29 sale amt = 2.33 sale amt = 2.37 sale amt = 1.87 sale amt = 3.54 sale amt = 3.84 sale amt = 3.97 sale amt = 3.97 sale amt = 2.18 sale amt = 3.75 sale amt = 4.33 sale amt = 1.76 sale amt = 1.79 sale amt = 1.79 sale amt = 1.98 sale amt = 4.97 sale amt = 4.72 sale amt = 3.77 sale amt = 3.29 sale amt = 3.28 sale amt = 2.07 sale amt = 5.01 sale amt = 3.34 sale amt = 1.22 sale amt = 1.56 sale amt = 2.38 sale amt = .98 sale amt = 2.98 sale amt = 1.32 sale amt = 4.42 sale amt = 2.87 sale amt = 1.29 sale amt = 1.62 sale amt = 4.65 sale amt = 3.37 sale amt = 3.20 sale amt = 2.97 sale amt = 3.41 sale amt = 3.82 sale amt = 3.97 sale amt = 3.97 sale amt = 3.97 sale amt = 3.97 sale amt = 1.87 sale amt = 28.27 sale amt = 2.86 sale amt = 1.45 sale amt = 1.53 sale amt = 1.72 sale amt = 1.75 sale amt = .93 sale amt = 1.86 sale amt = 2.22 sale amt = 3.66 sale amt = 3.97 sale amt = 2.96 sale amt = 1.67 sale amt = 2.97 sale amt = 8.42 sale amt = 1.10 sale amt = 2.73 sale amt = 3.19 sale amt = 2.97 find the five largest Figure 2.10 8 © 1997 William H. Inmon, all rights reserved Yet another way to analyze the sales data, looking for a useful way to understand the data and to determine where there might be interesting correlations, is to create a profile of the data. Certainly averages and medians (i.e., mid points) can be calculated, and those numbers are meaningful. But another meaningful way to characterize the data is in terms of a “profile.” Figure 2.11 depicts a simple profile of the sales data. sale amt = 2.98 sale amt = 4.65 sale amt = 4.98 sale amt = 5.56 sale amt = 3.09 sale amt = 1.71 sale amt = 65.98 sale amt = 3.65 sale amt = 3.22 sale amt = 3.09 sale amt = 3.37 sale amt = 3.09 sale amt = 3.87 sale amt = 2.87 sale amt = 2.01 sale amt = 5.92 sale amt = 1.87 sale amt = 2.86 sale amt = 48.76 sale amt = 3.97 sale amt = 3.97 sale amt = 4.46 sale amt = 4.92 sale amt = 2.19 sale amt = 3.97 sale amt = 2.87 sale amt = 2.72 sale amt = 3.31 sale amt = 17.29 sale amt = 3.19 sale amt = 2.76 sale amt = 78.32 sale amt = 4.16 sale amt = 1.29 sale amt = 2.33 sale amt = 2.37 sale amt = 1.87 sale amt = 3.54 sale amt = 3.84 sale amt = 3.97 sale amt = 3.97 sale amt = 2.18 sale amt = 3.75 sale amt = 4.33 sale amt = 1.76 sale amt = 1.79 sale amt = 1.79 sale amt = 1.98 sale amt = 4.97 sale amt = 4.72 sale amt = 3.77 sale amt = 3.29 sale amt = 3.28 sale amt = 2.07 sale amt = 5.01 sale amt = 3.34 sale amt = 1.22 sale amt = 1.56 sale amt = 2.38 sale amt = .98 sale amt = 2.98 sale amt = 1.32 sale amt = 4.42 sale amt = 2.87 sale amt = 1.29 sale amt = 1.62 sale amt = 4.65 sale amt = 3.37 sale amt = 3.20 sale amt = 2.97 sale amt = 3.41 sale amt = 3.82 sale amt = 3.97 sale amt = 3.97 sale amt = 3.97 sale amt = 3.97 sale amt = 1.87 sale amt = 28.27 sale amt = 2.86 sale amt = 1.45 sale amt = 1.53 sale amt = 1.72 sale amt = 1.75 sale amt = .93 sale amt = 1.86 sale amt = 2.22 sale amt = 3.66 sale amt = 3.97 sale amt = 2.96 sale amt = 1.67 sale amt = 2.97 sale amt = 8.42 sale amt = 1.10 sale amt = 2.73 sale amt = 3.19 sale amt = 2.97 from .00 to .99 2 from 1.00 to 1.99 21 create a profile from 2.00 to 2.99 23 from 3.00 to 3.99 31 from 4.00 to 4.99 10 from 5.00 to 5.99 3 from 6.00 to 6.99 0 greater than 6.99 6 Figure 2.11 The profile is useful to determine if anything unusual is happening to the data. Within the population of the data itself there may be hidden trends and ratios. The profile created in Figure 2.11 shows that the vast majority of sales are between $1.00 and $4.00. Anything outside of that range is an anomaly. The sales form a slightly skewed bell curve around the $3.00 sales mark. The use of a profile is to characterize the masses of data in a perspective that is not immediately obvious from examining the details of the data directly. 9 © 1997 William H. Inmon, all rights reserved Still another way to do a detailed analysis of data is to use a scatter chart. Figure 2.12 shows a simple scatter chart. shelf time - 5 days, cost - 10.99 shelf time - 4 days, cost - 12.67 shelf time - 7 days, cost - 16.82 shelf time - 1 days, cost - .75 shelf time - 1 days, cost - 1.19 shelf time - 3 days, cost - 5.98 shelf time - 20 days, cost - 89.95 shelf time - 22 days, cost - 65.98 shelf time - 24 days, cost - 4.82 shelf time - 3 days, cost - 1.75 shelf time - 2 days, cost - 2.21 shelf time - 4 days, cost - 3.56 shelf time - 21 days, cost - 65.00 shelf time - 27 days, cost - 90.00 shelf time - 18 days, cost - 156.33 shelf time - 10 days, cost - 15.95 shelf time - 10 days, cost - 16.00 shelf time - 13 days, cost - 14.98 shelf time - 3 days, cost - 59.99 shelf time - 32 days, cost - 65.98 shelf time - 27 days, cost - 129.34 shelf time - 35 days, cost - 5.95 shelf time - 4 days, cost - 6.95 shelf time - 14 days, cost - 4.21 shelf time - 13 days, cost - 3.75 shelf time - 8 days, cost - 3.79 shelf time - 6 days, cost - 5.88 shelf time - 5 days, cost - 2.99 shelf time - 3 days, cost - 2.76 shelf time - 9 days, cost - 6.98 shelf time - 18 days, cost - 3.76 shelf time - 6 days, cost - 3.87 shelf time - 3 days, cost - 1.29 shelf time - 16 days, cost - 17.96 shelf time - 13 days, cost - 19.44 shelf time - 18 days, cost - 89.99 shelf time - 17 days, cost - 2.87 shelf time - 1 days, cost - 1.77 shelf time - 2 days, cost - 2.86 shelf time - 8 days, cost - 8.95 shelf time - 3 days, cost - 2.61 shelf time - 3 days, cost - 12.87 shelf time - 10 days, cost - 17.75 shelf time - 11 days, cost - 49.95 shelf time - 15 days, cost - 56.99 shelf time - 12 days, cost - 98.00 shelf time - 23 days, cost - 97.34 shelf time - 19 days, cost - 65.49 shelf time - 6 days, cost - 6.97 shelf time - 3 days, cost - 8.97 shelf time - 4 days, cost - 7.99 shelf time - 4 days, cost - 2.99 shelf time - 2 days, cost - 4.55 shelf time - 1 days, cost - 5.98 40 . . 35 30 . . .. 25 .. days . 20 .. ... . 15 .. . . .. . 10 . . . . ... 5 ................... .. . 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 dollars correlating the shelf time of a product against the cost of the item at final sale Figure 2.12 In Figure 2.12, the shelf time of an item that has been sold is correlated with the price of the item. Two noticeable trends emerge from the scatter chart. Figure 2.13 shows those two trends. . 40 . 35 30 . . . . 25 .. days . 20 .. ... . 15 .. . . .. . 10 . . . . ... 5 ................... .. . 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 dollars identifying the major trends using a scatter chart Figure 2.13 10 © 1997 William H. Inmon, all rights reserved

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.