ebook img

DTIC ADA281793: A Combined Biased-Robust Estimator for Dealing with Influence and Collinearity in Regression PDF

29 Pages·1.5 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA281793: A Combined Biased-Robust Estimator for Dealing with Influence and Collinearity in Regression

ICPU IUMENTATION PAGE fom "A ?D0f' 8 S OMS No 070.4.Clee AD-A281 793 . ...... k ~ - . * 4-1 .** *' o I , ;C 2. REPORT DAT WORT TYPE AND DATES COVERED S. FUNDING NUMBERS .S psn jGLr'QS 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESSEES) 8. PERFORMING ORGANIZATION REPORT NUMBER AFIT Student Attending: AFIT/CI/CIA- 9. SPONSORING /MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSORING IMONITORING DEPARTMENI OF THE AIR FORCE AGENCY REPORT NUMBER AFIT/CI 2950 P STREET WRIGHT-PATTERSON AFB OH 45433-7765 11. SUPPLEMENTARY NOTES 2a. DISTRIBUTION /AVAILABILITY STATEMENT 12b. DISTRIBUTION CODE Approved for Public Release JAW 190-1 Distribution Unlimited MICHAEL M. BRICKER, SMSgt, USAF Chief Administration 13. ABSTRACT (Maximum 200 words) DTIC ELFEC TE -4M694-22701 MT'e QUALITY INSPECTED 8 7_19 174 SE 14. SUBJECT TERMS 15. N QER)OF PAGES 16. PRICE CODE 17. SECURITY CLASSIFICATION 18. SECURITY CLASSIFICATION 19. SECURITY CLASSIFICATION 20. LIMITATION OF ABSTRACT OF REPORT OF THIS PAGE OF ABSTRACT NSN 7540-01-280-5500 Sandard orn 298 (;ev 2-89) I. I I Research Proposal on A Combined Biased-Robust Estimator for Dealing with Influence and Collinearity in Regression Aeion For NTtS CRA&M D I i; TAB .s .~. n ..... ......... By Dist, ibttio I AvallaL;111;ly Codes Dit Ava;ia:'i dlor James R. Simpson Arizona State University THE PROBLEM Introductiona nd Background of the Problem Regression analysis is a statistical tool that has earned widespread use in nearly all areas of endeavor seeking to fit a model to a set of data. Although there are several methods of estimating the model parameters, the least squares method is used most often because of its general acceptance, elegant statistical properties and ease of computation. Unfortunately, the mathematical elegance that makes least squares so popular depends on a number of fairly strong and many times unrealistic assumptions. The assumption that makes least squares so attractive in terms of hypothesis testing and confidence intervals on the parameter estimates is that the distribution of the errors is normal or Gaussian. This assumption can be violated if one or more sufficiently outlying observations are present in the data, resulting in less than optimal estimates of the parameters. The second problem that can ruin the accuracy of least squares estimates is correlated regressors. Highly correlated regressors can cause large variances in the estimates of the coefficients, sometimes resulting in incorrect levels of magnitude or even incorrect signs for the coefficients. Outliers, which occur often in real data, occur for many reasons including typing or computation errors, interchanging of values, inadvertent observations from different populations and transient effects. Outliers can also be due to genuinely long-tailed distributions. Hampel et al. (1986) summarized the results of numerous studies of the frequency of outliers in real data and conclude that altogether I-10% outliers in routine data are more the rule rather than the exception. Outliers can be found in the response variable (y-variable) or the regressor variables (x-variables). Regardless of the origin, a single, sufficiently outlying observation in a data set can render least squares estimation useless. Robust estimation methods can deal with outliers relatively easily. Ronchetti (1987) points out that the goal of a robust selection procedure is to choose a model which fits the majority of the data, taking into account that the errors may not be normally distributed. A number of robust regression estimation techniques have been proposed and some have been successfully used in practice. Often when fitting a model to data, analysts find that some of the regressor variables are highly correlated with each other. This condition, known as multicollinearity, can have detrimental effects on the least squares estimates of the coefficients. In general, multicollinearit, tends to inflate the variance and absolute value of the least squares coefficients. In this case, the main problem with the least squares estimate is the restriction that the estimator be unbiased. Alternative estimation techniques that have been proposed successfully sacrifice small amounts of bias in exchange for large reductions in the variance of the estimates. Biased estimation methods, such as ridge regression, can provide stable coefficient estimates with computational ease. Outliers and multicollinearity occur simultaneously in real data almost as often as each problem occurs separately. Relative to the amount of research in biased-only and robust-only techniques, the research in biased-robust regression has been sparse. Most of the advances in this area have been made in the last two decades by Holland (1973), Pariente and Welsch (1977), Hogg (1979), Askin and Montgomery (1980) Montgomery and Askin (1981), Pfaffenberger and Dielman (1984), Lawrence and Marsh (1984), Walker and Birch (1985, 1988), Walker (1987). Askin and Montgomery (1984) and Pfaffenberger and Dielman (1990) have followed up the development of their techniques by performing Monte Carlo simulation studies to compare various approaches. The most common approach to biased-robust estimation is augmented weighted least squares which allows a biased estimator and robust estimator to be combined into a single biased-robust estimator. Many of the existing robust estimators can be easily combined with biased estimators using the augmented-weighted least squares approach. In fact, several of the recently created biased-only and robust-only estimators are excellent candidates for an improved combined estimator. Statement of the problem Frequently, difficulties arise when practitioners try to apply appropriate regression estimation techniques. The traditional view that least squares is robust to deviations (even gross ones) from the assumptions of normality and uncorrelated regressors discourages users from applying other methods. In instances where the model adequacy diagnostics reveal a poor least squares fit due to outliers and collinearity, the practitioner is often not able to properly fit a model because the biased-robust estimation techniques are not known or available. The increasing presence of observational data with correlated regressors and abundant outliers makes advances in the state of the art of biased-robust estimation imperative. Although progress continues, there is a growing need for users to have tools available to implement when least squares fails. A need exists to develop and test alternative approaches to the combined problem so that the community of 2 practitioners are aware of the potential to accurately estimate regression model terms. The most recent advancements in robust-only and biased-only estimation warrant development of combined biased-robust estimators. Research Objective The objective of this research is to develop a biased-robust regression estimator and determine how the method performs in the presence of nonnormal errors (outliers) and multicollinear regressor variables. To accomplish this major objective a number of investigative questions must be answered. The sub-questions listed below are elements of the major objective and will guide the details of the research effort. I. How will the biased-robust estimator be developed? A. What characteristics are required of the two classes of estimators (robust and biased) in order to take a robust estimator and a biased estimator and form a combined biased- robust estimator? Specifically, for each class of estimator: 1. What are the strengths and weaknesses associated with the available techniques? 2. What are the properties most desirable in an estimator? 3. Which estimator is the best relative to the desirable properties? B. What characteristics are required for the biased-robust estimator? 1. What are the properties most desirable for the combined estimator? 2. Which estimator is the best relative to the desirable properties? 3. What are the challenges associated with combining the estimators? II. Which estimators should be used for comparison in the performance test? III. How will each of these estimators be computed? A. Is software available that generates some of the chosen estimators? B. Which estimators require coding? C. Which programming language is most appropriate to code the remaining estimators? IV. How will the Monte Carlo simulation be developed to compare the estimators? A. What characteristics of the data are important to vary in the simulation? B. What type of design will be used in this experiment? C. How will the data be generated? V. What criteria will be used to measure the performance of the biased-robust estimators? A. What performance indices are important? B. What measures can be calculated based the simulation results? 3 Scope and limitationso f the study * A subset of the most promising robust and biased estimation techniques will be modeled and compared. * Monte Carlo simulation will be used to compare the techniques. A designed experiment will be developed to test the estimation technique int he presence of a number of different types of data. * The primary purpose of this study is not to only identify an estimator with the superior statistical properties. Certain statistical properties such as high breakdown point are important and will be treated accordingly. In addition, the estimators that have some asymptotic distributional properties is preferred because parametric tests of hypothesis can be performed. Of equal importance though are the method's ability to accurately estimate the model coefficients. Overall assessments will be based on the combined knowledge of statistical properties and performance results against data from the experiment. Outline of the remainder of the paper * Review various robust estimators, biased estimation techniques and biased-robust estimators * Methodology detailing the proposed combined estimator and its properties * Determination of computational procedures for the estimators, design of the experiment, generation of the data, and identification of the measures of performance used in the Monte Carlo simulation REVIEW OF THE RELATED LITERATURE In general, the majority of the research on alternatives to least squares estimation in the presence of outliers and correlated regressors has addressed either the nonnormal issue or the collinearity issue but seldom addressed the combined problem. This review will cover the three topics in proportion similar to the amount of research available in the literature. There are two reasons: 1) in this case it is true that the more research that has been performed, the more significant are the findings, 2) a thorough understanding of the biased-robust estimation problem is aided by one becoming familiar with the work in robust-only and biased-only estimation. The contributions to biased-robust estimation follow naturally and will be discussed in detail concerning both the estimation approaches and the Monte Carlo simulation comparisons. Robust Estimation The problem of robustness in statistics goes back to the beginnings of statistics, especially in terms of measures of location. In fact, Rey (1983) notes that the Greek besiegers of antiquity switched from using the mean to a more robust measure, the median. Hampel et al. (1986) point out that rejection of outliers was considered by Bernoulli (1777) and Bessel and Baeyer (1838). Formal rejection rules were given by Peirce (1852) and Chauvenet (1863). Thorough accounts of the early 4 work can be found in papers by Hater (1974-1976), Huber (1972), and Stigler (1973). It was not until recent decades though that robust estimation became a true research area. The awareness w-as created by people such as E. S. Pearson, G. E. P. Box and J. W. Tukey. Box (1953) actually coined the term robustness and Tukey (1960) demonstrated the drastic nonrobustness of the mean and presented robust alternatives. In the 1960s. papers by Huber (1964, 1965, 1968) and Hampel (1968) formed the basis for the theory of robust estimation and extended this theory to applications such as regression. Since these pioneering papers on robust estimation in regression, many approaches have been presented but no single approach is either optimum or superior to the others in all aspects. The important criteria used in the field to determine the strengths and weaknesses of an estimator will be introduced prior to the discussion of each of the techniques. Although some of the criteria are more important than others for a particular set of data, the optimum estimator would ideally have the positive characteristics of all criteria. Equivariance: Refers to statistics that transform properly. It can be one of three types: affine, scale or regression equivariant (Rousseeuw and Leroy, p. 116). Affine equivariance means that, under the sum of a linear transformation and a fixed vector, the estimator is transformed in the same way. Scale equivariance means that if the observations are multiplied by a constant c, the estimators are also multiplied by c. Regression equivariance means that without loss of generality, High breakdown point: The breakdown point of an estimator is the amount of contamination allowed in the data (usually a percentage or fraction) until the estimate ceases to give information about the parameters. Breakdown points can be as low as 0% (or sometimes referred to as l/n) meaning that only a single outlying observation can cause an estimator to be meaningless, as is the case with least squares. Breakdown points can be as high as 50%, meaning that up to half of the data can be contaminated and the estimator can still be useful. Efficiency: Expressed as a percentage, the degree to which the estimator performs like least squares in the presence of Gaussian or normally distributed errors. The term is computed as the mean squared error of the robust fit divided by the mean squared error of the least squares fit. Efficiencies near 90-95% are desirable. 5 X-space outlier: An unusual point in the x-direction. Its effect on a least squares estimator is very large because it "pulls" the least squares line in its direction. For this reason this observation is also called a leverage point. Y-space outlier: An unusual point in the y-direction only. This point can have d large influence on the least squares line but the nature and extent of the effect depends on its x-coordinates and the disposition of the other points. It is important to note that the most dangerous type of point is one that is an outlier in both directions (x andy-space outlizr). Computationale ase: Considerations include the complexity and availability of the method used to calculate the estimates. This measure also considers the potential for convergence problems. Distributionalp roperties: In order to test the adequacy of the estimation technique and choose the parameters which are significant in the model, hypothesis tests must be performed. These tests are more efficient if they are based on some, at least asymptotic, assumptions about the distribution of the estimator. A graphic will be displayed next to each robust technique discussed that quickly highlights the strengths and weaknesses of the method using the criteria just mentioned. Strengths will be indicated by shading. L-norm or (least absolute values) estimation 1 Many alternative estimators have been proposed for regression. One of EiO these approaches came from Edgeworth (1887), improving a proposal of Diekdo" Pot ,uim Boscovich (1757). He proposed the L-norm or least absolute values (LAV) X..po I regression estimator, which is determined by Co..mp.tioa E.. D stwbuti.ua Properties min ", (1) i=l This approach attempts to minimize the sum of the absolute errors. The LAV estimator is commonly solved with linear programming methods. Unfortunately, the breakdown point of LAV regression is still no better than 0%. The LAV is robust to an outlier in the y-direction (unlike least squares). However, LAV regression does not protect against outlying x, where the effect of the leverage point is even stronger than on the least squares line. It turns out that when the leverage 6 point is far enough away, the LAV line passes right through it. So a single erroneous point can totally offset the LAV estimator. The Li-norm and least squares (L-norm) are special cases of the Lp-norm regression problem. 2 The objective in the general case is to n P min (2) 1=1 where 1<p<2. This approach has been considered by Gentlemen (1965), Forsythe (1972) and Sposito et al. (1977). Dodge (1984) suggested a regression estimator based on the convex combination of the L. and L norms. All these proposals possess a zero breakdown point. 2 M-estimation WHuber (1973) introduced a class of estimators called "M-estimators". This H ,Xmdml.,,,m ethod is the most popular of all robust estimators. The M-estinators are x4pe. ouu, based on the idea of replacing the squared residuals by another function of the zero. :........... : residuals p(r), where p is a symmetric function with a unique minimum at zero. n n minY'p(e,)--min "p~y, - x',,l) (3) J i=) 10 =m M-estimators are maximum likelihood estimators where the function p is related to the likelihood function for an appropriate choice of the error distribution. Because the M-estimator is not scale invariant the minimization problem is modified by dividing the p function by a robust estimate of scale s, so the formula becomes mi P( Mr () (4) P S P = S A popular choice for s is s = median le, - median(e,) /0.6 745 7 The constant 0.6745 is used to make s an unbiased estimator of a when n is large and the sample actually arises from a normal distribution. 1' The least squares estimator is a special case of the p() function where p(u) = u. For a convex p, equivalence to (4) can be found by finding the first partial derivatives of (4) with respect to/3 and setting the result equal to 0, as minj:-V/ (L)= min V/(Y" ' ' x j =0 (5) fl i=1 S 1 where y/(u) = -p(u), resulting in the necessary condition normal equations. If /('=uu), then (5) reduces to the normal equations yielding the least squares estimator. However, in the case of robust estimation, Vt(u) is not linear so that (5) defines a nonlinear system of equations which requires an appropriate iterative technique. The yi(u) function controls the weight given to each residual and is very important in determining the robust and efficiency properties of the estimator. Although a number of popular V/-functions have been developed, they primarily belong to one of two categories: monotonic and redescending. The least squares y/-function described earlier reveals its weakness in situations involving heavy- tailed distributions. The W-function, Vu(u) =u, is unbounded meaning large residuals receive heavy weights. The Huber function (Huber, 1964), is an example of a monotone v/-function defined as yl(u) = min(c., max(u, -cH) which results in down-weighting the large residuals compared to ) least squares. Other /-functions redescend with increasing residual magnitude. The bisquare or biweight function of Beaton and Tukey (1974), is defined as ;v(u) = u(1 - (u /C )2)2 for iuj<c B and 0 if lu! > c. The c terms in both equations refer to tuning constants chosen to achieve desired efficiencies. The values CH=1.345 and cB=4.685 for the Huber and biweight ;,-functions respectively achieve 95% efficiency compared to the least squares estimator in the model when the errors are actually normally distributed. For an excellent summary of different approaches to the ;-ftnctions, see Montgomery and Peck (1992). The solution to (5) requires solving a system of equations using iteration schemes. Approaches include reweighted least squares, or the so-called H-algorithm. Iteratively reweighted least squares ! 8

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.