----- ---------------- ---*- -*-- DEPARTMENT OF STKIISTICS -_-----------1111-- ----,-dm,--- University cf Th'isconsin Madison, 14.Jisconsi~5~ '3 706 TECHNICPIL REPORT NO. 9 Work done 1964 Report issued 1Q73 CRITENA FOR JUDGING ADEQUACY OI? ESTIMATION BY AN APPROXIMATING RESPONSE FUNCTION by George E, P. Box and John Wetz This rcsearch was supportzd by i!le U. S. Navy through the Olfice of fiaval Research xndcr Curltract No. ONR-NO00 14-63A-0 128-0017. CHAPTER 1 INTRODUCTION 1 t was realized by Sir Ronald Fisher in the early 1920's that con- clusl.ons t.o be drawn frcm rouhnely recorded aata were of limited validity, l'c overcome the ambiguities and ulicertanties connected with such analy sj s, he introduced the concept of designed experiments in which randomi zation, replicatjon, blocking, and orthogonality of design were ccnlral features (Fisher, 1958, 1939, 1960; Fisher and Yates, 1963). To understand some of the difficulties and how Fisher's ideas overcame them, it is convenient to think in terns of a specific example. Suppose one were interested in studying the dependence of a response y, which might be, fcr example, the yield of wheat, upon a number of . . . vdrjables x x ,x such as the amount of potassium, nitrogen, a ~ d 1' 2' k phosphorus in the soil, the rainfall, or temperatures at various times 01 . . . , . . . year. Imagine that xl, x2, x N which includes the subset xl, xz, ' Xk represents a completely comprehensive listing of all underlying v~siablse which rnjght affect y. Denote the. functional relationship connecting the levels of with the response y by 3;. 1 . y.= f[x1, , * * , x ~ ) (1. 3t Further suppose that over the range In which the x's vary, a linear functional relationship adequately represents the dependence. That is, for +.he range of variation concerned we can write in place of {I. 1) N 1, 1 Wc*lsecsa Correlations Eliminated bv Randomizal:ion In practice all :he N factors affecting y are not: known, and o x .;iu9;r is confined lo a certain subset. Lct this subset consist of the fjr sf k factcrs, Then tzrc can write whcc~ E may be calle-",a n error term that includes the influence of all the - ur?k~own variables %+p XN- Suppose these are n observations from a 9 this system so that in matrix notation the n-dimensional vector -y of observations may be written r =4i+lE ,% (1~51 whcc- X 8 f s the "error term" contributed by the unknown variables. Now 2-2 if we write a rno.3~1 nofj ng that the components of -c are linear functions of the variables . , xk+] . . xN, then from the Central Limit Theorem with thc usua.1 - proviscs, the distribution of will tend to normality as the number of f cclrnpc;nc-2% x k+13 • " 7 XN becomes large. If, however, the parameters -01 j?l which we arc specifically interested are estimated by least squares we will no*L In ger?eral clht_ain u nhiased estimates unlcs s the variables ih X -2 . alLeu r?co:.rel2ted with those in X The nature of the biases can be seen if -1 the model is rewritten Y- = -X l(-Bl + -AB-2 ) + (-X z- -X1-AJ-B2 where (X',zl-) . -A = 'X~X~ - - (1*8) + The elements of the vector - -ABz are in fact the expected values of the least squares estimates for E(OA, ) = (x;~~-)-- -~-x; +( -Xxz~-~B z~) + = -o1 -A-BZ ( 1. 10) TO illustrate by means of a topical example, let -y be an th . n-dimensional vector of 0's and 1's and consider the u element y U th Let yU = 1 indicate that the u subject had, by a particular date, died of lung cancer; and let y = 0 indicate that he had not died of lung cancer. U Let the n x. 2 matrix X-, contain a column of Its corresponding to location th parameter and a second column whose u element is the average number th of packs of cigarettes smoked daily by the u subject. Let -X2 he an t h n-dimensional vector whose u element is "the strength of a genetic th factor in the u subject which predisposes the person to lung cancer and also produces a desire to smoke. " Then A- would contain the regression coefficient of the genetic factor on the number of packs smoked. If this regression were large and positive, the finding of a large positive regression of cancer on smoking could occur even when the true regression coefficient was zero due 20 the influence of -R-e2 . N~nscnsec orrelation, of which this is an example, can be eliminated only by ensuring in some way that -XI and -X, are independent. Fisher achieved this by introducing the concept of randomized design into experiments. Me emphasized that wherever possible, information should be obtained not from records which had just happened but from a carefully and deliberately staged trial at which the levels of the - variables in XI were at the choice of !he experimenter. In ge nerd, randomization would consist of the following. Suppose that for a particular experiment the actual matrix X-I is chosen . from a set of M matrices {X .} for which X Y = C- where i = 1, 2,. .M, -1 L - 1 and C- is a constant matrix. Associated with each element X of the set -1 i is a probability p, which is the chance that X will be selected for this 1 -1 i particular experiment. Now suppose we choose the set (X ) and an -1 i associated set (p. ) so that 1 where E represents the expectation over the randomization set, Since X-I is r chosen quite independently of -X2 , then E(X-iX-z ) = 5 and if -XI is selected r from a set for which X', X = C , then E (he l) = -el . The effect of unknown -11-li - r - variables X-, in inducing nonsense correlations would thus be eliminated. 1. 2 Replication to Supply a Valid Estimate of Eror The fact that the estimates -he l in the previous example were unbiased would be, by itself, of little value if we could not measure them th against solne r~asonablyp recise estimate of error. Now the u com- ponent of the error term -r = X,% is in which x ' are regarded as random variables. This has u, ktl' ' ' xufl variance where r, is the standard deviation of x. and p. is the coefficient of 1 1 I h th correlation between the jth and h variables. We can obtain a reliable estimate of V(E ) by making a series of runs in which -XI is kept U fixed but X-z is allowed to vary freely. However, if any of the x's in X-z are constrained in any manner we shall not have a true estimate of V(E ). U For example, suppose in a chemical experiment runs are made at rn . . distinct lev els of xl, . , + An estimate of error might be obtained by making r observations at each set of m reaction conditions and computing the variance estimate s2 based on m{r-l) degrees of freedarn. Clearly, however, s5will not be an unbiased estimate of V(a ) unless all the U z+~, , . . . sources of uncontrolled variation xN, are allowed to vary in the same unrestricted manner between the r replicates made at the same set of conditions as between cbs~rvationsm ade at different sets of conditions, For example, suppose uncontrollable changes occurring from day to day in external variables, such as ambient temperature and humidity, might E2, be important elements in and that a series of duplicated experiments were being made over a period of several weeks. Suppose that two experimental runs could be made on any given day. If duplicate runs were always made on the same day, the effect of tho external variables A> would not be taken into account in the estimate of error, Rowever, their effect would be accounted for if the duplicates were made randomly in time. Thus, as Fishes pointed out, a valid estimate of error could be obtained by deliberately arranging circumstances so that runs made at the same experimental conditions were subjected to the same error influences as were runs made at different experimental conditions. 1. 3 Blockina As we have explained, unbiased estimates with valid standard errors may be obtained by the use of randomization and suitable replication, and it becomes amropriate to analyze the experiment as if an adequate model were + y- "X-1O-I r (1. 143 d with the elements of the vector independent of each other and or' the E - clemcnts of XL, J30wever, it may be that although the estimates are valid, they are also very jrnprecise. Techniques are needed therefore, which reduce the errors of the estimates to a minimum while preserving validityt It frequently happens that some of the sources of error in -E = -XZ- QZ are recognizable although not controllable. For example, in chemical experiments it is often necessary to make experimental runs using different pieces of equipment. It may be known that slight differences beween, say, one reactor and another, will cause differences which will inflate the error term in a completely randomized design. (In our model the use of different pieces of equip- ment could be represented by making an appropriate number of the x's in X-Z so-called indicator variables. If x is an indicatcr variable in X-, h th it might take the value 1 when the h reactor was in operation, which would contribute an increment 8 to y, and the value 0 when this reactor h was not in operation. ) Since it is known which reactor is in operation at any given time, allowance for the reactor effects in the analysis can be made by removing this part of the model from the error term and putting it in the part to be estimated. Thus if where -ZI -P represents the effect of '%lookk"v ariables, such as reactors or days, whose influence can be recognized but not controlled, we may consider the model where -e = W--$ (1. 18) is that part of the error which remains after removing the systematic effects. The influence of e will be snialler than that of the original error term, Furthermore, if it is arranged in advance that -X and -Z are orthogo~als o that -X'-Z = -0 then the pls can be estimated as though the block variables did not exist. Since the block variables do not affect the estimation, they no longer contribute to the error, and as Fisher indicated, their influence is thus eliminated, An example of a randomized block design employing this principle would be one in which the rn sets of experimental con- ditions were run in each of r reactors. 1. 4 Orthogonality In general, the least squares estimates kwill be correlated. If pi and P. are two quantities in the group -P to be estimated, this will 1 mean in particular that the estimate of P, for given $ which we may 1 j' write Ap . l ~ . i,s afunctionof P thatis, 1 1 I' Ph, I pj = f(Bj) This dependence of the estimate AP , on the values of the other parameters 1 is undersirable, Furthermore, other things being equal, such correlation is associated with an increase in the error variance. For example, it has been shown by Hotelling (3944), Tocher (1952), and Box (1952) that for fixed diagonal elements of -X ' s the minimum varj ance e stimatc s of the elements of are obtained only when all off-diagonal elements are zero. Fisher therefore recommended the use of orthogonal designs. That is to (x) say, the set of matrices from which a random selection is to be made - will normally be a set of matrices for whlch the columns are orthogonal. For exam plc, if Xf! = Poxou + B1xl, + Pzxzu + P3x3U (1. 20) where x = 1, a z3 factorial design might be employed in which case the QU set (-x) could be the 8 ! matrices of dimension 8 X 4 whose rows are - (1 -1 1 11, (1 1 1 1) in all possible permutations. Thus X'X = I, X 8. f - , 1.5 The Analysis of Variance Associated with Fisher's methods for designing experiments was his analysis of variance (ANOVA) technique for analyzing results. In the L case of the design employing orthogonal blocks mentioned above, consider the model - The observations y can be thought of as a vector in the n-dimensional - - sample space and each of the columns of X and Z provides another such - vector, The complete set of such vectors in X defines what we shall call - a "treatment" hyperplane. The complete set of vectors in 5 defines what 9- we shall ca13 a "block" hyperplane. The values estimated by the fitted - least squares model are the coordinates of the projection of y into the hyperplane defined by the vectors of -X and -%. The least squares estimatc s
Description: