Statistics 512: Applied Linear Models Topic 7 Topic Overview This topic will cover (cid:15) Two-way Analysis of Variance (ANOVA) (cid:15) Interactions Chapter 19: Two-way ANOVA The response variable Y is continuous. There are now two categorical explanatory variables (factors). Call them factor A and factor B instead of X and X . (We will have enough subscripts as it is!) 1 2 Data for Two-way ANOVA (cid:15) Y, the response variable (cid:15) Factor A with levels i = 1 to a (cid:15) Factor B with levels j = 1 to b (cid:15) A particular combination of levels is called a treatment or a cell. There are ab treat- ments. (cid:15) Y is the kth observation for treatment (i;j), k = 1 to n i;j;k In Chapter 19, we for now assume equal sample size in each treatment combination (n = i;j n > 1; n = abn). This is called a balanced design. In later chapters we will deal with T unequal sample sizes, but it is more complicated. Notation For Y the subscripts are interpreted as follows: i;j;k (cid:15) i denotes the level of the factor A (cid:15) j denotes the level of the factor B (cid:15) k denotes the kth observation in cell or treatment (i;j) i = 1;::: ;a levels of factor A j = 1;::: ;b levels of factor B k = 1;::: ;n observations in cell (i;j) 1 KNNL Example (cid:15) KNNL page 832 (nknw817.sas) (cid:15) response Y is the number of cases of bread sold. (cid:15) factor A is the height of the shelf display; a = 3 levels: bottom, middle, top. (cid:15) factor B is the width of the shelf display; b = 2 levels: regular, wide. (cid:15) n = 2 stores for each of the 3(cid:2)2 = 6 treatment combinations (n = 12) T Read the data data bread; infile ’h:\System\Desktop\CH19TA07.DAT’; input sales height width; proc print data=bread; Obs sales height width 1 47 1 1 2 43 1 1 3 46 1 2 4 40 1 2 5 62 2 1 6 68 2 1 7 67 2 2 8 71 2 2 9 41 3 1 10 39 3 1 11 42 3 2 12 46 3 2 Model Assumptions Weassumethattheresponsevariableobservationsareindependent,andnormallydistributed with a mean that may depend on the levels of the factors A and B, and a variance that does not (is constant). Cell Means Model Y = (cid:22) +(cid:15) where i;j;k i;j i;j;k (cid:15) (cid:22) is the theoretical mean or expected value of all observations in cell (i;j). i;j (cid:15) the (cid:15) are iid N(0;(cid:27)2) i;j;k (cid:15) Y (cid:24) N((cid:22) ;(cid:27)2), independent i;j;k i;j There are ab+1 parameters of the model: (cid:22) , for i = 1 to a and j = 1 to b; and (cid:27)2. i;j 2 Parameter Estimates P (cid:15) Estimate (cid:22) by the mean of the observations in cell (i;j), Y(cid:22) = kYi;j;k. i;j i;j: n P (cid:15) Foreach(i;j)combination,wecangetanestimateofthevariance(cid:27)2 : s2 = k(Yi;j;k−Y(cid:22)i;j:)2. i;j i;j n−1 (cid:15) Combine these to get an estimate of (cid:27)2, since we assume they are all equal. (cid:15) In general we pool the s2 , using weights proportional to the df, n −1. i;j i;j P P (n −1)s2 (n −1)s2 (cid:15) The pooled estimate is s2 = Pi;j i;j i;j = i;j i;j i;j. (n −1) n −ab i;j i;j T P s2 (cid:15) Here, n = n, so s2 = i;j = MSE. i;j ab Investigate with SAS Note we are including an interaction term which is denoted as the product of A and B. It is not literally the product of the levels, but it would be if we used indicator variables and did regression. Using proc reg we would have had to create such a variable with a data step. In proc glm we can simply include A*B in the model statement, and it understands we want the interaction included. proc glm data=bread; class height width; model sales=height width height*width; means height width height*width; The GLM Procedure Class Level Information Class Levels Values height 3 1 2 3 width 2 1 2 Number of observations 12 means statement height The GLM Procedure Level of ------------sales------------ height N Mean Std Dev 1 4 44.0000000 3.16227766 2 4 67.0000000 3.74165739 3 4 42.0000000 2.94392029 means statement width Level of ------------sales------------ width N Mean Std Dev 1 6 50.0000000 12.0664825 2 6 52.0000000 13.4313067 3 means statement height (cid:2) width Level of Level of ------------sales------------ height width N Mean Std Dev 1 1 2 45.0000000 2.82842712 1 2 2 43.0000000 4.24264069 2 1 2 65.0000000 4.24264069 2 2 2 69.0000000 2.82842712 3 1 2 40.0000000 1.41421356 3 2 2 44.0000000 2.82842712 Code the factor levels and plot (We’re just doing this for a nice plot; it is not necessary for the analysis.) data bread; set bread; if height eq 1 and width eq 1 then hw=’1_BR’; if height eq 1 and width eq 2 then hw=’2_BW’; if height eq 2 and width eq 1 then hw=’3_MR’; if height eq 2 and width eq 2 then hw=’4_MW’; if height eq 3 and width eq 1 then hw=’5_TR’; if height eq 3 and width eq 2 then hw=’6_TW’; title2 ’Sales vs. treatment’; symbol1 v=circle i=none; proc gplot data=bread; plot sales*hw; Put the means in a new dataset proc means data=bread; var sales; by height width; 4 output out=avbread mean=avsales; proc print data=avbread; Obs height width _TYPE_ _FREQ_ avsales 1 1 1 0 2 45 2 1 2 0 2 43 3 2 1 0 2 65 4 2 2 0 2 69 5 3 1 0 2 40 6 3 2 0 2 44 Plot the means Recall the plotting syntax to get two separate lines for the two width levels. We can also do a plot of sales vs width with three lines for the three heights. This type of plot is called an \interaction plot" for reasons that we will see later. symbol1 v=square i=join c=black; symbol2 v=diamond i=join c=black; symbol3 v=circle i=join c=black; proc gplot data=avbread; plot avsales*height=width; plot avsales*width=height; The Interaction plots Questions Doestheheightofthedisplaya(cid:11)ectsales? Ifyes, comparetopwithmiddle, topwithbottom, and middle with bottom. Does the width of the display a(cid:11)ect sales? Does the e(cid:11)ect of height on sales depend on the width? Does the e(cid:11)ect of width on sales depend on the height? If yes to the last two, that is an interaction. Notice that these questions are not straightforward to answer using the cell means model. 5 Factor E(cid:11)ects Model For the one-way ANOVA model, we wrote (cid:22) = (cid:22)+(cid:28) where (cid:28) was the factor e(cid:11)ect. For the i i i two-way ANOVA model, we have (cid:22) = (cid:22)+(cid:11) +(cid:12) +((cid:11)(cid:12)) , where i;j i j i;j (cid:15) (cid:22) is the overall (grand) mean - it is (cid:22) in KNNL :: (cid:15) (cid:11) is the main e(cid:11)ect of Factor A i (cid:15) (cid:12) is the main e(cid:11)ect of Factor B j (cid:15) ((cid:11)(cid:12)) is the interaction e(cid:11)ect between A and B. i;j Note that ((cid:11)(cid:12)) is the name of a parameter all on its own and does not refer to the product i;j of (cid:11) and (cid:12). Thus the factor e(cid:11)ects model is Y = (cid:22)+(cid:11) +(cid:12) +((cid:11)(cid:12)) +(cid:15) . i;j;k i j i;j i;j;k A model without the interaction term, i.e. (cid:22) = (cid:22)+(cid:11) +(cid:12) , is called an additive model. i;j i j Parameter De(cid:12)nitions P (cid:22) The overall mean is (cid:22) = (cid:22) = i;j i;j under the zero-sum constraint (or (cid:22) = (cid:22) under the :: ab ab \last = 0 constraint"). P (cid:22) The mean for the ith level of A is (cid:22) = j i;j, and the mean for the jth level of B is P i: b (cid:22) = i(cid:22)i;j. :j a (cid:11) = (cid:22) −(cid:22) and (cid:12) = (cid:22) −(cid:22), so (cid:22) = (cid:22)+(cid:11) and (cid:22) = (cid:22)+(cid:12) . i i: j :j i: i :j j Note that the (cid:11)’s and (cid:12)’s act like the (cid:28)’s in the single-factor ANOVA model. ((cid:11)(cid:12)) is the di(cid:11)erence between (cid:22) and (cid:22)+(cid:11) +(cid:12) : i;j i;j i j ((cid:11)(cid:12)) = (cid:22) −((cid:22)+(cid:11) +(cid:12) ) i;j i;j i j = (cid:22) −((cid:22)+((cid:22) −(cid:22))+((cid:22) −(cid:22))) i;j i: :j = (cid:22) −(cid:22) −(cid:22) +(cid:22) i;j i: :j These equations also spell out the relationship between the cell means (cid:22) and the factor i;j e(cid:11)ects model parameters. Interpretation (cid:22) = (cid:22)+(cid:11) +(cid:12) +((cid:11)(cid:12)) i;j i j i;j (cid:15) (cid:22) is the overall mean (cid:15) (cid:11) is an adjustment for level i of A. i (cid:15) (cid:12) is an adjustment for level j of B. j (cid:15) ((cid:11)(cid:12)) is an additional adjustment that takes into account both i and j. i;j 6 Zero-sum Constraints As in the one-way model, we now have too many parameters and need now several con- straints: X (cid:11) = (cid:11) = 0 : i Xi (cid:12) = (cid:12) = 0 : j j X ((cid:11)(cid:12)) = ((cid:11)(cid:12)) = 0 8j (for all j) :j i;j Xi ((cid:11)(cid:12)) = ((cid:11)(cid:12)) = 0 8i (for all i) i: i;j j Estimates for Factor-e(cid:11)ects model P Y (cid:22)^ = Y(cid:22) = i;j;k i;j;k ::: abn (cid:22)^ = Y(cid:22) and (cid:22)^ = Y(cid:22) i: i:: :j :j: (cid:11)^ = Y(cid:22) −Y(cid:22) and (cid:12)^ = Y(cid:22) −Y(cid:22) i i:: ::: j :j: ::: ((cid:11)^(cid:12)) = Y(cid:22) −Y(cid:22) −Y(cid:22) +Y(cid:22) i;j i;j: i:: :j: ::: SS for ANOVA Table P P P SSA = (cid:11)^2 = (Y(cid:22) −Y(cid:22) )2 = nb (Y(cid:22) −Y(cid:22))2 factor A sum of squares i;j;k i i;j;k i:: ::: i i:: P P P SSB = (cid:12)^2 = (Y(cid:22) −Y(cid:22) )2 = na (Y(cid:22) −Y(cid:22))2 factor B sum of squares i;j;k j i;j;k :j: ::: j :j: P P 2 2 SSAB = ((cid:11)^(cid:12)) = n ((cid:11)^(cid:12)) AB interaction sum of squares P i;j;k i;j iP;j i;j SSE = (Y −Y(cid:22) )2 = e2 error sum of squares i;j;k Pi;j;k i;j: i;j;k i;j;k SST = (Y −Y(cid:22) )2 total sum of squares i;j;k i;j;k ::: SSM = SSA+SSB +SSAB = SST −SSE model sum of squares SST = SSA+SSB +SSAB +SSE = SSM +SSE df for ANOVA Table df = a−1 A df = b−1 B df = (a−1)(b−1) AB df = ab(n−1) E df = abn−1 = n −1 T T df = a−1+b−1+(a−1)(b−1) = ab−1 M 7 MS for ANOVA Table (no surprises) MSA = SSA=df A MSB = SSB=df B MSAB = SSAB=df AB MSE = SSE=df E MST = SST=df T MSM = SSM=df M Hypotheses for two-way ANOVA Test for Factor A E(cid:11)ect H : (cid:11) = 0 for all i 0 i H : (cid:11) 6= 0 for at least one i a i The F statistic for this test is F = MSA=MSE and under the null hypothesis this follows A an F distribution with df , df . A E Test for Factor B E(cid:11)ect H : (cid:12) = 0 for all j 0 j H : (cid:12) 6= 0 for at least one j a j The F statistic for this test is F = MSB=MSE and under the null hypothesis this follows B an F distribution with df , df . B E Test for Interaction E(cid:11)ect H : ((cid:11)(cid:12)) = 0 for all (i;j) 0 i;j H : ((cid:11)(cid:12)) 6= 0 for at least one (i;j) a i;j The F statistic for this test is F = MSAB=MSE and under the null hypothesis this AB follows an F distribution with df , df . AB E F-statistics for the tests Notice that the denominator is always MSE and the denominator df is always df ; the E numerators change depending on the test. This is true as long as the e(cid:11)ects are (cid:12)xed. That is to say that the levels of our variables are of intrinsic interest in themselves - they are (cid:12)xed by the experimenter and not considered to be a sample from a larger population of factor levels. For random e(cid:11)ects we would need to do something di(cid:11)erent (more later). 8 p-values (cid:15) p-values are calculated using the F distributions. dfNumerator;dfDenominator (cid:15) If p (cid:20) 0:05 we conclude that the e(cid:11)ect being tested is statistically signi(cid:12)cant. ANOVA Table proc glm gives the summary ANOVA table (cid:12)rst (model, error, total), then breaks down the model into its components A, B, and AB. Source df SS MS F Model ab−1 SSM MSM MSM=MSE Error ab(n−1) SSE MSE Total abn−1 SSTO MST A a−1 SSA MSA MSA=MSE B b−1 SSB MSB MSB=MSE AB (a−1)(b−1) SSAB MSAB MSAB=MSE KNNL Example: ANOVA with GLM proc glm data=bread; class height width; model sales=height width height*width; The GLM Procedure Dependent Variable: sales Sum of Source DF Squares Mean Square F Value Pr > F Model 5 1580.000000 316.000000 30.58 0.0003 Error 6 62.000000 10.333333 Corrected Total 11 1642.000000 Source DF Type I SS Mean Square F Value Pr > F height 2 1544.000000 772.000000 74.71 <.0001 width 1 12.000000 12.000000 1.16 0.3226 height*width 2 24.000000 12.000000 1.16 0.3747 Source DF Type III SS Mean Square F Value Pr > F height 2 1544.000000 772.000000 74.71 <.0001 width 1 12.000000 12.000000 1.16 0.3226 height*width 2 24.000000 12.000000 1.16 0.3747 Sums of Squares (cid:15) Type I SS are again the sequential sums of squares (variables added in order). Thus height explains 1544, width explains 12 of what is left, and the interaction explains 24 of what is left after that. 9 (cid:15) Type III SS is like Type II SS (variable added last) but it also adjusts for di(cid:11)ering n . So if all cells have the same number of observations (balanced designs are nice - i;j the variables height and width in our example are independent - no multicollinearity!) SS1, SS2, and SS3 will all be the same. (cid:15) More details on SS later. Other output R-Square Coeff Var Root MSE sales Mean 0.962241 6.303040 3.214550 51.00000 Results (cid:15) The interaction between height and width is not statistically signi(cid:12)cant (F = 1:16; df = (2;6); p = 0:37). NOTE: Check Interaction FIRST! If it is signi(cid:12)cant then main e(cid:11)ects are left in the model, even if not signi(cid:12)cant themselves! We may now go on to examine main e(cid:11)ects since our interaction is not signi(cid:12)cant. (cid:15) The main e(cid:11)ect of height is statistically signi(cid:12)cant (F = 74:71; df = (2;6); p = 4:75(cid:2)10−5). (cid:15) The main e(cid:11)ect of width is not statistically signi(cid:12)cant (F = 1:16; df = (1;6); p = 0:32) Interpretation (cid:15) The height of the display a(cid:11)ects sales of bread. (cid:15) The width of the display has no apparent e(cid:11)ect. (cid:15) The e(cid:11)ect of the height of the display is similar for both the regular and the wide widths. Additional Analyses (cid:15) We will need to do additional analyses to understand the height e(cid:11)ect (factor A). (cid:15) There were three levels: bottom, middle and top. Based on the interaction picture, it appears the middle shelf increases sales. (cid:15) We could rerun the data with a one-way anova and use the methods we learned in the previous chapters to show this (e.g. tukey).. 10
Description: