ebook img

IS 7300: Methods of Regression and Correlation PDF

21 Pages·2003·1.7 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview IS 7300: Methods of Regression and Correlation

इंटरनेट मानक Disclosure to Promote the Right To Information Whereas the Parliament of India has set out to provide a practical regime of right to information for citizens to secure access to information under the control of public authorities, in order to promote transparency and accountability in the working of every public authority, and whereas the attached publication of the Bureau of Indian Standards is of particular interest to the public, particularly disadvantaged communities and those engaged in the pursuit of education and knowledge, the attached public safety standard is made available to promote the timely dissemination of this information in an accurate manner to the public. “जान1 का अ+धकार, जी1 का अ+धकार” “प0रा1 को छोड न’ 5 तरफ” Mazdoor Kisan Shakti Sangathan Jawaharlal Nehru “The Right to Information, The Right to Live” “Step Out From the Old to the New” IS 7300 (2003): Methods of Regression and Correlation [MSD 3: Statistical Methods for Quality and Reliability] “!ान $ एक न’ भारत का +नम-ण” Satyanarayan Gangaram Pitroda ““IInnvveenntt aa NNeeww IInnddiiaa UUssiinngg KKnnoowwlleeddggee”” “!ान एक ऐसा खजाना > जो कभी च0राया नहB जा सकता हहहहै””ै” Bhartṛhari—Nītiśatakam “Knowledge is such a treasure which cannot be stolen” Indian Standard METHODS OF REGRESSION AND CORRELATION (Second Revision ) ICS 03.120.30 @ BIS 2003 BUREAU OF INDIAN STANDARDS MANAK BHAVAN, 9 BAHADUR SHAH ZAFAR MARG NEW DELHI 110002 No\t/)lher 2003 Price Group 7 Statistical Method for Quality andReliability Sectional Committee, MSD 3 FOREWORD This Indian Standard (Second Revision) wasadopted bythe Bureau ofIndian Standards, after the draft finalized bytheStatistical Method forQuality andReliability Sectional Committee hadbeen approved bythe Management and Systems Division Council. The study of the relationship between two variables isof fundamental importance in industry. For example, in the building industry, while studying the properties of cement, it may be necessary to estimate the effect of curing timeonthecompressive strength. Insuchproblems, whereonevariable isofparticular interest forstudying the effect of the other variable on it, the concept of regression isquite usefhl. The regression technique isalso helpful for the purpose of prediction. In some problems, the relationship between two variables maybe of great interest, for example, in the case of steel, onecan study tensile+trength byusing hardness test, asthe latter hasastrong relationship with the former. The determination of the extent ofrelationship between two variables leads to the concept of correlation. This standard was originally published in 1974to cover the statistical methods of regression and correlation in the case of two variables. This standard was revised in 1995 to include the concept of ‘scatter diagram’ more elaborately. Inview of the experience gained withtheuse ofthe standard incourse ofyears, itwas felt necessary to further revise it.In the revised version, following changes have been made: a) Atable which gives thevalues ofcorrelation coefficient (i-)fordifferent selected sample sizeshasbeen included sothat the sample correlation coefficient calculated value may directly be compared with this tabulated value to test whether the population correlation coefficient iszero or not, b) Confidence limits for the population regression linewith example has been included, c) Many editorial corrections have been incorporated, and d) The concepts atmany places have been elaborated for better understanding. The composition of the Committee responsible for the formulation of this standard isgiven inAnnex F. Is 7300:2003 Indian Standard METHODS OF REGRESSION AND CORRELATION (Second Revision ) 1SCOPE to be independent and the other dependent and it is usual to regard the independent variable asx and the This standard covers the statistical methods of linear dependent variable asy. Since therange ofdatavaries regression andcorrelation inthecaseoftwovariables. widely, the origin of zero is sometimes inconvenient Thecomputations havebeen illustratedwithexamples. to prepare a well-balanced scatter diagram. The data ranges are suitably presented on convenient scales so 2-REFERENCES that spread is close to a square and large enough for The following standards contain provisions, which individual perception. through reference inthis text constitute provisions of 4.1.4 The problem of outliers is encountered in the this standard. At the time of publication, the editions actualpreparation ofscatter diagrams. Outliers aretoo indicated were valid. All standards are subject to widely separated from the data set. If there are few revision and parties to agreements based on this outliers, they should be eliminated from the data. For standard are encouraged to investigate the possibility guidance on the criteria for rejection of outliers, of applying the most recent editions of the standards reference is invited to IS 8900. If there are many indicated below: (generally more than 25 percent) outliers, the causes 1SNo. Title for the same should be investigated and corrective 6200 (Part 1): Statisticaltestsofsignificance: Part 1 action taken. Thereafter, fresh data needs to be 1995 t-Normal and F-tests (second collected for plotting the scatter diagram. revision) 7920 (Part 1): Statistical vocabulary andsymbols: 4.1.5 Interpretation ofa Scatter Diagram 1994 Part 1 Probability and general When ascatter diagram isprepared, it isimportant to statistical terms (second revision) interpret itaccurately andtakenecessary measures.For 8900:1978 Criteria for therejection ofoutlying this purpose, the scatter diagram should be carefully observations observed for the relationship between two variables. 9300 (Part 1): Statistical models for industrial The interpretation cifthe scatter diagrams isexplained 1979 applications: Pa r t 1Discrete models asfollows: 3TERMINOLOGY a) Positive relationship — Inascatter diagram, if y increases with increase in x, then the For the purpose of this standard the definitions given relationship is said to be positive. When the in IS 7920 (Part 1)shall apply. points are close to a straight line [see Fig.1 4 BASIC CONCEPTS (a)],therelationship iscalled apositive linear relationship. Under such conditions control 4.1 Scatter Diagram ony(thedependent variable) canbeachieved by exercising control on x (the independent 4.1.1 Thescatterdiagramisusefultoknowthepresence variable). of the relationship or the nature of the relationship b) Negative relationship — Inascatterdiagram, between twovariables, ifany.The relationship canbe if y decreases with increase in x, then the acause andeffect relationship, arelationship between relationship issaid tobe negative [seeFig. 1 onecause andtheother, orarelationship between one (b)]. In this case, similar interpretation as effect and the other. given for (a) holds good. 4.1.2Scatterdiagram canevenbeusedbytheoperators c) Weak relationship — Sometimes the to find the relationship between two variables, ifany. relationships may notbeasclearly evident as Thismayleadtotaking appropriate actions forquality in (a) or (b) [see Fig. 1 (c)]. Further improvement. investigations mayberequired tofindoutthe 4.1.3 A scatter diagram is prepared by plotting the reasons, ifany, forthewider scatter. Possibly paired datainanX-Yplane.Itisdesirable tohavemore one factor alone is not sufficient to explain than 30pairs ofdata. Of thetwo variables, one issaid the relationship fully or there could be wide 1 IS 7300:2003 measurement errors.Therelationship maynot thepresence ofrelationship canbeconfirmed be useful for control purposes in such a definitely by stratifying the data into three situation. groups marked with: ., Aand X. d) No relationship — In a scatter diagram [see f) Non-linear relationship — In a scat~er Fig. 1 (d)], no relationship can be noted diagram there may berelationship between x between x and y. If the presence of and y but is non linear. For example, in relationship is expected on technological Fig, 1(f), y increases with an increase in x considerations, the causes/effects may be until a certain point, but decreases with an examined from other viewpoints. In such a increase in x beyond that point. Such situation thepossibility ofstratifying thedata relationship is called non-linear relationship may also be looked into [see4.1.5 (e)]. andcanbetreatedotherwise. Insuchsituation, e) Relationship revealed bystratt~cation — The itisconvenient tolocateoptimal combination scatter diagram [see Fig. 1 (e)], shows no for xandy. relationship at a glance, but if the data is d Znsuj?cient data range — When attention is classified into some different groups a paid only to the points marked with A,there relationship maybepossible. Inthisdiagram, seems tobeno relationship between xandy, Y Y .. .... . ..”, ,...... “... :, . . .. . . ..’ .:. . . . . . . .. .. .. x x [a) [b] Y Y L L .’. ‘.. . . . . ..,1 .,. .“. , . . . . . . ..- . . x (c] (d] x Y Y Lllb. . ... . . . .. . . .. k!’A .. ... . ..... . “....“.., . . 4? . . @., *N x x (e] (f] Y L- 1. . . .. , dW~A4I,:”.: “. Ah ,~’ ..’:INL . . I 1 x [9) FIG. 1VAR1OUS SCATTER DIAGRAMS 2 IS 7300:2003 as shown in Fig.1 (g), but positive linear or decrease in the value of x. The regression line is relationship isnotedwhenpointsare-observed also used for prediction purposes. Normally, in a }ittle wider range. Accordingly, it is extrapolation is not recommended, and when necessary to examine carefully the necessary, it should be used cautiously. appropriateness of the range ofx even when 5.1.1 The relationship of the type y = a + bx no relationship is suggested in the diagram encountered intheregression analysis isnotgenerally prepared for the first time. reversible and is based on the status of the variables concerned. Therefore, this type ofrelationship should 4.2 Regression not be used for predicting x for given y. However, Regression deals with situations when one variable is mathematically itispossible tofindrelationship ofthe dependent on the other variable. For example, the two typex=a +b‘yandthentheregression linesintersect variables may be the quantities of thecarbon steeland atthepoint (x,y) inthex,y plane. alloy steel produced from the same raw material or charge, elongation of boiler plate and the amount of 5.2 Method of Calculation (Ungrouped Data) tension applied, amount of rainfall and the yield of a 5.2.1 Let there be n pairs of observations forx andy crop,andsoon. Ofthetwovariables,oneisindependent corresponding to the items in the sample. For fitting (generally measurable) and the other is dependent the regression line the following expressions are then (desired to be controlled). Thus, it is evident that the calculated: production ofalloy steeldepends ontheproduction of carbonsteelsothatthequantityofcarbonsteelproduced ,=Q couldbeconsideredastheindependentvariableandthat a) Average ofx () n ofalloy steelasthedependent variable. Zy b) Average ofy ~ =~ 4.3 Correlation (1 c) Corrected sum of squares forx Correlation deals with the relationship between two factorsorvariables.Thedegreeorintensityofthelinear z(x-Y)2=zx* -[(xx)2 /n] relationship ismeasured by correlation coefficient. It maybementioned that inthe study ofcorrelation, itis d) Corrected sum of squares fory notthe intention tofind theeffect ofonevariable over X(Y-7)2 =~y’ -[(~y)2 /n] the other asinthe case ofregression analysis but it is tofind thedegree towhich thevariables vary together e) Corrected sum of products owing to influences which affect both of them. x(x-q(y-j7)= zxy-[(xx)(xy)/iJ However, the mere existence of high value of the correlation coefficient isnot necessarily indicative of NOTE — A suitable moforma as given in Annex A mav be theunderlying relationship betwe e n thetwovariables. helpful intheabove c~mputations. - Suchavalue canattimes bepurely a c cidental, thetwo 5.2.2 From the above quantities the regression variables having no connection whatsoever. In such coefficient b or b’ is calculated as: cases, the correlation coefficient may be spurious. b= Corrected sumofproducts 4.4 Before carrying out any regression or correlation Corrected sumof squares forx study, it isdesirable to look atthe scatter diagram to locate the outliner, ifany and eliminate them. b,= Correctedsumofproducts Correctedsumofsquaresfory 5 REGRESSION ANALYSIS Also theconstant a ora‘ of the regression equation is 5.1 ‘Regression Coefficient obtained as: in a scatter diagram of type [see 4..1.5(a) or (b)] a a= jj - b~ straight iine could be fitted to the observed values whichisoftheform-y=a+bx,wherey isthedependent ar=y–b’~ variable andxthe independent variable. The quantity 5.2.3 When the regression model is not of the linear a intheabove equation represents thevalueofy when x=O, and b denotes the slope ofthelineand isknown type and involves powers or exponential, the model astheregression coefficient which maybe negative or may bereduced to the linear type“withthe help ofthe positive depending on the orientation ofthe line with logarithmic transformation. Thereafter, the fitting of respect to the axes. Physically, b indicates the rate of the regression line is exactly similar to the one increase ordecrease inthevalue ofy forunit increase explained in5.2.2. 3 IS 7300:2003 5.2.4 Example Xxy = 396226.39 Table 1 gives the Brinell hardness number and the Correctedsum of = 161520.87- [(1556.3)2/15] tensile strength (expressed in units ofmegapascals) squares forx = 161520.87– 161471.31 for 15 specimens of cold drawn copper. Consider = 49.56 Brinell hardness number as the independent variable Corrected sumof (x)and tensile strength as the dependent variable (y). products = 396226.39- 1556”;;3813 Itisintended to fit aregression line tothe data. ( ) = 396226.39 – 395611.46 5.2.4.1 Plotting the data given in Table 1as a scatter = 614.93 diagram wherein the Brinell hardness number is b = 614.93/49.56= 12.4 measured along the X-axis and the tensile strength a = ~- bE=254.2–l 286.5=- 1032.3 along the Y-axis,Fig. 2 is obtained, from which the linear trend of the points isself-evident. For the sake Hence regression line isobtained as ofbetter understanding, theregression lineapplicable y = – 1032.3 + 12.4x tothe data isalso drawn inFig. 2. 5.2.4.3 For simplifying the computational work Table 1 Hardness and Tensile Strength Values involved in fitting aregression line, change of origin of Cold Drawn Copper isoften helpful inone orboth the variables. Thus, for (Clauses 5.2.4,5 .2.4.1 and 5.2.4.2) the example worked out in 5.2.4.2, ifthe variables x andy are changed to u and vsuch that u =x – 100 s] Specimen Brinell Tensile and v=y – 250, then the computations would be as No. No. Hardness Strength follows: x Y (1) (2) (3) (4) h = 56.3 ii =3.75 i) 1 104.2 268.0 xv = 63.0 v= 4.20 ii) 2 106.1 278.6 ~U2 = 260.87 iii) 3 105.6 275.0 iv) 4 106.3 281.5 ~UV = 851.39 v) 5 101.7 232.4 Z(U - ij)2 =Zu2–[(Zu)2/n] =260.87 -211.31 vi) 6 104.4 272.2 =49.56 vii) 7 102.0 227.5 viii) 8 103.8 255.1 Z (u-Z )(v-V) = 2fv– [(ZU) (Zv)/n] ix) 9 104.0 259.5 = 851.39- 236.46= 614.93 x) 10 101.5 229.0 xi) 11 101.9 233.8 b= 614.93/49.56= 12.4and ii- b.z =4.2-46.5 xii) 12 100.6 205.9 =–42.3 xiii) 13 104.9 272.0 xiv) 14 106.2 280.3 Hence the regression line is obtained as T= – 42.3 + xv) 15 103.1 242.2 12.4 z which when transformed to the original variables, comes out as: 5.2.4.2 From thedatainTable 1,various computations (y-250) =-42.3 + 12.4(X- 100) are obtained as follows: that isy =–1 032.3 + 12.4x Zx = 1556.3 % = 103.75 NOTE — Itwould beof interest toobserve thattheregression Zy = 3813.0 ~ = 254.2 coetllcient bisnot affected bythechange oforigin ofeitheror h’ = 161520.87 both thevariables. 100 102 104 106 108 BrineliHardness No.(x) FIG. 2 SCATTER DIAGRAM ALONGWITH THE REGRESSION LINE 4 IS 7300:2003 5.2.4.4Fromthisequation theexpected valueoftensile xandy arepresented inthe form of afrequency table. strength for any given Brinell hardness number could In such situations the range of each variate isdivided beobtained.Thus,whenthehardness number isknown into anumber of class intervals of equal width (say 1X as 105 the corresponding expected value of tensile for p classes of independent variable x and lYfor q strength would be 269.7 megapascals. classes ofdependent variable y). The class width forx andy need not be equal, and the frequency JXiYin. the 5.2.5 Construction of Confidence Limits for the cell isdetermined by the ith class interval of t~e first Regression Line variate andjth classinterval ofthesecond variate.This The model isy =et+ ~x+error would result inabivariate frequency distribution table (see Annex B) The estimates b of ~ and u of a are obtained for the example as : Table 2 Confidence Limits for Regression Line u=– 1032.3, b= 12.4 (Clause 5.2.5) The error sum of squares (02YJ, isgiven by : x Y Upper Limit Lower Ljmit Z(y-a-bx,)2 /(15–2)=30.813 (1) (2) (3) (4) For aparticular value ofX=x(Brinell hardness =x), 100.6 215.14 223.05 207.23 the predicted value of the tensile strength (j )is: 101.5 226.30 232.59 220.01 102.0 232.50 237.99 227.01 j=– 1032.3 + 12.4x 103.1 246.14 250.36 241.91 103.8 254.82 258.78 250.85 The standard deviation of j given X= x is 104.2 259.78 263.86 255.70 104.9 268.46 273.14 263.78 s(~) =~Y,x[(l/n)+ {(x– Z)21Z(X– ~)2}]x 105.6 277.14 282.18 271.50 106.3 285.82 292.63 279.01 Therefore, for a given x the confidence limits on the value ofyare 5.3.2Asafirst stepforcalculating theregression line, another proforma (see Annex C) isto be prepared. j *ts(j/x) 5.3.3 The different entries in the above proforma are where tisthevalue ofatdistribution with (n–2= 13) explained below: degrees of freedom. a) In the top row are given the mid-values of Sinceweareinterested intheconfidence limits forthe theclassintervalsfortheindependent variable whole oftheregression line,these limitsforindividual xwhereas inthe first column are given mid- j have to be relaxed. The appropriate multiplier is values oftheclassintervals forthedependent (2 F)’Awhere F is the upper 5 percent tail of F variabley. Inthecohtmn~Yaregiven thetotal distribution with degrees of freedom yl = 2 and frequencies of the corresponding rows yz=(n- 2)= (15 –2)= 13.From the tables ofF, the whereas in the row corresponding to~, are value ofF(2, 13),at5percent level ofsigniticance is given the total of the corresponding 3.80. Sothe multiplier is=(2 x 3.8)%= 2.76. frequencies inthe various columns. So the confidence limits for the regression line are b) In the row corresponding to u are given the given by: transformed variables forx whichareobtained by subtracting an arbitrary quantity XO a+bx+2.76 [~m((l/15)+(x-Z)2 /49.56}], (preferably value ofx closesttomedian) from and each of the mid-values of the class intervals for xvariate anddividing thesedifferences by a+bx-2.76[4m((l/ 15)+ (x-%~ /49.56}] the width of the class intervals forx variate. Therefore, the confidence limits ofregression line for That is,u=(x –xO)//Xw,here 1Xisthewidth of the data given in Table 1have been calculated from theclassinterval forx.A similar transformed the above expressions and are given inTable 2. variable v is given for the variate y in the NOTE — The upper and lower limits for the regression line respective column v= @–yO)//Y. form ahyperbolic curve. When xisclose to z tbe contribution c) The next two rows, namely, @,and u2~are ofthis term issmall. Asxdeviates from Y,the contribution of self-explanatory. So also the two columns Ibis term increases. corresponding to vfYand tify, d) The row corresponding to Vis obtained as 5.3 Method of Calculation (Grouped Data) sum of the products of v and the 5.3.1 Sometimes, theobservations onthetwovariables corresponding frequency in the column. So 5

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.