Table Of Content

Chapter 21 Logistic Regression Contents 21.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1390 21.1.1 Differencebetweenstandardandlogisticregression . . . . . . . . . . . . . . 1390 21.1.2 TheBinomialDistribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 1390 21.1.3 Odds,risk,odds-ratio,andprobability . . . . . . . . . . . . . . . . . . . . . 1392 21.1.4 Modelingtheprobabilityofsuccess. . . . . . . . . . . . . . . . . . . . . . . 1393 21.1.5 Logisticregression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1398 21.2 DataStructures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1402 21.3 Assumptionsmadeinlogisticregression . . . . . . . . . . . . . . . . . . . . . . . 1404 21.4 Example:SpaceShuttle-Singlecontinuouspredictor . . . . . . . . . . . . . . . . 1405 21.5 Example:PredictingSexfromphysicalmeasurements-Multiplecontinuouspre- dictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1410 21.6 RetrospectandProspectiveodds-ratio . . . . . . . . . . . . . . . . . . . . . . . . 1420 21.7 Example:Parentalandstudentusageofrecreationaldrugs-2×2table. . . . . . 1422 21.8 Example:Effectofseleniumontadpolesdeformities-2×ktable. . . . . . . . . . 1431 21.9 Example:Petfishsurvival-Multiplecategoricalpredictors . . . . . . . . . . . . 1441 21.10Example:Horseshoecrabs-Continuousandcategoricalpredictors. . . . . . . . . 1455 21.11Assessinggoodnessoffit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1470 21.12Variableselectionmethods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1474 21.12.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1474 21.12.2Example:Predictingcreditworthiness . . . . . . . . . . . . . . . . . . . . . 1475 21.13CompleteSeparationinLogisticRegression . . . . . . . . . . . . . . . . . . . . . 1481 21.14FinalWords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1489 21.14.1Zerocounts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1489 21.14.2Choiceoflinkfunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1489 21.14.3Morethantworesponsecategories . . . . . . . . . . . . . . . . . . . . . . . 1489 21.14.4Exactlogisticregressionwithverysmalldatasets. . . . . . . . . . . . . . . . 1490 21.14.5Morecomplexexperimentaldesigns . . . . . . . . . . . . . . . . . . . . . . 1490 21.14.6Yettodo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1491 Thesuggestedcitationforthischapterofnotesis: Schwarz,C.J.(2015). LogisticRegression. InCourseNotesforBeginningandIntermediateStatistics. 1389 CHAPTER21. LOGISTICREGRESSION Available at http://www.stat.sfu.ca/~cschwarz/CourseNotes. Retrieved 2015-08-20. 21.1 Introduction 21.1.1 Differencebetweenstandardandlogisticregression Inregularmultiple-regressionproblems,theY variableisassumedtohaveacontinuousdistributionwith theverticaldeviationsaroundtheregressionlinebeingindependentlynormallydistributedwithamean of0andaconstantvarianceσ2. TheX variablesareeithercontinuousorindicatorvariables. In some cases, the Y variable is a categorical variable, often with two distinct classes. The X variablescanbeeithercontinuousorindicatorvariables. TheobjectisnowtopredicttheCATEGORY inwhichaparticularobservationwilllie. Forexample: • TheY variableisover-wintersurvivalofadeer(yesorno)asafunctionofthebodymass,condi- tionfactor,andwinterseverityindex. • TheY variableisfledging(yesorno)ofbirdsasafunctionofdistancefromtheedgeofafield, foodavailability,andpredationindex. • TheY variableisbreeding(yesorno)ofbirdsasafunctionofnestdensity,predators,andtem- perature. Consequently,thelinearregressionmodelwithnormallydistributedverticaldeviationsreallydoesn’t makemuchsense–theresponsevariableisacategoryanddoesNOTfollowanormaldistribution. In thesecases,apopularmethodologythatisusedislogisticregression. Thereareanumberofgoodbooksontheuseoflogisticregression: • Agresti,A.(2002). CategoricalDataAnalysis. Wiley: NewYork. • Hosmer,D.W.andLemeshow,S.(2000). AppliedLogisticRegression. Wiley: NewYork. Theseshouldbeconsultedforallthegorydetailsontheuseoflogisticregression. 21.1.2 TheBinomialDistribution A common probability model for outcomes that come in only two states (e.g. alive or dead, success or failure, breeding or not breeding) is the Binomial distribution. The Binomial distribution counts the number of times that a particular event will occur in a sequence of observations.1 The binomial distributionis usedwhena researcherisinterested intheoccurrence ofanevent, notinits magnitude. For instance, in a clinical trial, a patient may survive or die. The researcher studies the number of 1ThePoissondistributionisaclosecousinoftheBinomialdistributionandisdiscussedinotherchapters. 1390 (cid:13)c2015CarlJamesSchwarz 2015-08-20 CHAPTER21. LOGISTICREGRESSION survivors,andnothowlongthepatientsurvivesaftertreatment. Inastudyofbirdnests,thenumberin theclutchthathatchismeasured,notthelengthoftimetohatch. Ingeneralthebinomialdistributioncountsthenumberofeventsinasetoftrials,e.g.thenumberof deaths in a cohort of patients, the number of broken eggs in a box of eggs, or the number of eggs that hatch from a clutch. Other situations in which binomial distributions arise are quality control, public opinionsurveys,medicalresearch,andinsuranceproblems. ItisimportanttoexaminetheassumptionsbeingmadebeforeaBinomialdistributionisused. The conditionsforaBinomialDistributionare: • nidenticaltrials(ncouldbe1); • alltrialsareindependentofeachother; • eachtrialhasonlyoneoutcome,successorfailure; • the probability of success is constant for the set of n trials. Some books use p to represent the probabilityofsuccess;otherbooksuseπtorepresenttheprobabilityofsuccess;2 • theresponsevariableY isthethenumberofsuccesses3inthesetofntrials. However, not all experiments, that on the surface look like binomial experiments, satisfy all the assumptionsrequired. Typicallyfailureofassumptionsincludenon-independence(e.g.thefirstbirdthat hatchesdestroysremainingeggsinthenest),orchangingpwithinasetoftrials(e.g.measuringgenetic abnormalities for a particular mother as a function of her age; for many species, older mothers have a higherprobabilityofgeneticdefectsintheiroffspringastheyage). TheprobabilityofobservingY successesinntrialsifeachsuccesshasaprobabilitypofoccurring canbecomputedusing: (cid:32) (cid:33) n p(Y =y|n,p)= py(1−p)n−y y wherethebinomialcoefficientiscomputedas (cid:32) (cid:33) n n! = y y!(n−y)! andwheren!=n(n−1)(n−2)...(2)(1). For example, the probability of observing Y = 3 eggs hatch from a nest with n = 5 eggs in the clutchiftheprobabilityofsuccessp=.2is (cid:32) (cid:33) 5 p(Y =3|n=5,p=.2)= (.2)3(1−.2)5−3 =.0512 3 Fortunately, we will have little need for these probability computations. There are many tables that tabulatetheprobabilitiesforvariouscombinationsofnandp–checktheweb. Therearetwoimportantpropertiesofabinomialdistributionthatwillserveusinthefuture. IfY is Binomial(n,p),then: 2FollowingtheconventionthatGreeklettersrefertothepopulationparametersjustlikeµreferstothepopulationmean. 3Thereisgreatflexibilityindefiningwhatisasuccess. Forexample,youcouldcounteitherthenumberofsuccessfuleggs thathatchorthenumberofeggsthatfailedtohatchinaclutch.Youwillgetthesameanswersfromtheanalysisaftermakingthe appropriatesubstitutions. 1391 (cid:13)c2015CarlJamesSchwarz 2015-08-20 CHAPTER21. LOGISTICREGRESSION • E[Y]=np (cid:112) • V[Y]=np(1−p)andstandarddeviationofY is np(1−p) Forexample,ifn = 20andp = .4,thentheaveragenumberofsuccessesinthese20trialsisE[Y] = np=20(.4)=8. Ifanexperimentisobserved, andacertainnumberofsuccessesisobserved, thentheestimatorfor thesuccessprobabilityisfoundas: Y p= (cid:98) n For example, if a clutch of 5 eggs is observed (the set of trials) and 3 successfully hatch, then the estimatedproportionofeggsthathatchisp = 3 = .60. Thisisexactlyanalogoustothecasewherea (cid:98) 5 sampleisdrawnfromapopulationandthesampleaverageY isusedtoestimatethepopulationmeanµ. 21.1.3 Odds,risk,odds-ratio,andprobability Theoddsofaneventandtheoddsratioofeventsareverycommontermsinlogisticcontexts. Conse- quently,itisimportanttounderstandexactlywhatthesesayanddon’tsay. Theoddsofaneventaredefinedas: P(event) P(event) Odds(event)= = P(notevent) 1−P(event) Thenotationusedisoftenacolonseparatingtheoddsvalues. Somesamplevaluesaretabulatedbelow: Probability Odds .01 1:99 .1 1:9 .5 1:1 .6 6:4or3:2or1.5 .9 9:1 .99 99:1 Forverysmallorverylargeodds,theprobabilityoftheeventisapproximatelyequaltotheodds.For exampleiftheoddsare1:99,thentheprobabilityoftheeventis1/100whichisroughlyequalto1/99. Theoddsratio(OR)isbydefinition,theratiooftwoodds: P(A) odds(A) 1−P(A) OR = = Avs.B odds(B) P(B) 1−P(B) Forexample, oftheprobabilityofanegghatchingunderconditionAis1/10andtheprobabilityofan egghatchingunderconditionBis1/20,thentheoddsratioisOR = (1 : 9)/(1 : 19) = 2.1 : 1. Again forverysmallorverylargerodds,theoddsratioisapproximatelyequaltotheratiooftheprobabilities. Anoddsratioof1,wouldindicatethattheprobabilityofthetwoeventsisequal. Inmanystudies,youwillhearreportsthattheoddsofaneventhavedoubled. ThisgiveNOinfor- mationaboutthebaserate. Forexample,didtheoddsincreasefrom1:millionto2:millionorfrom1:10 to2:10. 1392 (cid:13)c2015CarlJamesSchwarz 2015-08-20 CHAPTER21. LOGISTICREGRESSION Itturnsoutthatitisconvenienttomodelprobabilitiesonthelog-oddsscale. Thelog-odds(LO),also knownasthelogit,isdefinedas: (cid:18) (cid:19) P(A) logit(A)=log (odds(A))=log e e 1−P(A) Wecanextendtheprevioustable,tocomputethelog-odds: Probability Odds Logit .01 1:99 −4.59 .1 1:9 −2.20 .5 1:1 0 .6 6:4or3:2or1.5 .41 .9 9:1 2.20 .99 99:1 4.59 Noticethatthelog-oddsiszerowhentheprobabilityis.5andthatthelog-oddsof.01issymmetric withthelog-oddsof.99. Itis alsoeasy togo backfrom thelog-odds scaleto theregular probabilityscale intwo equivalent ways: elog-odds 1 p= = 1+elog-odds 1+e−log-odds Noticetheminussigninthesecondback-translation. Forexample,aLO =10,translatestop=.9999; aLO =4translatestop=.98;aLO =1translatestop=.73;etc. 21.1.4 Modelingtheprobabilityofsuccess Nowiftheprobabilityofsuccesswasthesameforalltrials,theanalysiswouldbetrivial:simplytabulate the total number of successes and divide by the total number of trials to estimate the probability of success. However, what we are really interested in is the relationship of the probability of success to somecovariateX suchastemperature,orconditionfactor. For example, consider the following (hypothetical) example of an experiment where various bird nests were found at various heights above the ground. For each nest, it was recorded if the nest was successful(atleastonebirdfledged)ornotsuccessful. CAUTION:Bewareofpseudo-replication,especiallywhendealingwithlogisticregression. Notice thattheexperimentalunitistheNEST(andnottheindividualchicks)becausetheexplanatoryvariable (height) operates at the nest level and not the chick level.4 The measurement is taken at the nest level (successfulornot)ratherthanatthechicklevel.Ifmeasurementsweretakenonindividualchickswithin the nest, then the experimental unit (the nest) is different from the observational unit (the chick) and moreadvancedmethodsmustbeused. 4Forexample,onecouldimaginethatindividualnestsarerandomizedtodifferentheight.Itismuchmoredifficulttoimagine thatindividualchickswererandomizedtodifferentheights. 1393 (cid:13)c2015CarlJamesSchwarz 2015-08-20 CHAPTER21. LOGISTICREGRESSION Height #Nests #Successful p (cid:98) 2.0 4 0 0.00 3.0 3 0 0.00 2.5 5 0 0.00 3.3 3 2 0.67 4.7 4 1 0.25 3.9 2 0 0.00 5.2 4 2 0.50 10.5 5 5 1.00 4.7 4 2 0.50 6.8 5 3 0.60 7.3 3 3 1.00 8.4 4 3 0.75 9.2 3 2 0.67 8.5 4 4 1.00 10.0 3 3 1.00 12.0 6 6 1.00 15.0 4 4 1.00 12.2 3 3 1.00 13.0 5 5 1.00 12.9 4 4 1.00 Notice that the probability of a successful nest seems to increase with height above the grounds (potentiallyreflectingdistancefrompredators?). Wewouldliketomodeltheprobabilityof(nest)successasafunctionofheight. Asafirstattempt, suppose that we plot the estimated probability of success (p) as a function of height and try and fit a (cid:98) straightlinetotheplottedpoints. ThedataisavailableintheJMPdatafilenestsuccess.jmpavailableintheSampleProgramLibraryat http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. Here is a por- tionofthedatafile: The column labeled p-hat is the empirical probability of (nest) success at each height is found as p = #successful. The column labelled Not-Succesfull is the complement of the number of success (cid:98) #nests 1394 (cid:13)c2015CarlJamesSchwarz 2015-08-20 CHAPTER21. LOGISTICREGRESSION and is computed using a formula variable. The column labeled empirical log-odds is the estimated log-oddfromthedataandisexplainedlater. Let us start by plotting the probability of a successful nest as a function of height and fitting an ordinary(least-squares)regressionline: TheAnalyze->FitY-by-X platformwasused,andpwastreated (cid:98) astheY variableandHeightastheX variable: Thisplotisnotentirelysatisfactoryforanumberofreasons: • The data points seem to follow an S-shaped relationship with probabilities of success near 0 at lowerheightsandnear1athigherheights. • Thefittedlinegivespredictionsfortheprobabilityofsuccessthataremorethan1andlessthan0 whichisimpossible. • The fitted line cannot deal properly with the fact that the probability of success is likely close to 0% for a wide range of small heights and essentially close to 100% for a wide range of taller heights. • Theassumptionofanormaldistributionforthedeviationsfromthefittedlineisnottenableasthe pareessentiallydiscreteforthesmallnumberofnestsfoundinthisexperiment. (cid:98) • While not apparent from this graph, the variability of the response changes over the different parts of the regression line. For example, when the true probability of success is very low (say 0.1), the standard deviation in the number of successful nests from a set of 5 nests is found as (cid:112) 5(.1)(.9)=.67whilethestandarddeviationofthenumberofsuccessfulnestsinasetof5nests (cid:112) and the probability of success of 0.5 is 5(.5)(.5) = 1.1 which is almost twice as large as the previousstandarddeviation. Forthese(andotherreasons), theanalysisofthistypeofdataarecommonlydoneonthelog-odds 1395 (cid:13)c2015CarlJamesSchwarz 2015-08-20 CHAPTER21. LOGISTICREGRESSION (alsocalledthelogit)scale. Theoddsofaneventiscomputedas: p ODDS = 1−p andthelog-oddsisfoundasthe(natural)logarithmoftheodds: (cid:18) (cid:19) p LO =log 1−p Thistransformationconvertsthe0-1scaleofprobabilitytoa−∞→∞scaleasillustratedbelow: p LO 0.001 -6.91 0.01 -4.60 0.05 -2.94 0.1 -2.20 0.2 -1.39 0.3 -0.85 0.4 -0.41 0.5 0.00 0.6 0.41 0.7 0.85 0.8 1.39 0.9 2.20 0.95 2.94 0.99 4.60 0.999 6.91 Noticethatthelog-oddsscaleissymmetricalabout0,andthatformoderatevaluesofp,changesonthe p-scalehavenearlyconstantchangesonthelog-oddsscale. Forexample,goingfrom.5→.6→.7on thep-scalecorrespondstomovingfrom0→.41→.85onthelog-oddsscale. Itisalsoeasytogobackfromthelog-oddsscaletotheregularprobabilityscale: eLO 1 p= = 1+eLO 1+e−LO Forexample,aLO =10,translatestop=.9999;aLO =4translatestop=.98;aLO =1translates top=.73;etc. We can now return back to the previous data. At first glance, it would seem that the estimated log-oddsissimplyestimatedas: (cid:18) (cid:19) p L(cid:100)O =log (cid:98) 1−p (cid:98) but this doesn’t work well with small sample sizes (it can be shown that the simple logit function is biased)orwhenvaluesofpcloseto0or1(thesimplelogitfunctionhits±∞). Consequently,insmall (cid:98) samples or when the observed probability of success is close to 0 or 1, the empirical log-odds is often computedas: (cid:18) (cid:19) (cid:18) (cid:19) np+.5 p+.5/n L(cid:100)Oempirical =log n(1−(cid:98)p)+.5 =log 1−(cid:98)p+.5/n (cid:98) (cid:98) 1396 (cid:13)c2015CarlJamesSchwarz 2015-08-20 CHAPTER21. LOGISTICREGRESSION Wecomputetheempiricallog-oddsforthehatchingdata: Height #Nests #Successful p(cid:98) L(cid:100)Oemp 2.0 4 0 0.00 -2.20 3.0 3 0 0.00 -1.95 2.5 5 0 0.00 -2.40 3.3 3 2 0.67 0.51 4.7 4 1 0.25 -0.85 3.9 2 0 0.00 -1.61 5.2 4 2 0.50 0.00 10.5 5 5 1.00 2.40 4.7 4 2 0.50 0.00 6.8 5 3 0.60 0.34 7.3 3 3 1.00 1.95 8.4 4 3 0.75 0.85 9.2 3 2 0.67 0.51 8.5 4 4 1.00 2.20 10.0 3 3 1.00 1.95 12.0 6 6 1.00 2.56 15.0 4 4 1.00 2.20 12.2 3 3 1.00 1.95 13.0 5 5 1.00 2.40 12.9 4 4 1.00 2.20 andnowplottheempiricallog-oddsagainstheight: 1397 (cid:13)c2015CarlJamesSchwarz 2015-08-20 CHAPTER21. LOGISTICREGRESSION Thefitismuchnicer,therelationshiphasbeenlinearized,andnow,nomatterwhattheprediction,it canalwaysbetranslatedbacktoaprobabilitybetween0and1usingtheinversetransformseenearlier. 21.1.5 Logisticregression Butthisisstillnotenough.Evenonthelog-oddsscalethedatapointsarenotnormallydistributedaround the regression line. Consequently, rather than using ordinary least-squares to fit the line, a technique calledgeneralizedlinearmodelingisusedtofittheline. Ingeneralizedlinearmodelsamethodcalledmaximumlikelihoodisusedtofindtheparametersof the model (in this case, the intercept and the regression coefficient of height) that gives the best fit to thedata. Whiledetailsofmaximumlikelihoodestimationarebeyondthescopeofthiscourse,theyare closely related to weighted least squares in this class of problems. Maximum Likelihood Estimators (often abbreviated as MLEs) are, under fairly general conditions, guaranteed to be the “best” (in the senseofhavingsmalleststandarderrors)inlargesamples. Insmallsamples,thereisnoguaranteethat MLEs are optimal, but in practice, MLEs seem to work well. In most cases, the calculations must be donenumerically–therearenosimpleformulaeasinsimplelinearregression.5 In order to fit a logistic regression using maximum likelihood estimation, the data must be in a standard format. In particular, both success and failures must be recorded along with a classification variablethatisnominallyscaled. Forexample,thesetofnests(at2.0m)willgeneratetwolinesofdata –oneforthesuccessfulnestsandonefortheunsuccessfulnests. Ifthecountforaparticularoutcome iszero,itcanbeomittedfromthedatatable,butIprefertorecordavalueof0sothatthereisnodoubt thatallnestswereexaminedandnoneofthisoutcomewereobserved. AnewcolumnwascreatedinJMPforthenumberofeggsthatfailedtofledge,andafterstackingthe reviseddataset,thedatasetinJMPthatcanbeusedforlogisticregressionlookslike:6 5Othermethodsthatarequitepopulararenon-iterativeweightedleastsquaresanddiscriminantfunctionanalysis. Theseare beyondthescopeofthiscourse. 6Thisstackeddataisavailableinthenestsuccess2.jmpdatasetavailablefromtheSampleProgramLibraryathttp://www. stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. 1398 (cid:13)c2015CarlJamesSchwarz 2015-08-20

Description:

CHAPTER 21. MULTIPLE LINEAR REGRESSION. The number of cases actually used in the fit should alway be ascertained because in large datasets, the.

Multiple linear regression - Statistics and Actuarial Science PDF

103 Pages·2015·4.34 MB·English

Checking for file health...

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Download Multiple linear regression - Statistics and Actuarial Science PDF Free - Full Version

by Unknow| 2015| 103 pages| 4.34| English

Download Multiple linear regression - Statistics and Actuarial Science by in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Multiple linear regression - Statistics and Actuarial Science

CHAPTER 21. MULTIPLE LINEAR REGRESSION. The number of cases actually used in the fit should alway be ascertained because in large datasets, the.

Detailed Information

Author:	Unknown
Publication Year:	2015
Pages:	103
Language:	English
File Size:	4.34
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Multiple linear regression - Statistics and Actuarial Science Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Multiple linear regression - Statistics and Actuarial Science PDF?

Yes, on https://PDFdrive.to you can download Multiple linear regression - Statistics and Actuarial Science by completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Multiple linear regression - Statistics and Actuarial Science on my mobile device?

After downloading Multiple linear regression - Statistics and Actuarial Science PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Multiple linear regression - Statistics and Actuarial Science?

Yes, this is the complete PDF version of Multiple linear regression - Statistics and Actuarial Science by Unknow. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Multiple linear regression - Statistics and Actuarial Science PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.