Table Of ContentChapter 21
Logistic Regression
Contents
21.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1390
21.1.1 Differencebetweenstandardandlogisticregression . . . . . . . . . . . . . . 1390
21.1.2 TheBinomialDistribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 1390
21.1.3 Odds,risk,odds-ratio,andprobability . . . . . . . . . . . . . . . . . . . . . 1392
21.1.4 Modelingtheprobabilityofsuccess. . . . . . . . . . . . . . . . . . . . . . . 1393
21.1.5 Logisticregression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1398
21.2 DataStructures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1402
21.3 Assumptionsmadeinlogisticregression . . . . . . . . . . . . . . . . . . . . . . . 1404
21.4 Example:SpaceShuttle-Singlecontinuouspredictor . . . . . . . . . . . . . . . . 1405
21.5 Example:PredictingSexfromphysicalmeasurements-Multiplecontinuouspre-
dictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1410
21.6 RetrospectandProspectiveodds-ratio . . . . . . . . . . . . . . . . . . . . . . . . 1420
21.7 Example:Parentalandstudentusageofrecreationaldrugs-2×2table. . . . . . 1422
21.8 Example:Effectofseleniumontadpolesdeformities-2×ktable. . . . . . . . . . 1431
21.9 Example:Petfishsurvival-Multiplecategoricalpredictors . . . . . . . . . . . . 1441
21.10Example:Horseshoecrabs-Continuousandcategoricalpredictors. . . . . . . . . 1455
21.11Assessinggoodnessoffit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1470
21.12Variableselectionmethods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1474
21.12.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1474
21.12.2Example:Predictingcreditworthiness . . . . . . . . . . . . . . . . . . . . . 1475
21.13CompleteSeparationinLogisticRegression . . . . . . . . . . . . . . . . . . . . . 1481
21.14FinalWords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1489
21.14.1Zerocounts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1489
21.14.2Choiceoflinkfunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1489
21.14.3Morethantworesponsecategories . . . . . . . . . . . . . . . . . . . . . . . 1489
21.14.4Exactlogisticregressionwithverysmalldatasets. . . . . . . . . . . . . . . . 1490
21.14.5Morecomplexexperimentaldesigns . . . . . . . . . . . . . . . . . . . . . . 1490
21.14.6Yettodo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1491
Thesuggestedcitationforthischapterofnotesis:
Schwarz,C.J.(2015). LogisticRegression.
InCourseNotesforBeginningandIntermediateStatistics.
1389
CHAPTER21. LOGISTICREGRESSION
Available at http://www.stat.sfu.ca/~cschwarz/CourseNotes. Retrieved
2015-08-20.
21.1 Introduction
21.1.1 Differencebetweenstandardandlogisticregression
Inregularmultiple-regressionproblems,theY variableisassumedtohaveacontinuousdistributionwith
theverticaldeviationsaroundtheregressionlinebeingindependentlynormallydistributedwithamean
of0andaconstantvarianceσ2. TheX variablesareeithercontinuousorindicatorvariables.
In some cases, the Y variable is a categorical variable, often with two distinct classes. The X
variablescanbeeithercontinuousorindicatorvariables. TheobjectisnowtopredicttheCATEGORY
inwhichaparticularobservationwilllie.
Forexample:
• TheY variableisover-wintersurvivalofadeer(yesorno)asafunctionofthebodymass,condi-
tionfactor,andwinterseverityindex.
• TheY variableisfledging(yesorno)ofbirdsasafunctionofdistancefromtheedgeofafield,
foodavailability,andpredationindex.
• TheY variableisbreeding(yesorno)ofbirdsasafunctionofnestdensity,predators,andtem-
perature.
Consequently,thelinearregressionmodelwithnormallydistributedverticaldeviationsreallydoesn’t
makemuchsense–theresponsevariableisacategoryanddoesNOTfollowanormaldistribution. In
thesecases,apopularmethodologythatisusedislogisticregression.
Thereareanumberofgoodbooksontheuseoflogisticregression:
• Agresti,A.(2002). CategoricalDataAnalysis. Wiley: NewYork.
• Hosmer,D.W.andLemeshow,S.(2000). AppliedLogisticRegression. Wiley: NewYork.
Theseshouldbeconsultedforallthegorydetailsontheuseoflogisticregression.
21.1.2 TheBinomialDistribution
A common probability model for outcomes that come in only two states (e.g. alive or dead, success
or failure, breeding or not breeding) is the Binomial distribution. The Binomial distribution counts
the number of times that a particular event will occur in a sequence of observations.1 The binomial
distributionis usedwhena researcherisinterested intheoccurrence ofanevent, notinits magnitude.
For instance, in a clinical trial, a patient may survive or die. The researcher studies the number of
1ThePoissondistributionisaclosecousinoftheBinomialdistributionandisdiscussedinotherchapters.
1390
(cid:13)c2015CarlJamesSchwarz 2015-08-20
CHAPTER21. LOGISTICREGRESSION
survivors,andnothowlongthepatientsurvivesaftertreatment. Inastudyofbirdnests,thenumberin
theclutchthathatchismeasured,notthelengthoftimetohatch.
Ingeneralthebinomialdistributioncountsthenumberofeventsinasetoftrials,e.g.thenumberof
deaths in a cohort of patients, the number of broken eggs in a box of eggs, or the number of eggs that
hatch from a clutch. Other situations in which binomial distributions arise are quality control, public
opinionsurveys,medicalresearch,andinsuranceproblems.
ItisimportanttoexaminetheassumptionsbeingmadebeforeaBinomialdistributionisused. The
conditionsforaBinomialDistributionare:
• nidenticaltrials(ncouldbe1);
• alltrialsareindependentofeachother;
• eachtrialhasonlyoneoutcome,successorfailure;
• the probability of success is constant for the set of n trials. Some books use p to represent the
probabilityofsuccess;otherbooksuseπtorepresenttheprobabilityofsuccess;2
• theresponsevariableY isthethenumberofsuccesses3inthesetofntrials.
However, not all experiments, that on the surface look like binomial experiments, satisfy all the
assumptionsrequired. Typicallyfailureofassumptionsincludenon-independence(e.g.thefirstbirdthat
hatchesdestroysremainingeggsinthenest),orchangingpwithinasetoftrials(e.g.measuringgenetic
abnormalities for a particular mother as a function of her age; for many species, older mothers have a
higherprobabilityofgeneticdefectsintheiroffspringastheyage).
TheprobabilityofobservingY successesinntrialsifeachsuccesshasaprobabilitypofoccurring
canbecomputedusing:
(cid:32) (cid:33)
n
p(Y =y|n,p)= py(1−p)n−y
y
wherethebinomialcoefficientiscomputedas
(cid:32) (cid:33)
n n!
=
y y!(n−y)!
andwheren!=n(n−1)(n−2)...(2)(1).
For example, the probability of observing Y = 3 eggs hatch from a nest with n = 5 eggs in the
clutchiftheprobabilityofsuccessp=.2is
(cid:32) (cid:33)
5
p(Y =3|n=5,p=.2)= (.2)3(1−.2)5−3 =.0512
3
Fortunately, we will have little need for these probability computations. There are many tables that
tabulatetheprobabilitiesforvariouscombinationsofnandp–checktheweb.
Therearetwoimportantpropertiesofabinomialdistributionthatwillserveusinthefuture. IfY is
Binomial(n,p),then:
2FollowingtheconventionthatGreeklettersrefertothepopulationparametersjustlikeµreferstothepopulationmean.
3Thereisgreatflexibilityindefiningwhatisasuccess. Forexample,youcouldcounteitherthenumberofsuccessfuleggs
thathatchorthenumberofeggsthatfailedtohatchinaclutch.Youwillgetthesameanswersfromtheanalysisaftermakingthe
appropriatesubstitutions.
1391
(cid:13)c2015CarlJamesSchwarz 2015-08-20
CHAPTER21. LOGISTICREGRESSION
• E[Y]=np
(cid:112)
• V[Y]=np(1−p)andstandarddeviationofY is np(1−p)
Forexample,ifn = 20andp = .4,thentheaveragenumberofsuccessesinthese20trialsisE[Y] =
np=20(.4)=8.
Ifanexperimentisobserved, andacertainnumberofsuccessesisobserved, thentheestimatorfor
thesuccessprobabilityisfoundas:
Y
p=
(cid:98) n
For example, if a clutch of 5 eggs is observed (the set of trials) and 3 successfully hatch, then the
estimatedproportionofeggsthathatchisp = 3 = .60. Thisisexactlyanalogoustothecasewherea
(cid:98) 5
sampleisdrawnfromapopulationandthesampleaverageY isusedtoestimatethepopulationmeanµ.
21.1.3 Odds,risk,odds-ratio,andprobability
Theoddsofaneventandtheoddsratioofeventsareverycommontermsinlogisticcontexts. Conse-
quently,itisimportanttounderstandexactlywhatthesesayanddon’tsay.
Theoddsofaneventaredefinedas:
P(event) P(event)
Odds(event)= =
P(notevent) 1−P(event)
Thenotationusedisoftenacolonseparatingtheoddsvalues. Somesamplevaluesaretabulatedbelow:
Probability Odds
.01 1:99
.1 1:9
.5 1:1
.6 6:4or3:2or1.5
.9 9:1
.99 99:1
Forverysmallorverylargeodds,theprobabilityoftheeventisapproximatelyequaltotheodds.For
exampleiftheoddsare1:99,thentheprobabilityoftheeventis1/100whichisroughlyequalto1/99.
Theoddsratio(OR)isbydefinition,theratiooftwoodds:
P(A)
odds(A)
1−P(A)
OR = =
Avs.B odds(B) P(B)
1−P(B)
Forexample, oftheprobabilityofanegghatchingunderconditionAis1/10andtheprobabilityofan
egghatchingunderconditionBis1/20,thentheoddsratioisOR = (1 : 9)/(1 : 19) = 2.1 : 1. Again
forverysmallorverylargerodds,theoddsratioisapproximatelyequaltotheratiooftheprobabilities.
Anoddsratioof1,wouldindicatethattheprobabilityofthetwoeventsisequal.
Inmanystudies,youwillhearreportsthattheoddsofaneventhavedoubled. ThisgiveNOinfor-
mationaboutthebaserate. Forexample,didtheoddsincreasefrom1:millionto2:millionorfrom1:10
to2:10.
1392
(cid:13)c2015CarlJamesSchwarz 2015-08-20
CHAPTER21. LOGISTICREGRESSION
Itturnsoutthatitisconvenienttomodelprobabilitiesonthelog-oddsscale. Thelog-odds(LO),also
knownasthelogit,isdefinedas:
(cid:18) (cid:19)
P(A)
logit(A)=log (odds(A))=log
e e 1−P(A)
Wecanextendtheprevioustable,tocomputethelog-odds:
Probability Odds Logit
.01 1:99 −4.59
.1 1:9 −2.20
.5 1:1 0
.6 6:4or3:2or1.5 .41
.9 9:1 2.20
.99 99:1 4.59
Noticethatthelog-oddsiszerowhentheprobabilityis.5andthatthelog-oddsof.01issymmetric
withthelog-oddsof.99.
Itis alsoeasy togo backfrom thelog-odds scaleto theregular probabilityscale intwo equivalent
ways:
elog-odds 1
p= =
1+elog-odds 1+e−log-odds
Noticetheminussigninthesecondback-translation. Forexample,aLO =10,translatestop=.9999;
aLO =4translatestop=.98;aLO =1translatestop=.73;etc.
21.1.4 Modelingtheprobabilityofsuccess
Nowiftheprobabilityofsuccesswasthesameforalltrials,theanalysiswouldbetrivial:simplytabulate
the total number of successes and divide by the total number of trials to estimate the probability of
success. However, what we are really interested in is the relationship of the probability of success to
somecovariateX suchastemperature,orconditionfactor.
For example, consider the following (hypothetical) example of an experiment where various bird
nests were found at various heights above the ground. For each nest, it was recorded if the nest was
successful(atleastonebirdfledged)ornotsuccessful.
CAUTION:Bewareofpseudo-replication,especiallywhendealingwithlogisticregression. Notice
thattheexperimentalunitistheNEST(andnottheindividualchicks)becausetheexplanatoryvariable
(height) operates at the nest level and not the chick level.4 The measurement is taken at the nest level
(successfulornot)ratherthanatthechicklevel.Ifmeasurementsweretakenonindividualchickswithin
the nest, then the experimental unit (the nest) is different from the observational unit (the chick) and
moreadvancedmethodsmustbeused.
4Forexample,onecouldimaginethatindividualnestsarerandomizedtodifferentheight.Itismuchmoredifficulttoimagine
thatindividualchickswererandomizedtodifferentheights.
1393
(cid:13)c2015CarlJamesSchwarz 2015-08-20
CHAPTER21. LOGISTICREGRESSION
Height #Nests #Successful p
(cid:98)
2.0 4 0 0.00
3.0 3 0 0.00
2.5 5 0 0.00
3.3 3 2 0.67
4.7 4 1 0.25
3.9 2 0 0.00
5.2 4 2 0.50
10.5 5 5 1.00
4.7 4 2 0.50
6.8 5 3 0.60
7.3 3 3 1.00
8.4 4 3 0.75
9.2 3 2 0.67
8.5 4 4 1.00
10.0 3 3 1.00
12.0 6 6 1.00
15.0 4 4 1.00
12.2 3 3 1.00
13.0 5 5 1.00
12.9 4 4 1.00
Notice that the probability of a successful nest seems to increase with height above the grounds
(potentiallyreflectingdistancefrompredators?).
Wewouldliketomodeltheprobabilityof(nest)successasafunctionofheight. Asafirstattempt,
suppose that we plot the estimated probability of success (p) as a function of height and try and fit a
(cid:98)
straightlinetotheplottedpoints.
ThedataisavailableintheJMPdatafilenestsuccess.jmpavailableintheSampleProgramLibraryat
http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. Here is a por-
tionofthedatafile:
The column labeled p-hat is the empirical probability of (nest) success at each height is found as
p = #successful. The column labelled Not-Succesfull is the complement of the number of success
(cid:98) #nests
1394
(cid:13)c2015CarlJamesSchwarz 2015-08-20
CHAPTER21. LOGISTICREGRESSION
and is computed using a formula variable. The column labeled empirical log-odds is the estimated
log-oddfromthedataandisexplainedlater.
Let us start by plotting the probability of a successful nest as a function of height and fitting an
ordinary(least-squares)regressionline: TheAnalyze->FitY-by-X platformwasused,andpwastreated
(cid:98)
astheY variableandHeightastheX variable:
Thisplotisnotentirelysatisfactoryforanumberofreasons:
• The data points seem to follow an S-shaped relationship with probabilities of success near 0 at
lowerheightsandnear1athigherheights.
• Thefittedlinegivespredictionsfortheprobabilityofsuccessthataremorethan1andlessthan0
whichisimpossible.
• The fitted line cannot deal properly with the fact that the probability of success is likely close
to 0% for a wide range of small heights and essentially close to 100% for a wide range of taller
heights.
• Theassumptionofanormaldistributionforthedeviationsfromthefittedlineisnottenableasthe
pareessentiallydiscreteforthesmallnumberofnestsfoundinthisexperiment.
(cid:98)
• While not apparent from this graph, the variability of the response changes over the different
parts of the regression line. For example, when the true probability of success is very low (say
0.1), the standard deviation in the number of successful nests from a set of 5 nests is found as
(cid:112)
5(.1)(.9)=.67whilethestandarddeviationofthenumberofsuccessfulnestsinasetof5nests
(cid:112)
and the probability of success of 0.5 is 5(.5)(.5) = 1.1 which is almost twice as large as the
previousstandarddeviation.
Forthese(andotherreasons), theanalysisofthistypeofdataarecommonlydoneonthelog-odds
1395
(cid:13)c2015CarlJamesSchwarz 2015-08-20
CHAPTER21. LOGISTICREGRESSION
(alsocalledthelogit)scale. Theoddsofaneventiscomputedas:
p
ODDS =
1−p
andthelog-oddsisfoundasthe(natural)logarithmoftheodds:
(cid:18) (cid:19)
p
LO =log
1−p
Thistransformationconvertsthe0-1scaleofprobabilitytoa−∞→∞scaleasillustratedbelow:
p LO
0.001 -6.91
0.01 -4.60
0.05 -2.94
0.1 -2.20
0.2 -1.39
0.3 -0.85
0.4 -0.41
0.5 0.00
0.6 0.41
0.7 0.85
0.8 1.39
0.9 2.20
0.95 2.94
0.99 4.60
0.999 6.91
Noticethatthelog-oddsscaleissymmetricalabout0,andthatformoderatevaluesofp,changesonthe
p-scalehavenearlyconstantchangesonthelog-oddsscale. Forexample,goingfrom.5→.6→.7on
thep-scalecorrespondstomovingfrom0→.41→.85onthelog-oddsscale.
Itisalsoeasytogobackfromthelog-oddsscaletotheregularprobabilityscale:
eLO 1
p= =
1+eLO 1+e−LO
Forexample,aLO =10,translatestop=.9999;aLO =4translatestop=.98;aLO =1translates
top=.73;etc.
We can now return back to the previous data. At first glance, it would seem that the estimated
log-oddsissimplyestimatedas:
(cid:18) (cid:19)
p
L(cid:100)O =log (cid:98)
1−p
(cid:98)
but this doesn’t work well with small sample sizes (it can be shown that the simple logit function is
biased)orwhenvaluesofpcloseto0or1(thesimplelogitfunctionhits±∞). Consequently,insmall
(cid:98)
samples or when the observed probability of success is close to 0 or 1, the empirical log-odds is often
computedas:
(cid:18) (cid:19) (cid:18) (cid:19)
np+.5 p+.5/n
L(cid:100)Oempirical =log n(1−(cid:98)p)+.5 =log 1−(cid:98)p+.5/n
(cid:98) (cid:98)
1396
(cid:13)c2015CarlJamesSchwarz 2015-08-20
CHAPTER21. LOGISTICREGRESSION
Wecomputetheempiricallog-oddsforthehatchingdata:
Height #Nests #Successful p(cid:98) L(cid:100)Oemp
2.0 4 0 0.00 -2.20
3.0 3 0 0.00 -1.95
2.5 5 0 0.00 -2.40
3.3 3 2 0.67 0.51
4.7 4 1 0.25 -0.85
3.9 2 0 0.00 -1.61
5.2 4 2 0.50 0.00
10.5 5 5 1.00 2.40
4.7 4 2 0.50 0.00
6.8 5 3 0.60 0.34
7.3 3 3 1.00 1.95
8.4 4 3 0.75 0.85
9.2 3 2 0.67 0.51
8.5 4 4 1.00 2.20
10.0 3 3 1.00 1.95
12.0 6 6 1.00 2.56
15.0 4 4 1.00 2.20
12.2 3 3 1.00 1.95
13.0 5 5 1.00 2.40
12.9 4 4 1.00 2.20
andnowplottheempiricallog-oddsagainstheight:
1397
(cid:13)c2015CarlJamesSchwarz 2015-08-20
CHAPTER21. LOGISTICREGRESSION
Thefitismuchnicer,therelationshiphasbeenlinearized,andnow,nomatterwhattheprediction,it
canalwaysbetranslatedbacktoaprobabilitybetween0and1usingtheinversetransformseenearlier.
21.1.5 Logisticregression
Butthisisstillnotenough.Evenonthelog-oddsscalethedatapointsarenotnormallydistributedaround
the regression line. Consequently, rather than using ordinary least-squares to fit the line, a technique
calledgeneralizedlinearmodelingisusedtofittheline.
Ingeneralizedlinearmodelsamethodcalledmaximumlikelihoodisusedtofindtheparametersof
the model (in this case, the intercept and the regression coefficient of height) that gives the best fit to
thedata. Whiledetailsofmaximumlikelihoodestimationarebeyondthescopeofthiscourse,theyare
closely related to weighted least squares in this class of problems. Maximum Likelihood Estimators
(often abbreviated as MLEs) are, under fairly general conditions, guaranteed to be the “best” (in the
senseofhavingsmalleststandarderrors)inlargesamples. Insmallsamples,thereisnoguaranteethat
MLEs are optimal, but in practice, MLEs seem to work well. In most cases, the calculations must be
donenumerically–therearenosimpleformulaeasinsimplelinearregression.5
In order to fit a logistic regression using maximum likelihood estimation, the data must be in a
standard format. In particular, both success and failures must be recorded along with a classification
variablethatisnominallyscaled. Forexample,thesetofnests(at2.0m)willgeneratetwolinesofdata
–oneforthesuccessfulnestsandonefortheunsuccessfulnests. Ifthecountforaparticularoutcome
iszero,itcanbeomittedfromthedatatable,butIprefertorecordavalueof0sothatthereisnodoubt
thatallnestswereexaminedandnoneofthisoutcomewereobserved.
AnewcolumnwascreatedinJMPforthenumberofeggsthatfailedtofledge,andafterstackingthe
reviseddataset,thedatasetinJMPthatcanbeusedforlogisticregressionlookslike:6
5Othermethodsthatarequitepopulararenon-iterativeweightedleastsquaresanddiscriminantfunctionanalysis. Theseare
beyondthescopeofthiscourse.
6Thisstackeddataisavailableinthenestsuccess2.jmpdatasetavailablefromtheSampleProgramLibraryathttp://www.
stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
1398
(cid:13)c2015CarlJamesSchwarz 2015-08-20
Description:CHAPTER 21. MULTIPLE LINEAR REGRESSION. The number of cases actually used in the fit should alway be ascertained because in large datasets, the.