Table Of Content

Boosting hazard regression with time-varying covariates 7 1 Donald K.K. Lee∗ and Ningyuan Chen 0 2 Yale University and HKUST b e F February 14, 2017 3 1 Abstract ] L Consideraleft-truncatedright-censoredsurvivalprocesswhoseevolu- tion depends on time-varying covariates. Given functional data samples M from the process, we propose a gradient boosting procedure for estimat- . ingitslog-intensityfunctioninaflexiblemannertocapturetime-covariate t a interactions. The estimator is shown to be consistent if the model is cor- t s rectly specified. Alternatively an oracle inequality can be demonstrated [ fortree-basedmodels. Weusetheproceduretoshednewlightonaques- tion from the operations literature concerning the effect of workload on 2 service rates in an emergency department. v 6 To avoid overfitting, boosting employs several regularization devices. 2 Oneofthemisstep-sizerestriction,buttherationaleforthisissomewhat 9 mysterious from the viewpoint of consistency: In theoretical treatments 7 of classification and regression problems, unrestricted greedy step-sizes 0 appeartosuffice. Giventhatthepartiallog-likelihoodfunctionalforhaz- 1. ardregressionhasunboundedcurvature,ourstudysuggeststhatstep-size 0 restrictionmightbeamechanismforpreventingthecurvatureoftherisk 7 from derailing convergence. 1 Keywords: Survival analysis, gradient boosting, time-varying covariates, func- : v tional data, step-size restriction, regression trees, partial likelihood, queuing i transition intensities, emergency departments. X MSC 2010 subject classifications: 62N02, 62G05, 90B22 r a 1 Introduction Consider the following left-truncated right-censored (LTRC) survival process: LetU bearandomfailuretimeandX(·)={X(t)∈X ⊆Rp}anexternal,time- varyingcovariateprocess. Conditionalonsurvivinguptotimet−,theprobabil- ityoffailingin[t,t+dt)isλ(t−,X(t−))dtwhereλ(t,x)istheintensityfunction ∗Correspondence: Donald Lee ([email protected]), Yale School of Management, New HavenCT06520,USA 1 of interest. The survival status of U and the value of X(t) are only observable inside the time window [L,C] defined by the truncation and censoring times respectively, withcensoringoccurringnolaterthanafixedtimethatisnormal- ized to 1. Thus given N realizations {L ,U ,C ,X (·)}N of the process, those i i i i i=1 for which failure occurred before truncation (U <L ) are unobserved. The re- i i maining n ≤ N realizations generate the samples {L ,T ,∆ ,X (t) }n i i i i t∈[Li,Ti] i=1 where T = min(U ,C ) and ∆ = I(L ≤ U ≤ C ) is the censoring indicator. i i i i i i i It is assumed that U and the truncation-censoring pair (L,C) are conditionally independent given X(·). In addition to being of vital importance to the health sciences, the process above also underpins a wide swath of other applications ranging from credit default models to customer transition dynamics in a queuing network. Hav- ing accurate models for these enables one to build high resolution simulations and/ortoevaluatethegoodness-of-fitofsimplerparametricforms. Tothisend, we wish to estimate the log-intensity F(t,x) = logλ(t,x) from a given set of functionaldatasamples{L ,T ,∆ ,X (t) }n . Tocapturepossibleinter- i i i i t∈[Li,Ti] i=1 actions in the time-covariate domain of F(t,x), a flexible estimator is desired. Afewnonparametricapproachesexistinliteratureandaredescribedattheend of this section. Some of them are suited to scenarios involving a small number of covariates, while others focus on the analysis of theoretical estimators. In this paper we propose a practical boosting procedure for estimating F(t,x) in someclassoffunctionsdefinedonthetime-covariatedomain[0,1]×X. Gradient boosting (Breiman [7], Friedman [15], Mason et al. [39]) is a regularized functional descent algorithm for minimizing risk, which in this case is the negative partial log-likelihood1 Rˆ (F)= 1 (cid:88)n (cid:40)(cid:90) TieF(t,Xi(t))dt−∆ F(T ,X (T ))(cid:41). (1.1) n n i i i i i=1 Li GiventhatRˆ (F)canbeunboundedbelow,whichcaninturnleadtooverfitting, n three types of regularization are employed: i) A weak learner approximation to the gradient function is used as the descent direction (e.g. coordinate descent or a shallow regression tree fit [15]) which can also induce variable selection; ii) A small step-size is used for the line update [15]; and iii) The number of descentiterationsiscapped. Algorithm1describesthegenericgradientboosting procedure,whileourproposedprocedureforhazardregressionwillbepresented in section 2 (Algorithm 2). In contrast to common classification and regression problems that involve non-functionaldata,thewaythatRˆ (F)dependsonF posesasignificantchal- n lenge to implementing boosting for hazard regression: We will show that the gradient of Rˆ (F) cannot be computed from its canonical representation (1.1). n ThekeytorectifyingthisistocarefullyspecifythedomainforRˆ (F). Insection n 2 we construct a sample-dependent domain on which (1.1) can be re-expressed 1The term F(Ti,Xi(Ti)) in the likelihood should strictly speaking be F(Ti−,Xi(Ti−)), however for our purposes it suffice to work with the modification (1.1) as its expectation is alsouniquelyminimizedatlogλ(seeLemma2). 2 Algorithm 1 Generic gradient boosting 1 Initialize F (t,x). 0 2 For m=0 to mˆ: • Computeasimpleapproximationg˜ (t,x)tothegradientofRˆ (F)atF . m n m • Perform the updates F (t,x) ← F (t,x)−θ g˜ (t,x), m ← m+1 m+1 m m m using a small step-size θ . m 3 The boosting estimator is Fˆ (t,x)=F (t,x)−(cid:80)mˆ−1θ g˜ (t,x). mˆ 0 m=0 m m as an integral with respect to a random measure µˆ on [0,1] × X. Specifi- n cally, given a linear space F of functions on [0,1]×X we use its completion in L2(µˆ ), (F,(cid:104)·,·(cid:105) ), as the domain. The tradeoff is that, unlike (1.1), the n µˆn integral representation is not coordinate-free and this has implications for com- putations. ToensurethatRˆ (F)>−∞on(F,(cid:104)·,·(cid:105) ),werequireeachfunction n µˆn in (F,(cid:104)·,·(cid:105) ) to be bounded on the support of µˆ . Note that F must then be finite-dimeµnˆnsional because the resulting inclusionn(F,(cid:104)·,·(cid:105) ) (cid:44)→ L∞(µˆ ) is an µˆn n embedding. Under regularity conditions our estimator is consistent if F is cor- rectlyspecified. OntheotherhandifF comprisesofflexiblepiecewiseconstant functions like regression trees, then the plug-in intensity estimator satisfies an oracle inequality. The number of boosting iterations and the degree of step-size restriction employed in our procedure will be justified in section 3. The former is set using the framework of Zhang and Yu [58], which shows for general loss functions that early stopping serves to prevent the estimator from growing too large too fast. On the other hand the role of step-size restriction is more mysterious. While [58] demonstrates that small step-sizes are needed to minimize empirical risk in order to prove consistency, it appears that unrestricted greedy step-sizes arealreadysmallenoughforcommonlyencounteredlearningproblems: Forthe usual classification losses, Telgarsky [52] and the works cited therein show that shrinkage is only needed for margin maximization. For other losses that are strongly convex and have uniformly bounded curvatures, the appendix of [58] showsthatgreedyline-searchsufficesforminimizingrisk. Inourstudyofhazard regression where the empirical risk has unbounded curvature, we contribute to clarifyingwhenexplicitstep-sizerestrictionsarenecessary: Wefindthatitserves as a mechanism for counterbalancing the curvature of the risk. If the curvature is unbounded then the step-sizes may need to be explicitly controlled to ensure empirical risk minimization. Our parsimonious bounds make it easy to see that smaller step-sizes always lead to faster convergence (in terms of sample size) at the expense of more iterations, with diminishing returns in performance. This matches the empirical observations in [15] and holds regardless of whether the risk has bounded curvature or not. 3 The remainder of this paper is organized as follows. Related approaches in literature will be described below. The proposed estimator will be constructed in section 2 and its guarantees derived in section 3. In section 4 we apply our procedure to a dataset from an emergency department to shed new light on a question from the operations literature concerning the effect of workload on service rates. Concluding remarks can be found in section 5. Related nonparametric approaches. Since Rˆ (F) can be unbounded n below, some restrictions need to be imposed on the function class F or on F(t,x) itself to prevent overfitting. One way is to impose shape constraints on the intensity function, as first suggested in Grenander [17]. Although there are methods for shape-constrained multivariate density estimation like Cule et al. [12], no analogue appears to exist for hazard regression in the presence of covariates. AnotherwaytoobtainrestrictionsistocontrolthedimensionofF,allowing ittogrowslowlywithsamplesize. TheresultingsequenceofMLEsformasieve estimator that can handle time-fixed covariates (Dohler and Ruschendorf [14], Karr[24],Kooperbergetal.[30])aswellastime-varyingones(HuangandStone [23]). However, for a given sample size, the theory provides no procedure for computingthefunctionalMLEorevenwhichF touse. TheMLEisparticularly challenging to compute when additional constraints are assumed as in [14]. In practice, researchers resort to using a stepwise feature selection heuristic2 to construct an estimator from products of univariate splines (Kooperberg and Clarkson [28], Kooperberg et al. [29]). An opposite approach to sieves is to fix a large or flexible function class and directly control the complexity of the estimator itself. Both boosting and penalizedlikelihoodmethodsaresimilarinthissense. However,beyondthecase of a single time-fixed covariate (Gu [18], O’Sullivan [44], Senthilselvan [50]), recent literature on penalized likelihood have moved from the nonparametric setting toward some form of separability. Formethodsnotbasedonlikelihoodmaximization,kernelestimatorscanbe obtained from smoothing in time-covariate space (Li and Doss [32], McKeague andUtikal[41],NielsenandLinton[43],Perezetal.[45],Spierdijk[51],vanKei- legom and Veraverbeke [56]). If the covariates are high-dimensional then kernel smoothing is susceptible to the curse of dimensionality, so semi-parametric restrictions are needed (Linton et al. [34]). Boosting semiparametric hazard models. Gradientboostinghavepre- viously been applied to both parametric and semiparametric models with time- fixed covariates: Ridgeway [46], Li and Luan [33], and Hothorn et al. [22] con- sidered the Cox proportional hazards model while Schmid and Hothorn [49] examined accelerated failure time models. For additive hazards models, high- dimensional covariates can be handled using either principal components re- 2An alternative is to cast sieve estimation as a model selection problem where Fˆ and dimF are jointly chosen to minimize the sum of the risk and a penalty on dimF. How- everthisrequiressolvingaseparateminimizationproblemforeachcandidatefunctionclass, whichiscomputationallyburdensomeexceptforcaseswherethesequenceofminimizershave orthogonalincrements. 4 gression (Ma et al. [37]) or Lasso (Martinussen and Scheike [38]). Boosting can certainly be applied as well given that the gradient approximation step in Algorithm 1 can be used to select variables. 2 Constructing the estimator The chief difficulty with implementing gradient boosting for hazard regression is that the gradient of the likelihood risk (1.1) cannot be determined from its directional derivative ddθRˆn(F +θf)(cid:12)(cid:12)(cid:12)(cid:12) = n1 (cid:88)n (cid:90) TieF(t,Xi(t))f(t,Xi(t))dt−n1 (cid:88)n ∆if(Ti,Xi(Ti)). θ=0 i=1 Li i=1 While this can be written as the difference of two different inner products (cid:10)eF,f(cid:11) −(cid:104)1,f(cid:105) , these cannot be combined into a single inner product of f † ‡ with some gradient function. This is intuitive since the first one involves time integrals while the second consists of pointwise evaluations. To remedy this, we will equip the class F of candidate estimators with an appropriate reproducing kernel by carefully specifying a sample-dependent domain for Rˆ (F). The n likelihood risk can then be re-expressed as a smooth convex functional whose gradient is readily obtainable. An analogous representation also exists for the expected risk which forms the basis for analyzing the estimator. Laying the groundwork. WebeginbystatingfourconditionsA1-A4that willbeusedthroughoutthepaper. ConditionA1belowisthesameasCondition 1(iv) of [23]. A1 λ(t,x) is bounded between some interval [Λ ,Λ ] ⊂ (0,∞) on the time- L H covariate domain [0,1]×X. A2 {L ,U ,C ,X (·)}N are independent and identically distributed3. i i i i i=1 Next, some regularity conditions on the covariate process X(·) are needed for the existence of the expected likelihood risk (Lemma 2). Let T denote the Borel algebra for [0,1], and (Ω,S,P) the probability space for L, U, C, and X(·). Implicit in this statement is that X(·) is a random function, which we require to satisfy A3 The map (ω,t)(cid:55)→X(t;ω) is S⊗T-measurable. The condition ensures that {t,X(t)} is S ⊗T-measurable and {T,X(T)} is S- measurable,anditisautomaticallysatisfiedifthesampletrajectoriesofX(t)are eitherleft-continuousorright-continuous(progressivemeasurability). Denoting the indicator function as I(·), we can then define the following population and empirical sub-probability measures on [0,1]×X: (cid:90) 1 µ(B)=E I(L≤t≤T)I[{t,X(t)}∈B]dt, (2.1) 0 3Inthecontextoftheapplicationinsection4,simulationresultssuggestthatourprocedure ispotentiallyrobusttonon-IIDcovariates. 5 1 (cid:88)n (cid:90) Ti µˆ (B)= I[{t,X (t)}∈B]dt n n i i=1 Li (2.2) 1 (cid:88)N (cid:90) 1 = I(L ≤t≤T )I[{t,X (t)}∈B]dt. n i i i i=1 0 Intuitively µˆ measures the denseness of the sample time-covariate paths on n [0,1]×X. Bearinginmindthatn=(cid:80)N I(L ≤U )israndom,E{(n/N)µˆ (B)}= i=1 i i n µ(B) under A2. Furthermore for any integrable f, (cid:90) (cid:90) 1 fdµ=E I(L≤t≤T)f(t,X(t))dt, (2.3) 0 (cid:90) 1 (cid:88)n (cid:90) Ti fdµˆ = f(t,X (t))dt. (2.4) n n i i=1 Li This allows us to define the following (random) norms and inner product (cid:90) (cid:107)f(cid:107) = |f|dµˆ , µˆn,1 n (cid:18)(cid:90) (cid:19)1/2 (cid:107)f(cid:107) = f2dµˆ , µˆn,2 n (cid:107)f(cid:107) =sup{|f(t,x)|:(t,x)∈[0,1]×X}, ∞ (cid:90) (cid:104)f,f(cid:48)(cid:105) = ff(cid:48)dµˆ , µˆn n and note that (cid:107)·(cid:107) ≤(cid:107)·(cid:107) ≤(cid:107)·(cid:107) because µˆ ([0,1]×X)≤1. µˆn,1 µˆn,2 ∞ n Bydesign, µˆ allowsustospecifyanaturaldomainforRˆ (F). Let{φ }d n n j j=1 be a set of bounded normalized functions [0,1]×X (cid:55)→ [−1,1] that are linearly independent, in the sense4 that (cid:82) ((cid:80) c φ )2dtdx = 0 if and only if c = [0,1]×X j j j 1 ···=c =0. The span of the functions is d   (cid:88)d  F = c φ :c ∈R . j j j   j=1 When F is equipped with (cid:104)·,·(cid:105) we obtain the following sample-dependent µˆn subspace of L2(µˆ ): n (F,(cid:104)·,·(cid:105) ). (2.5) µˆn Since (F,(cid:104)·,·(cid:105) ) comprises of equivalence classes [F] = {F(cid:48) ∈ F : F(cid:48) = µˆn F µˆ -a.e.} rather than actual functions, dim(F,(cid:104)·,·(cid:105) ) ≤ d. Furthermore the n µˆn pointwise values of [F] need to be defined in order for the likelihood risk (1.1) to be defined on (F,(cid:104)·,·(cid:105) ). Henceforth, given a basis5 {ϕˆ (t,x)} ⊂ F for µˆn j j 4Ifsomeofthecovariatesarediscrete-valuedthendxshouldbeinterpretedastheproduct ofacountingmeasureandaLebesguemeasure. 5ComputedfromapplyingGram-Schmidttoφ1(t,x),···,φd(t,x)forexample. 6 (F,(cid:104)·,·(cid:105) ), we shall represent each class with the unique function of the form µˆn (cid:80) c ϕˆ (t,x)containedwithinit. However,theresultingdefinitionofRˆ ([F])is j j j n not coordinate-free: Two different bases may lead to different pointwise values6 for [F] and hence different values for Rˆ ([F]). Nonetheless the guarantees for n our estimator in section 3 hold regardless of basis choice. Asimpleexamplewillhelpillustratethediscussionabove. Considerthecase of no covariates and only one observation, a failure at T = 1/4. Suppose F is spanned by {φ (t)=1,φ (t)=I (t)}. (2.6) 1 2 [0,1/4) (cid:82) (cid:82)1/4 Then fdµˆ = f(t)dt, and two possible orthonormal bases for (F,(cid:104)·,·(cid:105) ) n 0 µˆn are{ϕˆ (t)=2φ (t)}and{ψˆ (t)=2φ (t)}. Inthiscasedim(F,(cid:104)·,·(cid:105) )<dimF 1 1 1 2 µˆn becauseφ andφ belongtothesameequivalenceclassin(F,(cid:104)·,·(cid:105) ). Moreover 1 2 µˆn ϕˆ (t) and ψˆ (t) differ at the point t=1/4 even though they also belong to the 1 1 same class. The final condition we impose is for {φ }d to be linearly independent in j j=1 L2(µ), that is (cid:107)(cid:80) c φ (cid:107)2 = (cid:80) c (cid:0)(cid:82) φ φ dµ(cid:1)c = 0 if and only if c = j j j µ,2 ij i i j j 1 ···=c =0. Since by construction {φ }d is already linearly independent on d j j=1 [0,1]×X,theconditionintuitivelyrequiresthesetofallpossibletime-covariate trajectories to be adequately dense in [0,1]×X to intersect a sufficient amount of the support of every φ . This is weaker than conditions 1(ii)-1(iii) in [23] j which require X(t) to have a positive joint probability density on [0,1]×X, thereby ruling out discrete and categorical covariates. (cid:82) A4 The Gram matrix Σ = φ φ dµ is positive definite. ij i j Since F is finite-dimensional, all norms on it are equivalent so the following is finite: (cid:18) (cid:19) (cid:107)F(cid:107) α =2 sup ∞ =2/ inf (cid:107)F(cid:107) >1, (2.7) F F∈F:(cid:107)F(cid:107)∞=1 (cid:107)F(cid:107)µ,1 F∈F:(cid:107)F(cid:107)∞=1 µ,1 and for all F ∈F α (cid:107)F(cid:107) ≤(cid:107)F(cid:107) ≤(cid:107)F(cid:107) ≤ F(cid:107)F(cid:107) . (2.8) µ,1 µ,2 ∞ 2 µ,1 The factor of 2 in the definition of (2.7) is arbitrary and can be replaced with anythinggreaterthan1. ItwillbeshowninLemma3thatwithhighprobability there exists an empirical analogue to (2.8). Integral representations for the likelihood risk. Bydesign(F,(cid:104)·,·(cid:105) ) µˆn admits a reproducing kernel, which can be used to recast Rˆ (F) as a smooth n convex functional on (F,(cid:104)·,·(cid:105) ) with gradient (2.11). As discussed, the repre- µˆn sentation is not coordinate-free and will depend explicitly on the chosen basis {ϕˆ (t,x)} . Hence it is only valid for functions in the pointwise span of the j j basis. 6Unlessdim(F,(cid:104)·,·(cid:105) )=d,inwhichcaseeachclasscontainsexactlyonefunction. µˆn 7 Lemma 1. With respect to an orthonormal basis {ϕˆ (t,x)} of (F,(cid:104)·,·(cid:105) ), for j j µˆn F,f ∈span{ϕˆ (t,x)} the likelihood risk (1.1) can be written as j j (cid:90) Rˆ (F)= (eF −λˆF)dµˆ , (2.9) n n where λˆ ∈(F,(cid:104)·,·(cid:105) ) is the function µˆn (cid:40) n (cid:41) λˆ(t,x)= 1 (cid:88) (cid:88)∆ ϕˆ (T ,X (T )) ϕˆ (t,x). n i j i i i j j i=1 Thus there exists ρˆ∈(0,1) (depending on F and f) for which the Taylor representation 1(cid:90) Rˆ (F +f)=Rˆ (F)+(cid:104)gˆ ,f(cid:105) + eF+ρˆff2dµˆ (2.10) n n F µˆn 2 n holds, where the gradient gˆ (t,x)=(cid:88)(cid:10)eF,ϕˆ (cid:11) ϕˆ (t,x)−λˆ(t,x) (2.11) F j µˆn j j of Rˆ (F) is the projection of eF −λˆ onto (F,(cid:104)·,·(cid:105) ). Hence if gˆ =0 then the n µˆn F infimum of Rˆ (F) over span{ϕˆ (t,x)} is uniquely attained at F. n j j Remark. To see what can go wrong when (2.9) is applied to functions outside span{ϕˆ (t,x)} , consider the example from (2.6). If the basis {ϕˆ (t)=2φ (t)} j j 1 1 ischosen,thenφ (t)isnotpointwisespannedbyϕˆ (t)eventhoughφ andϕˆ /2 2 1 2 1 belong to the same equivalence class. Under (1.1) we have Rˆ (φ )=e/4 while n 2 (cid:82)(eφ2 −λˆφ2)dµˆn =e/4−1. Remark. The infimum of Rˆ (F) is not always attainable, so the log-intensity n MLEmightnotevenexist: Iff isnon-positiveandvanishesontheset{{T ,X (T )}: i i i ∆ = 1}, then Rˆ (F +θf) = (cid:82)(eF+θf −λˆF)dµˆ is decreasing in θ so f is a i n n direction of recession. This is however not an issue for our estimator because of early stopping. The expectation of the (scaled) likelihood risk also has an integral representation which forms the basis for the theory in section 3. A special case of the representation (2.12) below was derived in equation 4.1 of [23], under as- sumptions more stringent than A4 that do not allow for discrete or categorical covariates. In our setting, the key to establishing (2.12) is to show that on [0,1]×X, thepointprocessM(B)=∆·I[{T,X(T)}∈B](forthepointoffail- (cid:82) ure, if observed) has mean λdµ, from which we deduce Campbell’s formula E{∆·F(T,X(T))}=(cid:82) FλdµB. Lemma 2. For F ∈F ∪{logλ}, (cid:110)n (cid:111) (cid:90) R(F)=E Rˆ (F) = (eF −λF)dµ. (2.12) N n 8 Furthermore the restriction of R(F) to F is coercive: 1 Λ R(F)≥ L(cid:107)F(cid:107) +Λ min{0,1−log(2Λ )}, (2.13) 2 α ∞ H H F and it attains its minimum at a unique point F∗. If F contains the underlying log-intensity function then F∗ =logλ. Remark. Coerciveness (2.13) implies that any F with expected risk R(F) less than R(0)≤1<3 is uniformly bounded: α (cid:107)F(cid:107) < F[3/2+Λ max{0,log(2Λ )−1}] ∞ ΛL H H (2.14) ≤α β F Λ where the constant 3/2+Λ max{0,log(2Λ )−1} β = H H (2.15) Λ min{1,Λ } L is by design no smaller than 1 in order to simplify subsequent analyses. The boosting procedure. In the generic Algorithm 1, three types of regularization mechanisms are employed. First, a weak learner approximation to the negative gradient −gˆ (2.11) is used as the descent direction. A few F choices that induce some degree of variable selection are: • Small (limited depth) regression trees that are correlated with −gˆ . This F provides some smoothing of the gradient function [15]. • The member of the basis {φ } most aligned with −gˆ . This is a variant j j F ofcoordinatedescent,seeforexampleChapter7.2ofSchapireandFreund [48]. • If coordinate descent is used and each φ depends on only one component j ofthecovariatevector,weobtaincomponentwiselearners(Bühlmannand Yu [11]). The resulting estimator is additive in the covariate components. To model a generic approximation to a non-zero gradient, we say that a unit vector gˆ(cid:15) ∈(F,(cid:104)·,·(cid:105) ) is an (cid:15)-gradient at F if for some 0≤(cid:15)<1, F µˆn (cid:28) (cid:29) gˆ F ,gˆ(cid:15) ≥(1−(cid:15)). (2.16) (cid:107)gˆ (cid:107) F F µˆn,2 µˆn The smaller (cid:15) is, the closer the alignment is between the gradient and the (cid:15)- gradient. In particular gˆ is the unique 0-gradient. On the other hand, very F simple descent directions will be less aligned with gˆ , resulting in a larger (cid:15). F Theothertwotypesofregularizationsarespecifiedbelowforourprocedure. These choices serve to simplify the results in section 3: 9 Algorithm 2 Boosting algorithm for hazard regression 1 Set Fˆ =0, m=0 and repeat the following steps: 0 • Determinethecurrentgradientgˆ (2.11). Ifgˆ =0thenexitloop,else compute a simple approximationFˆgˆm(cid:15) that satisFˆfimes (2.16). Fˆm • If (cid:13) (cid:13) (cid:13)(cid:13)(cid:13)Fˆm− mν+n 1gˆF(cid:15)ˆm(cid:13)(cid:13)(cid:13) <Ψn (2.19) ∞ then update Fˆm+1 ←Fˆm− mν+n1gˆF(cid:15)ˆm, m←m+1. Else exit loop. 2 Set mˆ ←m. The estimators for the log-intensity and intensity functions are respectively mˆ−1 Fˆmˆ =− (cid:88) mν+n 1gˆF(cid:15)ˆm, λˆboost =eFˆmˆ. m=0 • The number of boosting iterations mˆ is controlled by stopping the algo- rithmbeforetheuniformnormoftheestimator(cid:107)Fˆ (cid:107) reachesorexceeds mˆ ∞ Ψ =W(n1/4)→∞, (2.17) n where W(y) is the branch of the Lambert function that returns the real root of the equation zez =y for y >0. • Thestep-sizeforthem-thiterationissettoν /(m+1),whichiscontrolled n intwoways. First, itismadetodecreasewitheachiterationaccordingto theRobbins-Monro[47]conditionthatthesumofthestepsdivergeswhile the sum of squared steps converges. The specific rate 1/(m+1) above is chosen out of mere convenience. Second, the shrinkage factor ν makes n the step-sizes decay with n at rate ν2eΨn <1, ν2eΨn →0. (2.18) n n (cid:12) ThisactsasacounterbalancetoRˆ (F)’scurvature: d2 Rˆ (F +θf)(cid:12) = n dθ2 n (cid:12) (cid:82) eFf2dµˆn ≤eΨn if (cid:107)F(cid:107)∞ <Ψn and (cid:107)f(cid:107)µˆn,2 =1. θ=0 Algorithm 2 describes the proposed boosting procedure for estimating logλ. 3 Properties Guarantees for our estimator can be derived for two scenarios of particular interest. In Proposition 5 we first consider models that are correctly specified 10