Delta Theorem in the Age of High Dimensions ∗ 7 Mehmet Caner 1 0 Department of Economics 2 n Ohio State University a J January 24, 2017 0 2 ] T S Abstract . h t a We provide a new version of delta theorem, that takes into account of high dimensional m parameter estimation. We show that depending on the structure of the function, the limits of [ 1 functions of estimators have faster or slower rate of convergence than the limits of estimators. v 1 Weillustratethisviatwoexamples. First,weuseitfortestinginhighdimensions,andsecondin 1 9 estimating large portfolio risk. Our theorem works in the case of larger number of parameters, 5 0 p, than the sample size, n: p>n. . 1 0 7 1 : v i X r a ∗Department of Economics, 452 ArpsHall, TDA Columbus, OH,43210. email:[email protected] 1 1 Introduction Delta Method is one of the most widely used theorems in econometrics and statistics. It is a very simple and useful idea. It can provide limits for complicated functions of estimators as long as function is differentiable. Basically, the idea is the limit of the function of estimators can be obtained from the limit of the estimators, and with exactly the same rate of convergence. In the case of finite dimensional parameter estimation, due to derivative at the parameter value being finite, rates of convergence of both estimators and function of estimators are the same. In the case of high dimensional parameter estimation, we show that this is not the case, and the rates of convergence may change. We show that the structure of the function is the key, and depending on that functions of estimators may converge faster or slower than estimators. In this paper, we provide a new version of delta method which accounts for high dimensions, and generalizes the previous finite dimensional case. After that we illustrate our point in two examples: first by examining a linear function of estimators that is heavily used in econometrics, and second by analyzing the risk of a large portfolio of assets in finance. Section 2 provides new delta method. Section 3 has two examples. Appendix shows the proof. 2 High Dimensional Delta Theorem Let β0 = (β10, ,βp0)′ be a p 1 parameter vector with an estimator βˆ= (βˆ1, ,βˆp)′. Define a ··· × ··· function f(.), f :K Rp Rm defined at least on a neighborhood of β0, where p > m, m will be ⊂ → taken as constant for convenience, but p is allowed to increase when n increases. Furthermore the function f(.) is differentiable at β0, which means, with h = 0, 6 f(β0+h) f(β0) fd(β0)h 2 lim k − − k = o(1), h→0 h 2 k k where . 2 is the Euclidean norm for a generic vector, and fd(β0) is the m p matrix, which ijth kk × cell consists of ∂fi/∂βj evaluated at β0, for i = 1, ,m, j = 1, p. ··· ··· Before the theorem, we need the following matrix norm inequality. Take a generic matrix which is of dimension m×p. Denote the Frobenius norm for a matrix as k|Ak|2 = mi=1 pj=1a2ij. Note q P P 2 that in some of the literature such as Horn and Johnson (2013), this definition is not considered a matrix norm, due to lack of submultiplicativity. However, our results will not change regardless of matrix norm definitions, if we abide by Horn and Johnson (2013), our results can be summarized in algebraic form, rather than matrix norm format. Define a′ 1 . A= .. , a′ m where a is p 1 vector, and its transpose is a′, i = 1, ,m. Then for a generic p 1 vector x, i × i ··· × m m kAxk2 = v (a′ix)2 ≤ v kaik22kxk2 = k|Ak|2kxk2, (2.1) ui=1 ui=1 uX uX t t where the inequality is obtained by Cauchy-Schwarz inequality. Note that if we apply Horn and Johnson (2013) norm definition, this matrix norm inequality still holds, but we cannot use the matrix norm. In that case we have m 2 kAxk2 ≤ v kaik2kxk2. (2.2) ui=1 uX t Our new delta theorem is provided for high dimensional case. This generalizes Theorem 3.1 of van der Vaart (2000). Key element in our Theorem below is fd(β0) 2. We should note that this k| k| norm of matrix derivative depends on n, through p, which is the number of columns in fd(β0). Let r ,r∗ , as n , be the rate of convergence of estimators, and the functions of estimators, n n → ∞ → ∞ respectively in the theorem below. Theorem 2.1. Let a function f(β) : K Rp Rm, and differentiable at β0. Let βˆ be the ⊂ → estimators for β0, and βˆ= β0, assume we have the following result: 6 rn βˆ β0 2 = Op(1). k − k 3 a) Then if fd(β0) 2 = o(1), with fd(β0) 2 > 0, we get k| k| 6 k| k| rn∗kf(βˆ)−f(β0)k2 = Op(1), (2.3) where r r∗ = O n . (2.4) n (cid:18)k|fd(β0)k|2(cid:19) b) Then if fd(β0) 2 =o(1), we get k| k| rn f(βˆ) f(β0) 2 = op(1). (2.5) k − k Remarks. 1. Note that in part a), with r∗, we have a slower or the same rate of convergence as n in r . In part b), clearly, the function of estimators converge to zero in probability faster than the n rate of estimators themselves. This can be seen from noting that in part b), even though βˆ β0 2 = Op(1/rn), k − k for functions f(βˆ) f(β0) 2 = op(1/rn). k − k 2. Note that Horn and Johnson (2013) defines Frobenius norm only for square matrices unlike our case. If we use their approach, then our main result in part a) will be r r∗ = O n , (2.6) n mi=1kfdi(β0)k22 q P where we use (2.2) instead of (2.1) in the proof of Theorem 2.1a. Also note that fdi(β0) is the p 1 × vector, which is ∂fi(.)/∂β evaluated at β0. 3. Also see that this result (2.4) can be obtained in other matrix norms subject to the same 4 caveat in Remark 1. A simple Holder’s inequality provides Ax 1 A 1 x 1, (2.7) k k ≤ k| k| k k wherewedefinethemaximumcolumnsummatrixnorm: k|Ak|1 = max1≤j≤p mi=1|aij|,whereaij is theij thelementofAmatrix. ApplyingthisintheproofofTheorem2.1a,givePnrn βˆ β0 1 = Op(1) k − k and replacing everything with Frobenius norm for matrices with maximum column sum matrix norm, and l2 norm for vectors with l1 norm, we have r r∗ = O n . (2.8) n (cid:18)k|fd(β0)k|1(cid:19) Also part b) can be written in l1 norm as well. 4. We can also extend these results to another norm. A simple inequality provides Ax A x , (2.9) ∞ ∞ ∞ k k ≤ k| k| k k p where we define the maximum row sum matrix norm: k|Ak|∞ = max1≤i≤m j=1|aij|, where aij is the ij th element of A matrix. Applying this in the proof of Theorem 2.1aP, given rn βˆ β0 ∞ = k − k O (1) and replacing everything with Frobenius norm for matrices with maximum row sum matrix p norm, and l2 norm for vectors with l∞ norm, we have r r∗ = O n . (2.10) n (cid:18)k|fd(β0)k|∞(cid:19) Also part b) can be written in l norm as well. ∞ 5. What if we have m = p? or m > p, and m as n ? Then all our results will go → ∞ → ∞ through as well, this is clear from our proof. 3 Examples We now provide two examples that will highlight the contribution. First one is related to linear functions of estimators in part a), and the second one is related to risk of the large portfolios, and 5 part b). Example 1. Let us denote β0 as the true value of vector (p 1) of coefficients. The number of the true × nonzero coefficients are denoted by s0, and s0 > 0. A simple linear model is: yt =x′tβ0+ut, where t = 1, ,n, with u iid mean zero, finite variance error, and x is deterministic set of p t t ··· regressors for ease of analysis. The lasso estimator in a simple linear model is defined as n (y x′β)2 λ p βˆ= argminβ∈Rp t− t + βj0 , n n | | t=1 j=1 X X where λ is a positive tuning parameter, and it is established that λ = O( logp). Corollary 6.14 or n Lemma 6.10 of Buhlmann and van de Geer (2011) shows, for lasso estimaqtors βˆ, with p > n rn βˆ β0 2 = Op(1), (3.1) k − k where n 1 r = . (3.2) n logp√s0 r At this point, we will not go into detail such as what assumptions are needed to get (3.1), except to tell thatminimaladaptive restrictive eigenvalue is positive, andnoise reductionis achieved. Details can be seen in Chapter 6 in Buhlmann and van de Geer (2011). The issue is what if the researchers are interested in the asymptotics of D(βˆ β0), where − D : m p matrix. D matrix can be thought of putting restrictions on β0. We want to see whether × D(βˆ β0) has a different rate of convergence than βˆ β0. From our Theorem 2.1a, it is clear that − − fd(β0) = D. Assume that D 2 = o(1). So k| k| 6 rn∗kD(βˆ−β0)k = Op(1), 6 where r r∗ = O n . (3.3) n D 2 (cid:18)k| k| (cid:19) We know from matrix norm definition: m 2 D 2 = di 2, k| k| v k k ui=1 uX t and di 2 is the Euclidean norm for vector di which is p 1, and k k × d′ 1 . D = .. . d′ m Basically in the case of inference, this matrix and vectors show how many of β0 will be involved with restrictions. If we want to use s0 elements in each row of D to test m restrictions, then D 2 = O(√s0). Note that this corresponds to using s0 elements in β0 for testing m restrictions. k| k| So n 1 r∗ = , (3.4) n logp! s0 r (cid:18) (cid:19) which shows that even though we have fixed number of restrictions, m, using s0 of coefficients in testing will slow down the rate of convergence, rn∗ by √s0, compared with lasso estimators, rate of r . This can be seen by comparing (3.2) with (3.4). n Example 2. One of the cornerstones of the portfolio optimization is estimation of risk. If we denote the portfolio allocation vector by w (p 1) vector, and the covariance matrix of asset returns by Σ, the × risk is (w′Σw). We want to analyze risk estimation error which is (w′Σˆw)1/2 (w′Σw)1/2. Σˆ is − the sampple covariance matrix of asset returns. We could have analyzed risk estimation error with estimated weights, as in Fan etal (2015), (wˆΣˆwˆ)1/2 (wˆΣwˆ)1/2, but this extends the analysis with − more notation with the same results. A crucial step in assessing the accuracy of risk estimator is given in p.367 of Fan etal (2015), which is the term w′(Σˆ Σ)w. . Just to simplify the analysis, we will assume iid, sub-Gaussian − 7 asset returns. Also we will find the global minimum variance portfolio as in Example 3.1 of Fan etal (2015). So Σ−11 p w = , 1′Σ−11 p p where Σ is nonsingular, and 1 is the p vector of ones. Assume 0 < Eigmin(Σ−1) p ≤ Eigmax(Σ−1) < , where Eigmin(.),Eigmax(.) represents the minimal, maximal eigenvalues ∞ respectively of the matrix inside the parentheses. In Remark 3 of Theorem 3 in Caner etal (2016) w 1 = O(max√sj), (3.5) k k j where s is the number of nonzero cells in j th row of Σ−1 matrix, j = 1, ,p. Equation (3.5) j ··· represents case of growing exposure, which means we allow for extreme positions in our portfolio, since we allow s , as n . See that j → ∞ → ∞ logp Σˆ Σ = O ( ), (3.6) ∞ p k − k n r by van de Geer etal. (2014). Then clearly by (3.5)(3.6) w′(Σˆ Σ)w w 2 Σˆ Σ = O (max s logp/n). (3.7) 1 ∞ p j | − | ≤ k k k − k 1≤j≤p p This means taking β0 = w′Σw,βˆ= w′Σˆw, so m = 1 in Theorem 2.1b, r w′(Σˆ Σ)w = O (1), (3.8) n p | − | where n 1 r = . (3.9) n logpmax1≤j≤psj r But the main issue is to get risk estimation error, not the quantity in (3.8). To go in that direction see that O(max s )= w 2Eigmin(Σ) w′Σw w 2Eigmax(Σ) = O(max s ), (3.10) j 1 1 j 1≤j≤p k k ≤ | | ≤ k k 1≤j≤p 8 where we use (3.5) and Eigmax(Σ) < , Eigmin(Σ) > 0. ∞ Note that risk is f(β0) =(w′Σw)1/2, and fd(β0) = (w′Σw)−1/2 = O((maxjsj)−1/2) in Theorem 2.1b. So fd(β0) = o(1), since we allow sj . Then apply our delta theorem, Theorem 2.1b here → ∞ r [(w′Σˆw)1/2 (w′Σw)1/2] = o (1), (3.11) n p − Nowweseethatrateofconvergenceinriskestimationisfasterin(3.11),comparedto(3.8)-(3.9). REFERENCES Abadir, K.and J.R. Magnus (2005). Matrix Algebra. Cambridge University Press. Cambridge. Buhlmann, P. and S. van de Geer (2011). Statistics for High-Dimensional Data. Springer Verlag, Berlin. Caner, M., E. Ulasan, L. Callot, and O, Onder (2016). ”A relaxed approach to estimating large portfolios and gross exposure,” arXiv: 1611.07347v1. Fan, J., Y. Liao, and X. Shi (2015). ”Risks of large portfolios,” Journal of Econometrics, 186, 367-387. Horn, R.A. and C. Johnson (2013). Matrix Analysis. Second Edition. Cambridge University Press, Cambridge. Van de Geer, S., P. Buhlmann, Y. Ritov, and R. Dezeure (2014). ”On asymptotically optimal confidence regions and test for high-dimensional models,” The Annals of Statistics, 42, 1166-1202. Van der Vaart, A.W. (2000). Asymptotic Statistics. Cambridge University Press, Cambridge. 9 Appendix Proof of Theorem 2.1a. The first part of the theorem shows why the classical proof of delta theorem will not work in high dimensions. However, this is not a negative result since it will guide us through the second part which shows us the solution. Part 1. First by differentiability, and define l(.) as a vector function, via p.352 of Abadir and Magnus (2005) l(.) :D Rp Rm ⊂ → f(βˆ) f(β0) fd(β0)[βˆ β0] 2 = l(βˆ β0) 2. (3.12) k − − − k k − k and l(βˆ β0) 2 k − k = o (1), (3.13) βˆ β0 2 p k − k where we use Lemma 2.12 of van der Vaart (2000). Then with βˆ β0 2 = op(1), (3.13) implies k − k l(βˆ β0) 2 = op(1). (3.14) k − k Since we are given rn βˆ β0 2 = Op(1), by (3.13)(3.14) k − k rn l(βˆ β0) 2 = op(1). (3.15) k − k By (3.12)-(3.15) rn[f(βˆ) f(β0)] rn[fd(β0)][βˆ β0] 2 = op(1). (3.16) k − − − k But this is the same result as in regular delta method. (3.16) is mainly a simple extension of Theorem 3.1 in van der Vaart (2000) to Euclidean spaces so far. However the main caveat comes fromderivative matrix fd(β0)whichis of dimensionm p. Therateof thematrix plays a rolewhen × p as n . For example, both rn[fd(β0)][βˆ β0] and rn[f(βˆ) f(β0)] may bediverging, but → ∞ → ∞ − − rn βˆ β0 = Op(1). Hence the delta method is not that useful if our interest centers on getting k − k rates for estimators as well as functions of estimators that converge. In the fixed p case, this is not an issue, since the matrix derivative will not affect the rate of convergence at all, as long as this is bounded away from zero, and bounded from above. Note that boundedness assumptions may not 10