ebook img

Testing for Neglected Nonlinearity Using Regularized Arti–cial PDF

35 Pages·2012·0.25 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Testing for Neglected Nonlinearity Using Regularized Arti–cial

Testing for Neglected Nonlinearity Using Regularized Arti(cid:133)cial Neural Networks Tae-Hwy Lee, Zhou Xi, and Ru Zhang (cid:3) University of California, Riverside April 2012 Abstract The arti(cid:133)cial neural network (ANN) test of Lee, White and Granger (LWG, 1993) uses the ability of the ANN activation functions in the hidden layer to detect neglected functionalmisspeci(cid:133)cation. AstheestimationoftheANNmodelisoftenquitedi¢ cult, LWG suggested activate the ANN hidden units based on randomly drawn activation parameters. To be robust to the random activations, a large number of activations is desirable. This leads to a situation for which regularization of the dimensionality is neededbytechniquessuchasprincipalcomponentanalysis(PCA),Lasso, Pretest, par- tialleastsquares(PLS),amongothers. However,someregularizationmethodscanlead to selection bias in testing if the dimensionality reduction is conducted by supervising the relationship between the ANN hidden layer activations of inputs and the output variable. This paper demonstrates that while these supervised regularization methods such as Lasso, Pretest, PLS, may be useful for forecasting, they may not be used for testing because the supervised regularization would create the post-sample inference or post-selection inference (PoSI) problem. Our Monte Carlo simulation shows that the PoSI problem is especially severe with PLS and Pretest while it seems relatively mild or even negligible with Lasso. This paper also demonstrates that the use of un- supervised regularization does not lead to the PoSI problem. LWG (1993) suggested a regularization by principal components, which is a unsupervised regularization. While the supervised regularizations may be useful in forecasting, regularization should not be supervised in inference. Keywords: Randomized ANN activations, Dimension reduction, Supervised regular- ization, Unsupervised regularization, PCA, Lasso, PLS, Pretest, PoSI problem. JEL Classi(cid:133)cation: C12, C45 DepartmentofEconomics,UniversityofCalifornia,Riverside,CA92521,USA.E-mails: [email protected], (cid:3) [email protected], [email protected]. 1 Introduction In this paper we explore the issues in testing for functional forms, especially for neglected nonlinearityinparametriclinearmodels. Manypapershaveappearedintherecentliterature which deal with the issues of how to carry out various speci(cid:133)cation tests in parametric re- gression models. To construct the tests, various methods are used to estimate the alternative models. For example, Fan and Li (1996), Li and Wang (1998), Zheng (1996), and Bradley and McClelland (1996) use local constant kernel regression; Hjellvik, Yao and Tjłstheim (1998) and Tjłstheim(1999) use local polynomial kernel regression; Cai, Fanand Yao (2000) and Matsuda (1999) use nonparametric functional coe¢ cient models; Poggi and Portier (1997) use a functional autoregressive model; White (1989), Lee, White and Granger (1993), Ter(cid:228)svirta, Lin and Granger (1993), Granger and Ter(cid:228)svirta (1993), Ter(cid:228)svirta (1996), and Corradi and Swanson (2002) use neural network models; Eubank and Spiegelman (1990) use spline regression; Hong and White (1995) use series regression; Stengos and Sun (1998) use wavelet methods; and Hamilton (2001) uses a parametric (cid:135)exible regression model. There are also many papers which compare di⁄erent approaches in testing for linearity. Forexample,Lee,White,andGranger(1993),Ter(cid:228)svirta,LinandGranger(1993),Ter(cid:228)svirta (1996), and Lee (2001) examine the neural network test and many other tests. Dahl (2002) and Dahl and GonzÆlez-Rivera (2000) study Hamilton(cid:146)s (2000) test and compare it with various tests including the neural network test. Blake and Kapetanios (1999, 2000) extend theneuralnetworktestusingaradialbasisfunctionfortheneuralnetworkactivationfunction instead of using the typical logistic function used in Lee, White and Granger (1993).1 Lee andUllah(2001, 2003)examinethetestsofLiandWang(1998), Zheng(1996), Ullah(1985), Cai, Fan and Yao (2000), H(cid:228)rdle and Mammen (1993), and A(cid:239)t-Sahalia, Bickel, and Stoker (2001). Fan and Li (2000) compare the tests of Li and Wang (1998), Zheng (1996), and 1For radial basis functions, see (e.g.) Campbell, Lo, and MacKinlay (1997, p. 517). 1 Bierens (1990). Whang (2000) generalizes the Kolmogorov-Smirnov and Cramer-von Mises tests to the regression framework and compare them with the tests of H(cid:228)rdle and Mammen (1993) and Bierens and Ploberger (1997). Hjellvik and Tjłstheim (1995, 1996) propose tests basedonnonparametricestimatesofconditionalmeanandvariancesandcomparethemwith a number of tests such as the bispectrum test and the BDS test. This paperfurtherinvestigates the arti(cid:133)cial neural network(ANN) test. The ANNtest is a conditional moment test whose null hypothesis consists of conditional moment conditions that hold if the linear model is correctly speci(cid:133)ed for the conditional mean. The ANN test di⁄ers from other tests by the choice of the (cid:145)test function(cid:146)that is chosen to be the ANN(cid:146)s hidden layer activations. It can be checked for their correlation with the residuals from the linear regression model. The advantage to use an ANN model to test nonlinearity is that the ANN model inherits the (cid:135)exibility as a universal approximator of unknown functional form. Hornick, Stinchcombe and White (1989) show that neural network is a nonlinear (cid:135)exible functional form being capable of approximating any Borel measurable function to any desired level of accuracy provided su¢ ciently many hidden units are available. Weconsideranaugmentedsinglehiddenlayerfeedforwardneuralnetworkmodelinwhich network output y is determined given input x as t t q y = x (cid:11)+ (cid:12) (cid:9)(x (cid:13) )+u ; (1) t 0t j 0t j t j=1 X where t = 1;:::;T, xt = (x1;t;:::xN;t)0, (cid:18) = (cid:11)0;(cid:12)0;(cid:13)10;:::;(cid:13)q0 0; (cid:11) = ((cid:11)1;:::;(cid:11)N)0, (cid:12) = ((cid:12) ;:::;(cid:12) ); and (cid:13) = ((cid:13) ;:::;(cid:13) ) for j = (cid:0)1;:::;q; and (cid:9)((cid:1)) is an activation function. 1 q 0 j j;1 j;N 0 (cid:1) An example of the activation function is the logistic function (cid:9)(z) = (1+exp(z)) 1. (cid:11) is a (cid:0) conformable column vector of connection strength from the input layer to the output layer; (cid:13) is a conformable column vector of connection strength from the input layer to the hidden j units, j = 1;::: ;q; (cid:12) is a (scalar) connection strength from the hidden unit j to the output j unit, j = 1;::: ;q; and (cid:9) is a squashing function (e.g., the logistic squasher) or a radial 2 basis function. Input units x send signals to intermediate hidden units, then each of hidden unit produces an activation (cid:9) that then sends signals toward the output unit. The integer q denotes the number of hidden units added to the a¢ ne (linear) network. When q = 0, we have a two layer a¢ ne network y = x (cid:11)+u : t 0t t It is well known that the ANN models are generally hard to estimate and su⁄er from possibly large estimation errors which can adversely a⁄ect their ability as a universal ap- proximator. To alleviate the estimation errors of an ANN model, it is useful to note that, for given values of (cid:13) (cid:146)s, the ANN is linear in x and the activation function (cid:9) and therefore j ((cid:11);(cid:12) ) can be estimated from linear regression once (cid:13) ;:::;(cid:13) are estimated or given. As 0 0 10 q0 suggested in Lee, White and Granger (LWG 1993), a(cid:0)set of (cid:13)(cid:146)s(cid:1)can be randomly generated. In this paper, we will generate a large set of (cid:13)(cid:146)s such that q (cid:12) (cid:9)(x (cid:13) ) can capture the j=1 j 0t j maximal nonlinear structure. The LWG statistic is designedPto detect neglected nonlinearity in the linear model by checking for correlation between the residual from a linear model and the additional hidden activation functions with randomly generated (cid:13)(cid:146)s. The additional hidden activation functions are hidden (or phantom) because they do not exist under the null hypothesis. The (cid:13)(cid:146)s are randomly generated in testing because they are not identi(cid:133)ed under the null hypothesis. The set of randomly selected (cid:13)(cid:146)s should be large enough so that it can be dense and make the ANN a universal approximator. While the architecture of the ANN model makes a universal approximator, it involves a very large number of parameters. Kock and Ter(cid:228)svirta (2011) consider regularizing the complexity of an ANN model and demonstrate that the regularization of the large dimension is crucial in using ANN models for out-of-sample forecasting. This motivates us to consider regularizing the ANN for testing for neglected nonlinearity. In fact, LWG (1993) uses a (unsupervised) regularization method, namely the principal component analysis, for the randomly activated test functions. Kock and Ter(cid:228)svirta (2011) consider two (supervised) 3 regularization approaches. They insightfully notice that the supervised regularizations will result in the size distortion in inference, and they use these approaches only for forecasting. One supervised regularization approach considered by Kock and Ter(cid:228)svirta (2011) to select a small q from a large q number of (cid:13)(cid:146)s is the simple-to-general algorithm, e.g., the (cid:3) QuickNet algorithm of White (2006), that adds one (cid:13) and one activation function at a time to the ANN. The QuickNet expands starting from 0 activation to q activations until (cid:3) the additional hidden unit activation is not found to improve the network capability. The second supervised regularization approach considered by Kock and Ter(cid:228)svirta (2011) is the general-to-simpleapproach. Thisapproach,fromavariable-selectionperspective,reducesthe number of activations from an initial large number q (say, 1000) to a smaller number q by (cid:3) penalizing the complexity of the ANN model. The penalized regression methods include the smoothlyclippedabsolutedeviationpenalty(SCAD)(FanandLi2001), adaptiveLasso(Zou 2006), adaptive elastic net (Zou and Zhang 2009), the bridge estimator (Huang, Horowitz and Ma 2008), among others. In the case where q is larger than the degrees of freedom, the marginal bridge estimator (Huang, Horowitz and Ma 2008) or the sure independence screening (SIS) (Fan and Lv 2008) may be used to reduce q below the degrees of freedom and then apply these estimation methods. The third approach is to follow LWG (1993) to compute the q principal components of (cid:3) the q additional hidden activation functions. Since the activation functions using randomly generated (cid:13)(cid:146)s may be collinear with each other and with x , LWG used principal components t of the q additional hidden activation functions. Unlike the above two supervised approaches, the principal components are not supervised for the output y. The purpose of this paper is to examine the e⁄ect of various regularization on the ANN test for neglected nonlinearity when the ANN is activated based on a large number of ran- dom activation parameters. We learn two points. First, when we consider the Lasso, the 4 partial least square (PLS) method, the Pretest method, and a method combining Lasso with principle components, these supervised regularization methods bring size-distortion and the ANN test su⁄ers from the post-sample inference or post-selection inference (PoSI) problem.2 Secondly, when we use the principle component analysis (PCA) as used in LWG (1993), this unsupervised regularization of the dimension reduction does not bring the PoSI problem, works really well for a large q, and the asymptotic (cid:31)2(q ) distribution does well in approxi- (cid:3) mating the (cid:133)nite sample distribution of the ANNtest statistic. To sum, while the supervised regularizations are useful in forecasting as studied by Bai and Ng (2008), Bair, Hastie, Paul and Tibshirani (2006), Inoue and Kilian (2008), Huang and Lee (2010), Hillebrand, Huang, Lee and Li (2011), Kock and Ter(cid:228)svirta (2011), and Kock (2011), this paper shows that reg- ularization should not be supervised in inference. Our Monte Carlo simulation shows that the PoSI problem is especially severe with PLS and Pretest while it seems relatively mild or even negligible with Lasso. This paper also demonstrates that the use of unsupervised regularization by principal components does not lead to the PoSI problem. The plan of the paper is as follows. In Section 2 we review the ANN test. Section 3 introduces various regularizations in two types, unsupervised and supervised. Section 4 presents the simulation results which demonstrate the PoSI problem of supervised methods. Section 5 concludes. 2 Testing for Neglected Nonlinearity Using ANN Consider Z = (y x ); where y is a scalar and x may contain a constant and lagged values t t 0t 0 t t of y . Consider the regression model t y = m(x )+" ; (2) t t t 2See P(cid:246)tscher and Leeb (2009) and Berk, Brown, Buja, Zhang and Zhao (2011). 5 where m(x ) E(y x ) is the true but unknown regression function and " is the error term t t t t (cid:17) j such that E(" x ) = 0 by construction. To test for a parametric model g(x ;(cid:18)) we consider t t t j H : m(x ) = g(x ;(cid:18) ) almost everywhere (a.e.) for some (cid:18) , (3) 0 t t (cid:3) (cid:3) H : m(x ) = g(x ;(cid:18)) on a set with positive measure for all (cid:18). (4) 1 t t 6 In particular, if we are to test for neglected nonlinearity in the regression models, set g(x ;(cid:18)) = x (cid:11); (cid:11) (cid:18): Then under H ; the process y is linear in mean conditional on t 0t (cid:26) 0 f tg x ; i.e., t H : m(x ) = x (cid:11) a.e. for some (cid:11) : (5) 0 t 0t (cid:3) (cid:3) The alternative of interest is the negation of the null hypothesis, that is, H : m(x ) = x (cid:11) on a set with positive measure for all (cid:11): (6) 1 t 6 0t When the alternative is true, a linear model is said to su⁄er from (cid:147)neglected nonlinearity(cid:148) (Lee, White, and Granger 1993). If a linear model is capable of an exact representation of the unknown function m(x ), t then there exists a vector (cid:11) such that (5) holds, which implies (cid:3) E(" x ) = 0 a.e.; (7) (cid:3)tj t where " = y x (cid:11) : By the law of iterated expectations " is uncorrelated with any (cid:3)t t (cid:0) 0t (cid:3) (cid:3)t measurable functions of x , say h(x ). That is, t t E[h(x )" ] = 0: (8) t (cid:3)t Depending on how we choose the (cid:145)test function(cid:146)h( ); various speci(cid:133)cation tests may be (cid:1) obtained. Thespeci(cid:133)cationtestsbasedonthesemomentconditions, socalledtheconditional moment tests, have been studied by Newey (1985), Tauchen (1985), White (1987, 1994), Bierens (1982, 1990), LWG (1993), Bierens and Ploberger (1997) and Stinchcombe and 6 White (1998), among others. The ANN test exploits (8) with the test function h( ) being (cid:1) chosen as the neural network hidden unit activation functions. LWG (1993) considered the test of (cid:147)linearity in conditional mean(cid:148)using the ANN model. To test whether the process y is linear in mean conditional on x , they used the following t t null and alternative hypothesis: H : Pr[E(y x ) = x (cid:11) ] = 1 for some (cid:11) 0 tj t 0t (cid:3) (cid:3) H : Pr[E(y x ) = x (cid:11)] < 1 for all (cid:11) 1 tj t 0t The procedure to construct the LWG test statistic is as follows. Under the null hypothesis that y is linear in conditional mean, we (cid:133)rst estimate a linear model of y on x , then if any t t t nonlinearity is neglected in the OLS regression, it will be captured by the residual term u^ . t Since the ANN model inherits the (cid:135)exibility as a universal approximator of unknown func- tionalform, wecanapplyanANNfunctiontoapproximateanypossibletypesofnonlinearity in the residual term u^ . t The neural network test is based on a test function h(x ) chosen as the activations of t (cid:147)phantom(cid:148)hidden units (x (cid:13) ); j = 1;:::;q; where (cid:13) are randomly generated column 0t j j vectors independent of x . (cid:13) (cid:146)s are not identi(cid:133)ed under the null hypothesis of linearity, cf. t j Davies (1977, 1987), Andrews and Ploberger (1994), and Hansen (1996). That is, E[ (x (cid:13) )" ] = 0 j = 1;:::;q; (9) 0t j (cid:3)t under H , so that 0 E((cid:9) " ) = 0; (10) t (cid:3)t where (cid:9)t = ( (x0t(cid:13)1); ::: ; (x0t(cid:13)q))0 (11) is a phantom hidden unit activation vector. Evidence of correlation of " with (cid:9) is evidence (cid:3)t t against the null hypothesis that y is linear in mean. If correlation exists, augmenting the t 7 linear network by including an additional hidden unit with activations (x (cid:13) ) would permit 0t j an improvement in network performance. Thus the tests are based on sample correlation of a¢ ne network errors with phantom hidden unit activations, n n n 1 (cid:9) "^ = n 1 (cid:9) (y x (cid:11)^); (12) (cid:0) t t (cid:0) t t (cid:0) 0t t=1 t=1 X X where "^ = y x (cid:11)^ are estimated by OLS. Under suitable regularity conditions it follows t t (cid:0) 0t from the central limit theorem that n 1=2 n (cid:9) "^ d N (0; W ) as n , and if one (cid:0) t=1 t t ! (cid:3) ! 1 ^ has a consistent estimator for its asymptotiPc covariance matrix, say W , then an asymptotic n chi-square statistic can be formed as n n 0 n 1=2 (cid:9) "^ W^ 1 n 1=2 (cid:9) "^ d (cid:31)2(q): (13) (cid:0) t t n(cid:0) (cid:0) t t ! ! ! t=1 t=1 X X Construct the following auxiliary regression: q u^ = x (cid:11)+ (cid:12) (x (cid:13) )+v ; t 0t j 0t j t j=1 X where t = 1;:::;T, xt = (x1;t;:::xN;t)0, (cid:18) = (cid:11)0;(cid:12)0;(cid:13)10;:::;(cid:13)q0 0; (cid:11) = ((cid:11)1;:::;(cid:11)N)0, (cid:12) = ((cid:12) ;:::;(cid:12) ); and (cid:13) = ((cid:13) ;:::;(cid:13) ) for j = (cid:0)1;:::;q; and ((cid:1)) is an activation function. 1 q 0 j j;1 j;N 0 (cid:1) LWG chose the logistic function (z) = (1+exp(z)) 1 as the activation function. If there is (cid:0) nonlinearityremainedintheresidual,weexpectthegoodnessof(cid:133)tfortheauxiliaryregression is high. However, one problem to estimate the auxiliary regression is that, when q is large, there may exist multicollinearity between (x (cid:13) ) and x and among (x (cid:13) ) themselves. 0t j t 0t j LWG suggested to choose q principle components of q activation functions (x (cid:13) ), with (cid:3) 0t j q < q; and then use these q principle components to run the auxiliary regression. Under (cid:3) (cid:3) the null hypothesis that the sequence y is linear conditional on x , the goodness of (cid:133)t in the t t auxiliary regression will be low. LWG (1993) constructed an LM-type test statistic which has an asymptotic (cid:31)2(q ) distribution under the null hypothesis. In their simulations, LWG (cid:3) chose q equal to 10 or 20 and q equal to 2 or 3 in di⁄erent data generating processes (DGP), (cid:3) 8 and the sample size 50, 100, or 200. Moreover, they dropped the (cid:133)rst principle component of (cid:9) to avoid the multicollinearity problem. In this paper, we have tried the ANN test both t with and without dropping the (cid:133)rst principle component, the results do not change much. Thus we keep the original LWG method with dropping the (cid:133)rst principal component for the ANN test in this paper. In practice, we need to generate (cid:13)(cid:146)s carefully so that (x (cid:13) ) is within a suitable range. 0t j If (cid:13)(cid:146)s are chosen to be too small then activation functions (x (cid:13) ) are approximately linear 0t j in x: We want to avoid this situation since they can not capture much nonlinearity. If (cid:13)(cid:146)s are too large the activation functions (x (cid:13) ) take values close to 1 (their maximum or 0t j (cid:6) minimum values), and we want to avoid this situation as well. In our study, for di⁄erent x(cid:146)s we generate (cid:13)(cid:146)s from uniform distributions with di⁄erent supports so that the activation functions are neither too small or too large. 3 Regularizing the ANN Test As discussed above, LWG (1993) regularized the large number of the network activation functions using principle components in order to avoid possible collinearity problem. The q < q principle components are used out of q activations. We note that the principle (cid:3) components make its variance largest, yet may not necessarily be the ones that best explain the residuals from the OLS regression, u^ . In other words, these principle components are t not (cid:147)supervised(cid:148)for y and thus for u^ . The regularization may be supervised so that t t the activations that are uncorrelated with u^ can be dropped and the activations that are t correlated with u^ can be selected to increase the power of the test. Such regularization t methods include the Lasso method, the PLS method, the Pretest method, the PCA-(cid:133)rst- and-then-Lasso method, and etc. We (cid:133)rst review the PCA method in the next subsection, and then other regularization methods in the following subsections. 9

Description:
Testing for Neglected Nonlinearity Using Regularized Arti–cial Neural Networks Tae-Hwy Lee, Zhou Xi, and Ru Zhang University of California, Riverside
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.