ebook img

Regression with Partially Observed Ranks on a Covariate: Distribution-Guided Scores for Ranks PDF

0.26 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Regression with Partially Observed Ranks on a Covariate: Distribution-Guided Scores for Ranks

Regression with Partially Observed Ranks on a Covariate: Distribution-Guided Scores for Ranks 7 1 Yuneung Kim, Johan Lim, Young-Geun Choi 0 2 and n Sujung Choi, and Do Hwan Park ∗ a J 4 ] E Abstract M This work is motivated by a hand-collected data set from one of the largest Internet . portals in Korea. This data set records the top 30 most frequently discussed stocks t a on its on-line message board. The frequencies are considered to measure the attention t s paid by investors to individual stocks. The empirical goal of the data analysis is [ to investigate the effect of this attention on trading behavior. For this purpose, we 1 regress the (next day) returns and the (partially) observed ranks of frequencies. In v 7 the regression, the ranks are transformed into scores, for which purpose the identity 9 or linear scores are commonly used. In this paper, we propose a new class of scores (a 0 scorefunction)thatisbasedonthemomentsoforderstatistics ofapre-decidedrandom 1 0 variable. The new score function, denoted by D-rank, is shown to be asymptotically . optimal to maximize the correlation between the response and score, when the pre- 1 0 decided random variable and true covariate are in the same location-scale family. In 7 addition, the least-squares estimator using the D-rank consistently estimates the true 1 : correlation between the response and the covariate, and asymptotically approaches v the normal distribution. We additionally propose a procedure for diagnosing a given i X score function (equivalently, the pre-decided random variable Z) and selecting one that r is better suited to the data. We numerically demonstrate the advantage of using a a correctly specified score function over that of the identity scores (or other misspecified scores) in estimating the correlation coefficient. Finally, we apply our proposal to test the effects of investors’ attention on their returns using the motivating data set. Keywords: Concomitant variable; investors’ attention; linear regression; moments of order statistics; optimal scaling; partially observed ranks ∗ Yuneung Kim, Johan Lim and Young-Geun Choi are with the Department of Statistics, Seoul National University, Seoul, 151-747, Korea. Sujung Choi is with the School of Business Administration, Soongsil University,Seoul,156-743,Korea. DohwanParkis with DepartmentofMathematics andStatistics, Univer- sity of Maryland at Baltimore County, Baltimore, MD, 21250, USA. All correspondence are to Johan Lim (E-mail: [email protected]) 1 1 Introduction This paper ismotivated by a hand-collected data set fromDaum.net, the 2ndlargest Internet portal in Korea. The Daum.net portal offers an on-line stock message board where investors can freely discuss specific stocks in which they might be interested. This portal also reports a ranked list of the top 30 stocks that are most frequently discussed by users on a daily basis. Thedatasetwascollectedbytheauthorsduringthe537tradingdaysfromOctober4th,2010, to November 23rd, 2012. Along with therank data, we also collected financial data regarding individual companies from FnGuide (http://www.fnguide.com). These additional data include stock-day trading volumes classified in terms of different types of investors, stock prices, stock returns, and so on. The purpose of analyzing the collected data is to investigate the shifts in stock returns caused by variations in investor attention. In finance, researchers are often interested in determining the motivations that drive buying and selling decisions in stock markets. It is commonlyassumed thatinvestors efficiently processrelevant informationinatimelymanner, but in reality, it is nearly impossible to be efficient because of information overload. In particular, individual investors are often less sophisticated than are institutional investors and have a limited ability to process all relevant information. For this reason, individual investors may pay attention only to a limited amount of information, perhaps that which is relativelyeasytoaccess. Thephenomenonoflimitedattentionisawell-documentedcognitive bias in the psychological literature (Kahneman, 1973; Camerer, 2003). This phenomenon affects the information-processing capacities of investors and thus may affect asset prices on the financial market. To empirically prove the effect of investor attention on stock returns, we regress the observed stock returns with respect to the partially observed ranks. Regression on a (partially observed) rank covariate has not previously been extensively studied in the literature. A procedure that is commonly used in practice to address rank 2 covariates is to (i) regroup the ranks into only a few groups (if the number of ranks is high) and (ii) treat the regrouped ranks as an ordinal categorical variable. Ordered categorical variables frequently arise in various applications and have been studied extensively in the literature. Score-based analysis is most commonly used for this purpose; see H´ajek (1968), Hora and Conover (1984), Kimeldorf et al. (1992), Zheng (2008), Gertheiss (2014) and the references therein. Thus, this typical two-step procedure for addressing a rank covariate is equivalent to defining a score function for the ranks. However, as in the case of ordinal categorical variables, such a score-based approach suffers from an inherent drawback related to the choice of the score function; different choices of scores may lead to conflicting con- clusions in the analysis (Graubard and Korn, 1987; Ivanova and Berger, 2001; Senn, 2007). The recommendation for selecting the score function according to the literature is (i) to choose meaningful scores for the ordinal categorical variable based on domain knowledge of the data, (ii) to use equally spaced scores if scientifically plausible scores are not available (see Graubard and Korn (1987)), and (iii) to find a optimal scaling transformed scores that maximize the correlation with the responses while preserving the assumed characteristics of the ordinal values(Linting et al., 2007; Costantini et al., 2010; de Leeuw and Mair, 2009; Mair, and de Leeuw, 2010; Jacoby, 2016). In this paper, we seek to provide an efficient tool for approach (i) described above, for the case in which some qualitative knowledge is available regarding the ranks or the ranking variable (the variable that is ranked). More specifically, we propose a new set of score functions, denoted by D-rank, and study their use in linear regression. The proposed score function is based on the moments of order statistics (MOS) of a pre-decided random variable Z. This score function has several interesting properties related with the regression model, if the pre-decided random variable is correctly specified as listed below. Here, the correct specification implies it is within the same location-scale family with the true (unobserved) covariate X. First, the D-rank is asymptotically optimal in the sense that it maximizes 3 the correlation between the response and score if the distribution of the D-rank is correctly specified. Second, the least-squares estimator using the D-rank consistently estimates the true correlation between the response and the covariate and asymptotically approaches the normal distribution. Finally, the residuals of the fitted regression allow us to diagnose the given score function (equivalently, the pre-decided random variable X) and to provide a tool for selecting a score function that is better suited to the data. The remainder of this paper is organized as follows. In Section 2, we study the properties of the proposed D-rank. In this section, we show that the proposed D-rank is asymptotically optimal to maximize the correlation between the response and score. In addition, We also demonstrate the asymptotic equivalence between the proposed score function and the quan- tilefunction; thequantilefunctionmayprovideabetterillustrationofthequalitativefeatures of the score function. In Section 3, we apply the score function to estimate the regression co- efficient of the linear model or, more precisely, to estimate the correlation coefficient between the response and the scoring variable X. We prove that the least-squares estimator using the D-rank consistently estimates the correlation coefficient and is asymptotically normally distributed. In addition, we discuss the procedure for selecting an appropriate score func- tion using the residuals. In Section 4, we numerically demonstrate that using the correctly specified score function significantly reduces the mean square error on the estimation of the correlation coefficient. In Section 5, we analyze the motivating data set to investigate the existence of the attention effect. Finally, in Section 6, we briefly summarize the paper and discuss the application of the proposed scores to regression using other auxiliary covariates. 2 Distribution-Guided Scores for Ranks (D-rank) Weconsider asimple regression modelinwhich onlypartialranksofa covariateareobserved. Specifically, suppose that Y ,X , i = 1,2,...,n is the complete set of observations, i i where Y is the variable o(cid:8)f(cid:0)primar(cid:1)y interest and X(cid:9) is the covariate related to Y . For i i i 4 example, in our rank data from Daum.net, for i = 1,2,...,n, Y is a relevant outcome such i as earning rate or trading volume, X is the “unobserved” investors’ attention on the ith i company measured by the frequency of on-line discussions, and R is the “observed” rank of i X among X ,X ,...,X . We make certain assumptions regarding the distributions of X i 1 2 n and Y. We assume that the linear model of the relationship between X and Y is i i X µ i X Y = µ +ρσ − +ǫ , (1) i Y Y i σ X where the ǫ s are IID values from a distribution of mean 0 and variance σ2. The objective i ǫ of this paper is the estimation and inference of ρ = corr Y,X (or the regression coefficient between Y and X) based on the observed data (Y ,R )(cid:0),i =(cid:1)1,2,...,n . To do it, we aim i i to define a good score function S(r) for the obse(cid:8)rved rank r, and consid(cid:9)er the regression of Y on S(r), where Y is the response Y for R = r. [r:n] [r:n] i i The D-rank, we propose in this paper, is a set of the MOS of pre-decided random variable Z, which we assume is in the same location-scale family of the true covariate X. To be specific, suppose that Z ,Z ,...,Z are independent and identically distributed (IID) copies 1 2 n of the random variable Z and that Z is the corresponding rth-order statistic for r = (r:n) 1,2,...,n. The D-rank defines the score of the rank r as S (r) = α := E Z for n (r:n) (r:n) r = 1,2,...,n. (cid:0) (cid:1) We first show that the D-rank maximizes the sample correlation between Y and α , [r:n] (r:n) r = 1,2,...,n, in asymptotic, among all increasing functions S (r) : 1,2,...,n R. n { } → n Let S (r) and α be the standardized scores (of S (r) and α ) to make S (r) = n (r:n) n (r:n) r=1 n n α = 0 and n S2(r) = n α2 = 1. Let S and S be the coPllection of all r=1 (r:n) r=1 n r=1 (r:n) n n iPncreasing functions SP(r) and S (rP), respectively. n n Theorem 1. Under the linear model (1), if Z is in the location-scale family of X, the D-rank maximizes the limit of the sample correlation between Y and S (r) among S (r) S : [r:n] n n n ∈ n 1 lim S (r) Y Y , (2) n [r:n] n n→∞ n σY − · r=1 X (cid:0) (cid:1) b 5 where σ2 = 1 n Y Y 2. Y n r=1 [r:n]− n P (cid:0) (cid:1) The prboof of Theorem 1 is followed in Appendix. Theorem 1 shows the asymptotic optimality of the D-rank for the regression in view of optimal scaling inthe literature. The optimal scaling finds optimally transformed scores that explain mostly well the assumed statistical model. It arises in various contexts including Gifi classification of non-linear multivariate analysis(de Leeuw and Mair, 2009), the aspect (cor- relational and non-cprrelational aspects) of multivariable(Mair, and de Leeuw, 2010), and non-linear principal component analysis(Linting et al., 2007; Costantini et al., 2010). Here, we adopt the idea of the optimal scaling in Jacoby (2016), and find the transformation to maximize the correlation between the response and transformed scores. Theorem 1 above shows that the D-rank maximizes the correlation in asymptotic, if pre-determined distribu- tion for the D-rank is correctly specified. The proposed score is closely related to the quantile of the underlying distribution of Z. Let F (z) for z R and Q (q) for q [0,1] be the cumulative distribution function Z Z ∈ ∈ (CDF) and the quantile function (QF), respectively, of Z. In the estimation of F (z) for Z Z ,i = 1,2,...,n , the rth-order statistic Z is the (r/n) 100-th percentile point of i (r:n) { } × the empirical CDF, and thus, its expected value is approximately equal to Q (r/n). More Z specifically, given p = r , q = 1 p , and Q = Q (p ), we can write r n+1 r − r r Z r p q 1 α = Q + r r Q(2) +O , (r:n) r 2(n+2) r n2 (cid:0) (cid:1) where Q(2) = f′ (Q ) f (Q ) 3 and f (z) is the probability density function of Z, which r − Z r { Z r } Z is differentiable. We r(cid:14)efer the reader to David (2003, Section 4.6) for the details of the relationship between the MOS and the quantiles. Consideration of the QF may provide a better understanding of the qualitative features of the proposed score function. Suppose we expect the score function S (r) is convex in tail n (for r [nc] for a constant c close to 1); in other words, S (r+1) S (r) S (r) S (r 1) n n n n ≥ − ≥ − − 6 for r [nc]. From the equivalence between the MOS and quantiles, it is known that the ≥ convexity of the scores S (r) is approximately equal to that of the quantile function Q (p). n Z Furthermore, the convexity of Q (p) for p c implies the following equivalent statements: Z ≥ (i) F(z) is concave in z, (ii) f′(z) 0 or (iii) logf(z) is decreasing in z, all for z Q ([nc]). Z ≤ ≥ 3 Simple Linear Regression In this section, we consider a simple regression model in which only partial ranks of a covariate are observed. Specifically, suppose that Y ,X , i = 1,2,...,n is the complete i i set of observations from the linear model (1), and R(cid:8)(cid:0)is the(cid:1)rank of X amon(cid:9)g X ,X ,...,X . i i 1 2 n The rank R of X is indirectly measured by the frequency of on-line discussions of the ith i i company. In this paper, we consider the case in which the ranks R are partially observed in the i sense that we observe only that U = R I R m + m+I R > m rather than R , where i i i i i ≤ m+ is an arbitrary constant that is greate(cid:0)r than m(cid:1). Finally(cid:0), the obs(cid:1)ervations are (Y ,U ), i = 1,2,...,n . i i (cid:8) (cid:9) We let Y = Y I R = r for r = 1,2,...,m, and denote the above partially observed data [r:n] i i by Y for notati(cid:0)onal sim(cid:1)plicity. [m] The objective of this section is to identify a good estimator of ρ = corr Y,X (or the regression coefficient between Y and X) and to test : ρ = 0 versus : ρ(cid:0) = 0(cid:1)or ρ > 0 0 1 H H 6 using the observed data Y . [m] 3.1 Least-Squares Estimator To estimate ρ, we recall assumptions regarding the distributions of X and Y. We assume that the linear model of the relationship between X and Y is i i X µ i X Y = µ +ρσ − +ǫ , i Y Y i σ X 7 where the ǫ s are IID values from a distribution of mean 0 and variance σ2. By ordering on i ǫ the X s, we have for r = 1,...,n i σ Y Y = µ +ρ X µ +ǫ , (3) [r:n] Y (r:n) X [r:n] σ − X (cid:0) (cid:1) where ρ = corr Y,X and (cid:0) (cid:1) E Y = µ +ρσ α (4) [r:n] Y Y (r:n) var(cid:0)Y (cid:1) = σ2 ρ2β +1 ρ2 [r:n] Y (rr:n) − cov Y (cid:0),Y (cid:1) = ρ2σ(cid:0)2β , r = s (cid:1) [r:n] [s:n] Y (rs:n) 6 (cid:0) (cid:1) with X µ X µ X µ (r:n) X (r:n) X (s:n) X α = E − and β = Cov − , − (r:n) (rs:n) σ σ σ (cid:26) X (cid:27) (cid:18) X X (cid:19) for r,s = 1,2,...,n (David and Galambos, 1974; David, 2003). We are motivated by the identities (3) and (4) given above and propose the least-squares estimator 1 [ns] α Y µ ρ s r=1 (r:n) [r:n]− Y (5) ≡ σY · P [rn=s(cid:8)1] α(2r:n) (cid:9) (cid:0) (cid:1) b as an estimator of ρ with s =b m/n, where, µ =P n Y /n and σ2 = n (Y µ )2/n are b Y i=1 i Y i=1 i− Y the empirical estimators of the mean and variancPe, respectively, of Y.P b b b We claim that, if X is drawn from a location-scale family generated by Z, then the least-squares estimator ρ s with s = m/n in (5), that is calculated based on the partial observations Y , is consi(cid:0)st(cid:1)ent and asymptotically normally distributed with an appropriate [m] b scale, as shown in Theorem 2. Suppose that [ns] [ns] [ns] [ns] 1 1 1 ΨI(s) := α2 σ2 ,ΨII(s) := α α β2 , andΦ (s) := α2 , n n (r:n) (r:n) n n (r1:n) (r2:n) (r1,r2:n) n n (r:n) r=1 r1=1r2=1 r=1 X X X X where σ2 = σ2 X , and let ΨI (s), ΨII(s) and Φ (s) be the limits of ΨI(s), ΨII(s) (r:n) (r:n) ∞ ∞ ∞ n n and Φ (s), respec(cid:0)tively ((cid:1)under the assumption that they exist). n 8 Theorem 2. Under the assumption that X is drawn from a distribution of a location-scale family with a finite variance, the distribution of √n ρ(s) ρ converges to the normal dis- − tribution of mean 0 and variance ΨI (s)/σ2 +ρ2Ψ(cid:0)II(s) /Φ2(cid:1)(s). ∞ Y ∞b ∞ (cid:8) (cid:9) The proof of Theorem 2 is provided in the Appendix. We conclude this section with two remarks regarding Theorem 2. First, in Theorem 2, from the tower property of the conditional expectation, 1 1 var √nρ > = 1, 1 n [ns] α2 ≥ var X r=1 (r:n) (cid:0) (cid:1) b (cid:0) (cid:14) (cid:1)P (cid:0) (cid:1) and when ρ = 0, the asymptotic variance of √nρ is larger than 1, which is the variance of the least-squares estimator in the case where X is completely observed. Second, it is possible b to test the hypothesis : ρ = 0 using the statistic T = √nρ, which has an asymptotically 0 H normal distribution of mean 0 and variance 1 Φ (s). ∞ b (cid:14) 3.2 Residual Analysis As in the classical linear model, the residuals can provide guidance for identifying a better model and score function. The residuals are defined as e∗ = Y µ /σ ρα for [r:n] [r:n]− Y Y − (r:n) r = 1,2,...,[ns]. Statistical properties of the residuals, which a(cid:0)re analogo(cid:1)us to those in the b classical linear model, are summarized as follows. Theorem 3. Under the assumptions of Theorem 2, the following statements are true for the residuals: (i) E e∗ = 0; (ii) [r:n] (cid:0) (cid:1) 1 ΨI(s) 1 var e∗ = ρ2β + 1 ρ2 +α2 n 2 [r:n] (rr:n) − (r:n)nσ2 Φ2(s) − [ns] α2 n o Y n r=1 (r:n) (cid:0) (cid:1) (cid:0) (cid:1) [ns] P ρ2 α α β +α2 (1 ρ2) × (w:n) (r:n) (rw:n) (r:n) −   Xw=1  (iii) E e∗ α = 0; and(iv) E e∗ Y∗ = 0, where Y∗ = µ  ρα . [r:n] (r:n) [r:n] [r:n] [r:n] Y − (r:n) (cid:0) (cid:1) (cid:0) (cid:1) b b b 9 The proof of Theorem 3 requires only simple algebra and is thus omitted here. The theorem states that the residuals have mean 0 and finite variance, and also states that they are uncorrelated with the scores α and the predicted values Y . Thus, the residual plots, (r:n) [r:n] which are the plots of (i) r versus e∗ , (ii) α versus e∗ , and (iii) Y versus e∗ , [r:n] (r:n) [rb:n] [r:n] [r:n] have the same interpretations as those of the classical linear model. We plug in µ and σ b Y Y with their empirical estimators and use e = Y µ /σ ρα . [r:n] [r:n] Y Y (r:n) − − − The residual sum of squares may be another(cid:0)useful tool for(cid:1)measuring the goodness of fit b b b of the proposed model, as in the classical linear model. The residual sum of squares in our model is defined as [ns] Y µ 2 [r:n] Y RSS = − Y [r:n] σ − r=1 (cid:18) Y (cid:19) X b and will be used along with the residual plots as a guidebfor selecting a better score function. b Finally, the proposed least-squares estimator (5) assumes that theregression line between α and Y µ has an intercept (at the y axis) of 0. Thus, if the model (or the score (r:n) [r:n] Y − function)i(cid:0)scorrectlys(cid:1)pecified, thentheinterceptestimatedbytheregression(withintercept) b should be close to 0, and the estimated intercept therefore serves as a measure for checking the correctness of the score function. Note that the regression (without intercept) performed in this paper is based on observations of the top [ns] ranks and assumes that the function passes through the origin (see Figure 4). 3.3 An Estimator with Unranked Observations The least-squares estimator presented in Section 3.2 does not fully use the information contained in Y := Y I(R = r),r > m ; it is used only to estimate µ and σ , not to [r:n] i i Y Y estimate ρitse(cid:8)lf. Inthissection, webriefly d(cid:9)emonstratehowρcanbemodifiedtoincorporate these unranked observations. b 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.