ebook img

Multivariate Confidence Intervals PDF

2 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Multivariate Confidence Intervals

Multivariate Confidence Intervals∗ Jussi Korpela† Emilia Oikarinen† Kai Puolam¨aki† Antti Ukkonen† Abstract a summary. However, most of these approaches focus on describing a single univariate distribution over real Confidence intervals are a popular way to visualize and numbers. Also, the precise definition of a confidence analyzedatadistributions. Unlikep-values,theycanconvey 7 information both about statistical significance as well as intervalvariesslightlyacrossdisciplinesandapplication 1 effect size. However, very little work exists on applying domains. 0 confidence intervals to multivariate data. In this paper we In this paper we focus on confidence areas: a gen- 2 defineconfidenceintervalsformultivariatedatathatextend the one-dimensional definition in a natural way. In our eralization of univariate confidence intervals to multi- n definitioneveryvariableisassociatedwithitsownconfidence variate data. All our intervals and areas are such that a interval as usual, but a data vector can be outside of a few they describe ranges of data and are of minimal width. J of these, and still be considered to be within the confidence area. We analyze the problem and show that the resulting In other words, they contain a given fraction of data 0 confidence areas retain the good qualities of their one- within their bounds and are as narrow as possible. By 2 dimensional counterparts: they are informative and easy to interpret. Furthermore, we show that the problem of choosing a confidence area with minimal size we essen- ] findingmultivariateconfidenceintervalsishard,butprovide tially locate the densest part of the distribution. Such P efficient approximate algorithms to solve the problem. confidence areas are particularly effective for visually A Keywordsmultivariatestatistics;confidenceintervals; showing trends, patterns, and outliers. t. algorithms Considering the usefulness of confidence intervals, a it is surprising how little work exists on applying confi- t s dence intervals on multivariate data [11]. In multivari- [ 1 Introduction ate statistics confidence regions are a commonly used 1 approach,seee.g.,[6],butthesemethodsusuallyrequire v Confidence intervals are a natural and commonly used making assumptions about the underlying distribution. 3 way to summarize a distribution over real numbers. Moreover, unlike confidence areas, most conference re- 6 In informal terms, a confidence interval is a concise 7 gions cannot be described simply with an upper and way to express what values a given sample mostly 5 a lower bound, e.g., confidence regions for multivariate contains. They are used widely, e.g., to denote ranges 0 Gaussian data are ellipsoids. Thus, these approaches . ofdata, specifyaccuraciesofparameterestimates, orin 1 do not fully capture two useful characteristics of one- Bayesian settings to describe the posterior distribution. 0 dimensional confidence intervals: a) easy interpretabil- Aconfidenceintervalisgivenbytwonumbers,thelower 7 ity and b) lack of assumptions about the data. 1 and upper bound, and parametrized by the percentage The simplest approach to construct multivariate : of probability mass that lies within the bounds. They v confidence intervals is to compute one-dimensional in- are easy to interpret, because they can be represented Xi visuallytogetherwiththedata,andconveyinformation tervals separately for every variable. While this naive approach satisfies conditions a) and b) above, it is easy r both about the location as well as the variance of a a to see how it fails in general. Assume, e.g., that we sample. have 10 independently distributed variables, and have There is a plethora of work on how to estimate computed for each variable a 90% confidence interval. the confidence interval of a distribution based on a This means that when considering every variable indi- finite-sized sample from that distribution, see [7] for vidually, only10%ofthedistributionliesoutsideofthe ∗This work was supported by Academy of Finland (decision respective interval, as desired. But taken together, the 288814)andTekes(RevolutionofKnowledgeWork). probabilitythatanobservationisoutsideatleastoneof This document is an extended version of [10] to appear in the the confidence intervals is as much as 1−0.910 =65%. Proceedingsofthe2017SIAMInternationalConferenceonData This probability, however, depends strongly on the cor- Mining(SDM2017). Implementation of the algorithms (Sec. 5) and code to re- relations between the variables. If the variables were produce Figs. 1 and 5 are provided at https://github.com perfectly correlated with a correlation coefficient of ±1, /jutako/multivariate-ci.git the probability of an observation being outside all con- †FinnishInstituteofOccupationalHealth,Helsinki,Finland fidence intervals would again be 10%. In general the 1 l l ll l NLaL=i=2v05e y l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lll l l lll lll ll l l l l l l l l L=25 L = 1 Naive L = 0 L=0 naive | | x y x Figure 1: Examples 1 and 2. Please see text for details. LEFT: Local anomalies in time series (solid black line) are easier to spot when computing the confidence area using the proposed method (orange lines) as opposed to existing approaches (green lines). RIGHT: Our approach results in a confidence area (orange “cross”) that is a better representation of the underlying distribution than existing approaches (green rectangle). correlation structure of the data affects in a strong and confidence intervals. In the special case l =0, this new hard-to-predict way on how the naive confidence inter- definition coincides with the MWE problem [11]. The vals should be interpreted in a multivariate setting. followingexamplesillustratefurtherpropertiesanduses Ideally a multivariate confidence area should retain of the new definition. the simplicity of a one-dimensional confidence interval. It should be representable by upper and lower bounds for each variable and the semantics should be the Example 1: local anomalies in time series. same: each observation is either inside or outside the TheleftpanelofFig.1presentsanumberoftimeseries confidence area, and most observations of a sample over m = 80 time points, shown in light gray, together should be inside. Particularly important is specifying with three different types of 90% confidence intervals, when an observation in fact is within the confidence shown in green, purple and orange, respectively. In area, as we will discuss next. this example, “normal” behavior of the time series is Confidence areas for time series data have been exemplifiedbytheblackdash-dottedcurvethatexhibits defined [9, 11] in terms of the minimum width envelope no noteworthy fluctuation over time. Two other types (MWE) problem: a time series is within a confidence of time series are shown also in black: a clear outlier area if it is within the confidence interval of every (dashed black line), and a time series that exhibits variable. While this has desirable properties, it can, normalbehaviormostofthetime,butstronglydeviates however, result in very conservative confidence areas from this for a brief moment (solid black line). if there are local deviations from what constitutes Inthesituationshown,wewouldliketheconfidence “normal” behavior. The definition in [11] is in fact too intervaltoonlyshowwhatconstitutes“normal”behav- strict by requiring an observation to be contained in all ior, i.e., not be affected by strong local fluctuations or variable-specific intervals. outliers. Such artifacts can be caused, e.g., by mea- Thus, here we propose an alternative way to define surement problems, or other types of data corruption. the confidence area: a data vector is within the confi- Alternatively, such behavior can also reflect some inter- denceareaifitisoutsidethevariable-specificconfidence esting anomaly in the data generating process, and this intervals at most l times, where l is a given integer. should not be characterized as “normal” by the confi- This formulation preserves easy interpretability: the denceinterval. InFig.1(left)theconfidenceareabased user knows that any observation within the confidence on MWE [11], is shown by the green lines; recall the area is guaranteed to violate at most l variable-specific MWE interval corresponds to setting l =0. While it is 2 unaffectedbytheoutlier,itstronglyreactstolocalfluc- Contributions tuations. The naive confidence area, i.e., one where we The basic problem definition we study in this paper havecomputedconfidenceintervalsforeverytimepoint is straightforward: for m-dimensional data and the individually, is shown in purple. It is also affected by parameters α ∈ [0,1] and integer l, find a confidence local peaks, albeit less than the l = 0 confidence area. interval for each of the variables so that the 1 − α Finally, the proposed method is shown in orange. The fraction of the observations lie within the confidence area is computed using l =25, i.e., a time series is still area, defined so that the sum of the length of the withintheconfidenceareaaslongasitliesoutsideinat intervals is minimized, and an observation can break most 25 time points. This variant focuses on what we at most l of the variable-specific confidence intervals. would expect to be normal behavior in this case. Our We make the following contributions in this paper: new definition of a confidence area is thus nicely suited for finding local anomalies in time series data. • We formally define the problem of finding a mul- tivariate confidence area, where observations have to satisfy most but not all of the variable-specific confidence intervals. • We analyze the computational complexity of the Example 2: representing data distribu- problem, and show that it is NP-hard, but admits tions. an approximation algorithm based on a linear pro- gramming relaxation. On the other hand, the right panel of Fig. 1 shows an • We propose two algorithms, which produce good example where we focus only on the time points x and confidence areas in practice. y as indicated in the left panel. Time point x resides • We conduct experiments demonstrating various in the region with a strong local fluctuation, while at aspects of multivariate confidence intervals. y there are no anomalies. According to our definition, an observation, in this case an (x,y) point, is within The rest of this paper is organized as follows: Related the confidence area if it is outside the variable-specific work is discussed in Sec. 2. We define the ProbCI and confidenceintervalsatmostltimes. Wehavecomputed CombCIproblemsformallyinSec.3. InSec.4westudy two confidence areas using our method, one with l = 0 theoretical properties of the proposed confidence areas, (green), and another with l=1 (orange), as well as the as well as study problem complexity. Sec. 5 describes naive confidence intervals (purple). algorithmsusedinexperimentsinSec.6. Finally,Sec.7 For l = 0, we obtain the green rectangle in Fig. 1 concludes this work. (right panel). The variable specific confidence intervals have been chosen so that the green rectangle contains 90% the data and the sum of the widths of the con- 2 Related work fidence intervals (sides of the rectangle) is minimized. For l = 1, we obtain the orange “cross” shaped area. Confidenceintervalshaverecentlygainedmorepopular- Thecrossshapefollowsfromallowinganobservationto ity, as they convey information both of statistical sig- exceed the variable specific confidence interval in l = 1 nificance of the result as well as the effect size. In con- dimensions. Again,theorangecrosscontains90%ofall trast, p-values give information only of the statistical observations, and has been chosen by greedily minimiz- significance: it is possible to have statistically signifi- ing the sum of the lengths of the respective variable- cant results that are meaningless in practice due to the specific confidence intervals. It is easy to see that with smalleffectsize. Theproblemhasbeenlongandacutely l=0,i.e.,whenusingtheMWEmethod[11],theobser- recognized, e.g., in medical research [5]. Some psychol- vations do not occur evenly in the resulting confidence ogy journals have recently banned the use of p-values area (green rectangle). Indeed, the top right and bot- [18,16]. Theproposedsolutionisnottoreportp-values tom right parts of the rectangle are mostly empty. In at all, but use confidence intervals instead [14]. contrast, with l = 1, the orange cross shaped confi- Simultaneous confidence intervals for time series dence area is a much better description of the underly- data have been proposed [9, 11]. These correspond to ing data distribution, as there are no obvious “empty” the confidence areas in this paper when l = 0. The parts. Our novel confidence area is thus a better rep- novelty here generalization of the confidence areas to resentation of the underlying data distribution than the allow outlying dimensions (l>0), similarly to [17], and existing approaches. the related theoretical and algorithmic contributions, 3 allowing for narrower confidence intervals and in some Problem3.1(ProbCI). Givenα∈[0,1],anintegerl, cases more readily interpretable results. Simultaneous and a distribution F over Rm, find a confidence area confidence intervals with l = 0 were in [2, p. 154], and (x ,x ) that minimizes (cid:80)m w(j) subject to constraint l u j=1 studied [13] by using the most extreme value within a datavectorasarankingcriterion. Anotherexamplesof (3.1) Pr [V(x|xu,xl)≤l]≥1−α. x∼F l = 0 type procedures include [12, 15]. In the field of information visualization and the visualization of time For this problem definition to make sense, the vari- series confidence areas have been used extensively; see ables or at least their scales must be comparable. Oth- [1]forareview. Aninterestingapproachistoconstruct erwise variables with high variance will dominate the simultaneous confidence regions by inverting statistical confidence areas. Therefore, some thinking and a suit- multiple hypothesis testing methods, see e.g., [6]. able preprocessing, such as normalization of variables, The approach presented in this paper allows some should be applied before solving for the confidence area dimensionsofanobservationtobepartiallyoutsidethe and interpreting it. confidence area. This is in the same spirit—but not The combinatorial problem is defined as follows. equivalent—to false discovery rate (FDR) in multiple Problem 3.2 (CombCI). Given integers k and l, and hypothesis testing, which also allows a controlled frac- n vectors x ,...,x in Rm, find a confidence area tion of positives to be “false positives”. In comparison, 1 n (x ,x ) by minimizing (cid:80)m w(j) subject to constraint the approach in [11] is more akin to family-wise error l u j=1 rate (FWER) that controls the probability of at least n one false discovery. (cid:88) (3.2) I[V(x |x ,x )≤l]≥n−k. i u l i=1 3 Problem definition Any confidence area satisfying (3.2) is called a (k,l)-confidence area. In the special case with l = 0, A data vector x is a vector in Rm and x(j) denotes the Problems3.1and3.2coincidewiththeminimumwidth value of x in jth position. Let matrix X ∈ Rn×m be a envelope problem from [11]. The problem definition datasetofndatavectorsx ,...,x , i.e. rowsofX. We with non-vanishing l is novel to best of our knowledge. 1 n start by defining the confidence area, its size, and the The relation of Problems 3.1 and 3.2 is as follows. envelope for a dataset X. Definition 3.3. Problems 3.1 and 3.2 match for a Definition 3.1. Given X ∈ Rn×m, a confidence area given data from distribution F and parameters α, k, for X is a doublet of vectors (x ,x ), x ,x ∈ Rm and l, if a solution of Problem 3.2 satisfies Eq. (3.1) l u l u composed of lower and upper bounds satisfying x (j) ≤ with equality for a previously unseen test data from l x (j) for all j, respectively. The size of the confidence distribution F. u area is A = (cid:80)m w(j), where w(j) = x (j)−x (j) is j=1 u l We can solve Problem 3.1 by solving Problem 3.2 the width of the confidence interval w.r.t. jth position. with different values of k to find a k that matches the The envelope of X is a confidence area denoted by given α, as done in Sec. 6.2 or [11]. env(X) = (x ,x ), where x (j) = minn x (j) and l u l i=1 i x (j)=maxn x (j). u i=1 i 4 Theoretical observations We define the error of the confidence area as the number of outlying data vectors as follows. 4.1 Confidence areas for uniform data Definition 3.2. Let x be a data vector in Rm and It is instructive to consider the behavior of Problem (xl,xu) a confidence area. The error of x given the 3.1 with the uniform distribution F = U(0,1)m. unif confidence area is defined as We show that a solution may contain very narrow confidence intervals and discuss the required number of m V(x|x ,x )=(cid:88)I[x(j)<x (j)∨x (j)<x(j)]. datavectorstoestimateaconfidenceareawithadesired u l l u level of α. j=1 Consider Problem 3.1 in a simple case of two- The indicator function I[(cid:3)] is unity if the condition dimensional data with m = 2, l = 1, and α = 0.1. (cid:3) is satisfied and zero otherwise. In this case an optimal solution to Problem 3.1 is The main problem in this paper is as follows. given by confidence intervals with w(1) = 0.9 and w(2) = 0, resulting in the size of confidence area of 4 (cid:80)2 w(j) = 0.9. A solution with confidence intervals only n ≈ 20 data vectors would therefore be needed j=1 of equal width, i.e., w(1) = w(2) = 0.68 would lead to even for large m. As hinted by the toy example in substantiallylargerareaof1.37. Asthissimpleexample Fig.1,anon-vanishingparameterlleadsatleastinthis demonstrates, if data is unstructured or if some of the example to substantially narrower confidence intervals variables have unusually high variance, we may obtain and, hence, make it possible to compute the confidence solutions where some of the confidence intervals are intervals with smaller data sets. very narrow. In the case of uniform distribution the choice of variables with zero width confidence intervals 4.2 Complexity and approximability is arbitrary: e.g., in the example above we could as well have chosen w(1) = 0 and w(2) = 0.9. Such We continue by briefly addressing the computational narrow intervals are not misleading, because they are properties of the CombCI problem. The proofs are easy to spot: for a narrow interval—such as the one provided in the appendix. discussed in this paragraph—only a small fraction of Theorem 4.1. CombCI is NP-hard for all k. the values of the variable lie within it. In real data sets with non-trivial marginal distributions and correlation For k > 0 the result directly follows from [11, structure the confidence intervals are often of roughly Theorem 3], and for k = 0 and l > 0 a reduction from similar width. Vertex-Cover can be used. Nextweconsiderthebehavioroftheproblematthe Now, there exists a polynomial time approximation limit of high-dimensional data. algorithmforsolvingavariantof CombCIinthespecial caseofk =0. Inparticular,weconsiderhereaone-sided Lemma 4.1. The solution with confidence intervals of confidencearea,definedonlyastheupper boundx ,the u equal width w =w =x (j)−x (j) corresponds to α of j u l lower bound x is assumed to be fixed, e.g., at zero, or l any other suitable value. This complements the earlier (4.3) α=1−BC(l,m,w), result that the complement of the objective function of where BC(l,m,w) = (cid:80)l (cid:0)m(cid:1)(1 − w)jwm−j is the CombCI is hard to approximate for k > 0 and l = 0 j=0 j [11, Theorem 3]. cumulative binomial distribution. Theorem 4.2. There is a polynomial time (l + 1) Lemma 4.2. If n vectors are sampled from F = unif approximation algorithm for the one-sided CombCI U(0,1)m then the expected width of the envelope for problem with k =0. each variable is w = n−1. The probability of more than n+1 l variables from a vector from F being outside the The proof uses a linear programming relaxation of unif envelope is given by Eq. (4.3) with w = n−1. an integer linear program corresponding to the k = 0 n+1 variant of the one-sided CombCI and the approxima- One implication of Lemma 4.2 is that there is a tionratioobtainedfromthesolutiongivenbytherelax- limittothepracticalaccuracythatcanbereachedwith ation. a finite number of samples. The smallest α we can Findingaboundthatdoesnotdependonlisanin- hope to realistically reach is the one implied by the teresting open question, as well as extending the proof envelopeofthedata,unlesswemakesomedistributional to the two-sided CombCI. It is unlikely that the prob- assumptions of the shape of the distribution outside lem admits a better approximation bound than 2, since the envelope. Conversely, the above lemmas define a this would immediately result in a new approximation minimum number of samples needed for uniform data algorithm for the Vertex-Cover problem. This is be- to find the confidence area for a desired level of α. cause in the proof of Theorem 4.2 we describe a simple In the case of l = 0 to reach the accuracy of α, reduction from Vertex-Cover to CombCI with l=1 we have wm ≈1−α, or n≈−2m/log(1−α)≈2m/α, andk =0. Thisreductionpreservesapproximability,as wherewehaveignoredhigherordertermsin1/mandα. it maps the Vertex-Cover instance to an instance of For a typical choice of α=0.1 and m=100 this would ourprobleminastraightforwardmanner. ForVertex- imply that at least n≈2000 data vectors are needed to Cover it is known that approximation ratios below 2 estimatetheconfidencearea;thenumberofdatavectors are hard to obtain in the general case. Indeed, the best needed increases linearly with the dimensionality m. (cid:112) known bound is equal to 2−Θ(1/ log|V|) [8]. On the other hand, for a given β ∈]0,1[, if we let l = (cid:98)βm(cid:99), a solution where the width of the envelope is w ≈ 1 − β is asymptotically sufficient when the 5 Algorithms dimensionality m is large, in which case the number of datavectorsrequiredisn≈2/β. Foravalueofβ =0.1 We present two algorithms for (k,l)-confidence areas. 5 5.1 Minimum intervals (mi) 6 Empirical evaluation Our first method is based on minimum intervals. A We present here empirical evaluation of the algorithms. standard approach to define a confidence interval for In the following mi and gr refer to the Minimum univariate data is to consider the minimum length intervals and Greedy algorithm, respectively. intervalthatcontains(1−a)%oftheobservations. This can be generalized for multivariate data by treating 6.1 Datasets each variable independently, i.e., for a given a, we set x (j) and x (j) equal to lower and upper limit Wemakeexperimentsusingvariousdatasetsthatreflect l u of such a minimum length interval for every j. The different properties of interest from the point of view of mi algorithm solves the CombCI problem for given fittingconfidenceareas. Inparticular,wewanttostudy k and l by adjusting the parameter a such that the the effects of correlation (autocorrelation in the case of resulting (x ,x ) satisfies the constraint in Eq. (3.2) time-series or regression curves), number of variables, u l in the training data set. The resulting confidence and distributional qualities. area may have a larger size than the optimal solution, Artificial data. We use artificial multivariate since all variables use the same a. Note that the mi (i.i.d. variables) datasets sampled from the uniform solution differs from the naive solution mentioned in distribution (in the interval [0,1]), the standard the introduction because the naive solution does not normal distribution, and the Cauchy distribution adjust a but simply sets it to a = k/n. The time (location parameter 0, scale parameter γ =2), with complexity of mi with binomial search for correct a is varyingnandmtostudysometheoreticalproperties O(mnlogklogn). of multivariate confidence areas. Kernel regression data. We use the Ozone and South African heart disease (SA heart) datasets (see, e.g., [4]) to compute confidence areas for boot- 5.2 Greedy algorithm (gr) strapped kernel regression estimates. We use a sim- ple Nadaraya-Watson kernel regression estimate to Our second method is a greedy algorithm. The greedy producethevectorsX,andfitconfidenceintervalsto searchcouldbedoneeitherbottom-up(startingfroman theseusingouralgorithms. Bychangingthenumber emptysetofincludeddatavectorsandthenaddingn−k of bootstrap samples we can vary n, and by alter- data vectors) or top-down (starting from n data vectors ing the number of points in which the estimate is and by excluding k data vectors). Since typically k is computed we can vary m. smaller than n − k we will consider here a top-down Stock market data. We obtained daily closing greedy algorithm. prices of n = 400 stocks for years 2011–2015 (m = Theideaofthegreedyalgorithmistostartfromthe 1258 trading days). The time-series are normalized envelopeofwholedataandsequentiallyfindkvectorsto to reflect the absolute change in stock price with exclude by selecting at each iteration the vector whose respect to the average price of the first five trading removal reduces the envelope the largest amount. In days in the data. The data is shown in Fig. 5. order to find the envelope wrt. the relaxed condition allowing l positions from each vector to be outside, at 6.2 Finding the correct k eachiterationthesetofincludeddatapointsneedstobe (re)computed. This is done by implementing a greedy OuralgorithmsbothsolvetheCombCIproblem(Prob- algorithmsolvingtheCombCIproblemfork =0. Here lem3.2)inwhichwemustspecifythenumberofvectors one removes individual points that result in maximal kthatareallowedtoviolatetheconfidencearea. Toob- decrease in the envelope size so that at most l points tain a matching α in ProbCI (Definition 3.3) we must from each data vector are be removed, thus obtaining choose the parameter k appropriately. We study here theenvelopewrt.thelcriterion. Afterthisenvelopehas how this can be achieved. been computed, the data vector whose exclusion yields Fig. 2 shows α as a function of k/n in data with a maximal decrease in the size of the confidence area m = 10 independent normally distributed variables for is excluded. For this, a slightly modified variant of the four combinations of n and l. The dashed green line greedy MWE algorithm from [11] with k = 1 is used. shows the situation with k/n = α. (This is a desirable Afterk datavectorshavebeenexcluded,thefinalsetof property as it means fine-tuning k/n is not necessary points included in the envelope is returned. The time to reach some particular value of α.) We observe from complexity of gr is O(mknlogn). Fig. 2 that when the data is relatively small (n=250), 6 n = 250, l = 0 n = 250, l = 2 n = 1000, l = 0 n = 1000, l = 2 0 0 0 0 2 2 2 2 0. 0. 0. 0. 0 0 0 0 1 1 1 1 a 0. a 0. a 0. a 0. h h h h p p p p al 5 al 5 al 5 al 5 0 0 0 0 0. 0. 0. 0. GR GR GR GR 2 MI 2 MI 2 MI 2 MI 0 0 0 0 0. 0. 0. 0. 0.05 0.10 0.15 0.20 0.05 0.10 0.15 0.20 0.05 0.10 0.15 0.20 0.05 0.10 0.15 0.20 k/n k/n k/n k/n Figure2: Comparisonofk/nusedwhenfittingtheconfidenceintervalontrainingdataandtheobservedlevelofα inaseparatetestdata. Bothdataarenormalwithm=10,n =1000,n ={250,1000}andl={0,2}. The test train dashed green line indicates k/n=α. Shown are the averages over 25 independent trials over different randomly generated training and test instances. Note the log-scale on the vertical axis. MI, m=10 GR, m=50 confidence aMrIe, ams=5w0ith l = 0 of [11]. GRO, nm=t1h0e other hand, for n = 1000 we also observe how gr starts 00 00 to “00underfit” as k/n increases,00meaning that we have 5 5 5 5 0. 0. α<0.k/n. Errorsinthisdirection0.,however,simplymean 0 0 that0the resulting confidence a0rea is conservative and pha 0.10 pha 0.10 phawill0.10satisfy the given k/n byphaa m0.10argin. al al al al The dependency between k/n and α and other 0 0 0 0 2 2 2 2 0 0 obse0rvations made above are qu0alitatively identical for 0. 0. 0. 0. uniform and Cauchy distributed data (not shown). 5 5 5 5 0 0 0 0 0 0 0 0 0. 0. 0. 0. 0 1 2 3 4 0 5 10 15 20 6.3 E0 ffe5ct1o0f 1l5on20α in tes0t d1ata2 3 4 l (test) l (test) l (test) l (test) Note that the value of α that we compute for a given confidenceareaalsodependsonthevalueoflusedwhen Figure 3: Effect of the value of l on the observed α test evaluating. We continue by showing how α depends on for different values of l . On the left l is equal train train the value of l used when evaluating the confidence area to 0 (black), 1 (red), 2 (green) and 3 (blue), while on on test data. In this experiment we train confidence the right l is 0 (black), 5 (red), 10 (green), and 15 train areas using the Ozone data (with n = 500 and m = 10 (blue). In every case the confidence area was trained to orm=50)andadjustk/nsothatwehaveα=0.1fora have α=0.1 for the given l . train givenvalueofl ,wherel ∈{0,0.1m,0.2m,0.3m} train train is the parameter given to our algorithms (mi, gr) to solve Prob. 3.2. Then we estimate α for all l ∈ test for a given k/n both gr as well as mi tend to find {1,...,m} using Eq. (3.1) and previously unseen test confidence areas that are somewhat too narrow for the data set. testdataleadingtoα>k/n. Toobtainsomeparticular Results are shown in Fig. 3. We find that in every α, we must thus set k/n to a lower value. For example, casethelineforagivenvalueofl intersectsthethin train with n = 250 and l = 0, to have α = 0.1 we must dashed (red) line (indicating α = 0.1) at the correct let k/n = 0.05. As the number of training examples is value l = l . More importantly, however, we also test train increasedton=1000, wefindthatbothalgorithmsare find that l has a very strong effect on the observed test closer to the ideal situation of k/n = α. Interestingly, α. This means that if we relax the condition under this happens also when when l increases from l = 0 which an example still belongs in the confidence area to l = 2. The relaxed notion of confidence area we (i.e.,increasel ),αdropsataveryfastrate,meaning test introduce in this paper is thus somewhat less prone that a confidence area trained for a particular value of to overfitting issues when compared against the basic l , will be rather conservative when evaluated using train 7 mi gr South African CHD data dataset l m A/m t k/n A/m t k/n unif 0 10 1.0 0.1 0.06 1.0 0.7 0.07 1.0 GR, l=5 unif 2 10 0.9 0.2 0.08 0.9 14.0 0.11 MI, l=5 unif 10 100 0.9 0.7 0.05 0.9 32.1 0.03 0.8 GMRI, ,l =l=00 unif 50 500 0.9 2.1 0.03 0.9 220.2 0.03 norm 0 10 5.1 0.1 0.07 5.3 0.8 0.07 d) 0.6 nnoorrmm 120 10100 33..26 00..28 00..0096 33..26 1323..01 00..0093 Pr(ch 0.4 norm 50 500 3.5 2.1 0.03 3.5 209.7 0.03 Cauchy 0 10 262.3 0.1 0.08 325.1 0.7 0.07 0.2 Cauchy 2 10 21.8 0.1 0.09 25.6 10.7 0.08 Cauchy 10 100 36.8 0.7 0.07 38.5 31.4 0.03 0.0 Cauchy 50 500 30.5 2.2 0.05 31.5 203.3 0.03 100 120 140 160 180 200 220 ozone 0 10 5.1 0.1 0.08 5.1 0.8 0.07 ozone 2 10 3.4 0.1 0.09 3.5 9.0 0.09 systolic blood pressure ozone 5 50 4.2 0.3 0.09 4.3 11.4 0.05 Ozone data ozone 15 50 3.0 0.3 0.09 3.1 48.6 0.06 SAheart 0 10 5.3 0.1 0.06 5.4 0.7 0.06 70 GMRI, ,l =l=55 SAheart 2 10 3.3 0.2 0.08 3.7 15.5 0.14 GR, l=0 SAheart 5 50 4.3 0.2 0.07 4.5 8.6 0.04 60 MI, l=0 SAheart 15 50 2.9 0.3 0.08 3.3 66.7 0.06 evel 50 Table 1: Comparison between mi and gr. one l 40 oz 0 3 0 2 0 l > l . Obviously the converse holds as well, i.e., 1 test train whenltest <ltrain,theconfidenceareawillbecomemuch 0 50 100 150 200 250 300 too narrow very fast. This implies that l should be train radiation set conservatively (to a low value) when it is important to control the false positive rate, e.g., when the “true” Figure 4: Using gr and mi on bootstrapped kernel number of noisy dimensions is unknown. regression estimates (n = 250, m = 30, l ∈ {0,5}). Top: Probabilityofcoronaryheartdiseaseasafunction ofsystolicbloodpressureintheSAheartdata. Bottom: Ozone level as a function of observed radiation. 6.4 Algorithm comparison 6.5 Application to regression analysis Next we briefly compare the mi and gr algorithms in terms of the confidence area size (given as A/m, We can model the accuracy of an estimate given by where m is the number of variables) and running a regression model by resampling the data points, time t (in seconds). Results for artificial data sets as e.g., by the bootstrap method, and then refitting the well as the two regression model data are shown in model for each of the resampled data sets [3]. The Table 1. The confidence level (in test data) was set estimation accuracy or spread of values for a given to α = 0.1 in every case, and all training data had independent variable can be readily visualized using n=500examples. Allnumbersareaveragesof25trials. confidence intervals. We can observe, that mi tends to produce confidence Fig. 4 shows examples of different confidence inter- areasthatareconsistentlysomewhatsmallerthanthose vals fitted to bootstrapped kernel regression estimates found by gr. Also, mi is substantially faster, albeit on two different datasets using both l = 0 and l = 5 our implementation of gr is by no means optimized. (n = 250 and m = 30). In both cases k was adjusted Finally, the k/n column shows the confidence level that so that α = 0.1 in a separate test data. We find that was used when fitting the confidence area to obtain qualitatively there is very little difference between mi α = 0.1. Almost without exception, we have k/n < and gr when l = 5. For l = 0, gr tends to produce α for both algorithms, with mi usually requiring a a somewhat narrower area. In general, this example il- slightly larger k than gr. Also worth noting is that lustrates the effect of l on the resulting confidence area for extremely skewed distributions, e.g., Cauchy, the invisualterms. Byallowingtheexamplestolieoutside confidence area shrinks rapidly as l is increased from the confidence bounds for a few variables (l = 5) we zero. obtainsubstantiallynarrowerconfidenceareasthatstill 8 reflect very well the overall data distribution. In addition to visual tasks, the confidence intervals can also be used in automated algorithms as a simple 6.6 Stock market data and robust distributional estimators. As the toy exam- ple of Fig. 1 shows, the confidence areas with l > 0 The visualization for the stock market data of Fig. 5 can be a surprisingly good distributional estimator, if has been constructed using gr algorithm with param- dataissparse,i.e.,amajorityofvariablesisclosetothe eters k = 40 and l = 125. The figure shows in yel- mean and in each data vector only a small number of low the stocks that are clearly outliers and among the variables have significant deviations from the average. k = 40 anomalously behaving observations ignored in With p-values there are established procedures to the construction of the confidence area. The remaining deal with multiple hypothesis testing. Indeed, a proper n−k = 360 stocks (shown in blue) remain within the treatment of the multiple comparisons problem is re- confidence area at least m−l = 90% of the time. How- quired, e.g., in scientific reporting. Our contribution m ever, they are allowed to deviate from the confidence to the discussion about reporting scientific results is to area 10% of the time. Fig. 5 shows several such stocks, pointoutthatitisindeedpossibletotreatmultidimen- one of them highlighted with red. By allowing these sionaldatawithconfidenceintervalsinaprincipledway. excursions, the confidence area is smaller and these po- tentially interesting deviations are easy to spot. E.g., References the red line represents Mellanox Technologies and mar- ket analyses from fall 2012 report the stock being over- [1] W.Aigner,S.Miksch,H.Schumann,andC.Tominski. priced at that time. The black line in Fig. 5 represents Visualization of Time-Oriented Data. Springer, 2011. Morningstar, an example staying inside the confidence [2] A. C. Davidson and D. V. Hinkley. Bootstrap Methods area. Ifwedonotallowanydeviationsoutsidethecon- and Their Application. 1997. fidence intervals, i.e., we set l = 0, then the confidence areawillbelargerandsuchdeviationsmightbemissed. [3] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, 1993. [4] J. Friedman, T. Hastie, and R. Tibshirani. The ele- 7 Concluding remarks ments of statistical learning. Springer, 2001. [5] M.J.GardnerandD.G.Altman. Confidenceintervals The versatility of confidence intervals stems from their ratherthanPvalues: estimationratherthanhypothesis simplicity. They are easy to understand and to inter- testing. Brit. Med. J., 292:746–750, 1986. pret,andthereforeoftenusedinpresentationandinter- [6] O. Guilbaud. Simultaneous confidence regions corre- pretation of multivariate data. The application of con- sponding to Holm’s step-down procedure and other fidence intervals to multivariate data is, however, often closed-testing procedures. Biom. J., 50(5):678, 2008. doneinanaiveway,disregardingeffectsofmultiplevari- [7] R. J. Hyndman and Y. Fan. Sample quantiles in ables. This may lead to false interpretations of results statistical packages. Am. Stat., 50(4):361–365, 1996. if the user is not being careful. In this paper we have presented a generalization of [8] G. Karakostas. A better approximation ratio for confidence intervals to multivariate data vectors. The the vertex cover problem. ACM Trans. Algorithms, 5(4):41:1–41:8, 2009. generalization is simple and understandable and does notsacrificetheinterpretabilityofone-dimensionalcon- [9] D. Kolsrud. Time-simultaneous prediction band for a fidence intervals. The confidence areas defined this way time series. J. Forecasting, 26(3):171–188, 2007. behave intuitively and offer insight into the data. The [10] J. Korpela, E. Oikarinen, K. Puolama¨ki, and A. Ukko- problem of finding confidence areas is computationally nen. Multivariate confidence intervals. In Proc. SIAM hard. We present two efficient algorithms to solve the Int. Conf. Data Min., 2017. To appear. problem and show that even a rather simple approach [11] J. Korpela, K. Puolama¨ki, and A. Gionis. Confidence (mi) can produce very good results. bands for time series data. Data Min. Knowl. Discov., Confidence intervals are an intuitive and useful 28(5-6):1530–1553, 2014. tool for visually presenting and analyzing data sets, [12] W. Liu, M. Jamshidian, Y. Zhang, F. Bretz, and spotting trends, patterns, and outliers. The advantage X. Han. Some new methods for the comparison of ofconfidenceintervalsisthattheygiveatthesametime two linear regression models. J. Stat. Plan. Inference, informationaboutboththestatisticalsignificanceofthe 137(1):57–67, 2007. result and size of the effect. In this paper, we have [13] M. Mandel and R. Betensky. Simultaneous confidence shownseveralexamplesdemonstratingtheusefulnessof intervals based on the percentile bootstrap approach. multivariate confidence intervals, i.e. confidence areas. Comput. Stat. Data Anal., 52(4):2158–2165, 2008. 9 100 80 D) 60 S Price (U 02040 llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll −20 11/Q1 11/Q3 12/Q1 12/Q3 13/Q1 13/Q3 14/Q1 14/Q3 15/Q1 15/Q3 16/Q1 Figure 5: Visualization of the relative closing values of 400 stocks from Jan. 2011 to Dec. 2015 (1258 days) comparedtothestartingdays. Theconfidenceareawithparametersk =40andl=125isshownwithbluelines. An example of a valuation of a stock that temporarily deviates (in less than l time points) from the confidence areaisshowninred,andanexampleofavaluationforastockobserving“normal”developmentisshowninblack. [14] R.Nuzzo. Scientificmethod: Statisticalerrors. Nature, 1sinx . WeconsideradecisionvariantoftheCombCI u 506:150–152, 2014. problemwherewemustdecideifthereexistsanx with u (cid:80) [15] R. Schu¨ssler and M. Trede. Constructing minimum- jxu(j)≤K foragivenintegerK. Anoptimalvertex widthconfidencebands.Econ.Lett.,145:182–185,2016. coverisobtainedsimplybyselectingtheverticesj with x (j)=1 in the optimal upper bound x . [16] D.TrafimowandM.Marks. Editorial. BasicAppl.Soc. u u Psych., 37(1):1–2, 2015. [17] M. Wolf and D. Wunderli. Bootstrap joint prediction A.2 Proof of Theorem 4.2 regions. J. Time Ser. Anal., 36(3):352–376, 2015. [18] C.Woolston.PsychologyjournalbansPvalues.Nature, To prove Theorem 4.2 we use a linear programming re- 519:9, 2015. laxation of an integer linear program (ILP) correspond- ingtothek =0variantoftheone-sidedCombCIwhich A Appendix wedefineasfollows. LetX beamatrixrepresentingthe data vectors x ,...,x . For the moment, consider the 1 n jth position of every vector x in X, and let this be A.1 Proof of Theorem 4.1 i sorted in decreasing order of x (j). Let i(cid:48) denote the i The case of k > 0 follows directly from [11, Theorem vector that follows vector i in this sorted order. More- 3]. In the case k = 0 we use a reduction from over, at every position j we only consider vectors that Vertex-Cover. In the Vertex-Cover problem we are strictly above the lower bound x (j). l are given a graph G, and an integer K, and the task TheILPwedefineusesbinaryvariablesqtoexpress is to cover every edge of G by selecting a subset of whetheragivenvectorisinsidetheone-sidedconfidence vertices of G of size at most K. A reduction from area. Wehaveq (j)=1whenvectoriiscontainedinthe i Vertex-Cover to the CombCI problem maps every confidence area at position j, and q (j)=0 otherwise. i edge e = (a,b) of the input graph G into the vector To compute the size of the confidence area, we in- x , where x (a) = x (b) = 1, and x (w) = 0 when troduce a set of coefficients, denoted by ∆. We define e e e e w (cid:54)∈{a,b}. Furthermore, we add two vectors x and x ∆ (j)=x (j)−x (j)asthedifferencebetweentheval- a b i i i(cid:48) which satisfy x (1) = x (2) = −m−1 and which have ues of vectors i and i(cid:48) at position j. For the vector i a b a value of zero otherwise. Moreover, we set k = 0 and that appears last in the order, i.e., there is no corre- l=1. sponding i(cid:48), we let ∆ (j) = x (j)−x (j), where x (j) i i l l For the optimal confidence area, we must allow the is the value of the lower side of the confidence interval firstelementofvectoraandthesecondelementofvector at position j. Using this, we can write the difference b be outside the confidence area, resulting to a lower between the largest and smallest value at position j as bound x (j) = 0 for all j. Thus, it suffices to consider the sum (cid:80)n ∆ (j), and the size of the envelope that l i=1 i a variant of the problem where the input vectors x contains all of X is equal to (cid:80)m (cid:80)n ∆ (j). i j=1 i=1 i are constrained to reside in {0,1}m and to seek an Now, given a feasible assignment of values to q, upper bound x ∈ {0,1}m. To minimize the size of i.e., an assignment that satisfies the constraints that u the confidence area we need to minimize the number of we will define shortly, we can compute the size of the 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.