Vol. 0(0000) Covariate-powered weighted multiple testing with false discovery rate control Nikolaos Ignatiadis and Wolfgang Huber Department of Statistics Stanford University e-mail: [email protected] 7 European Molecular Biology Laboratory 1 Heidelberg, Germany 0 e-mail: [email protected] 2 Abstract: Consider a multiple testing setup where we observe mutually in- n dependent pairs ((Pi,Xi))1≤i≤m of p-values Pi and covariates Xi, such that a Pi ⊥ Xi under the null hypothesis. Our goal is to use the information po- J tentiallyavailableinthecovariatestoincreasepowercomparedtoconventional 8 proceduresthatonlyusethePi,whilecontrollingthefalsediscoveryrate(FDR). 1 To this end, we recently introduced independent hypothesis weighting (IHW), a weighted Benjamini-Hochberg method, in which the weights are chosen as a ] function of the covariate Xi in a data-driven manner. We showed empirically E in simulations and datasets from genomics and proteomics that IHW leads to M alargepowerincrease,whilecontrollingtheFDR.Thekeyideawastousehy- pothesis splittingtolearnthe weight-covariate function without overfitting. In . this paper, we provide a survey of IHW and related approaches by presenting t a them underthelensofthetwo-groups model,whenitisvalidconditionallyon t a covariate. Furthermore, a slightly modified variant of IHW is proposed and s showntoenjoyfinitesampleFDRcontrol.Thesameideascanalsobeappliedfor [ finite-samplecontrolofthefamily-wiseerrorrate(FWER)viaIHW-Bonferroni. 1 v 9 1. Introduction 7 1 1.1. Motivation 5 0 . Statistical analysis of modern high-throughput datasets invariably invokes the 1 issue of multiple testing. The problem has been well studied, and many solutions 0 7 have been proposed. The most commonly used approaches start with a list of p- 1 valuesPi,oneforeachhypothesisHi,andrejectallhypotheseswithap-valuebelow : a (possibly random, that is, data-dependent) threshold t∗. The goal is to control a v i measureoftype-Ierroratlevelα.Traditionally,thismeasurehasbeenthefamily-wise X errorrate(FWER),butformanyapplicationsthis istoostringent,andoverthe last r 20yearsthefalsediscoveryrate(FDR)hasbecomeapopularchoice[5],asitismore a permissive and adaptive [4]. Nevertheless, practitioners often think of the multiple testing problem as a burden – a necessary price to pay for doing high-throughput exploratory work [44]. However, multiple testing also presents an unparalleled new opportunity: seeing the results from many tests simultaneously allows us to infer properties of our data that we could never learnfroma single test. One might argue that the perception of multiple testing as a burden is exacerbated by shortcomings of existing statistical methods that do not make use of all the information available. The case in point explored here is that often, beyond p-values, side information, represented by covariates X , is available for each hypothesis. This side-information i mayberelatedtothedifferentpowerofthetests,ortodifferentpriorprobabilitiesof the null hypothesis being true. Such covariatesare often apparentto statisticians or 0 N. Ignatiadis and W. Huber/IHW 1 todomainscientists[26].Yet,existingmethodsworkingonlywithp-valuesignorethis information – possibly because this side information was irrelevantin the context of (frequentist) single hypothesis testing. Such covariate-adjustedFDR estimation and controlhadalreadybeenconsideredadecadeago[19,35],buthasrecentlyresurfaced as an active research topic. Inthispaperweelaborateonanideawerecentlyintroducedtothecomputational biologycommunity [26]:We proposedindepenthypothesis weighting(IHW), a mod- ificationtothe Benjamini-Hochbergprocedure[5](the mostfrequentlyusedmethod for FDR control)that can increasepower by incorporatingsuch covariatesX . Each i X is assumed to be independent of the p-value P under the null hypothesis, yet i i informativeaboutpower.Thebasicideaissimple:assignaweighttoeachhypothesis basedonthe value ofits covariateX byapproximatingthe optimaldecisionbound- i aryinadata-drivenway.However,doingthis inanaivewaycanresultinoverfitting and loss of FDR control.This is avoidedby employing randomizationin the form of hypothesissplittingintok-folds:Learntheweightsofhypothesesinagivenfoldfrom theotherk 1folds.Theideasarewidelyapplicableandcanbegeneralizedtoother − multiple testing proceduresthatcanmakeuse ofweights,i.e.,non-negativenumbers that average to one, and that signal different priorities for different hypotheses. As anexample forsuchgeneralization,we alsoshow the applicability ofourmethods to the Bonferroni procedure. 1.2. Outline In Section 2 we quickly review the Benjamini-Hochberg (BH) procedure in terms ofthetwo-groupsmodel.WealsointroducetheweightedBHprocedure.InSection3 the two-groups model is generalized to the situation in which we have covariates (conditionaltwo-groupsmodel)andarangeofillustrativeapplicationsarementioned. Subsequently,thismodelisusedtomotivateanddescribeIHW(Section4).Weprove finite sample FDR control of a further modification of this procedure in Section 5. Finally we provide a survey of related approaches in Section 6. 2. Multiple testing background ConsidertestingmdistincthypothesesH ,...,H basedonp-valuesP ,...,P . 1 m 1 m We will write H = 0 for the null hypotheses and H = 1 for the alternatives. In i i this paper, the p-values are assumed to be uniformly distributed under the null (or stochastically larger than uniform), and jointly independent. A multiple testing procedure will reject R hypotheses, and V of these will be nulls, i.e., it will commit V type-I errors. The generalized type-I error is usually the expectation of a function of V and R. For example, the family-wise error rate (FWER), is defined as FWER := P[V 1]. Here we will mainly focus on the false ≥ discovery rate (FDR), defined as the expectation of the false discovery proportion (FDP): V FDR:=E[FDP]:=E R 1 (cid:20) ∨ (cid:21) 2.1. The two-groups model and the Benjamini-Hochberg procedure Thetwo-groupsmodel[17]startswithaBayesianflavour:Insteadofdeterministic H , each hypothesis has the same prior probability π =P[H = 0] of being null. In i 0 i N. Ignatiadis and W. Huber/IHW 2 addition, p-values are uniformly distributed under the null, and alternative p-values have distribution F that is stochastically smaller than uniform: alt H Bernoulli(1 π ) i 0 ∼ − P H =0 U[0,1] (1) i i | ∼ P H =1 F i i alt | ∼ Underthissetup,anaturalquantitytocontrolistheposteriorprobabilityofbeing anull.ThisisalsosometimesreferredtoastheBayesianfalsediscoveryrate,denoted by Fdr. π t Fdr(t)=P[H =0 P t]= 0 i i | ≤ F(t) F(t):=π t+(1 π )F (t)isthe marginaldistributionofthe p-valuesunderthe 0 0 alt − two-groups model. TheBenjamini-Hochbergprocedurenowproceedsasfollows:First,conservatively estimateπ by1,estimateF(t)bytheempiricalCDFF(t)andthenestimateFdr(t) 0 byplugging-intheaboveestimators,i.e.,Fdr(t)= t .Inthesecondstep,itchooses b F(t) b tˆ∗ [0,1]aslargeaspossible,suchthatFdr(tˆ∗) α,forsomepre-specifiedα (0,1). ∈ d ≤ ∈ Finally, it rejects all hypotheses with p-value P tˆ∗. i ≤ When the two-groupsmodel holds wdith π <1 (and reasonablealternativedistri- 0 bution),thenasymptotically(m ),itcontrolstheBayesianFdrandtˆ∗converges →∞ almost surely to t∗, where t∗ :=sup t [0,1] t α [45]. ∈ | F(t) ≤ Moreimportantly,theprocedureanlsocontrolsthefrequoentistFDR,evenwhenthe two-groupsmodeldoesnothold,eachalternativep-valuehasadifferentdistribution and H are considered deterministic. This shows the robustness of the Fdr(t) esti- i mator and is related to the nonparametric properties of the empirical distribution function. In fact, it is a surprising result that FDR is controlled, despitedthe greedy choice of the threshold tˆ∗ [17]. 2.2. The weighted Benjamini-Hochberg procedure Despite the favourable properties of the BH procedure, one of its major short- comingsisthatitignoresheterogeneityacrosshypotheses.Forexample,considerthe following generalization of the two-groups model in which each hypothesis has its own prior probability π and alternative distribution F . 0,i alt,i H Bernoulli(1 π ) i 0,i ∼ − P H =0 U[0,1] (2) i i | ∼ P H =1 F i i alt,i | ∼ By ignoring this heterogeneity, the BH procedure pays a price in terms of power loss. However,we emphasize that type-I error control is still guaranteed. While there are multiple ways to exploit such heterogeneity (Section 6), a very generalapproachisto assigna weightW to eachhypothesis.TheW satisfyW 0 i i i m ≥ and W = m. The weighted BH procedure (Algorithm 1, [21]) then simply i=1 i appliestheordinaryBHproceduretotheweightedp-valuesP /W .Thushypotheses i i P with W >1 get prioritized. i If the weights are chosen a-priori, i.e., without looking at the p-values, then the weightedBHprocedurealsocontrolsthefrequentistFDR[21].Inparticular,mostof N. Ignatiadis and W. Huber/IHW 3 Algorithm 1: The weighted Benjamini-Hochberg method Input:Anominallevelα∈(0,1),avector ofp-valuesP1,...,Pm andweights W1,P...Wm≥0withPmi=1Wi=m 1 LetQi= Wi (Qi=0forPi=0,Qi=∞forWi=0,Pi6=0) i 2 LetQ(1),...,Q(m) betheorderstatisticsofQ1,...,Qm andletQ(0):=0 3 Letk∗=maxnk|Q(k)≤ kmα , k≥0o 4 Rejectallhypotheses withQi≤Q(k∗) theliteraturetodateonlyconsidersthecaseofdeterministicweights[8,23,40].These resultsareveryvaluable,sinceweightedmultipletestingprocedureshavebeenshown to be robust to weight mispecification [21]: Choosing good weights can lead to huge increases in power, yet ”bad” weights will only slightly decrease power compared to theunweightedprocedure.Thishasleadtonumerouspapersheuristicallysuggesting weights for specific scientific applications (e.g. [12,29,36,54]). As an example of principled prior guessing, in [14,20] an elegant approachis developed for hypothesis weighting based on effect size information from a prior experiment. In contrast, the major goalof this work is to allow the weights to depend also on thep-valuesinadata-drivenway,whilestillcontrollingtheFDR.Indeed,theweights will have to depend on the p-values if we wantto developa powerfulprocedure that ispracticalandcanbeusedout-of-the-boxbypractitioners(e.g.,computationalbiol- ogists)withoutlabouriousmodelingbeingaprerequisite(butseealsoSubsection6.5 for a counter-point). 3. Conditional two-groups model The starting point for developing a new procedure for data-driven hypothesis weightingisanextensionofthetwo-groupsmodel(1),sothatitbetterapproximates the very generalmodel in (2), wherein each hypothesis has its own prior probability π and alternative distribution F . The idea is that differences in π and F 0,i alt,i 0,i alt,i across hypotheses can be explained by an observed random covariate X (see ∈ X also [27]). For example, we model the prior probability π as a function [0,1]. 0 X → We call this the conditional two-groups model (3): X PX i ∼ H X Bernoulli(1 π (X )) i i 0 i | ∼ − (3) P (H =0,X ) U[0,1] i i i | ∼ P (H =1,X ) F i | i i ∼ alt|Xi Marginalizing over H , we get that: i P X =x F(tx):=π (x)t+(1 π (x))F (t) i | i ∼ | 0 − 0 alt|Xi=x Thus, in the conditional two-groups model, (P ,X ,H ) are assumed to be ex- i i i changeable, and since we observe X we have more flexibility in our modeling than i in the two-groups situation. Later we will model weights as a function of X . i Similarlytothetwo-groupsmodel,wearejustinterestedinitbeinganapproxima- tiontothe truth.Inpractice,the criticalcomponentisthe conditionalindependence of X andP under the null hypothesis.In addition, we can only expect power gains i i when indeed F and π (x) are not constant as functions of x. alt|Xi=x 0 N. Ignatiadis and W. Huber/IHW 4 We now demonstrate that existence of such covariates X is a weak assumption, i since these are available in multiple applications. 3.1. Domain-specific covariates: “Co-data” In many scientific applications,suchcovariatesare apparentto domainscientists: Thesecovariatesusuallyarerelatedtothetrueeffectsizesoftheindividualhypothe- ses or their prior probability of being true. This relation is the result of either a known causal relationship between the covariate and the hypotheses being tested or because previous, related studies have shown an association. In any case, for such domain-specific covariates to be independent of the p-values under the null hypoth- esis, the critical aspect is that they should not have been used in any way for the marginal hypothesis testing. Here we mention some examples, see [26] for additional ones: In neuroscience, we are now able to simultaneously measure the activity of • hundreds of neurons. A scientific question of interest is whether two neurons areinsynchrony[43].Here,weknowthatneuronswhichareincloseproximity to each other are a-priori more likely to be interacting. Thus, the geometric distance between the neurons can be used as a covariate for rejecting the null hypothesis that the neurons are not interacting. Theremightexistp-valuesfromaprevious,butrelatedexperiment:Forexam- • ple, in [20] data from previous, independent genome-wide association studies (GWAS) for related diseases are used to increase the power of a longevity re- lated GWAS study. The widespread existence of such covariates was also observed in [53], who used the term “co-data”to describe these and developed a weightedridge regressionpro- cedure, with data-driven weights based on the co-data. 3.2. Statistical covariates In single hypothesis testing, classical theory often dictates which test statistics shouldbeusedunderoptimalityconsiderations.Allotherinformationcanessentially be discarded or should be conditioned on. In this section, we want to illustrate that such information, which is irrelevant in single hypothesis testing, can be embedded in the conditional two-groups framework and can help increase the power of the resulting multiple testing procedure; sometimes dramatically so. 3.2.1. Sample size N i A very generic covariate, with many applications, is the sample size N , when it i differsacrosstests.Notethatiftheteststatisticiscontinuousandthenullhypothesis is simple,then the p-value under the nullwillstill be uniformly distributedindepen- dentlyofN .Itisalsoreasonabletoassumethatthepriorprobabilityofahypothesis i being true does not depend on N , i.e., π (N )=π . However,the alternativedistri- i 0 i 0 butiondoesdependonN :ForhigherN wehavemorepower.Forexample,consider i i a one-sided z-test in which we observe independent Xi,...,Xi (µH ,1) with 1 Ni ∼ N i µ>0 and use P =1 Φ √N Xi as our statistic. i i − (cid:16) (cid:17) N. Ignatiadis and W. Huber/IHW 5 Then the conditional two-groups model applies with π (N )=π and 0 i 0 F (t N )=1 Φ(Φ−1(1 t) N µ) alt i i | − − − Whilethisexampleistrivial,itisinstructive:Ifonewpantstomaximizediscoveries, then one expects that hypotheses with large sample sizes should be prioritized. The methods described here will be able to accomplish this automatically and hence increase power. Yet, in some cases, this might not necessarily be desirable: p-values areoftencriticizedforreflectingsamplesizemorethaneffectsize,andoptimalweights would amplify this effect. Remark 1. Technically,optimalweightingautomaticallyadjuststoverylargesam- ple sizes [22,34,37]: Hypotheses with a very large sample size (or effect size) should be down-weighted;this phenomenonis calledsize-investing.However,in many prac- tically relevant situations, it will still be the case that larger sample size attracts larger optimal weight to a hypothesis. 3.2.2. Overall variance (independent of label) in two-sample tests Foramoreinterestingexample,considertwo-sampletestingforequalityofmeans. To simplify the discussion, assume that for the i-th hypothesis we observe X ,...,X (µ ,σ2) and Y ,...,Y (µ ,σ2) i,1 i,n ∼N X,i i i,1 i,n ∼N Y,i i (everything independent). We are interested in testing H : µ = µ , and do not i X,i Y,i know σ . The optimal test statistic for this situation is the two-sample t-statistic: i n X Y i i T = − i 2 S2 +S2 r X,i Y,i 2 q Here X (resp. Y ) are the sample means of X ,...,X (resp. Y ,...,Y ). i i i,1 i,n i,1 i,n Similarly, S2 (resp. S2 ) are the sample variances. X,i Y,i Inaddition,denotebyµˆ := 1 X +Y andS2thesamplemeanandsamplevari- i 2 i i i ance after pooling all observations (X ,...,X and Y ,...,Y ) and forgetting i,1 i,n i,1 i,n (cid:0) (cid:1) their labels. Nownotethatunderthenullhypothesisµ =µ =:µ andwehaveX ,...,X , X,i Y,i i i,1 i,n Y ,...,Y (µ ,σ2)i.i.d.Then,(µˆ ,S2)isacomplete sufficientstatisticforthe i,1 i,n ∼N i i i i experiment, while T is ancillary. Thus, by Basu’s theorem, when the i-th hypothe- i sis is null (H = 0), (µˆ ,S2) is independent of T and we can use it as a covariate. i i i i As mentioned above, while irrelevant in single hypothesis testing, (µˆ ,S2) can be i i extremely useful in a multiple hypothesis testing. First, consider S2 and note that under the null it is distributed as a scaled χ2- i distribution. On the other hand, under the alternative, we expect S2 to take larger i values with high probability, especially for large values of µ µ . Therefore, if X,i Y,i | − | we are doing m t-tests, each with unknown variance σ2 and if we assume σ G, i i ∼ froma concentrateddistributionG, then hypotheseswith highS2 are morelikely to i be true alternatives (and also likely to be alternatives with high power). Thus, the overall variance (ignoring sample labels) is independent of the p-values under the null hypothesis, yet informative about the alternative and can lead to a large power increase in simultaneous two-sample t-tests [10,26]. For a second example of the usefulness of (µˆ ,S2) in this setting, consider the i i screening statistic |µˆi|. This can be interpreted as a statistic for the null hypothesis Si N. Ignatiadis and W. Huber/IHW 6 µ = µ = 0. If we believe a-priori that for many of the hypotheses i with X,i Y,i µ =µ asparsityconditionholds,sothatinfactµ =µ =0[30],thenlarge X,i Y,i X,i Y,i values of this statistic are more likely to correspond to alternatives. Note that we did not have to actually re-specify our null hypothesis from µ = µ to µ = X,i Y,i X,i µ =0. We instead just used the latter to derive our covariate and are still testing Y,i for µ =µ . X,i Y,i 3.2.3. Ratio of number observations in each group in two-sample tests For yet another example, revisit the two-sample situation, but now assume that for the i-th hypothesis, we have n observations of the first population and n 1,i 2,i observationsfromthesecondpopulation,suchthatn +n =n .Then n1,i(ni−n1,i) 1,i 2,i i n2 i is a statistic which is related to the alternative distribution, with values close to 1 4 implyinghigherpower[39].ThisstatisticisalsorelatedtotheMinorAlleleFrequency (MAF) in genome-wide association studies [9]. 3.2.4. Sign of estimated effect size As a final example of a statistical covariate, consider a two-sided test, where the null distribution is symmetric and the test-statistic is the absolute value of a symmetric statistic T . Then, the sign of T is independent of the p-value under the i i nullhypothesis.However,wemighta-prioribelievethatamongthealternatives,more hypotheses have a true positive effect size, rather than a negative one or vice versa. Thus, the sign could also be used as an informative covariate. The idea of using the sign to improve power, while controlling the FDR, can actually be traced back to the early days of microarrays, where it was implemented in the SAM (significance analysis of microarrays)procedure [51]. 3.3. p-value covariates: Multivariate p-values In many situations, we have bivariate (or even multivariate) p-values for each hypothesis which are independent under the null. For a toy example, we revisit the two-samplet-testin3.2.2,thistimewithknownvarianceσ2 =1.Fortestingweagain i use T as our statistic (which is suboptimal in this case). Then, the overall variance i (ignoring labels) can be converted into a second p-value for H , noting that it is χ2- i distributed under the null hypothesis [15]. In a more realistic setting, multivariate p-values are also available in heterogeneous multiomics experiments in biology [1] (e.g.,ap-valuebasedontranscriptomicsmeasurementsandonebasedonproteomics for each hypothesis). Insuchasituation,themethods describedherecanbe appliedbyconsideringone p-value as being primary and the rest as secondary (i.e., as covariates). Nevertheless, for this situation, specialized methods have been developed [1,15] and are preferable, since they consider the components of the multidimensional p- values in a symmetric way and can also profit from the additional distributional properties inherent to p-values. N. Ignatiadis and W. Huber/IHW 7 4. Data-driven hypothesis weighting 4.1. An oracle procedure: IHWoracle We will develop the IHW (Independent Hypothesis Weighting) procedure using ananalogousmotivationtothatoftheBHprocedure.WeconsideraBayesianoracle which knows the conditional two-groups model (3). Rather than returning a single threshold t [0,1] and rejecting hypotheses with P t, the new oracle returns a i ∈ ≤ function g : [0,1] in a function class and rejects hypotheses with P g(X ). i i X → G ≤ In this setting, the posterior probability of being a null (conditionally on rejection is): π (x)g(x)dPX P[H =0 H rejected]=P[H =0 P g(X )]= X 0 (4) i | i i | i ≤ i F(g(x)x)dPX RX | The BH-type oracle maximized t subject to a constraintRon the posterior proba- bility. This is equivalent to maximizing power (F(t) = P[P t]). Note that in the i ≤ conditional two-groups model the power satisfies: P[P g(X )]= F(g(x)x)dPX (5) i i ≤ | ZX This leads us to consider the threshold function g that maximizes power subject to a constraint on the posterior probability of making a type-I error: maximize F(g(x)x)dPX g∈G ZX | (6) π (x)g(x)dPX subject to X 0 α F(g(x)x)dPX ≤ RX | Replacing PX by the empirical meaRsure we get: m 1 maximize F(g(X )X ) i i g∈G m | Xi=1 (7) 1 m π (X )g(X ) subject to m i=1 0 i i α 1 m F(g(X )X ) ≤ m Pi=1 i | i As in the case of the BH procedurePwe would like a procedure that (a) approx- imates the oracle procedure when the assumptions of the conditional two-groups model are met and (b) is robust to misspecification of the conditional two-groups model. In particular, even given an adversary that manipulates our oracle, our pro- cedure should only be affected in its power;but it should still control the FDR. The goal is, of course, that even if the conditional two-groups model is a coarse approx- imation to the truth and we have a good guess of that approximation, we will still be able to gain power compared to BH and control the FDR. It turns out that to construct such a procedure (called IHWoracle henceforth, see Algorithm 2), we simply need to rescale the oracle threshold function from (7) and then apply the weighted BH procedure. We call the procedure IHWoracle since we assume we have some prior source (which could be the optimal oracle, but not necessarily) which informs us about F(t x) and π (). 0 | · In particular, while motivated by the conditional two-groups model (3), the va- lidity of this method will will depend on the following more general distributional assumption: N. Ignatiadis and W. Huber/IHW 8 Algorithm 2: The IHWoracle procedure 1 Letg beasolutionofoptimizationproblem(7) mg(Xi) 2 Fori∈{1,...,m},setWi=1ifg(Xi)=0∀i,otherwisesetWi= Pmi=1g(Xi) 3 ApplytheweightedBHprocedure(Algorithm1)to((Pi,Wi))i Assumption 1 (Distributional setting). Let (P ,X ), i 1,...,m be mutually i i independent (p-value, covariate) pairs and H 1,...,∈m{ the ind}ex set of null 0 hypotheses. Also assume that for i H it holds⊂tha{t P X} (independent) and P 0 i i i is (super)uniform, i.e. P[P t] t∈. ⊥ i ≤ ≤ To show the validity of the IHWoracle procedure, we first show: Theorem1. ConsiderameasurableweightingfunctionW=(W ,...,W ): m 1 m m X → [0,m]thatdependsonlyonthecovariatesX=(X ,... .X )suchthat W (X)= 1 m i=1 i m almost surely. Then, under Assumption 1, the weighted BH procedure (Algo- P rithm 1) with p-values P = (P ,...,P ) and weights W(X) controls the FDR at 1 m the pre-specified level α (0,1). ∈ Proof. All the steps of the proof of Theorem 2 with τ = 1 go through essentially unchanged. Corollary 1 (IHWoracle). Let F(t x) be an arbitrary (but deterministic) condi- | tional distribution function and π () an arbitrary (deterministic) prior probability 0 · function, as specified in the conditional two-groups model (3). Then, under Assump- tion 1, the IHWoracle procedure controls the FDR at the nominal level α: FDR α IHWoracle ≤ Proof. Just observe that the IHWoracle rule of assigning weights exactly fulfills the conditions specified in Theorem 1. WenotethatTheorem1isapplicablemoregenerally.Forinstance,let’srevisitthe two-sample t-test from Section 3.2.2, where for simplicity we assume m=2m′,m′ N. Now, consider the following procedure: Apply the BH procedure to the m′ hy∈- potheseswiththehighestoverallsamplevariance(independentofsamplelabel).This is equivalent to assigning weight W =2 to hypotheses above that cutoff, W =0 to i i the rest and applying the weighted BH procedure. Theorem 1 now gives a rigorous justification for the validity of such a procedure: something that has been observed in previous work [10,49] and shown to lead to a strong power increase in applica- tions [10]. 4.2. Bird’s eye view of IHW So far,itmightseem like wehave notmade a lotofprogress:We haveshifted the a-priori guessing from the weights [21] to guessing π (x) and F(t x). However, the 0 | latter will enable us to develop a data-driven procedure. The first approach that comes to mind is to replace π (x) and F(t x) in the 0 | IHWoracleprocedure(Algorithm2)inaplug-infashionbycorrespondingestimators. However,suchaprocedurewillhavebadfinitesampleproperties,evenincaseswhere itmightbeasymptoticallyconsistent[26].Thisisthecasebecauseweareoverfitting N. Ignatiadis and W. Huber/IHW 9 Algorithm 3: IHW (Independent Hypothesis Weighting) (IHWsplit) Randomlysplithypotheses intoK folds forl∈{1,...,K}do LetIl⊂{1,...,m}betheindexsetofhypotheses infoldl (IHW1) Estimateπ0(x)andF(t|x)using{(Pi,Xi)|i∈{1,...,m}\Il} (IHW2) Letg(·)beasolutionofoptimizationproblem(7)plugginginthe estimatedπ0(x)andF(t|x)fromthepreviousstepand{Xi|i∈Il} (IHW3) Fori∈Il,setWi=1ifg(Xi)=0∀i∈Il,otherwiseset |Il|g(Xi) Wi= Pi∈Ilg(Xi) end (wBH) ApplytheweightedBHprocedure(Algorithm1)to((Pi,Wi))i by learning π (x) and F(t x) from the same (P ,X ) pairs to which we are then 0 i i | applying the weighted BH procedure. To overcome this, we introduce randomization to the IHWoracle procedure by means of hypothesis splitting. The resulting algorithm, called IHW, is presented in its generalform in Algorithm 3. The subsections below will further elaborate on the individual components of IHW, as well as provide a discussion and guidelines for practical implementations. 4.3. Hypothesis splitting approach Ashasbeenrecentlydemonstratedinaplethoraofpapers[16,32,48,52],introduc- ing external randomness to statistical procedures can often be the key to tractable inference for high dimensional problems. The randomization protects against over- fitting and leads to increased power. The hypothesis splitting step (IHWsplit) in Algorithm 3 is in exactly this spirit. To further motivate the hypothesis splitting approach, we first consider a cross-validationprocedure: 4.3.1. Cross-validation Assume that rather than controlling the FDR, our goal is to estimate the true threshold function g, which is the minimizer of (6). Also assume that we have an estimator which depends on some tuning parameters λ. Then we could use cross- validation: Split the hypotheses into K-folds and then for each fold apply the esti- mator to the held-out (training) folds and use the test fold to estimate the quality oftheestimator.Forexample,onecouldusethe thresholdfunctionlearnedfromthe training folds to derive weights for the test fold as in step (IHW3) of Algorithm 2. Then one could apply the weighted BH procedure to the test fold and calculate the number of discoveries. Finally one would choose the tuning parameter λ that acrossthe K test folds led on average to the highest number of discoveries. Suchaprocedurecanleadtoreasonableestimatesofthe weightfunction(seealso Section 4.4.4). However,cross-validationis not related to our goal of controlling the FDR: Even with a cross-validated weight function, there is no guarantee that the resulting procedure controls the FDR and does not overfit.