ebook img

Limits on Support Recovery with Probabilistic Models: An Information-Theoretic Framework PDF

1.2 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Limits on Support Recovery with Probabilistic Models: An Information-Theoretic Framework

1 Limits on Support Recovery with Probabilistic Models: An Information-Theoretic Framework Jonathan Scarlett and Volkan Cevher Abstract—The support recovery problem consists of deter- where S 1,...,p represents the set of relevant vari- mining a sparse subset of a set of variables that is relevant in ables, X ⊆R{p is a m}easurement vector, X (respectively, S generating a set of observations, and arises in a diverse range ∈ β ) is the subvector of X (respectively, β ) containing the S S of settings such as compressive sensing, and subset selection entries indexed by S, and P is a given probability 6 in regression, and group testing. In this paper, we take a Y|XSβS 1 unified approach to support recovery problems, considering distribution. Given a collection of measurements Y Rn 0 generalprobabilisticmodelsrelatingasparsedatavectortoan andthecorrespondingmeasurementmatrixX Rn×p∈(with 2 observation vector. We study the information-theoretic limits each row containing a single measurement vec∈tor), the goal of both exact and partial support recovery, taking a novel g is to find the conditions under which the support S can approach motivated by thresholding techniques in channel u be recovered either perfectly or partially. In this paper, coding. We provide general achievability and converse bounds A characterizingthetrade-offbetweentheerrorprobabilityand we study the information-theoretic limits for this problem, numberofmeasurements,andwespecializethesetothelinear, characterizing the number of measurements n required in 0 3 1-bit, and group testing models. In several cases, our bounds terms of the sparsity level k and ambient dimension p not only provide matching scaling laws in the necessary and regardlessofthecomputationalcomplexity.Suchstudiesare sufficient number of measurements, but also sharp thresholds ] useful for assessing the performance of practical techniques T with matching constant factors. Our approach has several I advantages over previous approaches: For the achievability and determining to what extent improvements are possible. . part, we obtain sharp thresholds under broader scalings of Before proceeding, we state some important examples of s c the sparsity level and other parameters (e.g., signal-to-noise models that are captured by (1). [ ratio)comparedtoseveralpreviousworks,andfortheconverse Linear Model: The linear model [5], [6] is ubiquitous part, we not only provide conditions under which the error 3 probability fails to vanish, but also conditions under which it in signal processing, statistics, and machine learning, and v tends to one. in itself covers an extensive range of applications. Each 0 observation takes the form 4 Index Terms—Support recovery, sparsity pattern recovery, 4 information-theoretic limits, compressive sensing, non-linear Y = X,β +Z, (2) 7 models, 1-bitcompressive sensing, grouptesting, phase transi- (cid:104) (cid:105) 0 tions, strong converse where , denotes the inner product, and Z is additive (cid:104)· ·(cid:105) . noise. An important quantity in this setting is the signal-to- 1 0 noise ratio (SNR) E[(cid:104)X,β(cid:105)2], and in the context of support I. INTRODUCTION E[Z2] 5 recovery,thesmallestnon-zeroabsolutevalueβ inβ has min 1 The support recovery problem consists of determining a also been shown to play a key role [5], [7], [8]. : v sparsesubsetofasetofvariablesthatisrelevantinproduc- Quantized Linear Models: Quantized variants of the i ingasetofobservations,andarisesfrequentlyindisciplines linear model are of significant interest in applications with X such as group testing [1], [2], compressive sensing (CS) hardware limitations. An example that we will consider in r a [3], and subset selection in regression [4]. The observation this paper is the 1-bit model [9], given by models can vary significantly among these disciplines, and (cid:0) (cid:1) Y =sign X,β +Z , (3) it is of considerable interest to consider these in a unified (cid:104) (cid:105) fashion. This can be done via probabilistic models relating where the sign function equals 1 if its argument is non- the sparse vector β Rp to a single observation Y R in negative, and 1 if it is negative. the following manne∈r: ∈ Group Testi−ng: Studies of group testing problems began several decades ago [10], [11], and have recently regained (Y S =s,X =x,β =b) P ( x ,b ), (1) | ∼ Y|XSβS ·| s s significant attention [2], [12], with applications including medical testing, database systems, computational biology, The authors are with the Laboratory for Information and Inference andfaultdetection.Thegoalistodetermineasmallnumber Systems(LIONS),ÉcolePolytechniqueFédéraledeLausanne(EPFL)(e- mail:{jonathan.scarlett,volkan.cevher}@epfl.ch). of “defective” items within a larger subset of items. The This work was supported in part by the European Commission under itemsinvolvedinasingletestareindicatedbyX 0,1 p, Grant ERC Future Proof, SNF 200021-146750 and SNF CRSII2-147633, ∈{ } and each observation takes the form andEPFLFellowsHorizon2020grant665667. (cid:26) (cid:27) ThisworkwaspresentedinpartattheIEEEInternationalSymposiumon (cid:91) InformationTheory(2015),andattheACM-SIAMSymposiumonDiscrete Y =1 Xi =1 Z, (4) { } ⊕ Algorithms(2016). i∈S 2 with S representing the defective items, Y indicating 2. We explicitly provide the constant factors in our whether the test contains at least one defective item, and Z bounds, allowing for more precise characterizations representing possible noise (here denotes modulo-2 addi- of the performance compared to works focusing on ⊕ tion). In this setting, one can think of β as deterministically scaling laws (e.g., see [5], [8], [20]). In several cases, having entries equaling one on S, and zero on Sc. the resulting necessary and sufficient conditions on the The above examples highlight that (1) captures both number of measurements coincide up to a multiplica- discrete and continuous models. Beyond these examples, tive 1 + o(1) term, thus providing exact asymptotic several other non-linear models are captured by (1), includ- thresholds (sometimes referred to as phase transitions ing the logistic, Poisson, and gamma models. [24], [30]) on the number of measurements. 3. As evidenced in our examples outlined below, our framework often leads to such exact or near-exact A. Previous Work and Contributions thresholds for significantly more general scalings of k, Numerous previous works on the information-theoretic SNR, etc. compared to previous works. limits of support recovery have focused on the linear model 4. The majority of previous works have developed con- [5],[7],[8],[13]–[22].Themainaimoftheseworks,andof verse results using Fano’s inequality, leading to nec- thatthepresentpaper,istodevelopnecessaryandsufficient essary conditions for P[error] 0. In contrast, → conditions for which an “error probability” vanishes as our converse results provide necessary conditions for p . However, there are several distinctions that can P[error] 1. The distinction between these two con- → ∞ (cid:54)→ be made, including: ditions is important from a practical perspective: One • Randommeasurementmatrices[5],[7],[8],[13]vs.ar- may not expect a condition such as P[error] 10−10 bitrary measurement matrices [16], [18], [19]; to be significant, whereas the condition P[err≥or] 1 → • Exact support recovery [5], [7], [8], [13] vs. partial is inarguably so. support recovery [15], [16], [20]; Contributions for Specific Models: An overview of our • Minimax characterizations for β in a given class [5], bounds for specific models is given in Table I, where we [7], [8], [13] vs. average performance bounds for ran- state the derived bounds with the asymptotically negligible dom β [14], [16], [21]. terms omitted. All of the models and their parameters are Perhaps the most widely-studied combination of these is defined precisely in Section IV; in particular, the functions thatofminimaxcharacterizationsforexactsupportrecovery f ,...,f and the remainder terms (∆ ,∆ ) are given ex- 1 9 1 9 with random measurement matrices. In this setting, within plicitly, and are easy to evaluate. We proceed by discussing the class of vectors β whose non-zero entries have an these contributions in more detail, and comparing them to absolute value exceeding some threshold β , necessary various existing results in the literature: min and sufficient conditions on n are available with matching 1. (Linear model) In the case of exact recovery, we scaling laws [7], [8]. See also [23], [24] for information- recover the exact thresholds on the required number theoretic studies of the linear model with a mean square of measurements given by Jin et al. [17], as well error criterion. as handling a broader range of scalings of β := min Compared to the linear model, research on the min β : β = 0 (see Section IV-A for details) i i information-theoretic limits of support recovery for non- and{st|ren|gthening(cid:54) the}conversebyconsideringthemore linearmodelsisrelativelyscarce.Thesystemmodelthatwe stringent condition P[error] 1. Our results for haveadoptedfollowsthoseofalineofworksseekingmutual partial recovery provide near-m→atching necessary and information characterizations of sparsity problems [2], [11], sufficient conditions under scalings with k = o(p), [14], [25], though we make use of significantly different thus complementing the extensive study of the scaling analysis techniques. Similarly to these works, we focus on k =Θ(p) by Reeves and Gastpar [15], [16]. random measurement matrices and random non-zero entries 2. (1-bit model) We provide two surprising observations of β. Other works considering non-linear models have used regarding the 1-bit model: Corollary 3 provides a low- vastlydifferentapproachessuchasregularizedM-estimators SNR setting where the quantization only increases the [26], [27] and approximate message passing [28]. asymptotic number of measurements by a factor of π, 2 High-level Contributions: We consider an approach whereasCorollary4providesahigh-SNRsettingwhere using thresholding techniques akin to those used in the scaling law is strictly worse than the linear model. information-spectrum methods [29], thus providing a new Similar behavior will be observed for partial recovery alternative to previous approaches based on maximum- (Corollaries 2 and 5) by numerically comparing the likelihood decoding and Fano’s inequality. Our key contri- bounds for various SNR values. butionsandtheadvantagesofourframeworkareasfollows: 3. (Grouptesting)Asymptoticthresholdsforgrouptesting 1. Consideringbothexactandpartialsupportrecovery,we with k = Θ(1) were given previously by Malyu- provide non-asymptotic performance bounds applying tov [11] and Atia and Saligrama [2]. However, for togeneralprobabilisticmodels,alongwithaprocedure the case that k , the sufficient conditions of → ∞ forapplyingthemtospecificmodels(cf.SectionIII-B). [2] that introduced additional logarithmic factors. In 3 Model Result Parameters Distributions SufficientnforP[error]→0 NecessarynforP[error](cid:54)→1 Cor.1 k=o(p) DGiasucsrseitaenβXS (cid:96)=m1a,.x..k(1+∆f11)(lo(cid:96))g(cid:0)p−(cid:96)k(cid:1) (cid:96)=m1a,.x..klogf(cid:0)1p−((cid:96)k(cid:96))+(cid:96)(cid:1) Linear (∆1→0forvariousscalings) k→∞,k=o(p) Gaussianβ αklogp (α−α∗)klogp Cor.2 Partialrecoveryof GaussianXS max k max k proportion1−α∗ α∈[α∗,1] f2(α) α∈[α∗,1] f2(α) (cid:96)logp (cid:96)logp k=Θ(1) Discreteβ max max Cor.3 LowSNR GaussianXS (cid:96)=1,...k f3((cid:96)) (cid:96)=1,...k f3((cid:96)) (withinafactor π oflinearmodel) (withinafactor π oflinearmodel) 2 √2 1-bit Cor.4 k=Θ(p) FixedβS - Ω(p logp) HighSNR GaussianX (comparedtoΘ(p)forlinearmodel) k→∞,k=o(p) Gaussianβ αklogp (α−α∗)klogp Cor.5 Partialrecoveryof GaussianXS max k max k proportion1−α∗ α∈[α∗,1] f5(α) α∈[α∗,1] f5(α) Cor.6 k=Θ(pθ) FixedβS klogkp klogkp BernoulliX f6(θ) log2 (f6(θ)=log2forθ≤ 13) Grouptesting Cor.7 Nokisy=(cΘro(pssθo)ver FixedβS klogkp klogkp probabilityρ) BernoulliX (f7(θ)=log2f−7(Hθ2)(ρ)forsmallθ) log2−H2(ρ) Cor.8 kPa→rtia∞lr,ecko=veroy(po)f FixedβS klogkp (1−α∗)(cid:0)klogkp(cid:1) proportion1−α∗ BernoulliX log2−H2(ρ) log2−H2(ρ) General log(cid:0)p−k+(cid:96)(cid:1) discrete Cor.9 Arbitrary Arbitrary - max (cid:96) observations (cid:96)=1,...k f9((cid:96))+∆9 Table I: Overview of main results for exact or partial support recovery under various observation models. In the necessary and sufficient number of measurements, asymptotically negligible terms have been omitted. All quantities are defined precisely in Section IV. contrast, we obtain matching Θ(cid:0)klog p(cid:1) scaling laws in Section IV. The proofs of the general bounds are given k for any sublinear scaling of the form k = O(pθ) in Section V, and conclusions are drawn in Section VI. (θ (0,1)). Moreover, for sufficiently small θ we ∈ obtain exact thresholds. In particular, for the noiseless setting we show that n klog p measurements are C. Notation both necessary and suffi≈cient fo2rkθ 1. This is in ≤ 3 Weuseupper-caselettersforrandomvariables,andlower- fact the same threshold as that for adaptive group case variables for their realizations. A non-bold character testing [31], thus proving that non-adaptive Bernoulli may be a scalar or a vector, whereas a bold character refers measurement matrices are asymptotically optimal even to a collection of n scalars (e.g., Y Rn) or vectors (e.g., when adaptivity is allowed; this was previously known X Rn×p).Wewriteβ todenoteth∈esubvectorofβ atthe S only in the limit as θ 0 [32]. For the noisy case, we ∈ columns indexed by S, and X to denote the submatrix of → S prove an analogous claim for sufficiently small θ. A X containing the columns indexed by S. The complement shortened and simplified version of this paper focusing with respect to 1,...,p is denoted by ()c. exclusively on group testing can be found in [33]. { } · The symbol means “distributed as”. For a given joint 4. (General discrete observations) Our converse for the ∼ distribution P , the corresponding marginal distributions XY case of general discrete observations (Corollary 9) are denoted by P and P , and similarly for conditional X Y recovers that of Tan and Atia [25] for the case that marginals (e.g., P ). We write P[] for probabilities, E[] β is fixed, strengthens it due to a smaller remainder Y|X · · S for expectations, and Var[] for variances. We use usual term ∆ , and provides a generalization to the case that · 9 notations for the entropy (e.g., H(X)) and mutual infor- β is random. S mation (e.g., I(X;Y)), and their conditional counterparts (e.g., H(X Z), I(X;Y Z)). Note that H may also denote B. Structure of the Paper thedifferen|tialentropyf|orcontinuousrandomvariables;the In Section II, we introduce our system model. In Section distinction will be clear from the context. We define the bi- III, we present our main non-asymptotic achievability and naryentropyfunctionH2(ρ):= ρlogρ (1 ρ)log(1 ρ), converse results for general observation models, and the and the Q-function Q(x):=P[W− x] (−W −N(0,1))−. ≥ ∼ procedure for applying them to specific problems. Several We make use of the standard asymptotic notations O(), · applications of our results to specific models are presented o(), Θ(), Ω() and ω(). We define the function []+ = · · · · · 4 max 0, , and write the floor function as . The function [A1] The support set S is uniform on the (cid:0)p(cid:1) subsets of { ·} (cid:98)·(cid:99) k log has base e. 1,...,p of size k, and the measurement matrix X is { } i.i.d. on some distribution P . X II. PROBLEMSETUP [A2] The non-zero entries βS are distributed according to P , and this distribution is permutation-invariant and A. Model and Assumptions βS the same for all realizations of S. Recall that p denotes the ambient dimension, k denotes [A3] The observation vector Y is conditionally i.i.d. ac- the sparsity level, and n denotes the number of measure- cording to P , and this distribution is the same Y|XSβS ments. We let be the set of subsets of 1,...,p having for all realizations of S, and invariant to common S { } cardinalityk.Thekeyrandomvariablesinoursetuparethe permutations of the columns of X and entries of β . S S supportsetS ,thedatavectorβ Rp,themeasurement [A4] The decoder is given (X,Y), and also knows the matrix X R∈n×Sp, and the observati∈on vector Y Rn.1 system model including k, P , and P . ∈ ∈ Y|XSβS βS The support set S is assumed to be equiprobable on Our main goal is to derive necessary and sufficient the (cid:0)p(cid:1) subsets within . Given S, the entries of β k S Sc conditions on n and k (as functions of p) such that Pe or are deterministically set to zero, and the remaining entries P (d ) vanishes as p . Moreover, when considering e max are generated according to some distribution β P . →∞ S ∼ βS converseresults,wewillnotonlybeinterestedinconditions We assume that these non-zero entries follow the same under which P 0, but also conditions under which the distribution for all of the (cid:0)p(cid:1) possible realizations of S, and e (cid:54)→ k stronger statement Pe 1 holds. that this distribution is permutation-invariant. → Inparticular,weintroducetheterminologythatthestrong The measurement matrix X is assumed to have i.i.d. val- converse holds if there exists a sequence of values n∗, ues on some distribution PX. We write PXn×p, to denote indexedbyp,suchthatforallη >0,wehavePe 0when the corresponding i.i.d. distributions for matrices, and we n n∗(1+η), and P 1 when n n∗(1 η→). This is e write PXk as a shorthand for PXk×1. Given S, X, and β, rel≥ated to the notion of a→phase transi≤tion [24−], [30]. More each entry of the observation vector Y is generated in a generally, we will refer to conditions under which P 1 e conditionally independent manner, with the i-th entry Y(i) as strong impossibility results, not necessarily requi→ring distributed according to matching achievability bounds. That is, the strong converse (Y(i) S =s,X(i) =x(i),β =b) P ( x(i),b ), conclusively gives a sharp threshold between failure and | ∼ Y|XSβS ·| s s success, whereas a strong impossibility result may not. (5) It will prove convenient to work with random variables forsomeconditionaldistributionP .Weagainassume Y|XSβS that are implicitly conditioned on a fixed value of S, say symmetrywithrespecttoS,namely,thatP doesnot Y|XSβS s= 1,...,k . We write P and P in place of P depend on the specific realization, and that the distribution { } βs Y|Xsβs βS andP toemphasizethatS =s.Moreover,wedefine is invariant when the columns of XS and the entries of βS Y|XSβS the corresponding joint distribution undergo a common permutation. Given X and Y, a decoder forms an estimate Sˆ of S. P (b ,x ,y):=P (b )Pk(x )P (y x ,b ), Similarly to previous works studying information-theoretic βsXsY s s βs s X s Y|Xsβs | s s (8) limits on support recovery, we assume that the decoder and its multiple-observation counterpart knows the system model. We consider two related perfor- P (b ,x ,y):=P (b )Pn×k(x )Pn (yx ,b ). mance measures. In the case of exact support recovery, the βsXsY s s βs s X s Y|Xsβs | s s (9) error probability is given by where Pn ( ,b ) is the n-fold product of Pe :=P[Sˆ(cid:54)=S], (6) PY|Xsβs(·|Y·,|bXss)β.s ·|· s Except where stated otherwise, the random variables and is taken with respect to the realizations of S, β, X, (β ,X ,Y) and (β ,X ,Y) appearing throughout this pa- and Y; the decoder is assumed to be deterministic. We also s s s s per are distributed as consideralessstringentperformancecriterionrequiringthat only k d entries of S are successfully recovered, for (β ,X ,Y) P (10) − max s s ∼ βsXsY some d 1,...,k 1 . Following [15], [16], the error max (β ,X ,Y) P , (11) probability ∈is{given by − } s s ∼ βsXsY withtheremainingentriesofthemeasurementmatrixbeing Pe(dmax):=P(cid:2)|S\Sˆ|>dmax∪|Sˆ\S|>dmax(cid:3). (7) distributed as Xsc ∼ PXn×(p−k), and with βsc = 0 deter- NotethatifbothS andSˆhavecardinalitykwithprobability ministically. That is, we condition on a fixed S =s except where stated otherwise. one,thenthetwoeventsintheunionareidentical,andhence Fornotationalconvenience,themainpartsofouranalysis either of the two can be removed. arepresentedwithP ,P andP representingprob- For clarity, we formally state our main assumptions as βs X Y|Xsβs ability mass functions (PMFs), and with the corresponding follows: averages written using summations. However, except where 1ExtensionstomoregeneralalphabetsbeyondRarestraightforward. stated otherwise, our analysis is directly applicable to case 5 that these distributions instead represent probability density ChannelState functions (PDFs), with the summations replaced by inte- βS∼PβS grals where necessary. The same applies to mixed discrete- Message Codeword Output Estimate continuous distributions. S Encoder XS Channel Y∼PYn|XSβS Decoder Sˆ B. Information-Theoretic Definitions Figure 1: Connection between support recovery and coding Before introducing the required definitions for support over a mixed channel. recovery, it is instructive to discuss thresholding techniques inchannelcodingstudies.Thesecommencedinearlyworks such as [34], [35], and have recently been used extensively For fixed s and a corresponding pair (s ,s ), we dif eq in information-spectrum methods [29], [36]. ∈ S introduce the notation 1) Channel Coding: We first recall the mutual informa- tion, which is ubiquitous in information theory: PY|XsdifXseq(y|xsdif,xseq):=PY|Xs(y|xs) (15) I(X;Y):=(cid:88)P (x,y)logPY|X(y|x). (12) PY|XsdifXseqβs(y|xsdif,xseq,bs):=PY|Xsβs(y|xs,bs), XY P (y) (16) Y x,y where P is the marginal distribution of (9). While the In deriving asymptotic and non-asymptotic performance Y|Xs left-handsidesof(15)–(16)representthesamequantitiesfor bounds, it is common to work directly with the logarithm, any such (s ,s ), it will still prove convenient to work dif eq PY|X(y x) withtheseinplaceoftheright-handsides.Inparticular,this ı(x;y):=log | , (13) P (y) allows us to introduce the marginal distributions Y which is commonly known as the information density. P (yx ) The thresholding techniques work by manipulating prob- Y|Xseq | seq (cid:88) a(cid:80)binil=it1ieıs(Xoif;Yevi)ent>s oγf.thFeorfotrhme f(cid:80)ornim=e1rı,(Xoni;eYic)an≤peγrfoarnmd :=xsdif PXn×(cid:96)(xsdif)PY|XsdifXseq(y|xsdif,xseq) (17) a change of measure from the conditional distribution Y P (y x ,b ) Y|Xseqβs | seq s given X to the unconditional distribution of Y, with a (cid:88) := P(cid:96)(x )P (y x ,x ,b ), (18) multiplicativeconstante−γ.Forthelatter,onecansimilarly X sdif Y|XsdifXseqβs | sdif seq s perform a change of measure from Y to (Y X). Hence, in xsdif | bothcases,thereisasimplerelationbetweentheconditional where (cid:96) := s . Using the preceding definitions, we dif | | and unconditional probabilities of the output sequences. introducetwoinformationdensities.Thefirstcontainsprob- Usingthesemethods,onecangetupperandlowerbounds abilities averaged over β , s on the error probability such that the dominant term is P (yx ,x ) (cid:20) n (cid:21) ı(x ;yx ):=log Y|XsdifXseq | sdif seq , (19) P n1 (cid:88)ı(Xi;Yi)≤I(X;Y)+ζn (14) sdif | seq PY|Xseq(y|xseq) i=1 whereas the second conditions on β =b : s s forsomeζ =o(1).Assumingthat (X ,Y ) n hassome formofi.i.nd.structure,onecananaly{zethiiseix}pir=e1ssionusing ın(x ;yx ,b ):=(cid:88)n ı(x(i) ;y(i) x(i),b ), (20) tools from probability theory. The law of large numbers sdif | seq s sdif | seq s i=1 yields the channel capacity C = max I(X;Y), and PX where the single-letter information density is refined characterizations can be obtained using variations of the central limit theorem [37]. P (y x ,x ,b ) Amongthechannelcodingliterature,ouranalysisismost ı(xsdif;y|xseq,bs):=log Y|XPsdifXseqβs (y|xsdif,bs)eq s . similar to that of mixed channels [29, Sec. 3.3], where Y|Xseqβs | seq s (21) the relation between the input and output sequences is not As mentioned above, we will generally work with discrete i.i.d., but instead conditionally i.i.d. given another random random variables for clarity of exposition, in which case variable. In our setting, β will play the role of this random s the ratio is between two PMFs. In the case of continuous variable. See Figure 1 for a depiction of this connection. observations the ratio is instead between two PDFs, and 2) Support Recovery: As in [2], [14], we will consider more generally this can be replaced by the Radon-Nikodym partitions of the support set s into two sets s = dif derivative as in the channel coding setting [37]. ∈ S (cid:54) ∅ and s . As will be seen in the proofs, s will typically eq eq Averaging (21) with respect to the random variables in correspond to an overlap between s and some other set s (10) conditioned on β = b yields a conditional mutual s s (i.e., s s), whereas s will correspond to the indices in dif information, which we denote by ∩ one set but not the other (e.g., s s). There are 2k 1 ways \ − of performing such a partition with s = . I (b ):=I(X ;Y X ,β =b ). (22) dif (cid:54) ∅ sdif,seq s sdif | seq s s 6 This quantity will play a key role in our bounds, which will Moreover, this can be strengthened by noting from the typically have the form proof of Theorem 1 that γ may depend on β , and s (cid:20) (cid:0) p (cid:1) (cid:21) choosing γ(bs)=logPβs1(bs) accordingly. P P n max |sdif| , (23) 2. Defining e ≈ ≤(sdif,seq)Isdif,seq(βs) I :=I(β ;Y X ) (27) 0 s s | as will be made more precise in the subsequent sections. (cid:20) P (Y X ,β )(cid:21) V :=Var log Y|Xs,βs | s s , (28) 0 P (Y X ) Y|Xs | s III. GENERALACHIEVABILITYANDCONVERSEBOUNDS we have for any δ >0 that 0 In this section, we provide general results holding for (cid:114) V arbitrarymodelssatisfyingtheassumptionsgiveninSection γ =I + 0 = P (γ) δ . (29) II. Each of the results for exact recovery has a direct 0 δ0 ⇒ 0 ≤ 0 counterpart for partial recovery. For clarity, we focus on This follows directly from Chebyshev’s inequality. the former throughout Sections III-A and III-B, and then 3. Defining proceed with the latter in Section III-C. (cid:20)(cid:20) P (Y X ,β )(cid:21)+(cid:21) I :=E log Y|Xsβs | s s , (30) 0,+ P (Y X ) A. Initial Non-Asymptotic Bounds Y|Xs | s we have for any δ >0 that 0 Here we provide our main non-asymptotic upper and I lower bounds on the error probability. These bounds bear γ = 0,+ = P (γ) δ . (31) astrongresemblancetoanalogousboundsfromthechannel δ0 ⇒ 0 ≤ 0 coding literature [29]; in each case, the dominant term This follows directly from Markov’s inequality. involves tail probabilities of the information density given The proof of Theorem 1 is based on a decoder the in (20). The mean of the information density is the mutual searches for a unique support set s such that information in (22), which thus arises naturally in the subsequent necessary and sufficient conditions on n upon ı(x ;yx )>γ (32) sdif | seq |sdif| showingthatthedeviationfromthemeanissmallwithhigh for some γ k and all 2k 1 partitions (s ,s ) of s probability. The procedure for doing this given a specific { (cid:96)}(cid:96)=1 − dif eq with s = . Since the numerator in (19) is the likelihood model will be given in Section III-B. dif (cid:54) ∅ of y given (x ,x ), this decoder can be thought of a We start with our achievability result. Here and through- sdif seq weakenedversionofthemaximum-likelihood(ML)decoder. out this section, we make use of the random variables LiketheMLdecoder,computationalconsiderationsmakeits defined in (11). implementation intractable. Theorem 1. For any constants δ1 >0 and γ, there exists a Thefollowingtheoremprovidesageneralnon-asymptotic decoder such that converse bound. (cid:20) (cid:26) P P (cid:91) ın(X ;Y X ,β ) Theorem 2. Fix δ1 > 0, and let (sdif(bs),seq(bs)) be e ≤ sdif | seq s an arbitrary partition of s = 1,...,k (with sdif = ) (sdif,seq):sdif(cid:54)=∅ depending on b Rk. For any{decoder,}we have (cid:54) ∅ s (cid:18)p k(cid:19) (cid:18)k2(cid:18) k (cid:19)2(cid:19) (cid:27)(cid:21) ∈ log − +log +γ +P (γ)+2δ , (cid:20) ≤ |sdif| δ12 |sdif| 0 1(24) Pe ≥P ın(Xsdif(βs);Y|Xseq(βs),βs) (cid:18) (cid:19) (cid:21) p k+ s (β ) where log − | dif s | +logδ δ . (33) ≤ s (β ) 1 − 1 dif s (cid:20) P (Y X ,β ) (cid:21) | | P (γ):=P log Y|Xs,βs | s s >γ . (25) Proof: See Section V-B. 0 P (Y X ) Y|Xs | s The proof of Theorem 2 is based on Verdú-Han type Proof: See Section V-A. bounding techniques [36]. Remark 1. TheprobabilityinthedefinitionofP (γ)isnot 0 B. Techniques for Applying Theorems 1 and 2 ani.i.d.sum,andthetechniquesforensuringthatP (γ) 0 0 → vary between different settings. The following approaches The bounds presented in the preceding theorems do not will suffice for all of the applications in this paper: directly reveal the number of measurements required to achieving a vanishing error probability. In this subsection, 1. In the case that P is discrete, P (yx ) = (cid:80) βS Y|Xs | s we present the steps that can be used to obtain such P (b )P (yx ,b ), and it follows that bs βs s Y|Xs,βs | s s conditions. We provide examples in Section IV. 1 Theideaistouseaconcentrationinequalitytoboundthe γ =log = P (γ)=0. (26) min P (b ) ⇒ 0 first term in (24) (or (33)), which is possible due to the fact bs βs s 7 that each summation ın is conditionally i.i.d. given β . We approaches the probability, with respect to P , that (37) s βs proceed by providing the details of these steps separately fails to hold. In many cases, the second logarithm in the fortheachievabilityandconverse.Westartwiththeformer. numerator therein is dominated by the first. It should be 1. Observe that, conditioned on β = b , the notedthattheconditionthatthesecondtermin(39)vanishes s s mean of ın(X ;Y X ,β ) is nI (b ), where can also impose conditions on n. For most of the examples I (b ) issddieffine|d isneq(22s). sdif,seq s presented in Section IV, the condition in (37) will be the sdif,seq s 2. Fix δ (0,1), and suppose that for a fixed value b dominant one; however, this need not always be the case, 2 s of β , w∈e have for all (s ,s ) that and it depends on the concentration inequality used in (35). s dif eq TheapplicationofTheorem2isdoneusingsimilarsteps, (cid:18)p k(cid:19) (cid:18)k2(cid:18) k (cid:19)2(cid:19) soweprovidelessdetail.Fixδ2 >0,andsupposethat,fora log − +log +γ |sdif| δ12 |sdif| fiisxseudcvhaltuheatbs ofβs,thepair(sdif,seq)=(sdif(bs),seq(bs)) n(1 δ )I (b ), (34) ≤ − 2 sdif,seq s (cid:18)p k+ s (cid:19) and log − s | dif| −logδ1 ≥n(1+δ2)Isdif,seq(bs), (40) dif | | P(cid:104)ın(Xsdif;Y|Xseq,bs)≤n(1−δ2)Isdif,seq(bs)(cid:12)(cid:12)βs =bs(cid:105) and ≤ψ|sdif|(n,δ2) (35) P(cid:104)ın(Xsdif;Y|Xseq,bs)≤n(1+δ2)Isdif,seq(bs)(cid:12)(cid:12)βs =bs(cid:105) forsomefunctions{ψ(cid:96)}k(cid:96)=1 (e.g.,thesemayarisefrom ≥1−ψ|(cid:48)sdif|(n,δ2) (41) Chebyshev’s inequality or Bernstein’s inequality [38, for some function ψ(cid:48) . Combining these conditions, we Ch. 2]). Combining these conditions with the union |sdif| seethatthefirstprobabilityin(33),withanaddedcondition- bound, we obtain ingonβ =b ,islowerboundedby1 ψ(cid:48) (n,δ ).Inthe P(cid:20) (cid:91) (cid:26)ın(X ;Y X ,β ) case thatsψ(cid:96)(cid:48) iss defined for multiple (cid:96)−valu|sedsif|corre2sponding sdif | seq s to different values of bs, we can further lower bound this (sdif,s(cid:18)eqp):sdkif(cid:19)(cid:54)=∅ (cid:18)k2(cid:18) k (cid:19)2(cid:19) (cid:27)(cid:12) (cid:21) byN1e−xt,mwaex(cid:96)oψbs(cid:96)(cid:48)e(rnv,eδ2th).at (40) holds if and only if log − +log +γ (cid:12)β =b ≤ |sdif| δ12 |sdkif|(cid:18) (cid:19) (cid:12) s s n log(cid:0)p−k|s+di|fs|dif|(cid:1)−logδ1. (42) ≤(cid:88) k(cid:96) ψ(cid:96)(n,δ2). (36) ≤ Isdif,seq(bs)(1+δ2) (cid:96)=1 Recallingthatthepartition(s ,s )isanarbitraryfunction dif eq 3. Observe that the condition in (34) can be written as of βs, we can ensure that this coincides with n≥ log(cid:0)|ps−diIkfs|(cid:1)dif+,selqo(gb(cid:16)s)kδ(1221(cid:0)−|sdkδif2|)(cid:1)2(cid:17)+γ. (37) n≤(sdif,smeq)a:xsdif(cid:54)=∅logIs(cid:0)dpif−,sk|es+qd(i|fsb|dsi)f|((cid:1)1−+lδo2g)δ1 (43) by choosing each pair (s ,s ) as a function of b to We summarize the preceding findings in the following. dif eq s achieve this maximum. Theorem 3. For any constants δ1 > 0, δ2 (0,1) and γ, Finally, we note that the maximum over (cid:96) in the above- and functions {ψ(cid:96)}k(cid:96)=1 (ψ(cid:96) :Z×R→R), d∈efine the set derivedterm1−max(cid:96)ψ(cid:96)(cid:48)(n,δ2)mayberestrictedtoanyset 1,...,k provided that s is constrained similarly B(δ1,δ2,γ):=(cid:8)bs : (35) and (37) hold for all Lin⊆(43{); one si}mply chooses th|edpifa|rtition (sdif(bs),seq(bs)) (cid:9) (s ,s ) with s = . (38) so that (cid:96)= s always lies in this set. Putting everything dif eq dif dif (cid:54) ∅ | | together, we have the following. Then we have Theorem 4. For any set 1,...,k , constants δ >0 Pe ≤P(cid:2)βs ∈/ B(δ1,δ2,γ)(cid:3)+(cid:88)k (cid:18)k(cid:96)(cid:19)ψ(cid:96)(n,δ2)+P0(γ)+2δ1. adnefidneδ2th>e s0e,t and functionLs⊆{ψ{(cid:96)(cid:48)}(cid:96)∈L (ψ}(cid:96)(cid:48) : Z×R →1 R), (cid:96)=1 (39) (cid:48)(δ ,δ ):=(cid:8)b : (41) and (42) hold for all 1 2 s Remark 2. The preceding arguments remain unchanged B (cid:9) (s ,s ) with s . (44) when δ2 also depends on (cid:96)= sdif . We leave this possible dif eq | dif|∈L | | dependence implicit throughout this section, since a fixed Then we have value will suffice for all but one of the models considered P P(cid:2)β (cid:48)(δ ,δ )(cid:3)(cid:16)1 maxψ(cid:48)(n,δ )(cid:17) δ . (45) in Section IV. e ≥ s ∈B 1 2 − (cid:96)∈L (cid:96) 2 − 1 In the case that (35) holds for all b (or more generally, Ifthepair(s ,s )hadbeenfixedinTheorem2,asop- s dif eq within a set whose probability under P tends to one) and posedtobeingafunctionofβ ,thenwewouldhaveonlyob- βs s the final three terms in (39) vanish, the overall upper bound tainedaweakerresultwiththestatement“forall(s ,s )” dif eq 8 Procedure 1:StepsforObtainingNecessaryandSufficient tion inequalities such as Bernstein’s inequality, due to the Conditions on n from Theorems 3 and 4 combinatorial terms in (39). 1. (Identify a Typical Set) Construct a sequence of “typ- ical” sets Rk of non-zero entries, indexed by β p, such thTat P⊆(cid:2)β (cid:3) 1, thus restricting the C. Extensions to Partial Recovery s β ∈ T → vectors bs for which ı(Xsdif;Y|Xseq,bs) needs to be We now turn to the partial support recovery criterion characterized. in (7). The changes in the analysis required to generalize 2. (Bound the Information Density Tail Probabilities) Us- Theorems 1 and 2 to this setting are given in Section V-C; ing a concentration inequality for i.i.d. summations rather than repeating each of these, we focus our attention (e.g., Chebyshev, Bernstein), bound the tail probabil- on the resulting analogues of Theorems 3 and 4. ities in (35) and (41) for each (s ,s ) and b , dif eq s β ∈T Theorem 5. For any constants δ > 0, δ (0,1) and withafixedconstantδ .Uponmakingthesedependent 1 2 2 γ > 0, and functions ψ k (ψ : Z∈ R R), on (sdif,seq,bs) only through (cid:96) := sdif , the bounds { (cid:96)}(cid:96)=dmax+1 (cid:96) × → are denoted by ψ (n,δ ) and ψ(cid:48)(n,δ| ). | define the set (cid:96) 2 (cid:96) 2 3. (Control the Remainder Terms) By suitable rearrange- (cid:8) (δ ,δ ,γ):= b : (35) and (37) hold for all ments, find conditions on n under which the terms 1 2 s B (cid:80)(cid:96)(cid:0)k(cid:96)(cid:1)ψ(cid:96)(n,δ2) and max(cid:96)∈Lψ(cid:96)(cid:48)(n,δ2) in (39) and (sdif,seq) with |sdif|∈{dmax+1,...,k}(cid:9). (46) (45) vanish, thus ensuring that their contribution is Then we have negligible.Similarly,chooseδ tovanishwithpsothat 1 its contribution is negligible, and for the achievability P (d ) P(cid:2)β / (δ ,δ ,γ)(cid:3) part, choose γ such that the remainder term P0(γ) e max ≤ s ∈B 1 2 vanishes (cf. Remark 1). (cid:88)k (cid:18)k(cid:19) + ψ (n,δ )+P (γ)+2δ , (47) 4. (Combine and Simplify) Combine the previous steps as (cid:96) (cid:96) 2 0 1 follows: (cid:96)=dmax+1 a) Construct the set of non-zero entries (δ ,δ ,γ) where P is defined in (25). 1 2 0 Rk (respectively, (cid:48)(δ ,δ )) in (38)B(respectivel⊆y, B 1 2 For the converse part, (42) is replaced by (44)); b) Deducefrom(39)(respectively,(45))andStep3that log(cid:0)p−k+|sdif|(cid:1) log(cid:80)dmax(cid:0)p−k(cid:1)(cid:0)|sdif|(cid:1) logδ P P[β / (δ ,δ ,γ)]+o(1) (respectively, P n |sdif| − d=0 d d − 1, e s 1 2 e P[β≤s ∈B(cid:48)(δ∈1,Bδ2)]+o(1)); ≥ ≥ Isdif,seq(bs) (48) c) From the properties of the typical set in Steps 1– Tβ and we have the following analog of Theorem 4. 2, deduce that P 0 (respectively, P 1) when e e → → n satisfies (37) (respectively, (42)) for all b ; Theorem 6. For any set d +1,...,k , constants s β max d) Augment this condition on n with Step 3. ∈T δ > 0 and δ (0,1), aLnd⊆f{unctions ψ(cid:48) }(ψ(cid:48) : Z R1 R), defin2e∈the set { (cid:96)}(cid:96)∈L (cid:96) × → (cid:48)(δ ,δ ):=(cid:8)b : (41) and (48) hold for all in(44)replacedbyafixedpair.Assumingthattheremainder B 1 2 s (cid:9) terms in (45) are insignificant, this weaker result is of the (sdif,seq) with sdif . (49) form P (cid:38) max P(cid:2)n f(s ,s ,β )(cid:3) rather than | |∈L P (cid:38)Pe(cid:2)n ma(xsdif,seq) f(s≤ ,s d,ifβ )e(cid:3)q. Thsis can lead to Then we have siegnificantl≤ydiffer(esnditf,bseoqu)ndsodinf theeqsamsplecomplexity,and P (d ) P(cid:2)β (cid:48)(δ ,δ )(cid:3)(cid:16)1 maxψ(cid:48)(n,δ )(cid:17) δ . thedistinctioniscrucialinourapplicationsinSectionIV.As e max ≥ s ∈B 1 2 − (cid:96)∈L (cid:96) 2 − 1 (50) describedintheproofinSectionV,thekeytoobtainingthis difference is in applying a refined version of an argument The applications of Theorems 5 and 6 follow identical based on a genie. steps to Procedure 1. However, it will be seen that the The general steps in applying Theorems 3 and 4 to restriction s > d can in fact considerably simplify dif max | | specific problems are outlined in Procedure 1. thesesteps,sinceitremovestheneedtoobtainconcentration In our experience, the choice of in the first step inequalities for smaller values of s . β dif T | | of Procedure 1 usually comes naturally given the specific model. On the other hand, it is often less straightforward D. Comparison to Fano’s Inequality to find a sufficiently powerful concentration inequality in Step 2. A simple choice is Chebyshev’s inequality, which Most previous works on the information-theoretic limits expresses ψ and ψ(cid:48) in terms of I (b ) (see (22)) and ofsparsityrecoveryhavemadeuseofFano’sinequality[39, (cid:96) (cid:96) sdif,seq s the corresponding variances of the information densities. Sec. 2.11]. For this reason, we provide here a discussion on This choice is usually effective for the converse, wheres the relative merits of this approach and our approach. To the achievability part typically requires sharper concentra- this end, we consider the following bound, which can be 9 obtained by combining the analysis of [2], [14] with our β be a uniformly random permutation of a fixed vector s refined genie argument: (b ,...,b ), and we choose P N(0,1). Since both 1 k X ∼ P (cid:88)P (b )max(cid:26)0,1 nIsdif(bs),seq(bs)(bs)+1(cid:27) tmheutumaelaisnufroermmeantitonmaintri(c2e2s)ainsdgitvheennboyis[e39a,reCGh.a1u0ss]ian, the e ≥ βS s − log(cid:0)p−k+|sdif(bs)|(cid:1) bs |sdif(bs)| (51) I (b )= 1log(cid:16)1+ 1 (cid:88) b2(cid:17). (56) sdif,seq s 2 σ2 i in the notation of Theorem 2. By analyzing this bound i∈sdif similarly to Section III-B, we obtain for any δ >0 that 2 Throughout this subsection, we denote b := min b min i i Pe ≥δ2P(cid:2)βs ∈BF(cid:48)ano(δ2)(cid:3)− log(p 1k+1), (52) abndbm=axΘ:(=b ma)xia|nbdi|.0W<eabssum=etOha(t1σ);2n=otΘe(t1h)a,tabndt|h=a|t min max min min − o(1) is allowed. The steps of Procedure 1 are as follows. where Step 1: We trivially choose the typical set to contain β (cid:26) log(cid:0)p−k+|sdif|(cid:1) all vectors on the support of P . T (cid:48) (δ ):= b : n |sdif| (1 δ ) βS BFano 2 s ≤ I (b ) − 2 Step 2: We make use of the following concentration sdif,seq s (cid:27) inequality based on Bernstein’s inequality. for all (s ,s ) with s = . (53) dif eq dif (cid:54) ∅ Proposition 1. Under the preceding setup for the linear model, we have for all (s ,s ) and b that A similar result for partial recovery can also be derived by dif eq s incAorspodrisactiunsgsetdheianrgtuhmeeinnttsrofdroumcti[o1n6,]tahnedktheyepardevsaennttapgaepeorf. P(cid:20)(cid:12)(cid:12)ın(Xsdif;Y|Xseq,bs)−nIsdif,seq(bs)(cid:12)(cid:12)≥nδ(cid:12)(cid:12)(cid:12)βs =bs(cid:21) Theorem 4 is that it provides a more precise characteri- (cid:18) δ2n (cid:19) zation of how far the error probability is from zero, and 2exp , (57) ≤ − 2(4α2 +δα ) in particular, the conditions under which Pe 1 (strong sdif sdif → impossibility results). On the other hand, the bound on Pe where 2σ (σ+σ ) in (52) is always bounded away from one for fixed δ2, and α := sdif sdif (58) becomes increasingly weak for small δ2. sdif σ2+σs2dif The advantage of Fano’s inequality is that it only re- with σ2 :=(cid:80) b2. quires the mutual information to be computed, whereas our sdif i∈sdif i approach also requires the application of a concentration Proof: See Appendix B. inequality.This,inturn,typicallyrequiresthevarianceofthe Settingδ =δ2Isdif,seq(bs),itfollowsthatin(35)and(41) informationdensitytobecharacterized,whichisnotalways we can set straightforward. However, as discussed following Procedure ψ (n,δ )=ψ(cid:48)(n,δ )=2 max 1, the main difficulty associated with these concentration (cid:96) 2 (cid:96) 2 (sdif,seq,bs):|sdif|=(cid:96) inequalities is typically in finding one which is sufficiently (cid:18) (δ I (b ))2n (cid:19) powerfulfortheachievabilitypart.Thus,theaddeddifficulty exp 2 sdif,seq s . (59) − 2(4α +δ I (b ))α in the converse may not add to the overall difficulty in sdif 2 sdif,seq s sdif deriving matching achievability and converse bounds. Step 3: In accordance with the first item of Remark 1, we set γ as in (26) so that P (γ)=0. 0 IV. APPLICATIONSTOSPECIFICMODELS We focus on the conditions on n under which the term (cid:80)k (cid:0)k(cid:1)ψ (n,δ ) in (39) vanishes; the term containing ψ(cid:48) In this section, we present applications of Theorems 3–6 (cid:96)=1 (cid:96) (cid:96) 2 (cid:96) in (45) can be handled in a similar yet simpler fashion. to the linear [5], 1-bit [9], and group testing [2] models, By the assumptions σ2 = Θ(1) and b = Θ(b ), we andtomoregeneralmodelswithdiscreteobservations[25]. max min readilyobtainI (b )=Θ(log(1+(cid:96)b2 ))andα2 = Throughout the section, we make use of general concentra- sdif,seq s min sdif Θ(min 1,(cid:96)b2 ) using (56) and (58), where (cid:96) = s . tion inequalities given in Appendix A. We also make use of { min} | dif| Usingthesegrowthratesandupperboundingthesummation the following variance quantity: in(39)byk timesthecorrespondingmaximum,weseethat Vsdif,seq(bs):=Var(cid:2)ı(Xsdif;Y|Xseq,bs)(cid:12)(cid:12)βs =bs(cid:3). (54) (cid:80)k(cid:96)=1(cid:0)k(cid:96)(cid:1)ψ(cid:96)(n,δ2) → 0 provided that the following holds for some sufficiently small constant ζ (depending on δ ): 2 A. Linear Model with Discrete β s nlog2(1+(cid:96)b2 ) Here we consider the linear model, where each observa- min(cid:112) ζ min 1,(cid:96)b2 +log(1+(cid:96)b2 ) min 1,(cid:96)b2 tion takes the form { min} min { min} k (cid:96)log logk (60) Y = X,β +Z, (55) − (cid:96) − →∞ (cid:104) (cid:105) where Z N(0,σ2) for some σ >0. for all (cid:96). We now treat two cases separately: ∼ Without loss of generality, we consider the fixed support • If (cid:96)b2min = o(1), the first term in (60) behaves as set s = 1,...,k . Following the setup of [17], we let Θ(n(cid:96)b2 );byrearranging,weconcludethatitsuffices { } min 10 that n and n=Ω(cid:0)logk(cid:1) with a sufficiently large In cases (i)–(iii), we have logk = o(logp), and it implied→co∞nstant. b2min immediatelyfollowsthatthenumeratorin(64)isdominated • If(cid:96)b2min =Ω(1),thefirsttermin(60)behavesasΩ(n), by the first term, and hence, the others can be factored into and it thus suffices that n and n=Ω(k) with a η in (62). Moreover, in case (i), both conditions in (61) are →∞ sufficiently large implied constant. dominated by the objective in (64) with (cid:96) := sdif = 1, Thus, the overall condition that we require is n and which behaves as Θ(cid:0)lbo2gp(cid:1). In cases (ii)–(iii)|, th|e first →∞ condition in (61) is agaminindominated by the term in (64) (cid:16)logk(cid:17) n=Ω and n=Ω(k), (61) with (cid:96)=1. The second condition is dominated by the term b2min with(cid:96)=k,whichbehavesasΘ(cid:0) klogp (cid:1)=Ω(cid:0)klogp(cid:1). log(1+kb2 ) logk with sufficiently large implied constants. For the converse, In case (iv), the first term in the nmuinmerator of (64) the analogous condition to (60) contains only the first term may not be dominant for small (cid:96) := s , since logk = dif | | ontheleft-handside(thedifferencebeingduetothefactthat Θ(logp).However,byobservingthattheobjectivescalesas the combinatorial term in (39) is not present in (45)), and a Θ(cid:0) (cid:96)logp (cid:1)andusingtheassumedscalingofb2 ,itis similar argument reveals that it suffices that n=ω(cid:0) 1 (cid:1). readloigly(1v+e(cid:96)rbi2mfiien)dthatthemaximumcanonlybeachievmeindwith Step 4: Combining the preceding steps and apbp2mliyning (cid:96) = Θ(k). For any such maximizer, we have log(cid:0)p−k(cid:1) = (cid:96) asymptotic simplifications, we obtain the following. Θ(klogp), and hence, the second term in the numerator of (64) can be factored into η, as it behaves as O(k). The two Corollary1. Undertheprecedingsetupforthelinearmodel conditions in (61) are identical under the given scaling of with σ2 = Θ(1), b = Θ(b ), b2 = O(1), k = o(p), min max min b2 ,andaredominatedbytheobjectivein(62)with(cid:96)=k, and mβ distinct elements in (b1,...,bk), we have Pe → 0 wmhiinch behaves as Θ(cid:0) klogk (cid:1). as p provided that loglogk →∞ In the case that bmin = Θ(1), the thresholds given in log(cid:0)p−k(cid:1) Corollary 1 coincide with those given in the main results of n max (cid:16) |sdif| (cid:17)(1+η), (62) [17].Ourframeworkhastheadvantageofhandlingthecase ≥sdif(cid:54)=∅ 12log 1+ σ12 (cid:80)i∈sdif b2i that bmin = o(1), as well as providing the strong converse (P 1) instead of the weak converse (P 0). However, under any one of the following additional conditions: (i) e e → (cid:54)→ itshouldbenotedthattheachievabilitypartsof[17]havethe k = Θ(1); (ii) k = o(logp) and m = Θ(1); (iii) k = β notable advantage of using a decoder that does not depend O((logp)θ) for some θ > 0, and m = 1; (iv) k = Θ(pθ) β for some θ (0,1), b2 =Θ(cid:0)logk(cid:1), and m =1. on the distribution of βs. ∈ min k β Onfirstglance,theboundsin(62)–(63)mayappeartobe Conversely, without any additional conditions, we have difficulttoevaluate,sincethemaximizationsareover2k 1 Pe 1 as p whenever − → →∞ non-emptysubsetss .However,itisinfactonlykofthem dif log(cid:0)p−k+|sdif|(cid:1) that need to be computed, since for any given (cid:96)= s the n max (cid:16) |sdif| (cid:17)(1 η) (63) maximizing s is the one with the smallest corres|pdoinf|ding ≤sdif(cid:54)=∅ 12log 1+ σ12 (cid:80)i∈sdif b2i − value of (cid:80)i∈sddifif b2i. Comparison to the LASSO: Conditions for the support for some η >0. recovery of the computationally tractable LASSO algorithm Proof:Theconversepartfollowsfrom(42)withδ1 0 were given by Wainwright [6]. Several comparisons to the sufficiently slowly. To check the condition n = ω(cid:0) →1 (cid:1) information-theoretic limits were given in [5], [6] in terms b2 stated following (61), we may assume without lossminof of scaling laws; here we complement these comparisons by generality that (63) holds with equality, since the decoder briefly discussing the corresponding constant factors. For canalwayschoosetoignoreadditionalmeasurements.When simplicity,wefocusonthecasethatthenon-zeroentriesare equality holds, we observe that for the worst-case sdif with all equal to a common value b0 = ckβ (for some constant cβ (cid:96) = 1, the denominator therein behaves as O(b2 ) (since representingtheper-sampleSNR)andk ispoly-logarithmic min b2 = O(1)) and the numerator behaves as Θ(logp), and in p, corresponding to case (iii) of Corollary 1. hmenince, the condition n=ω(cid:0) 1 (cid:1) is satisfied. The results of [6] state that LASSO requires at least b2 For the achievability part, wmeinfirst use (37) to obtain (2klogp)(1+o(1))measurementsregardlessofc ,andthat β log(cid:0)p−k(cid:1)+2log(cid:0)k(cid:0) k (cid:1)(cid:1)+klogm thisboundisalsoachievableinthelimitascβ →∞.Onthe n max |sdif| |sdif| β(1+η), otherhand,Corollary1revealsthatfortheoptimaldecoder, (cid:16) (cid:17) ≥sdif(cid:54)=∅ 1log 1+ 1 (cid:80) b2 the coefficient to klogp can be arbitrarily small provided 2 σ2 i∈sdif i that c is large enough. More precisely, applying some (64) β simple manipulations to (62), we find that the coefficient to wherethefinalterminthenumeratorarisesfrom(26)since klogpissup α ,whereαrepresentsthera- Pβs(bs)isthesameforallpermutationsof(b1,...,bk),and α∈(0,1] 12log(1+cβα) islowerboundedbym−k.Observethatthefirstterminthe tio |sdif|.Itiseasytoverifythatthemaximumisachievedat β k innumtheeractoorrolbleahryavsetsateams eΘn(t|,sadnifd|ltohgeps)ecfoorndeatcehrmofbethheavceassaess αLA=SS1O,ypierlodvianbglythyeieclodnssatasnutbloogp(t1i2+mcaβl)c.oWnsetacnotnwclhuednectha>tth1e, Θ(cid:0)logk+|sdif|log|sdkif|(cid:1). andfailstoachievetheoptimallogarithmicdecay.Hoβwever,

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.