1 Relabeling and Summarizing Posterior Distributions in Signal Decomposition Problems when the Number of Components is Unknown Alireza Roodaki, Julien Bect and Gilles Fleury 3 Abstract—This paper addresses the problems of relabeling summary of the distribution (e.g., a histogram or a kernel 1 and summarizing posterior distributions that typically arise, in density estimate). In the case of multimodal distributions, 0 a Bayesian framework, when dealing with signal decomposi- summarization becomes more difficult but can be carried 2 tion problems with an unknown number of components. Such out using, for instance, the approximation of the posterior posterior distributions are defined over union of subspaces of n differingdimensionalityandcanbesampledfromusingmodern by a Gaussian Mixture Model (GMM) [6]. Summarizing or a Monte Carlo techniques, for instance the increasingly popular approximating posterior distributions has also been used in J RJ-MCMC method. No generic approach is available, however, designingproposaldistributionsofMetropolis-Hastings (MH) 8 to summarize the resulting variable-dimensional samples and samplers in an adaptive MCMC framework; see, e.g., [7–9]. extract from them component-specific parameters. This paperaddressesthe problemof summarizingposterior ] We propose a novel approach, named Variable-dimensional O ApproximatePosteriorforRelabelingandSummarizing(VAPoRS), distributions in the case of some trans-dimensional problems C to this problem, which consists in approximating the poste- (i.e., “problems in which the number of things that we don’t . rior distribution of interest by a “simple”—but still variable- knowisoneofthethingsthatwedon’tknow”[10,11]).More t dimensional—parametric distribution.The distance between the a specifically, we concentrate on the problem of signal decom- t two distributions is measured using the Kullback-Leibler di- position when the number of components is unknown, which s vergence, and a Stochastic EM-type algorithm, driven by the [ RJ-MCMC sampler, is proposed to estimate the parameters. is an important case of trans-dimensional problem. Examples Two signal decomposition problems are considered, to show the of such problems include the detection and estimation of 1 v capability of VAPoRS both for relabeling and for summarizing sinusoidsinwhiteGaussiannoise[12]andtherelatedproblem variable dimensional posterior distributions: the classical prob- 0 ofestimatingdirectionsofarrivalinarrayprocessing[13],the lemofdetectingandestimatingsinusoidsinwhiteGaussiannoise 5 detection of objects in images [14, 15], and the detection of on the one hand, and a particle counting problem motivated by 6 the Pierre Auger project in astrophysics on the other hand. physicalparticles(neutrons,muons,...)usingnoisydatafrom 1 various types of sensors, for instance in spectroscopy [16] or 1. Index Terms—Bayesian inference; Signal decomposition; astrophysics [17, 18]. 0 Trans-dimensional MCMC; Label-switching; Stochastic EM. Let y = (y ,y ,...,y )t be a vector of N observations, 3 1 2 N where the superscript t stands for vector transposition. In 1 : I. INTRODUCTION signal decomposition problems, the model space is a finite v i Nowadays, owing to the advent of Markov Chain Monte or countable set of models, M = {Mk, k ∈ K }, where k X Carlo (MCMC) sampling methods [2–5], Bayesian data anal- denotes the number of components and K ⊂N the set of its r ysisisconsideredasaconventionalapproachinmachinelearn- possible values. It is assumed here that, under Mk, there are a ing,signalandimageprocessing,anddataminingproblems— k components with vectors of component-specific parameters to name but a few. Nevertheless, in many applications, prac- qqq =(qqq ,...,qqq )∈Q k, where Q ⊆Rd and Q 0={∅}. One 1:k 1 k tical challenges remain in the process of extracting, from the featurethatthe problemswe are consideringhave in common generated samples, quantities of interest to summarize the istheinvarianceofthelikelihood p(y|k,qqq )withrespectto 1:k posterior distribution. permutations (relabeling) of the components, which is called Summarization consists, loosely speaking, in providing a the“label-switching”issue intheliterature;see, e.g.,[19–23]. few simple yet interpretable parameters and/or graphics to We will discuss this issue further in Section I-A. the end-user of a statistical method. For instance, in the case In a Bayesian framework, a joint posterior density of a scalar parameter with a unimodal posterior distribution, f(k,qqq ),p(k,qqq |y) is obtained through Bayes’ formula 1:k 1:k measures of location and dispersion (e.g., the empirical mean forthenumberk ofcomponentsandthevectorofcomponent- and the standard deviation, or the median and the interquar- specificparameters,afterassigningpriordistributionsonthem: tile range) are typically provided in addition to a graphical f(k,qqq ) (cid:181) p(y|k,qqq )p(qqq |k)p(k), (1) 1:k 1:k 1:k Alireza Roodaki is with LTCI, CNRS, Télécom ParisTech, Paris, France. where(cid:181) indicatesproportionality.Thisdensityisdefinedover Email:[email protected]. JulienBectandGillesFleuryarewithE3S—SUPELECSystemsSciences, avariable-dimensionalspaceQQQ ,whichisaunionofsubspaces DepartmentofSignalProcessingandElectronicSystems,SUPELEC,Gif-sur- of differing dimensionality, i.e., QQQ =∪ {k}×Q k. k≥0 Yvette, France.Email:fi[email protected] Theposteriordensity(1) completelydescribesthe informa- The results presented here are also part of the Ph.D. thesis of the first author[1]. tion (and the associated uncertainty) provided by the data y 2 about the candidate models and the vector of unknown pa- rameters. Since it is only knownup to a normalizingconstant in most cases, Monte Carlo simulation methods, such as the 4 7 ReversibleJumpMCMC(RJ-MCMC)sampler[10],havebeen .8 % widely used to approximate it. A. The label-switching issue k 3 One of the most challenging issues when attempting at 30 .8 summarizingposteriordistributions,thatevenoccursin fixed- % dimensional situations, is the label-switching phenomenon (see, e.g., [19–23]), which is caused by the invarianceof both the likelihood and the prior distribution under permutations 2 5 ofthecomponents.Asa consequence,thecomponent-specific 9 .5 marginal posterior distributions are all equal, and therefore % useless for the purpose of summarizing the information con- 0 0.3 0.6 0.5 0.75 1 tained in the posterior distribution about individual compo- p(k|y) w nents. Figure1:Posteriordistributionsof k(left)andsortedradialfrequen- The simplest way of dealing with the label-switching issue cies, www , given k (right) from 100000 output RJ-MCMC samples. 1:k is to introduce an Identifiability Constraint (IC), such as The true number of components is three. The vertical dashed lines sortingthecomponentswithrespecttooneoftheirparameters; in the right figure locate the true radial frequencies. see [19] for more discussion concerning the use of ICs in the problemof Bayesian analysis of GMM. However,in most practical examples, choosing an appropriate IC manually is not feasible. Many relabeling algorithms have therefore been developed to “undo” the label-switching effect automatically, Figure 1 represents the posterior distributions of both the but all of them are restricted to the case of fixed-dimensional number k of components and the sorted radial frequen- posterior distributions; see [23–25] for recent advances and cieswww =(w ,...,w )t givenk obtainedusing100000sam- 1:k 1 k references. plesgeneratedbythe RJ-MCMC sampler. Note that, here,we In variable-dimensional posterior distributions, there is an usedsortingtomitigatetheeffectoflabel-switchingforvisual- extra uncertainty about the “presence” of components, in ad- ization.Eachrowisdedicatedtoonevalueofk,for2≤k≤4. ditiontotheirlocation.Thischallengingproblemhashindered Observethatothermodelshavenegligibleposteriorprobabili- previous attempts to undo label-switching in the variable- ties, since p(2≤k≤4|y)=0.981.Inthe experiment,theob- dimensional scenario, where, according to [26] “the meaning servedsignaloflengthN=64consistsofthreesinusoidswith of individual components is vacuous”. This argument will be energies AAA =(20,6.32,20)t, where A =a2 +a2 , phases 1:k j c,j s,j clarified in the following illustrative example. fff = (0,p /4,p /3)t, where f = −arctan(a /a ), and 1:k j s,j c,j true radial frequencies www =(0.63,0.68,0.73)t. The SNR, 1:k B. Illustrative example: joint Bayesian detection and estima- kDa1:kk2/ Ns 2 , where a1:k= ac,1,as,1,...,ac,k,as,k t and tion of sinusoids in white Gaussian noise DistheN×(cid:0)2k“d(cid:1)esignmatrix”of(cid:0)sinesandcosinesasso(cid:1)ciated to www , is set to the moderate value of 7dB. In this example, it is assumed that under Mk, the observed 1:k signal y is composed of k sinusoidal componentsobserved in Roughlyspeaking,two approachesco-existin the literature white Gaussian noise. That is, under Mk, for summarizing variable-dimensional posterior distributions: k Bayesian Model Selection (BMS) and Bayesian Model Aver- y[i] = (cid:229) (ac,jcos(w ji)+as,jsin(w ji))+n[i], aging (BMA). The BMS approach ranks models according to j=1 theirposteriorprobabilities p(k|y),selectsonemodel,denoted where ac,j and as,j are the cosine and sine amplitudes, and by kMAP here, where MAP stands for Maximum A Poste- w j is the radial frequency of the jth sinusoidal component. riori, and then summarizes the posterior distribution of the Moreover, n is a white Gaussian noise of variance s 2. component-specific parameters under the (fixed-dimensional) The unknown parameters are the number k of sinusoidal selectedmodel.Thisisatthepriceoflosingvaluableinforma- components, the vectors qqq j = (ac,j,as,j,w j) of component- tion provided by the other (discarded) models. For instance, specific parameters, 1 ≤ j ≤ k, and the noise variance s 2. in the example of Figure 1, all informationabout the small— Thus,Q =R2×(0,p )andQQQ = ∪k≥0{k}×Q k ∪R+.We use and therefore harder to detect—middle component is lost by the hierarchicalmodel,priordis(cid:0)tributions,and(cid:1)the RJ-MCMC selecting the most a posteriori probable model M2. On the samplerproposedin[12]forthisproblem;theinterestedreader other hand, the BMA approach consists in reporting results is thus referred to [10, 12] for more details1. thatareaveragedoverallpossiblemodels.AlthoughtheBMA approach is suitable for signal reconstruction and prediction 1In fact, the “Birth-or-Death” moves’ acceptance ratio provided in the purposes (see, e.g., [28] and references therein), it is not seminal paper [12] is erroneous. See [1, Chapter 1] or [27] for justification andtrueexpressionoftheacceptance ratio,whichisusedin thispaper. appropriate for studying component-specific parameters, the 3 number of which changes in each model2. More information II. VAPORS concerningthesetwoapproachescanbefoundin[10,28]and We assume that the target posteriordistribution, defined on references therein. the variable-dimensionalspace QQQ = {k}×Q k, admits a Tothebestofourknowledge,nogenericmethodiscurrently k∈K probability density function (pdf) fSwith respect to the kd- available that would allow to summarize the information that dimensional Lebesgue measure on each {k}×Q k, k∈K . is so easily read on Figure 1 for this very simple example: Ourobjectiveis to approximatethe true posteriordensity f namely, that there seem to be three sinusoidal components in using a “simple” parametric model qhhh , where hhh is the vector the observed noisy signal, the middle one having a smaller of parameters defining the model. The pdf qhhh will also be “probability of presence” than the others. defined on the variable-dimensional space QQQ (i.e., it is not a fixed-dimensional approximation as in the BMS approach). We assume that a Monte Carlo sampling method—e.g., an C. Outline of the paper RJ-MCMC sampler [10]—is available to generate M sam- Inthispaper,weproposeanovelapproach,namedVariable- ples from f, which we denote by qqq (i) = k(i),qqq (i) , for dimensional Approximate Posterior for Relabeling and Sum- 1:k(i) i = 1,...,M. (cid:0) (cid:1) marizing(VAPoRS),forrelabelingandsummarizingposterior distributionsdefined overvariable-dimensionalsubspaces that typically arise in signal decomposition problems when the A. Variable-dimensionalparametric model number of components is unknown. It consists in approxi- Insteadoftryingtodescribetheproposedparametricfamily mating the true posteriordistributionwith a parametricmodel of densities {qh } directly, let us now adopt a generative (ofvarying-dimensionality),byminimizationoftheKullback- point of view, i.e., let us describe how to sample an QQQ - Leibler (KL) divergence between the two distributions. A valued random variable qqq =(k,qqq ) from the corresponding 1:k Stochastic Expectation Maximization (SEM)-type algorithm probabilitydistribution.We assumethata positiveintegerLis [29–31], driven by the output of an RJ-MCMC sampler, is given, which represents the number of “components” present used to estimate the parameters of the approximate model. in the posterior. VAPoRS shares some similarities with the relabeling algo- First we generate, independently for each of the L com- rithms proposed in [20, 24, 25] to solve the label switching ponents, a binary indicator variable x ∈{0,1} drawn from l problem, and also with the EM-type algorithm used in [8] in the Bernoulli distribution Ber(p l), where x l = 1 indicates the context of adaptive MCMC algorithms (both in a fixed- that the corresponding component is present (otherwise, it is dimensional setting). The main contribution of this paper is absent) in qqq . The actual number k of componentsin the gen- theintroductionofanoriginalvariable-dimensionalparametric erated samples is thus defined as k=(cid:229) L x . The parameter l=1 l model,whichallowstotackledirectlythedifficultproblemof p ∈(0,1]willbecalledthe“probabilityofpresence”ofthelth l approximatingadistributiondefinedoveraunionofsubspaces component. of differingdimensionality—andthusprovidesa first solution Second, given the vector of indicator variables xxx = to the “trans-dimensional label-switching” problem, so to (x ,...,x ), a Q -valued random vector is generated for each 1 L speak. component that is present (i.e., for each l such that x =1). l Perhaps, the algorithm that we propose can be seen as a Thisrandomvectorisgeneratedaccordingtosomeprobability realization of the idea that M. Stephens had in mind when he distributionassociatedtothecomponent,thatwillbeassumed stated [32, page 94]: to be a Q -truncated d-dimensional Gaussian distribution with “This raises the question of whether we might be able to mean mmm and covariance matrix SSS in this paper3. In order l l obtainanalternativeviewofthe[variable-dimensional]poste- to achieve the required invariance with respect to component riorbycombiningtheresultsforalldifferentk’s,andgrouping relabeling, the generated vectors are randomly4 arranged in a together components which are “similar”, in that they have vector qqq =(qqq ,...,qqq ). 1:k 1 k similar predictive density estimates. However, attempts to do Contemplatingtheposteriordistributionsofthesortedradial this have failed to produce an easily interpretable results.” frequenciesdepictedintherightpanelofFigure1,particularly The paper is organized as follows. Section II introduces the plots related to the models with three and four sinusoidal the proposed model and stochastic algorithm for relabeling components, it can be observed that there are “diffuse parts” and summarizing variable dimensionalposterior distributions. intheRJ-MCMCoutputsamplesresultingintheheavyasym- Section III illustrates the performance of VAPoRS using two metrictailsofsomecomponents.Itisclearthatamodelmade signal decomposition examples, namely, the problem of joint of Gaussian components only is not capable of describing BayesiandetectionandestimationofsinusoidsinwhiteGaus- these diffuse samples, at least not in a parsimonious way. sian noise and the problem of joint Bayesian detection and These abnormal observations, with respect to the bulk of the estimation of particles in the Auger project (in astrophysics). observeddata, or, simply outliers, can adverselyinfluencethe Section IV confirms the performances of VAPoRS using a Monte Carlo experiment. Finally, Section V concludes the 3Note that any d-dimensional parametric family of distributions could be used at this point. As often in the literature [8, 20, 24, 25], the Gaussian paper and gives directions for future work. distribution is chosen here as a convenient mean of describing a “compact” andunimodald-dimensional distribution, nothingmore. 2See, however, the intensity plot provided in Section III (Figure 7) as an 4More precisely, a permutation of the k components that are present is exampleofaBMAsummaryrelated toacomponent-specific parameter. drawnuniformlyinthesetofallpermutations. 4 l p The distribution of the allocation vector z is l qhhh (z) = qhhh (z|xxx )qhhh (xxx ), (3) x x mmm SSS where qhhh (xxx ) is given in (2). Note that xxx is a deterministic L+1 l l l function of z: xxx =n(z), with n(z)=(cid:229) k 1 , for 1≤l≤ l j=1 zj=l L+1. To compute the first term of (3), remember that the l=1,2,···,L points generated by the components of the parametric model arerandomlyarrangedinqqq .Therefore,forall xxx ∈{0,1}L× 1:k N such that (cid:229) L+1x =k, qqq l=1 l x ! Figure 2: Proposed variable-dimensional parametric model in a qhhh (z|xxx ) = Lk+!1 1xxx =n(z), (4) generative viewpoint. since two arrangements that differ only by the position of the points corresponding to the PPP give rise to the same allocation vector. processoffittingtheapproximateposteriortothetrueposterior distribution of interest and consequently lead to meaningless The conditional distribution qhhh (qqq |z) reads parameter estimates. k Toovercomethisrobustnessissue,weproposetoincludein qhhh (qqq |z) = (cid:213) qhhh (qqq j|zj), (5) themodela“noise-like”PoissonPointProcess(PPP;see,e.g., j=1 [33]) to account for the presence of outliers in the observed where samples. We assume that the PPP is homogeneous5 on Q , bwyiththeinPtePnPsittyhuls/f|oQ ll|o.wTshae Pnouimssboenr doifstcriobmutpioonnewntisthgmeneearnateld. qhhh (qqq j|zj) = N (cid:16)qqq j|mmm zj,SSS zj(cid:17) if zj ≤ L, (6) To be consistent with our previous notations, we denote by |Q1| if zj = L+1. x L+1∈Nthisnumber;notethatx 1,...,x Lstilltaketheirvalues Therefore, from Equations (2) to (6), we have in{0,1}.The(extended)vectorxxx thusfollowstheprobability distributpio(nxxx |ppp ,l ) = e−l ·l xL+1 (cid:213) L p xl(1−p )(1−xl). (2) qhhh (qqq ,z) = ek−!l (cid:18)|Ql |(cid:19)xL+1 1≤(cid:213) j≤k N (cid:16)qqq j|mmm zj,SSS zj(cid:17) x L+1! l=1 l l zj6=L+1 L Given xxx , x L+1 random samples are generated uniformly × (cid:213) p lxl(1−p l)(1−xl) 1Z (z), (7) on Q and inserted randomly among the samples drawn from l=1 rthhhhane=dGo(mahhhu1s,vs.ai.ar.nia,hhhbcloLe,mlqqqp)o=nane(ndkts,hhh.qqqlW1:=ke)(dmmmthelna,toSSS tlie,spblty)h.uqsFhhhiggutehrneeer2padtepfdro,ovfwidtihethes wnvelh(czet)orer∈s({x(i01.,e,1..,.}.t,h,fexoLrs+e11t)≤o=flan≤l(lzLz))∈a.n∪dkZ∈Ki{skt}he×sLetkofsuacllhatlhloactaxtlio=n the directed acyclic graph of the model. C. Estimating the model parameters B. Distribution of the labeled samples A random variable qqq = (k,(qqq ,...,qqq )) drawn from the We propose to fit the parametric distribution qhhh to the 1 k posterior f of interest by minimizing a divergence measure density qhhh can be thoughtof as an “unlabeled sample”, since the label l ∈ L , {1,...,L+1} of the component from from f to qhhh . We use the KL divergence as a divergence measure in this paper, though other divergence measures can which each qqq (1 ≤ j ≤ k) originates cannot be recovered j be used as well6. fromqqq itself.Letusnowintroducethe(variable-dimensional) Denoting the KL divergence from f to qhhh by allocation vector DKL(f(qqq )kqhhh (qqq )), we define the criterion to be minimized z = (k,(z1,...,zk)) ∈ {k}×Lk, as k∈[K f(qqq ) which provides the missing piece of information: zj = l J(hhh ) , DKL(f(qqq )kqhhh (qqq )) = ZQQQ f(qqq )logqhhh (qqq ) dqqq . indicatesthatqqq originatesfromthelth (Gaussian)component j if l≤L, while z =L+1 indicatesthatqqq originatesfromthe Using samples generated by the RJ-MCMC sampler, this j j point process component. We will refer to the pair (qqq ,z) as criterion can be approximated as adisltarbibeuletidonsaqmhhh p(qlqqe,.zI)n=thqehhh (fqqqol|lozw)qinhhhg(,z)w. e will derive its joint J(hhh ) ≃ JˆM(hhh ) = − 1 (cid:229)M log qhhh (qqq (i)) +C, (8) M i=1 (cid:16) (cid:17) 5Homogeneityisassumedhereforthesakeofsimplicity,butmoreelaborate (non-homogeneous) models are easily accommodated by our approach, if 6see [1, Chapter 2] where another divergence measure proposed by [34] needed. hasbeenusedforthisproblem forrobustness reasons. 5 Remark 1. It would also be possible to assign prior dis- At the (r+1)th iteration, do: tributions over the unknown parameters hhh and study their posteriordistributions(forexample,usingan MCMC sampler (S-step) Draw allocation vectors z(i,r+1), 1≤i≤M, withthelatentvariablezaddedtothestateofthechain,inthe using an IMH step with target q (···|qqq (i)). hhhˆ(r) spiritofthe “dataaugmentation”algorithm[36]). Thiswould, however,leave the label-switching issue unsolved(because of (E-step) Construct the pseudo-completedlog-likelihood the invariance of qhhh to permutations of its components). JM(hhh ) = −(cid:229) Mi=1log qhhh (qqq (i),z(i,r+1)) . Remark 2. Convergence results of the SEM algorithm in (cid:0) (cid:1) (M-step) Esticmate hhhˆ(r+1) such that the general form are provided by [31] and, in the particular hhhˆ(r+1) = argminhhh JˆM(hhh ). exampleofmixtureanalysisproblems,by[37].Unfortunately, theassumptionsin[31,37] donotholdin theproblemwe are dealing with as, 1) the observed samples qqq (i) are correlated, Figure 3: Proposed SEM-type algorithm owingtothefactthattheyaregeneratedfromthetrueposterior distributionusingsome MCMC methods,e.g.,the RJ-MCMC sampler; 2) an I-MH sampler is used to draw z(i) from whereC is a constantthat doesnot depend on hhh . One should the conditional posterior distribution. Nevertheless, empirical note that minimizing Jˆ(hhh ) amounts to choosing evidence of the “good” convergence properties of the SEM- type algorithm we proposed will be provided in the next two M hhhˆ = argmaxhhh (cid:229) log qhhh (qqq (i)) . (9) sections. i=1 (cid:16) (cid:17) To estimate the model parameters hhh ∈ N, one of the D. Robustified algorithm extensively used algorithms for Maximum Likelihood (ML) Preliminary experiments with the SEM-type algorithm de- parameter estimation in latent variable models is the EM scribed in Figure 3 were not satisfactory, because the sample algorithmproposedby[35].However,itturnsoutthattheEM meanand(co)varianceestimatesin theM-step,obtainedfrom algorithm, which has been used in similar works [8, 20, 24], minimizing the KL divergence from the posterior distribu- is not appropriate for solving this problem, as computing tion f to the parametricmodel qhhh , still sufferfrom sensitivity the expectation in the E-step is intricate. More explicitly, in to the outliers in the observed samples, even after including our problem the computational burden of the summation in the Poisson point process component. As a workaround, the E-step over the set of all possible allocation vectors z we propose to use robust estimates [38] of the means and increases very rapidly with both L and k. In fact, even for (co)variancesofGaussiandistributionsinsteadoftheempirical moderate values of L and k, say, L = 15 and k = 10, the means and (co)variances in the M-step. For example, in the summation is far too expensive to compute as it involves case of univariate Gaussian distributions, one can use the (cid:229) k L! ≈1.31010 terms. median and the interquartile range as robust estimators of the m=0(L−k+m)! In this paper, we propose to use the SEM algorithm [29– meanandvariance,respectively.See [1,Section2.5]formore 31], a variation of the EM algorithm in which the E-step is discussion of the robustness issue, including an alternative substituted with stochastic simulation of the latent variables solution using the “robust divergence” of [34]. from their conditional posterior distributions given the pre- Remark3. Similarrobustnessconcernsarewidespreadinthe vious estimates of the unknown parameters. In other words, clustering literature; see, e.g., [39] and references therein. at the iteration r+1 of the SEM algorithm, denoting the estimated parameters at iteration r by hhhˆ(r), for i = 1,...,M, the allocation vectors z(i) are drawn from q (···|qqq (i)). This III. ILLUSTRATIVEEXAMPLES hhhˆ(r) step is called the Stochastic (S)-step. Then, these random Inthissection,wewillinvestigatethecapabilityofVAPoRS samples are used to constructthe so-called pseudo-completed for summarizing variable-dimensional posterior distributions log-likelihood. using two signal decomposition examples; 1) joint Bayesian Exactsamplingfromq (···|qqq (i)),asrequiredbytheS-step detection and estimation of sinusoids in white Gaussian hhhˆ(r) of the SEM-typealgorithm,is unfortunatelynotfeasible—not noise [12] and 2) joint Bayesian detection and estimation of evenusingtheaccept-rejectalgorithm,duetotheheavilycom- astrophysical particles in the Auger project [17, 18]; see [1, binatorial expression of the normalizing constant q (qqq (i)). Chapters 3 and 4] for more results and discussion. We em- hhhˆ(r) Instead, since phasize again that the output of the trans-dimensional Monte Carlo sampler, e.g, the. RJ-MCMC sampler in this paper, is qhhhˆ(r)(z(i)|qqq (i)) (cid:181) qhhhˆ(r)(qqq (i),z(i)) considered as the observed data for VAPoRS. can be computed up to a normalizing constant, we choose to use an Independent Metropolis-Hasting (IMH) step with A. Joint Bayesian detection and estimation of sinusoids in q (z(i) | qqq (i)) as its stationary distribution; see [1, Algo- white Gaussian noise hhhˆ(r) rithm2.2]formoredetails.TheproposedSEM-typealgorithm Let us consider the problem of detection and estimation is summarized in Figure 3. of sinusoidal components introduced in Section I-B where 6 m s2 Gaussian Comp. #1 Gaussian Comp. #2 0.75 10−2 40 m =0.62,s=0.017,p =1.00 m =0.68,s=0.021,p =0.22 0.7 20 10−3 0.65 df 20 df 0.6 p 10−4 l p p 10 1 0.5 0 0 0.75 0.5 0.75 1 0.5 0.75 1 0.25 Gaussian Comp. #3 point process Comp. 0.5 0.25 ˆ 0 m =0.73,s=0.011,p =0.97 4 l =0.33 JM 40 1 y 0 df nsit 2 p 20 e −1 nt i −2 −3 0 0 0 20 40 60 80 100 0.5 0.75 1 0 1 2 3 SEM iteration w w Figure 4: Evolution of the model parameters along with the crite- Figure 5: Histogram of the labeled samples, that is, the samples rionJˆM definedin(8)using100iterationsofVAPoRSwithL = 3 allocated to the Gaussian and Poisson point process components, on the RJ-MCMC output samples shown in Figure 1. versus the pdf’s of estimated Gaussian components in the model (black solid line) using VAPoRS on the sinusoid detection example. The estimated parameters of each component are presented in the corresponding panel. the unknown parameters are the number k of components, the component-specific parameters (a ,a ,w ), 1≤ j ≤k, c,j s,j j and the noise variance s 2. Since the amplitudes and the noise variance can be analytically integrated out, we focus on summarizing the joint posterior distribution p(k,www |y) of atrans-dimensionalsetting.Figures5showsthehistogramsof 1:k the labeled samples, i.e., (qqq (i),z(i)), with i=1,...,M, along the form illustrated in Figure 1. Therefore, we assume that with the pdf’s of the estimated Gaussian components (black the proposed parametric model introduced in Section II-A consists of univariate Gaussian components, with means m , solidline).Moreover,thesummariesprovidedbyVAPORSfor l variances s2, and probabilities of presence p , 1 ≤ l ≤ L, each componentare presented in its correspondingpanel. We l l used the average of the last 50 SEM iterations as parameter to be estimated. Moreover, the space of component-specific parameters is Q =(0,p )⊂R. estimates, as recommended in the SEM literature; see, for example, [30, 31]. Comparing the distributions of the labeled Before launching VAPoRS, we need first to initialize the samples with the ones of the posterior distributions of the parametric model. It is natural to deduce the number L of sortedradialfrequenciesgivenk=3showninFigure1,which Gaussian components from the posterior distribution of k. Here,we setitto the90th percentileof p(k|y)to keepallthe are highly multimodal, reveals the capability of VAPoRS in solving label-switching in a variable-dimensionalsetting. probablemodelsintheplay.ToinitializetheGaussiancompo- nents’parameters,i.e., m l ands2l,1≤l≤L,weusedtherobust Looking at the bottom right panel of Figure 5, the role estimatesofthemeansandvariancesofthemarginalposterior of the point process component in capturing the outliers in distributions of the sorted radial frequencies given k = L. the observed samples, which cannot be described by the Finally, we set p l=0.9, for 1≤l≤L, and l =0.1. Gaussiancomponents,becomesclearer.Notethat,withoutthe We ran the “robustified” stochastic algorithm introduced in pointprocess component,these outlierswould be allocated to Section II on the specific example shown in Figure 1, for the Gaussian components and would, consequently, induce a 100 iterations, with L= 3 Gaussian components (note that significant deterioration of the parameter estimates. the posterior probability of {k≤3} is approximately 90.3%). Table I presents the summaries provided using VAPoRS To assess the convergence of VAPoRS, Figure 4 illustrates the evolution of the model parameters hhh together with the along with the ones obtained using the BMS approach. Con- trarytotheBMSapproach,VAPoRShasenabledustobenefit criterionJ.Twosubstantialfactsshowingtheconvergenceof fromtheinformationofallprobablemodelstogivesummaries VAPoRScanbededucedfromthisfigure:first,thedecreasing behavior of the criterion JˆM, which is almost constant after about the middle harder to detect component. Turning to the results of VAPoRS, it can be seen that the estimated means the 10th iteration; second, the convergence of the parameters are compatible with the true radial frequencies. Furthermore, of the parametricmodel,particularlythe means m andproba- l bilitiesofpresencep ,1≤l≤L,eventhoughweusedanaive the estimated probabilitiesof presenceare consistentwith un- l certainty of them in the variable-dimensionalposteriorshown initialization procedure.Indeedafterthe 40th iteration there is in Figure 1. no significant move in the parameter estimates. Asdiscussed inSection I,oneofthe mainobjectivesof the To observe better the “goodness-of-fit” of the estimated algorithmwe proposedis to solvethe label-switchingissue in Gaussian components, the bottom panel of Figure 6 depicts 7 Comp. m s p m BMS sBMS 40 1 0.62 0.017 1 0.62 0.016 2 0.68 0.021 0.22 — — 3 0.73 0.011 0.97 0.73 0.012 30 TableI:Summariesofthevariable-dimensionalposteriordistribution y shown in Figure 1; VAPoRS vs. the BMS approach. nsit 20 e nt i 4 10 k 3 0 0.4 0.6 0.8 1 w 2 Figure7:Histogramintensityofallradialfrequenciessamplesusing 1 the BMA approach along with the intensity of the fitted parametric 0.5 model obtained using VAPoRS. 0 0.5 0.75 1 w Figure 6: Posterior distribution of the sorted radial frequencies www 1:k 0.6 given k (top) and normalized pdf of the fitted Gaussian components (bottom). y bilit0.4 a b o theirnormalizeddensities7,undertheposteriordistributionsof r P0.2 the sorted radial frequencies given k. This figure can be used to validatethe coherencyofthe estimated summarieswith the information in the variable-dimensionalposterior distribution. 0 0 2 4 6 8 10 It can be seen from the figures that the shape of the pdf’s of k the estimated Gaussian components are coherent in both the Figure 8: Posterior distribution of the number k of number of location and dispersion with the ones of the posterior of the components (black) and its approximated version (gray) obtained from the fittedmodel. sorted radial frequencies. It is also useful for validating the estimated summaries to compare the intensity of the estimated parametric model qhhh B. Joint Bayesian detection and estimation of astrophysical defined, in general, as particles in the Auger project L As the second illustrative example, we show results on a h(hhh ) = (cid:229) p l ·N (···|mmm l,SSS l), (10) signaldecompositionproblemencounteredintheinternational l=1 astrophysics collaboration called Auger [17, 18]. The Auger project is aimed at studying ultra-high energy cosmic rays, where we ignore the point process component, with the with energies in order of 1019eV, the most energetic particles histogram intensity of all radial frequencies obtained using found so far in the universe. The long-term objective of this the BMA approach(see [1, Chapter2] formore information). project is to study the nature of those ultra-high energy parti- Figure 7 shows such a figure for the specific example of this cles and determine their origin in the universe. Nevertheless, section where the solid black line indicates the intensity of they are not observed directly. In fact, when they collide the the estimated parametric model. These figures also indicate earth’satmosphere,ahostofsecondaryparticlesaregenerated, the “goodness-of-fit” of the fitted approximate posterior and some of which, mostly “muons”, finally reach the ground. the true one. To detect them, the Pierre Auger Cosmic Ray Observatory Finally, to validate both the estimated probabilitiesof pres- was built which consists of two independent detectors; an enceoftheGaussiancomponentsandthemeanparameterl of arrayofSurfaceDetectors(SD)andanumberofFluorescence the Poisson point process component, Figure 8 illustrates the Detectors (FD). posteriordistribution of the number k of componentstogether The number of muons and their arrival times can be used with its approximatedversionsusing VAPoRS. It can be seen asindicationsofboththechemicalcompositionandtheorigin from the figure that VAPoRS well captured the information of the primary particles (see [17, 18] for more information). providedin the true posteriorof the number k ofcomponents. Here, we concentrate on the signal decomposition problem, where the goal is to countthe numberof muonsand estimate 7Toobtainthenormalizeddensities,first,wenormalizedtheestimatedpdf’s their individual parameters from the signals observed by SD to have their maximum equal to one. Then, we multiplied the estimated detectors. To show results, we use the Bayesian algorithm probability of presence of each Gaussian component to its corresponding and the RJ-MCMC sampler developed in [9, 40, 41] for the normalized estimated pdf. Thus, the maximum of each normalized density isequaltothecorresponding estimated probability ofpresence. trans-dimensionalproblemofjointdetectionandestimationof 8 muons. In this section, we first briefly describe the problem 20 and then use VAPoRS to relabel and summarize variable- E dimensional output samples of the RJ-MCMC sampler devel- P # 10 oped by [9, 40, 41]. WhenamuoncrossesaSDtank,itgeneratesphotoelectrons 0 (PE’s) alongits trackthatare, then, capturedby detectors and create a discrete observed signal. We denote the vector of observedsignalbyn=(n ,...,n )∈NN,wheretheelementn 1.5 1 N i indicates the number of PE’s deposited by the muons in the time interval sity 1 n e [ti−1,ti) , [t0+(i−1)tD ,t0+itD ), int0.5 where t0 is the absolute starting time of the signal and tD = 0100 200 300 400 500 600 25 ns is the signal resolution (length of one bin). t[ns] Eachmuonhastwocomponent-specificparameters,namely, Figure 9: (top) Observed signal n. (bottom) Intensity of the model thearrivaltimetm andthesignalamplitudeam .Theabsorption h(t|aaam ,tttm )definedin(11).Therearek=5muonsinthesignalwith process of the photons generated by a muon is modeled by a the true arrival times, i.e., tttm =(105,169,267,268,498), indicated by vertical dashed lines. non-homogeneous Poisson point process with intensity [41, Section 2.2] h(t|am ,tm ) = am pt,td(t−tm ), (11) 7 where pt,td(t) is the time response distribution, td is the rise- time and t is the exponential decay (both measured in ns); see Figure 9 (bottom) for such exponential shape intensities. 6 Then,the expectednumberof PE’s in the bin i is obtainedby integrating the intensity (11) in the corresponding bin: k ti n¯i(am ,tm ) = am Zti−1pt,td(t−tm )dt. (12) 5 Conditioning on the number k of muons and the vector of parameters tttm =(tm ,1,...,tm ,k) and aaam =(am ,1,...,am ,k), and assumingthatthenumberofPE’sineachbinareindependent, 4 the likelihood is written as N 0 0.25 0.5 100 200 300 400 500 (cid:213) p(n|k,tttm ,aaam ) = p(ni|n¯i(k,aaam ,tttm )), (13) p(k|nnn) t[ns] i=1 Figure10:Posteriordistributionsofthenumber kofmuons(left)and where p(ni|n¯i(k,aaam ,tttm )) is a Poisson distribution with the the sorted arrival times, tttm , given k (right) constructed using 60000 mean n¯i(k,aaam ,tttm ). Then, assuming independence of the RJ-MCMC output samples after discarding the burn-in period. The muons, the expected number of PE’s in the ith bin, i.e., true number of components is five. The vertical dashed lines in the right figure locate the arrival times. n¯i(k,aaam ,tttm ), given k, tttm , and aaam becomes k (cid:229) n¯i(k,aaam ,tttm ) = n¯i(am ,j,tm ,j). (14) so, M5. We ran VAPoRS with L=6 Gaussian components j=1 on the RJ-MCMC output samples shown in Figure 10 (note We will now illustrate the performance of VAPoRS on a that p(k≤6|n)=0.94). simulated PE counting signal (see [1, Chapter 4] for results Figure 11 shows the histogram of the labeled samples on two other simulated experiments). The observed signal of and the estimated parameters of the components. From the theillustrativeexampleconsideredhereconsistsoffivemuons figure, it can be seen that the bimodality effects caused by located at tttm =(105,169,267,268,498) (see Figure 9). The label-switching exhibited in Figure 10 is removed completely posterior distributions of the number k of muons and sorted and the estimated Gaussian components enjoy reasonable arrival times are shown in Figure 10. Note that, in this variances. In the presented summary, there are four muons example,therearetwomuonswithalmostequalarrivaltimes, with high probabilities of presence correspondingto the ones i.e., the third and fourth muons. showninthebottomrowofFigure10.Therearealsotwoother Using the BMS approach, the model with four muons muons with comparatively low probabilities of presence. would be selected (p(k=4|n)=0.4), although M5 has an Infact,thesamplesallocatedtothepointprocesscomponent almostsimilarposteriorprobabilityof0.38.Moreover,observe shown the bottom row of Figure 11 can be regarded as the that the marginal posterior of the arrival time of the third residuals of the fitted model, that is, the observed samples componentis bimodalunderboth M4 and, more significantly which the L Gaussian components in qhhh have not been able 9 L=3 L=4 Gaussian Comp. #1 Gaussian Comp. #2 0.04 0.04 l =1.91 l =0.92 m =99.62,s=4.022,p =1.00 0.1 m =172.98,s=8.163,p =0.18 nsity0.02 0.02 e 0.1 nt f i d p 0 0 L=6 L=8 0 0 0.04 0.04 Gaussian Comp. #3 Gaussian Comp. #4 l =0.42 l =0.24 y 0.1 m =173.38,s=5.194,p =1.00 0.1 m =237.13,s=8.475,p =0.37 ensit0.02 0.02 nt i pdf 0 100 200 300 400 500 0 100 200 300 400 500 t[ns] t[ns] 0 0 Figure 12: Histograms of the residuals of the fitted model using Gaussian Comp. #5 Gaussian Comp. #6 VAPoRS with different values of L={3,4,6,8}. 0.1 m =260.93,s=7.154,p =0.94 0.1 m =504.36,s=5.555,p =1.00 1 df 0.5 L=6 p 0 0 0 point process Comp. 1 l =0.42 0.5 L=7 y sity0.02 nsit 0 n e 1 e0.01 nt nt i i d0.5 L=8 e 0 z 100 200 300 400 500 ali 0 m 1 t[ns] or n 0.5 L=9 Figure11: Histogramofthelabeled samplesalong withthepdf’sof 0 estimatedGaussiancomponentsinthemodel(blacksolidline)using 100 200 300 400 500 VAPoRS with L=6 on the variable-dimensional postrior shown in t[ns] Figure10.Theestimatedparametersofeachcomponentarepresented in the corresponding panel. Figure13:Normalizedpdf’softhefittedGaussiancomponentsusing VAPoRS with different values of 6≤L≤9. to describe.These residualscan be used,as usualin statistics, as a tool for goodness-of-fit diagnostics and model choice. IV. MONTECARLO EXPERIMENT Figure 12 illustrates the histograms of the residuals of the The examples of Section III have illustrated the capability fitted model for different values of L∈{3,4,6,8}. It can be of VAPoRS to relabel and summarize variable-dimensional seen from the top left panel of Figure 12 that the distribution posterior distributions encountered in two signal decompo- oftheresidualscorrespondingtothecasewhereL=3contains sition problems. In order to confirm these findings, we will a few “significant” peaks. The peaks are gradually removed now investigate more systematically, by means of a Monte by adding Gaussian components. When L=4, a component Carlo simulation experiment, how faithfully the approximate is added at tm =261 that captures samples distributed around posterior distribution preserves certain features of the true the most significant peak of the top left panel of Figure 12. posterior distribution. However,therestillexistafewpeaks,particularlyaroundtm = One hundred realizations of the sinusoid detection experi- 173whicharecapturedwhen L≥6.However,thedistribution ment described in Section I-B (see Figure 1) were simulated of residuals for the case of L=6 and L=8 do not differ and analyzed using the same RJ-MCMC sampler as before. significantly.Note the decreaseof valueof lˆ byincreasing L. The number of RJ-MCMC iterations was set to 100000 and Figure 13 compares the normalized intensities of the esti- the first 20000 samples were discarded as the burn-in period. matedGaussiancomponentsfor6≤L≤9.Itcanbeseenfrom Then,thesampleswerethinnedtooneeveryfifth.Toinitialize thefigurethatchangingLinareasonablerange,say,6≤L≤9, the parametric model qhhh in a systematic fashion, we set L to doesnotinfluencesignificantlythefinalinference.Inallcases, the largest k such thatits posteriorprobabilityis notless than the six Gaussian components that were estimated in the case 0.05. Then, during the process of the SEM-type algorithms, of L=6 exist. By moving from L=6 to L=9, additional if sufficient number of samples, say, 10, is not allocated Gaussiancomponentsareaddedintheobtainedsummarywith to a Gaussian component (or, equivalently, its probability very low probabilities of presence, which improve the fit but of presence fades to zero), we will remove it from the does not change much the final inference. parametricmodelanddecrease L by one.Using this approach 10 generally results in approximate posterior distributions which where k·k is the L -norm and we set yˆ =yˆBMA and yˆ = 2 0 0 0 are “richer” than those provided by the BMS approach8, in yˆVAPoRS, when using the BMA approach and VAPoRS, re- 0 the sense that L≥kMAP, where kMAP=argmax p(k|y). To spectively. It can be seen from the figure that the normalized k initializetheGaussiancomponents’parameters,i.e.,themeans errorsofthereconstructednoiselesssignalsusingthecompact m andvariancess2, weusedaspreviouslyrobustestimatesof summaryobtained by VAPoRS are quite comparablewith the l l the meanandvariancesof theposteriordistributionsofsorted ones obtained using the BMA approach. radial frequencies given k=L. Finally, the scatter plots in the last two panels compare Figure 14 compares various features of the fitted approxi- the expected number of components in the intervals (0,p /4) mate posterior distribution qhhhˆ, obtained using 100 iterations and (p /4,p /2)using VAPoRS with, again, the ones obtained of VAPoRS, with the corresponding features of the true using the direct BMA approach. For the BMA approach, the variable-dimensionalposteriordistribution.These features are expected number of components in an interval T ⊂(0;p ) is described in the rest of this section. given by The scatter plots shown in panels (a), (b), and (c) compare the posteriordistributionof the number k of components,i.e., E(N(T)|y) = (cid:229) E(N(T)|k,y)p(k|y) ≈ 1 (cid:229)M N(i)(T), p(k|y),withitsapproximatedversion,denotedhereby pˆ(k|y), M k∈K i=1 in 100runs.We onlyshowthe posteriorprobabilitiesof k=2 and k=3 in this comparison, as the other probabilities were where N(i)(T) is the number of radial frequencies observed in T on the ith sample. On the other hand, from the summary close to zero. The digits situated on the right of the points providedby VAPoRS, the expected numberof componentsin in the panel (a) indicate the number of occurrence of the corresponding event in 100 runs and kˆMAP=argmax pˆ(k|y). interval T is k Iint cpa(nk|bye) wseaesnwfreollmprtheseesrevethdrebeypthaneealsppthroaxtimthaeteidnfopromstaetriioonr Ehhhˆ (N(T)|y) = (cid:229)L pˆlN (T;hhhˆl) + lˆ ||QT||, distributions. l=1 NextwecomparetheperformanceofVAPoRSwiththeone whereN (T;hhhˆ )denotestheprobabilityofT undertheGaus- l ofthe“direct”BMAapproach9 inreconstructingthenoiseless sian distribution with parameters hhhˆ . The figures confirm that l signal y0 =Da1:k. To this end, the estimated reconstructed the expected number of components in the chosen intervals noiseless signal is defined as computed using both approaches are very similar. The results shown in this section confirmed that the ap- yˆ = E(y |y) 0 0 proximate posterior distribution qhhhˆ obtained using VAPoRS = (cid:229) E(y |k,qqq ,y)p(k,qqq |y)dqqq . (15) preserves faithfully several important features of the true 0 1:k 1:k 1:k k∈K ZQ k posterior distribution; see [1, Section 3.4] for more results in InthedirectBMAapproach,usingthesamplesgeneratedwith this vein, including a numerical investigation comparison of theRJ-MCMCsampler,theaboveintegralisapproximatedby the properties of estimators derived from VAPoRS. yˆB0MA = M1 (cid:229)M D(i)aaaˆ(1i:)k(i), V. CONCLUSION i=1 In this paper, we have proposed a novel algorithm to whereD(i) isthedesignmatrixoftheith vectorofthesampled relabel and summarize variable dimensional posterior distri- radial frequencies www (i) and aaaˆ(i) is the posterior mean butions encountered in signal decomposition problems when 1:k(i) 1:k(i) of the amplitudes given www (i) and its hyperparameters. To the number of component is unknown. For this purpose, a 1:k(i) reconstruct the noiseless signal from the fitted approximate variable-dimensional parametric model has been designed to posterior qhhhˆ using VAPoRS, one can generate R pairs of approximate the posterior of interest. The parameters of the samples (k(r),www (r) ) as explained in Section II-A. Then, we approximate model have been estimated by means of an 1:k(r) SEM-type algorithm, using samples from the true posterior set yˆV0APoRS= R1 (cid:229)R D(r)aaaˆ(1r:k)(r). sdaismtrpibleurt,ioen.g.f, gtheeneRraJt-eMdCbMy Ca trsaanmsp-dleimr.eMnsoiodnifiaclaMtioonnsteoCfaorulor r=1 initial SEM-type algorithm have been proposed, in order to Panel (d) compares the normalized reconstruction errors cope with the lack of robustness of maximum likelihood-type when using VAPORS with the ones of the direct BMA estimates. approach in dB, defined as The relevance of the proposed algorithm, both for sum- kyˆ −y k2 marizing and for relabeling variable-dimensional posterior 0 0 10log10(cid:18) ky k2 (cid:19), (16) distributions,hasbeenillustratedontwosignaldecomposition 0 examples, namely, the problem of detection and estimation 8Later,inapost-processingstep,sinceeachGaussiancomponenthasbeen of sinusoids in Gaussian white noise and a particle counting endowedwithaprobabilityofpresencepl,with1≤l≤L,onecandecideto problem motivated by the astrophysics project Auger. Most discardtheoneswithpl smallerthanacertainthreshold;see[1,Section3.4.3] notably, VAPoRS has been shown to be the first approach in formorediscussionaboutthisidea. the literature capable of solving the label-switching issue in 9By “direct”, we mean that posterior means are approximated using the RJ-MCMCsamplesdirectly, andnotusingtheVAPoRSposterior. trans-dimensionalproblems.Wehaveshownthattheproposed