ebook img

Identifiability and optimal rates of convergence for parameters of multiple types in finite mixtures PDF

2.5 MB·English
by  Nhat Ho
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Identifiability and optimal rates of convergence for parameters of multiple types in finite mixtures

Identifiability and optimal rates of convergence for parameters of multiple types in finite mixtures NhatHo and XuanLongNguyen Technical report 536 DepartmentofStatistics 5 UniversityofMichigan 1 0 January 9, 2015 2 n Abstract a J This paper studies identifiability and convergence behaviors for parameters of multiple 1 types in finite mixtures, and the effects of model fitting with extra mixing components. 1 First,wepresentageneraltheoryforstrongidentifiability,whichextendsfromtheprevious workofNguyen[2013]andChen[1995]toaddressabroadrangeofmixturemodelsand ] T to handle matrix-variate parameters. These models are shown to share the same Wasser- S stein distancebasedoptimalratesofconvergenceforthespaceofmixingdistributions— h. n−1/2 underW1 fortheexact-fittedandn−1/4 underW2 fortheover-fittedsetting,where t n is the sample size. This theory, however, is not applicable to several important model a classes,includinglocation-scalemultivariateGaussianmixtures,shape-scaleGammamix- m turesandlocation-scale-shapeskew-normalmixtures. Thesecondpartofthisworkisde- [ votedtodemonstratingthatforthese”weaklyidentifiable”classes, algebraicstructuresof 1 thedensityfamilyplayafundamentalroleindeterminingconvergenceratesofthemodel v parameters,whichdisplayaveryrichspectrumofbehaviors.Forinstance,theoptimalrate 7 ofparameterestimationinanover-fittedlocation-covarianceGaussianmixtureisprecisely 9 determinedbytheorderofasolvablesystemofpolynomialequations—theseratesdeteri- 4 oraterapidlyasmoreextracomponentsareaddedtothemodel. Theestablishedratesfora 2 0 varietyofsettingsareillustratedbyasimulationstudy.1 . 1 0 1 Introduction 5 1 : Mixture models are popular modeling tools for making inference about heterogeneous data [Lindsay, v i 1995, McLachlanandBasford, 1988]. Under the mixture modeling, data are viewed as samples from X a collection of unobserved or latent subpopulations, each posits its own distribution and associated r a parameters. Learning about subpopulation-specific parameters is essential tounderstanding ofthe un- derlyingheterogeneity. Theoreticalissuesrelatedtoparameterestimationinmixturemodels,however, remain poorly understood — as noted in a recent textbook [DasGupta, 2008] (pg. 571), “mixture modelsareriddledwithdifficultiessuchasnonidentifiability”. Researchaboutparameteridentifiability formixturemodelsgoesbacktotheearlyworkofTeicher [1961,1963],YakowitzandSpragins[1968]andothers,andcontinuestoattractmuchinterest[HallandZhou, 2003,Halletal.,2005,Elmoreetal.,2005,Allmanetal.,2009]. Toaddressparameterestimationrates, anaturalapproachistostudythebehaviorofmixingdistributions thatariseinthemixturemodel. This approach is well-developed in the context of nonparametric deconvolution [CarrollandHall, 1988, 1This research is supported in part by grants NSF CCF-1115769, NSF OCI-1047871, NSF CAREER DMS-1351362, andNSFCNS-1409303. TheauthorswouldliketoacknowledgeElisabethGassiat,XumingHe,JudithRousseau,Naisying Wang,ShuhengZhouandseveralothersforvaluablediscussionsrelatedtothiswork. AMS2000subjectclassification:Primary62F15,62G05;secondary62G20. Keywordsandphrases:mixturemodels,strongidentifiability,weakidentifiability,Wassersteindistances,minimaxbounds, maximumlikelihoodestimation,systemofpolynomialequations. 1 Zhang, 1990,Fan,1991], but these results are confined toonly aspecific type ofmodel –the location mixtures. Beyond location mixtures there have been far fewer results. In particular, for finite mixture models, anotable contribution wasmadebyChen,whoproposed anotion ofstrong identifiability and established the convergence of the mixing distribution for a class of over-fitted finite mixtures [Chen, 1995]. Over-fittedfinitemixtures,asopposedtoexact-fittedones,aremixturesthatallowextramixing components in their model specification, when the actual number of mixing components is bounded by a known constant. Chen’s work, however, was restricted to models that have only a single scalar parameter. ThisrestrictionwaseffectivelyremovedbyNguyen,whoshowedthatWassersteindistances (cf.[Villani,2009])provideanaturalsourceofmetricsforderivingratesofconvergenceofmixingdis- tributions[Nguyen,2013]. Heestablishedratesofconvergence ofmixingdistributions foranumberof finite and infinite mixture models withmulti-dimensional parameters. Rousseau and Mengersen stud- ied over-fitted mixtures in a Bayesian estimation setting [Rousseau andMengersen, 2011]. Although they did not focus on mixing distributions per se, they showed that the mixing probabilities associ- ated with extra mixing components vanish at a standard n−1/2 rate, subject to a strong identifiability condition on the density class. Finally, we mention a related literature in computer science, which focuses almost exclusively on the analysis of computationally efficient procedures for clustering with exact-fittedGaussianmixtures(e.g.,[Dasgupta,1999,BelkinandSinha,2010,Kalaietal.,2012]). Duetorequirementsofstrongidentifiability, theexistingtheoriesdescribedaboveareapplicableto onlycertainclassesofmixturemodels,typicallythosethatcarryasingleparametertype. Finitemixture models with multiple varying parameters (location, scale, shape, covariance matrix) are considerably morecomplex andmanydonotsatisfysuch strongidentifiability assumptions. Theyinclude location- scale mixtures of Gaussians, shape-scale mixtures of Gammas, location-scale-shape mixtures of of skew-normals (alsoknownasskew-Gaussians). Atheoryforsuchmodelsremainsopen. Setting The goal of this paper is to establish rates of convergence for parameters of multiple types, including matrix-variate parameters, that arise inavariety offinitemixture models. Assumethateach subpopulation is distributed bya density function (with respect to Lebesgue measure on anEuclidean space ) that belongs to a known density class f(x θ,Σ),θ Θ Rd1,Σ Ω S++,x . X | ∈ ⊂ ∈ ⊂ d2 ∈ X Here, d 1,d 0, S++ is the set of all d nd symmetric positive definite matrices. A finiote 1 ≥ 2 ≥ d2 2 × 2 mixturedensitywithkmixingcomponentscanbedefinedintermsoff andadiscretemixingmeasure G = k p δ withk supportpointsasfollows i=1 i (θi,Σi) P k p (x) = f(x θ,Σ)dG(θ,Σ) = p f(x θ ,Σ ). G i i i | | Z i=1 X Examples for f studied in this paper include the location-covariance family (when d = d 1) un- 1 2 ≥ der Gaussian or some elliptical families of distributions, the location-covariance-shape family (when d > d ) under the generalized multivariate Gaussian, skew-Gaussian or the exponentially modified 1 2 Student’s t-distribution, and the location-rate-shape family (when d = 3,d = 0) under Gamma or 1 2 other distributions. The combination of location parameter with covariance matrix, shape and rate parameters in mixture modeling enables rich and more accurate description of heterogeneity, but the interaction amongvarying parameter types canbecomplex, resulting invaried identifiability andcon- vergence behaviors. In addition, we shall treat the settings of exact-fitted mixtures and over-fitted mixturesseparately, asthelatertypicallycarriesmorecomplexbehavior thantheformer. AsshownbyNguyen,theconvergence ofmixturemodelparameterscanbemeasuredintermsofa Wasserteindistance onthespaceofmixingmeasuresG[Nguyen,2013]. LetG = k p δ and i=1 i (θi,Σi) G = k0 p0δ betwodiscrete probability measures onΘ Ω,whichisequipped withmetric 0 i=1 i (θi0,Σ0i) × P 2 P ρ. RecalltheWassersteindistance oforderr,foragivenr 1: ≥ 1/r W (G,G )= inf q ρr((θ ,Σ ),(θ0,Σ0)) , r 0  q ij i i j j  i,j X   where the infimum is taken over all joint probability distributions q on [1,...,k] [1,...,k ] such 0 × that,whenexpressing qasak k matrix,themarginalconstraints hold: q = p and q = p′. × 0 ij i ij j j i Suppose that a sequence of mixing measures G G under W metricPat a rate ω =Po(1). If all n 0 r n → G have the same number of atoms k = k as that of G , then the set of atoms of G converge to n 0 0 n the k atoms of G at the same rate ω under ρ metric. If G have varying k [k ,k] number of 0 0 n n n 0 ∈ atoms,wherekisafixedupperbound, thenasubsequence ofG canbeconstructed sothateachatom n of G is a limit point of a certain subset of atoms of G — the convergence to each such limit also 0 n happens at rate ω . Some atoms of G may have limit points that are not among G ’s atoms — the n n 0 massassociated withthoseatomsofG mustvanishatthegenerally fasterrateωr. n n In order to establish the rates of convergence for the mixing measure G, our strategy is to derive sharp bounds which relate the Wasserstein distance ofmixing measures G,G′ and adistance between corresponding mixture densities pG,pG′, such as the variational distance V(pG,pG′). It is relatively simple to obtain upper bounds for the variational distance of mixing densities (V for short) in terms ofWasserstein distances W (G,G′)(shorthanded byW ). Establishing (sharp) lowerbounds forV in r r termsofW isthemainchallenge. Suchaboundmaynothold,duetoapossiblelackofidentifiability r ofthemixingmeasures: onemayhavepG = pG′,soclearlyV = 0butG = G′,sothatWr = 0. 6 6 General theory of strong identifiability The classical identifiability condition requires that p = G pG′ entailsG= G′. Thisamountstothelinearindependenceofelementsf inthedensityclass[Teicher, 1963]. Inordertoestablish quantitative lowerbounds onadistance ofmixturedensities, weintroduce severalnotionsofstrongidentifiability,extendingfromthedefinitionofChen[1995]tohandlemultiple parametertypes,includingmatrix-variateparameters. Therearetwokindsofstrongidentifiability. One suchnotioninvolvestakingthefirst-orderderivativesofthefunctionf withrespecttoallparametersin the model, and insisting that these quantities be linearly independent in sense to be precisely defined. This criterion will be called “strong identifiability in the first order”, or simply first-order identifiabil- ity. When the second-order derivatives are also involved, we obtain the second-order identifiability criterion. Itisworthnotingthatpriorstudiesonparameterestimationratestendtocenterprimarilythe second-order identifiability condition or something even stronger [Chen, 1995, LiuandShao, 2004, RousseauandMengersen,2011,Nguyen,2013]. Weshowthatforexact-fittedmixtures,thefirst-order identifiability condition (alongwithsomeadditional regularity conditions) sufficesforobtaining that V(p ,p ) & W (G,G ), (1) G G0 1 0 when W (G,G ) is sufficiently small. Moreover, for a broad range of density classes, we also have 1 0 V . W , for which we actually obtain V(p ,p ) W (G,G ). A consequence of this fact is that 1 G G0 ≍ 1 0 for any estimation procedure that admits the n−1/2 convergence rate for the mixture density under V distance, themixturemodelparametersalsoconverge atthesamerateunderEuclideanmetric. Turningtotheover-fittedsetting, second-order identifiability alongwithmildregularityconditions wouldbesufficient forestablishing thatforanyGthathasatmost k support points wherek k +1 0 ≥ andk isfixed, V(p ,p ) & W2(G,G ). (2) G G0 2 0 3 when W (G,G ) is sufficiently small. The lower bound W2(G,G ) is sharp, i.e wecan not improve 2 0 2 0 the lower bound to Wr for any r < 2 (notably, W W ). A consequence of this result is, take 1 2 ≥ 1 any standard estimation method (such that the MLE) which yields n−1/2 convergence rate for p , G the induced rate of convergence for the mixing measure G is the minimax optimal n−1/4 under W . 2 It also follows that the mixing probability mass converge at n−1/2 rate (which recovers the result of RousseauandMengersen [2011]), in addition to showing that the component parameters converge at n−1/4 rate. We also show that there is a range of mixture models with varying parameters of multiple types thatsatisfiesthedeveloped strongidentifiability criteria. Allsuchmodelsexhibitthesamekindofrate for parameter estimation. In particular, the second-order identifiability criterion (thus the first-order identifiability)issatisfiedbymanydensityfamiliesf includingthemultivariateStudent’st-distribution, theexponentially modifiedmultivariateStudent’st-distribution. Second-order identifiabilityalsoholds for several mixture models with multiple types of (scalar) parameters. These results are presented in Section 3.2. The proofs of these characterization theorems are rather technical, but one useful insight onecandrawfromthemisthatthestrongidentifiabilitycondition(ineitherthefirstorthesecondorder) isessentially determined bythe smoothness ofthe kernel density inquestion (whichcan beexpressed intermsofhowfastthecorresponding characteristic functionvanishes towardinfinity). Theoryforweaklyidentifiableclasses Wehurryuptopointoutthatmanycommondensity classes do not satisfy either or both strong identifiability criteria. The Gamma family of distributions (with both shape and scale parameters vary) is not identifiable in the first order. Neither is the family of skew-Gaussian distributions [AzzaliniandCapitanio, 1999, AzzaliniandValle, 1996]. Convergence behavior for the mixture parameters of these two families are unknown, in both exact and over-fitted settings. Theubiquitous Gaussian family, whenbothlocation andscale/covariance parameters vary, is identifiable in the first order, but not in the second order. So, the general theory described above can beappliedtoanalyzeexact-fittedGaussianmixtures,butnotforover-fittedGaussianmixtures. Itturns outthattheseclassesofmixturemodelsrequireaseparateandnoveltreatment. Throughout thiswork, we shall call such density families “weakly identifiable classes”, i.e., those that are identifiable in the classical sense,butnotinthesenseofstrongidentifiability takenineitherthefirstorsecondorder. Weakidentifiabilityleadstoanextremelyrich(andpreviouslyunreported)spectrumofconvergence behavior. It is no longer possible to establish inequalities (1) and (2), because they do not hold in general. Instead, we shall be able to establish sharp bounds of the types V & Wr for some precise r value of r, which depends on the specific class of density in consideration. This entails minimax optimal but non-standard rates of convergence for mixture model parameters. In our theory for these weakly identifiable classes, the algebraic structure of the density f, not merely its smoothness, will nowplaythefundamental roleindetermining therates. Gaussian mixtures: We will first discuss the Gaussian family of densities of the standard form f(x θ,Σ),whereθ Rd andΣ S++ aremeanandcovariance parameters, respectively. Thelackof | ∈ ∈ d strongidentifiability inthesecondorderisduetothefollowingidentity: ∂2f ∂f (x θ,Σ)= 2 (x θ,Σ), ∂θ2 | ∂Σ | which entails that the derivatives of f taken with respect to the parameters up to the second order are notlinearlyindependent. Moreover,thisalgebraicstructureplaysthefundamentalroleinourprooffor thefollowinginequality: V(p ,p )& Wr(G,G ), (3) G G0 r 0 4 Density Exact-fitted mix- Over-fitted mix- MLErateforG Minimax lower classes tures tures forn-iidsample boundforG (I) Generalized V &W1 Exact-fit: Exact-fit: First-order Gaussian, W1 .n−1/2 W1&n−1/2 identifiable Student’st,... (II) Student’s t, sameas(I) V &W2 Exact-fit: Exact-fit: 2 Second- exponentially sameas(I) sameas(I) order modified identifiable Student’st,... Over-fit: Over-fit: W2 .n−1/4 W1&n−1/4 Not location-scale sameas(I) V &Wr, Exact-fit: Exact-fit: r second- multivariate r depending on sameas(I) sameas(I) order Gaussian k−k0 identifiable Ifk−k0 =1,r= Over-fit: Over-fit: 4 Wr .n−1/2r W1&n−1/2r Ifk−k0 =2,r= 6 Gamma Genericcase: Genericcase: Generic: W1 . Generic: distribution V &W1 V &W22 n−1/2 or W2 . W1&n−1/2 n−1/4 W2&n−1/4 Patho.case: Patho.case: Patho.case: Patho. case: loga- V 6& Wr forany V 6& Wr forany unknown rithmic,i.eW & r r r r≥1 r≥1 n−1/r ∀r≥1 Not Location- V 6&Wr V 6&Wr Unknown logarithmic 1 1 first-order exponential ∀r≥1 ∀r≥1 W1&n−1/r identifiability distribution ∀r≥1 Exactfit: Exact-fit: Genericcase: Genericcase: Generic: Generic: V &W1 V & Wmm,where W1 .n−1/2 W1&n−1/2 m=rorr+1 Patho. confor- Patho. confor- Patho.conformant: Patho. confor- mant: mant: W2 .n−1/4 mant: V &W22 unknown W2&n−1/4 Skew- Patho. non- Patho. non- Patho. non- Patho. non- Gaussian conformant: conformant: conformant: conformant: distribution V & Wss for unknown Ws .n−1/2s W3 & n−1/6, or somes W4 & n−1/8, or W5 & n−1/10,or ... Otherwise: Otherwise: Otherwise: Otherwise: V 6& Wr forany unknown unknown logarithmic 1 r≥1 Over-fit: Over-fit: n−1/2m unknown orunknown Table1: Summaryofresultsestablished inthispaper. Tobeprecise, allupperboundsforMLErates areoftheform(logn/n)−γ,butthelogarithmic termisremovedinthetabletoavoidcluttering. 5 where r 1 is defined as the minimum value of r 1 such that the following system of polynomial ≥ ≥ equations k−k0+1 c2an1bn2 j j j = 0 forall 1 α r n !n ! ≤ ≤ 1 2 Xj=1 n1+X2n2=α doesnothaveanynon-trivialrealsolution (c ,a ,b ) k−k0+1. Weemphasizethatthelowerboundin { j j j }j=1 Eq.(3)issharp,inthatitcannotbereplacedbyWr (orWr)foranyr < r. Aconsequence ofthisfact, 1 r byinvokingstandardresultsfromasymptoticstatistics,isthattheminimaxoptimalrateofconvergence for estimating G is n−1/2r under W distance metric. The authors find this correspondence quite r striking – one which links precisely the minimax optimal estimation rate of mixing measures arising fromanover-fittedGaussianmixturetothesolvability ofanexplicitsystemofpolynomial equations. Determining the solvability of a system of polynomial equations is a basic question in (compu- tational) algebraic geometry. For the system described above, there does not seem to be an obvious answer to the general value of r. Since the number of variables in this system is 3(k k +1), one 0 − expects that r keeps increasing as k k increases. In fact, using a standard method of Groebner 0 − bases [Buchberger, 1965], we can show that for k k = 1 and 2, r = 4 and 6, respectively. In 0 − addition if k k 3, then r 7. Thus, the convergence rate of the mixing measure for over-fitted 0 − ≥ ≥ Gaussianmixturedeteriorates veryquickly asmoreextracomponents areincludedinthemodel. Gammamixtures: We shall now briefly describe several other model classes studied in this paper. Gammadensitiesrepresentonesuchclass: theGammadensityf(x a,b)hastwopositiveparameters,a | forshapeandbforrate. Thisfamilyisnotidentifiableinthefirstorder. Thelackofidentifiabilityboils downtothefundamentalidentity(10). Byexploitingthisidentity,wecanshowthatthereareparticular combinations of the true parameter values which prevent the Gamma class from enjoying strong con- vergenceproperties. Byexcludingthemeasure-zero setofpathological casesoftruemixingmeasures, theGammadensityclassinfactcanbeshowntobestronglyidentifiableinbothorders. Thus,thisclass isalmoststronglyidentifiable,usingtheterminologyofAllmanetal.[2009]. Thegeneric/pathological dichotomy in the convergence behavior within the Gamma class is quite interesting: in the measure- onegenericsetoftruemixingmeasures,themixingmeasurecanbeestimatedatthestandardrate(i.e., n−1/2 under W forexact-fitted andn−1/4 under W forover-fitted mixtures). Thepathological cases 1 2 are not so forgiving: even for exact-fitted mixtures, one can do no better than a logarithmic rate of convergence. Location-exponential mixtures: Lest some wonder whether this unusually slow rate for the exact- fitted mixture setting can happen only in the measurably negligible (pathological) cases, we also introduce a location-extension of the Gamma family, the location-exponential class: f(x θ,σ) := | 1 exp x−θ1(x > θ). We show that the minimax lower bound for estimating the mixing measure σ − σ inanexact-fittedmixtureoflocation-exponentials isnofasterthanalogarithmicrate. Skew-Gaussian mixtures: The most fascinating example among those studied is perhaps skew- Gaussian distributions. This density class generalizes the Gaussian distributions, by having an extra parameter, shape, which controls density skewness. The skew-Gaussian family exhibits an extremely broad spectrum of behavior, some of which shared with the Gamma family, some with the Gaussian, but this family is really a league of its own. It is not identifiable in the first order, for a reason that is somewhat similar tothat of theGammafamily described above. Asaconsequence, one can construct a full measure set of generic cases for the true mixing measures according to which, the exact-fitted mixturemodeladmitsstrongidentifiablity andconvergence rate(asinthegeneraltheory). Within the seemingly benign setting of exact-fitted mixtures, the pathological cases for the skew- Gaussiancarryaveryrichstructure, resultinginavarietyofbehaviors: forsomesubsetoftruemixing 6 measures, the convergence rate is tied to solvability of a certain system of polynomial equations; for someothersubset, theconvergence ispoor–theratecanbelogarithmic atbest. Turning to over-fitted mixtures of skew-Gaussian distributions, unfortunately our theory remains incomplete. The culprit lies in the fundamental identity (13), which shows that the first and second order derivatives of the skew-Gaussian densities are dependent on a nonlinear manner. This is in contrast tothelinear dependence that characterizes Gaussian andGammadensities. Thus, themethod of proof that works well for the previous examples is no longer adequate – the rates obtained are probably notoptimal. Keyproofideas Wenowprovideabriefdescription ofourmethodofproofsfortheresultsobtained in this paper, a summary of which given in Table 1. There are two different theories: a general the- ory for the strongly identifiable classes and specialized theory for weakly identifiable classes. Within each model classes, the key technical objective is the same: to derive sharp inequalities of the form V(p ,p ) & Wr(G,G ),wheresharpness isexpressedinthechoiceofr. G G0 r 0 For strongly identifiable classes, either in the first or the second order, the starting point of our proof is an application of Taylor expansion on the mixture density difference p p , where G Gn − G0 n represents asequence ofmixing measures that tend to G in Waserstein distance W , where r = 1or 0 r 2,theassumedorderofstrong identifiablity. Themainpartoftheproofinvolves trying toforceallthe Taylor coefficients in the Taylor expansion to vanish according to the converging sequence of G . If n thatisprovedtobeimpossible, thenonecanarriveattheboundoftheformV & Wr. Thus,ourproof r techniqueissimilartothatofNguyen[2013]. Toshowthatthederivedinequalitiesaresharp,weresort tocareful constructions ofa“worst-case”sequence ofG . n For weakly identifiable classes, the Taylor expansion technique continues to provide the proof’s backbone, butthekeyissuenowisdetermining the“correct”order uptowhichtheTaylorexpansion is exercised. Sincehigh-order derivativesofthedensityf arenolongerindependent, thedependence has tobetaken into account before onecan fallback toasimilar technique afforded by thegeneral theory described above. Ifthehigh-order derivatives arelinearly dependent, asisthecase ofGaussian densi- ties, it is possible to reduce the original Taylor expansion in terms of only a subset of such derivative quantities that are linearly independent. This reduction process paves the way for a system of poly- nomial equations to emerge. It follows then that the right exponent r in the desired bound described abovecanbelinkedtotheorderofsuchasystem whichadmitsanon-trivial solution. Practicalimplications Problematicconvergencebehaviorsexhibitedbywidelyutilizedmodelssuch as Gaussian mixtures may have long been observed in practice, but to our knowledge, most of the obtained convergence rates are established for the first timein this paper, particularly those ofweakly identifiable classes. The results established for the popular Gaussian class present a formal reminder about the limitation of Gaussian mixtures when it comes to assessing the quality of parameter esti- mation, but only when the number of mixing components is unknown. Since a tendency in practice is to “over-fit” the mixture generously with many more extra mixing components, our theory warns against this practice, because the convergence rate for subpopulation-specific parameters deteriorates rapidly withthenumber ofredundant components. Inparticular, weexpect thatthe value r inthe rate n−1/2r tends to infinity as the number of redundant Gaussian components increases to infinity. To completethespectrum ofrates, wenotethelogarithmic rate(logn)−1/2 ofconvergence ofthemixing measure in infinite Gaussian location mixtures, via a Bayes estimate [Nguyen, 2013] or kernel-based deconvolution [Caillerie etal.,2011]. ForGammaandskew-Gaussianmixtures,(forapplications,see,e.g.[GhosalandRoy,2011,LeeandMcLachlan, 2013, Wiperetal., 2001]) our theory paints a wide spectrum of convergence behaviors within each 7 model class. We hope that the theoretical results obtained here may hint at practically useful ways for determining benign scenarios when the mixture models enjoy strong identifiability properties and favorableconvergence rates,andforidentifyingpathological scenarioswherethepractionerswoulddo wellbyavoiding them. Paperorganization Therestofthe paper isorganized asfollows. Section2provides someprelimi- narybackgrounds andfacts. Section3presentsageneral theoryofstrongidentifiability, byaddressing the exact-fitted and over-fitted settings separately before providing a characteration of density classes for which the general theory is applicable. Section 4 is devoted to a theory for weakly identifiable classes, by treating each of the described three density classes separately. Section 5.1 contains easy consequences of the theory developed earlier – this includes minimax bounds and the convergence ratesofthemaximum likelihood estimation, whichareoptimal inmanycases. Thetheoretical bounds areillustratedviasimulationsinSection5.2. Self-containedproofsofrepresentativetheoremsaregiven inSection6,whileproofsofremainingresultsarepresented intheAppendix. Notation DivergencedistancesstudiedinthispaperincludethetotalvariationaldistanceV(pG,pG′)= 1 1 pG(x) pG′(x) dµ(x)andtheHellingerdistanceh2(pG,pG′) = ( pG(x) pG′(x))2dµ(x). 2 | − | 2 − AsZK,L N, the first derivative of real function g : RK×L R of maZtrixpΣ is definepd as a K L ∈ → ∂2g × matrix whose (i,j) element is ∂g/∂Σ . The second derivative of g, denoted by is a K2 L2 ij ∂Σ2 × ∂ ∂g matrix made of KL blocks of K L matrix, whose (i,j)-block is given by . Addition- × ∂Σ ∂Σ (cid:18) ij(cid:19) ally, as N N, for function g : RN RK×L R defined on (θ,Σ), the joint derivative between 2 ∈ × → ∂2g ∂2g 2 2 thevectorcomponentandmatrixcomponent = isa(KN) LmatrixofKLblocksfor ∂θ∂Σ ∂Σ∂θ × ∂ ∂g N-columns,whose(i,j)-blockisgivenby 2 . Finally,foranysymmetricmatrixΣ Rd×d, ∂θ ∂Σ ∈ (cid:18) ij(cid:19) λ (Σ)andλ (Σ)respectively denote itssmallestandlargesteigenvalue. 1 d 2 Preliminaries First of all, we need to define our notion of distances on the space of mixing measures G. In this paper, we restrict ourself to the space of discrete mixing measures with exactly k distinct support 0 points on Θ Ω, which is denoted by (Θ Ω), and the space of discrete mixing measures with × Ek0 × at most k distinct support points on Θ Ω, which is denoted by (Θ Ω). In addition, let (Θ k × O × G × Ω) = (Θ Ω) be the set of all discrete measures with finite support points. Consider mixing k k∪∈NE × k measure G = p δ , where p = (p ,p ,...,p ) denotes the proportion vector and (θ,Σ) = i (θi,Σi) 1 2 k i=1 A((θc1o,uΣp1li)n,g..b.e,t(wPθeke,nΣpk)a)nddepno′tiessatjhoeinstudpipsotrritbinugtioantoqmosnin[1Θ..×.,Ωk]. L[i1k,e.w.i.s,ek,′l]e,twGh′ic=hisekix=′p1reps′isδe(θdi′,aΣs′i)a. × k P k′ matrixq = (q ) [0,1]k×k andadmitsmarginalconstraints q = p′ and q = p ij 1≤i≤k,1≤j≤k ∈ ij j ij i i=1 j=1 for any i = 1,2,...,k and j = 1,2,...,k′. We call q a coupling of p aPnd p′, and use P(p,p′) to Q denotethespaceofallsuchcouplings. As in Nguyen [2013], our tool for analyzing the identifiability and convergence of parameters in a mixture model is by adopting Wasserstein distances, which can be defined as the optimal cost of 8 moving mass from one probability measure to another [Villani, 2009]. For any r 1, the r-th order ≥ Wasserstein distancebetweenGandG′ isgivenby 1/r W (G,G′) = inf q ( θ θ′ + Σ Σ′ )r . r q∈Q(p,p′) ij k i − jk k i− jk (cid:18) i,j (cid:19) X In both equations in the above display, denotes either the l norm for elements in Rd or the 2 k · k entrywisel normformatrices. AcentralthemeofthepaperistherelationshipbetweentheWasserstein 2 distances of mixing measures G,G′ and distances of corresponding mixture densities pG,pG′. Recall that mixture density p is obtained by combining a mixing measure G (Θ Ω) with a family of G ∈ G × densityfunctions f(x θ,Σ),θ Θ,Σ Ω : { | ∈ ∈ } k p (x) = f(x θ,Σ)dG(θ,Σ) = p f(x θ ,Σ ). G i i i | | Z i=1 X Clearly ifG = G′ then pG = pG′. Intuively, ifW1(G,G′) orW2(G,G′) issmall, so isadistance between pG and pG′. This can be quantified by establishing an upper bound for the distance of pG andpG′ intermsofW1(G,G′)orW2(G,G′). Ageneralnotionofdistance betweenprobability densi- ties defined on a common space is f-divergence (or Ali-Silvey distance) AliandSilvey [1966]: an f- g divergencebetweentwoprobabilitydensityfunctionsf andgisdefinedasρ (f,g) = φ fdµ, φ f whereφ :R Risaconvexfunction. Similarly,thef-divergencebetweenpGandpG′Zisρφ(cid:18)(pG(cid:19),pG′) = → φ pG′ p dµ. Asφ(x) = 1(√x 1)2, we obtain the squared Hellinger distance (ρ2 h2). As p G 2 − h ≡ (cid:18) G (cid:19) R 1 φ(x) = x 1,weobtainthevariational distance(ρ V). V 2| − | ≡ A simple way of establishing an upper bound for an f-divergence between pG and pG′ is via the “composite transportation distance” betweenmixingmeasuresG,G′: d (G,G′) = inf q ρ (f ,f′) ρφ q∈Q(p,p′) ij φ i j i,j X wheref = f(x θ ,Σ )andf′ = f(x θ′,Σ′)foranyi,j. Thefollowinginequality regardingtherela- i | i i j | j j tionshipbetweenρφ(pG,pG′)anddρφ(G,G′)isasimpleconsequence ofJensen’sinequality[Nguyen, 2013]: ρφ(pG,pG′)≤ dρφ(G,G′). It is straightforward to derive upper bounds for d (G,G′) in terms of Wasserstein distances W , by ρφ r takingintoaccountspecificstructures ofthedensityfamilyf,andthencombinewiththeinequality in theprevious display toarriveatupper bounds forρφ(pG,pG′)intermsofWasserstein distances. Here areafewexamples. Example2.1. (Multivariate generalized Gaussiandistribution [Zhangetal.,2013]) mΓ(d/2) Thedensity family f takes the form f(x θ,m,Σ) = exp( ((x θ)TΣ−1(x | πd/2Γ(d/(2m))Σ 1/2 − − − θ))m),whereθ Rd,m > 0,andΣ S++. IfΘ isboundedsubset|of|Rd,Θ = m R+ :1 m ∈ ∈ d 1 2 { ∈ ≤ m m , and Ω = Σ S++ : λ λ (Σ) λ (Σ) λ , where λ,λ > 0, then for any ≤ ≤ } ∈ d ≤ 1 ≤ d ≤ G1,G2 ∈ G(Θ1×Θ2×Ωn),weobtainh2(ppG1,pG2) .pW22(G1,G2)oandV(pG1,pG2) . W1(G1,G2). 9 Example2.2. (Multivariate Student’s t-distribution) The density family f takes the form f(x θ,Σ) = C (ν + (x θ)TΣ−1(x θ))−(ν+d)/2, where ν ν | − − Γ((ν +d)/2)νν/2 is a fixed positive degree of freedom and C = . If Θ is bounded subset of Rd ν Γ(ν/2)πd/2 and Ω = Σ S++ : λ λ (Σ) λ (Σ) λ , then for any G ,G (Θ Ω), we obtain ∈ d ≤ 1 ≤ d ≤ 1 2 ∈ G × h2(pG1,pGn2). W22(G1,G2p)andV(pGp1,pG2). W1o(G1,G2). Example2.3. (Exponentially modifiedmultivariate Student’s t-distribution) Letf(x θ,λ,Σ)tobedensityfunctionofX = Y +Z,whereY followsmultivariatet-distribution with | locationθ,covariancematrixΣ,fixedpositivedegreeoffreedomν,andZ isdistributedbytheproduct of d independent exponential distributions with combined shape λ = (λ ,...,λ ). If Θ is bounded 1 d subsetofRd Rd,whereRd = x Rd :x > 0 i ,andΩ = Σ S++ :λ λ (Σ) × + + ∈ i ∀ ∈ d ≤ 1 ≤ λ (Σ) λ ,then forany G ,(cid:8)G (Θ Ω), h2(cid:9)(p ,p ) .n W2(G ,G )anpd V(p ,p ) . d ≤ 1 2 ∈ G × G1 G2 2 1 2 G1 G2 Wp1(G1,G2).o Example2.4. (ModifiedGaussian-Gammadistribution) Letf(x θ,λ,β,Σ)tobedensityfunctionofX = Y +Z,whereY isdistributed bymultivariate Gaus- | siandistribution withmeanθ,covariance matrixΣ,andZ isdistributed bytheproductofindependent Gamma distributions with combined shape vector α = (α ,...,α ) and combined rate vector β = 1 d (β ,...,β ). IfΘisboundedsubsetofRd Rd Rd andΩ = Σ S++ : λ λ (Σ) λ (Σ) 1 d × +× + ∈ d ≤ 1 ≤ d ≤ λ ,thenforanyG1,G2 ∈ G(Θ×Ω),h2(pG1,pG2) . V(pnG1,pG2) . W1(Gp1,G2). p (cid:9) 3 General theory of strong identifiability Theobjectiveofthissectionistodevelopageneraltheoryaccordingtowhichasmalldistancebetween mixturedensitiespG andpG′ entailsasmallWassersteindistancebetweenmixingmeasuresGandG′. The classical identifiability criteria requires that pG = pG′ entail G = G′, which essentially equiva- lent to a linear independence requirement for the class of density family f(x θ,Σ)θ Θ,Σ Ω . { | | ∈ ∈ } To obtain quantitative bounds, we need stronger notions of identifiability, ones which involve higher order derivatives of density function f, taken with respect to the multivariate and matrix-variate pa- rameters present in the mixture model. The advantage of this theory, which extends from the work of Nguyen [2013] and Chen [1995], is that it is holds generally for a broad range of mixture models, which allow for the same bounds on the Wasserstein distances of mixing measures to hold. This in turnleadsto“standard”ratesofconvergenceforthemixingmeasure. Ontheotherhand,manypopular mixture modelssuch asthelocation-covariance Gaussian mixture, mixtureofGamma,andmixtureof skew-Gaussiandistributions donotsubmittothegeneraltheory. Insteadtheyrequireseparateandfun- damentally distinct treatments; moreover, such models also exhibit non-standard rates of convergence forthemixingmeasure. Readersinterested inresultsforsuchmodelsmayskipdirectlytoSection4. 3.1 Definitions andgeneral bounds Definition 3.1. The family f(x θ,Σ),θ Θ,Σ Ω is identifiable in thefirst-order if f(x θ,Σ)is { | ∈ ∈ } | differentiable in(θ,Σ)andthefollowingassumption holds A1. Foranyfinite k different pairs (θ ,Σ ),...,(θ ,Σ ) Θ Ω,ifwehaveα R,β Rd1 and 1 1 k k i i ∈ × ∈ ∈ 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.