A Geometric Framework For Density Modeling SutanoyDasgupta 1,DebdeepPati1,andAnujSrivastava1 ∗ 1DepartmentofStatistics,FloridaStateUniversity 7 1 0 2 n a J Abstract 0 2 Weintroduceageometricapproachforestimatingaprobabilitydensityfunction(pdf)givenitssam- ] E ples. The procedure involves obtaining an initial estimate of the pdf and then transforming it via a M warpingfunctiontoreachthefinalestimate. Theinitialestimateisintendedtobecomputationallyfast, . t a albeit suboptimal, but its warping creates a larger, flexible class of density functions, resulting in sub- t s stantially improved estimation. The warping is accomplished by mapping diffeomorphic functions to [ 1 thetangentspaceofaHilbertsphere,avectorspacewhoseelementscanbeexpressedusinganorthog- v 6 onal basis. Using a truncated basis expansion, we estimate the optimal warping and, thus, the optimal 5 6 densityestimate. Thisframeworkisintroducedforunivariate,unconditionalpdfestimationandthenex- 5 tendedtoconditionalpdfestimation. Theapproachavoidsmanyofthecomputationalpitfallsassociated 0 . 1 with current methods without losing on estimation performance. In presence of irrelevant predictors, 0 7 theapproachachievesbothstatisticalandcomputationalefficiencycomparedtoclassicalapproachesfor 1 : conditional density estimation. We derive asymptotic convergence rates of the density estimator and v i X demonstratethisapproachusingsyntheticdatasets,andacasestudytounderstandassociationofatoxic r metaboliteonpretermbirth. a KEYWORDS: conditionaldensity;densityestimation;Hilbertsphere;irrelevantpredictors;sieveestimation; tangentspace;weightedlikelihoodmaximization ∗CorrespondingAuthor:[email protected] 1 1. INTRODUCTION Estimatingaprobabilitydensityfunction(pdf)isanimportantandwellstudiedfieldofresearchinstatis- tics. The most basic problem in this area is that of univariate pdf estimation from iid samples, henceforth referred to as unconditional density estimation. Another problem of utmost importance is conditional den- sityestimation. Hereoneneedstocharacterizethebehavioroftheresponsevariableforeachofthedifferent values of the predictors. A natural extension of the these problems is that of estimating the joint pdf for multiplerandomvariables. Giventheimportanceofpdfestimationinstatisticsandrelateddisciplines, alargenumberofsolutions have been proposed for each of these problems. While the earliest works focused on parametric solutions, thetrendoverthelastthreedecadesistouseanonparametricapproachasitminimizesmakingassumptions about the underlying density (and the relationships between variables for conditional and joint densities). The most common nonparametric techniques are kernel based; refer to Rosenblatt et al. [1956], Hall et al. [1991], Sheather and Jones [1991], Li and Racine [2007] for a narrative of works. Related to these ap- proaches are “tilting” or “data sharpening” techniques for unconditional density estimation, for example Hjort and Glad [1995], Doosti and Hall [2016] and the references therein. Kernel methods are very pow- erful in univariate setting. However, as the number of variables involved gets higher, the methods become computationallyinefficientbecauseofthecomplexitiesinvolvedinbandwidthselection, especiallyincon- ditional density estimation setup. Another common approach [Leonard, 1978, Lenk, 1988, 1991, Tokdar et al., 2010, Tokdar, 2012] for pdf estimation is a two step estimation procedure, where one first estimates an initial pdf, say f , from the data, perhaps restricting to a parametric family. Then, one improves upon p this estimate by considering a function g(x) depending on the initial estimate f and an unknown func- p (cid:82) tion W, to reach a final estimate of the type exp g(x) / exp g(x) dx. In a classical paradigm one can { } x { } chooseexp g(x) asW(x)f (x)whereW(x)isapositivefunction. InaBayesiancontext,thefunctiong p { } is assigned a Gaussian process prior W with mean f . Often the calculation of the normalization constant p makes the computation very cumbersome. On the other hand, two step procedures for conditional density estimation are based on estimating the conditional mean function, and then the conditional density of the residuals [Hansen, 2004]. Over the recent years, Bayesian methods for estimating pdfs based on mixture modelsandlatentvariableshavereceivedalotofattentionbecauseoftheirexcellentpracticalperformances andanincreasinglyrichsetofalgorithmictoolsforposteriorcomputationusingMarkovChainMonteCarlo 2 (MCMC) methods. References include Escobar and West [1995], Mu¨ller et al. [1996], MacEachern and Mu¨ller [1998], Kalli et al. [2011], Jain and Neal [2012], Kundu and Dunson [2014], Bhattacharya et al. [2010] among others. However these results also come at a very high computational cost associated with implementing the MCMC algorithms. Applications of flexible Bayesian models for conditional densities arediscussedinMacEachern[1999],DeIorioetal.[2004],GriffinandSteel[2006],Dunsonetal.[2007], ChungandDunson[2009],NoretsandPelenis[2012], amongothers. Althoughtheliteraturesuggeststhat suchmethodsbasedonmixturemodelshave severalattractiveproperties, theylackinterpretabilityandthe MCMCsolutionsformodelfittingareoverlycomplicatedandexpensive. Inthisarticleageometricallyintuitiveframeworkforatwostepapproachisproposedthatisapplicable tobothconditionalandunconditionaldensityestimation. Themotivationhereistoavoidthecomputational costassociatedwithtraditionaltechniques, whilenotlosingoutonpracticalperformance. Assumethatwe have a strictly positive univariate density f on the interval [0,1]; f serves as an initial estimate as in the p p traditionaltwostepmethods. LetΓbethesetofallpositivediffeomorphismsfrom[0,1]toitself: Γ = γ γ issmooth,γ−1 issmooth,γ˙ > 0,γ(0) = 0,γ(1) = 1 (1) { | } TheelementsofΓplaytheroleofwarpingfunctions,ortransformationsoff . Givenaγ Γ,thetransfor- p ∈ mationoff isdefinedby: p (f ,γ) = (f γ)γ˙ . (2) p p ◦ Henceforth this transformation is referred to as warping and the resulting pdf f as a warped density. This mapping is comprehensive in the sense that it forms a transitive group action on the space of all strictly positive pdfs on [0,1]. This allows one to go from any positive pdf to any other positive pdf using an appropriate γ. Note that since (cid:82)1f (γ(x))γ˙(x)dx = 1, there is no need to normalize this transformation. 0 p However, since Γ is not a vector space, one has to map the elements of Γ, using a nonlinear mapping, to a vectorspacetofacilitateexpansionusinganorthogonalbasis. Thediffeomorphismsastransformationsofa probability density shape have been used in the past, albeit with a different setup and scope [Saoudi et al., 1994, 1997]. Also, the notion of transformation between pdfs has been used in the literature on optimal transport [Tabak and Turner, 2013, Tabak and Trigila, 2014]. The transport is achieved using an iterated compositionofmapsandisthusverydifferentfromestimatingadiffeomorphism. In this article, we develop the framework for estimating an unconditional, univariate pdf defined on 3 [0,1]. This simple setting helps explain and illustrate the main ingredients of the framework. Besides, the proposedgeometricframeworkisnaturallyunivariateinthesensethatthetransformationdefinedin(2)acts on univariate density shapes making it a logical starting point for developments. In this simple setup, the approach delivers excellent performance while avoiding heavy computational cost, and is comparable to standardkernelmethods, evenatverylowsamplesizes. Theframeworkisextendedtounivariatedensities withunknownsupportbyscalingtheobservationdomainto[0,1]. Adefiningcharacteristicofthiswarping transformation is that it takes any initial density estimate to any final desired estimate, as long as they are both strictly positive. Hence the initial estimate can be constructed in anyway – parametric (e.g. gaussian) ornonparametric(e.g. kernelestimate),andisallowedtobeasub-optimalestimateofthetruedensity. The second part of the article focuses on extending the framework to estimation of conditional density f(y x)from (yi,xi) : i = 1,...,n,y R,x Rd,d 1 . Theapproachistostartwithanonparametric | { ∈ ∈ ≥ } meanregressionmodeloftheformy = m(x )+(cid:15) ,(cid:15) (0,σ2),wherem( )isestimatedusingastandard i i i i ∼ N · nonparametricestimator,toobtainaninitialconditionaldensityestimatef (mˆ(x),σˆ2)atthelocation p,x ≡ N x. Then f is warped using a warping function γ into a final conditional density estimate. Importantly, p,x x the set Γ and the warping transformation remain the same as the unconditional density estimation, and do not involve the predictors. Only the choice of the element γ Γ varies with the predictor x. This makes x ∈ the estimation process computationally efficient even in the presence of many predictors. The selection of γ is based on a weighted-likelihood objective function that borrows information from the neighborhood x ofthelocationxatwhichtheconditionaldensityisevaluated. Sincethetransformationoperationdoesnot involvethepredictors,thefinalestimateisnaturallyresistanttotheeffectofpredictorsthatareirrelevantto theresponse. Therestofthispaperisorganizedasfollows. Section2introducestheformalmethodologyandframe- work on a univariate unconditional density estimation setting. Section 3 presents notations and the con- vergence rate of the density estimator. Section 4 focuses on the practical implementation of the method andoutlinestheassociatednumericaltechniques. Illustrationsofunivariatedensityestimationonsimulated datasets are presented in Section 5. In section 6 the theory for conditional density estimation is developed andthepropertiesoftheproposedmethodareillustratedusingsimulateddatasets. Theapplicationofcon- ditional density estimation using the proposed framework on a real dataset is also presented. Section 7 includesadiscussiononhowtheideaspresentedinearliersectionscanbeusedtoperformbivariateuncon- 4 T1(S )= ∞ Let (fp,γ) F theunderlyingtruedensityand as the second step of the estimation procedure. This way the )1 | {qv)=exp−11(q): nonpanraomnpeatrriacm),esturicch),thsuacth.afnFtthuoarptthtaiemnrmaolpoftrFipem,aleltFp cthoemnpqo=sed√oγ˙f,tcwaollesdteiptsssaqsufaorlelo-rwoso.tvFeirlostc,iwtyefuunticlitzioenth(SeRfaVcFt)t,hiastaifpointontheunitHilbert anyparametricfamilywhospelepsacraommeintegrsfrcoamnbeeasilycomputedisagoodcandidatefor sphereS ⊂ ):S ∞ Figure 1: Left: The true pdf f is estimated by transforming an initial estimate f by the warping function t p γ. The larger the set of allowed γs, the better the estimate is. Right: Representing warping function γ as elementofthetangentspaceT1(S+). ∞ ditionaldensityestimation,thechallengesinvolvedinhigherdimensions,andasimplebivariateillustration. Finally,theproofsoftheconvergenceratepresentedinSection3isderivedinanappendix. 2. PROPOSEDFRAMEWORK In this section we develop a two-step framework for estimating univariate, unconditional pdf, and start by F introducingsomenotation. Let bethesetofallstrictlypositive,univariateprobabilitydensityfunctions on [0,1]. Let f F denote the underlying true density and x f , i = 1,2,...,n be independent t i t ∈ ∼ samples from f . Furthermore, let F be a pre-determined subset of F, such that an optimal element t p (defined appropriately) f F is relatively easy to compute. For instance, any parametric family with p p ∈ simple estimation procedure is a good candidate for f . Similarly, kernel density estimates are also good p sincetheyarecomputationallyefficientandrobustinunivariatesetups. Next, we define a warping-based transformation of elements of F , using elements of Γ defined in p (1). Note that Γ is an infinite-dimensional set that has a group structure under composition as the group operation. That is, for any γ ,γ Γ, the composition γ γ Γ. The identity element of Γ is given by 1 2 1 2 ∈ ◦ ∈ γ (t) = t, and for every γ Γ, there is a function γ 1 such that γ γ 1 = γ . For any f F and id − − id p p ∈ ◦ ∈ γ Γ,definethemapping(f ,γ) = (f γ)γ˙ asgivenin(2). Theimportanceofthismappingcomesfrom p p ∈ ◦ thefollowingresult. Proposition1. ThemappingF Γ F,specifiedabove,formsanactionofΓonF. Furthermore,this × → 5 actionistransitive. Proof: We can verify the two properties in the definition of a group action: (1) For any γ ,γ Γ and 1 2 ∈ f F,wehave((f,γ ),γ ) = (((f γ )γ˙ ) γ )γ˙ = (f,γ γ ). (2)Foranyf F,(f,γ ) = f. To 1 2 1 1 2 2 1 2 id ∈ ◦ ◦ ◦ ∈ showtransitivity, weneedtoshowthatgivenanyf ,f F, thereexistsaγ Γ, suchthat(f ,γ) = f . 1 2 1 2 ∈ ∈ If F and F denote the cumulative distribution functions associated with f and f , respectively, then the 1 2 1 2 desiredγ issimplyF1−1 ◦F2. Sincef1 isstrictlypositive,F1−1 iswelldefinedandγ isuniquelyspecified. Furthermore,sincef isstrictlypositive,wehaveγ˙ > 0. (cid:50) 2 Thisresultimpliesthattogetherthepair(f ,γ)spansthefullsetF,ifγ ischosenfreelyfromΓ. One p can reach any f F, from any f F using an appropriate element of Γ. However, if one uses Γ , t p p J ∈ ∈ a finite-dimensional submanifold of Γ, instead of the full Γ, we may not reach the desired f , but only get t F closer. ThisisdepictedpictoriallyintheleftpanelofFigure1wheretheinnerdiskdenotestheset . The p increasingringsaroundF representtheset (f ,γ) f F ,γ Γ foranincreasingsequenceofΓ s. p p p p J J { | ∈ ∈ } AsΓ approachesΓ,thecorrespondingapproximationapproachesf . MoredetailsareincludedinSection J t A.1. 2.1 Finite-DimensionalRepresentationofWarpingFunctions Solving an optimization problem, say maximum-likelihood estimation, over Γ faces two main challenges. First, Γ is a nonlinear manifold, and second, it is infinite-dimensional. We handle the nonlinearity by forming a bijective map from Γ to a tangent space of the unit Hilbert sphere S (the tangent space is a ∞ vector space), and infinite dimensionality by selecting a finite-dimensional subspace of this tangent space. Together, thesetwostepsareequivalenttofindingafamilyoffinite-dimensionalsubmanifoldsΓ thatcan J beflattenedintovectorspaces. Thisallowsforarepresentationofγ usingorthogonalbasis. Oncewehave afinite-dimensionalrepresentationofγ,wecanoptimizeoverthisrepresentationofγ usingthemaximum- likelihoodcriterion. (cid:112) Define a function q : [0,1] R, q(t) = γ˙(t), as the square-root slope function (SRSF) of a γ Γ. → ∈ (ForadiscussiononSRSFsofgeneralfunctions,pleaserefertoChapter4ofSrivastavaandKlassen[2016]). For any γ Γ, its SRSF q is an element of the interior of the positive orthant of the unit Hilbert sphere ∈ S L2, denoted by S+. This is because q 2 = (cid:82)1q(t)2dt = (cid:82)1γ˙(t)dt = γ(1) γ(0) = 1. We ∞ ⊂ ∞ (cid:107) (cid:107) 0 0 − have a positive orthant, boundaries excluded, because by definition, q is a strictly positive function. The 6 mapping between Γ and S+ is a bijection, with its inverse given by γ(t) = (cid:82)tq(s)2ds. The unit Hilbert 0 ∞ sphere is a smooth manifold with known geometry under the L2 Riemannian metric [Lang, 2012]. It is not a vector space but a manifold with a constant curvature, and can be easily flattened into a vector space locally. The chosen vector space is a tangent space of S+. A natural choice for reference, to select the ∞ tangent space, is the point 1 S+, a constant function with value 1, which is the SRSF corresponding to ∈ ∞ γ = γid(t) = t.ThetangentspaceofS+ at1isaninfinite-dimensionalvectorspacegivenby: T1(S+) = ∞ ∞ v L2([0,1],R) (cid:82)1v(t)dt = v,1 = 0 . { ∈ | 0 (cid:104) (cid:105) } Next,wedefineamappingthattakesanarbitraryelementofS+ tothistangentspace. Forthisretraction, ∞ wewillusetheinverseexponentialmapthattakesq S+ toT1(S+)accordingto: ∈ ∞ ∞ θ exp−11(q) : S+∞ → T1(S+∞), v = exp−11(q) = sin(θ)(q−1cos(θ)), (3) whereθ = cos 1( 1,q )isthearc-lengthfromqto1. TherightpanelofFig. 1showsapictorialillustration − (cid:104) (cid:105) ofthemappingfromS+ tothetangentspaceT1(S+). ∞ ∞ WeimposeanaturalHilbertstructureonT1(S+∞)usingthestandardinnerprodu(cid:113)ct: (cid:104)v1,v2(cid:105) = (cid:82)01v1(t)v2(t)dt. Itiseasytocheckthatsinceq ∈ S+∞,θ = cos−1((cid:104)1,q(cid:105)) < π/4,andhence(cid:107)v(cid:107) = (cid:82)01v(t)2dt = θ < π/4, where v = exp−11(q). Thus the range of the inverse exponential map is not the entire T1(S+), but an ∞ open subset T10(S+∞) = {v ∈ T1(S+∞) : (cid:107)v(cid:107) < π/4}. Further, we can select any orthogonal basis = bj,j = 1,2,... of the set T1(S+) to express its elements v by their corresponding coefficients; B { } ∞ (cid:80) that is, v(t) = ∞j=1cjbj(t), where cj = (cid:104)v,bj(cid:105). The only restriction on the basis elements bj’s is that theymustbeorthogonalto1,thatis, b ,1 = 0. Inordertomappointsbackfromthetangentspacetothe j (cid:104) (cid:105) Hilbertsphere,weusetheexponentialmap,givenby: sin( v ) exp(v) : T1(S+) S , exp(v) = cos( v )1+ (cid:107) (cid:107) . (4) ∞ → ∞ (cid:107) (cid:107) v (cid:107) (cid:107) If we restrict the domain of the exponential map to the subset T0(S+), then the range of this map is S+. 1 ∞ ∞ Usingthesetwosteps,wespecifythefinite-dimensional(approximate)representationofdiffeomorphisms. WedefineacompositemapH : Γ RJ,pictoricallyillustratedinFigure2,as → γ ∈ Γ −S−R−S−→F q = (cid:112)γ˙ ∈ S+∞ −e−x−p−−1→1 v ∈ T10(S+∞) −{−bj→} {cj = (cid:104)v,bj(cid:105)} ∈ RJ . (5) TherangeofH isVπ = {c ∈ RJ : (cid:107)(cid:80)Jj=1cjbj(cid:107) < π/4} ⊂ RJ. Now,wedefineG : RJ → Γ,as {cj} ∈ RJ −{−bj→} v = (cid:88)J cjbj ∈ T1(S+∞) −e−x−p→1 q = exp1(v) ∈ S∞ →− γ(t) = (cid:90)0tq(s)2ds. (6) j=1 7 )1 | v { composed of two steps as follows. First, we utilize the fact that if then q = √γ˙, called its square-root velocity function (SRVF), is a point on the unit Hilbert sphere S ⊂ Figure2: AgraphicrepresentationofEqn. 5leadingtoabijectivemapbetweenΓandV . π If we restrict the domain of G to V , then G is invertible and its inverse is H. Since our attention can be π restrictedonlytothesetVπ,ratherthantheentirespaceRJ,weidentifythefunctionGasH−1 henceforth, restrictingitsdomaintobeV . Foranyc V ,letγ denotethediffeomorphismH 1(c). ForanyfixedJ, π π c − ∈ thesetH 1(V )isafinite-dimensionalsubmanifoldofΓ,onwhichweposetheestimationproblem. AsJ − π goestoinfinity,thissubmanifoldconvergestothefullgroupΓ. Withthissetting,wecanrewritetheestimateoftheunknowndensityf ,givenaninitialdensityestimate t f ,asfˆ(t) = f (γ (t))γ˙ (t),t [0,1],whereγ = H 1(cˆ)and p p cˆ cˆ cˆ − ∈ (cid:32) (cid:34) (cid:35)(cid:33) n (cid:88) cˆ= argmax log(fp(γc(xi))γ˙c(xi)) , γc = H−1(c). (7) c∈RJ i=1 Choice of Basis Functions: One can choose from a wide range of basis elements for T1(S ). Since the ∞ proposed analysis is on [0,1], one can use the Fourier basis elements (excluding 1 of course). However, other bases such as splines and Legendre polynomials can also be used. In the experimental studies, we demonstrate an example using the Meyer wavelets. Meyer wavelets have attractive properties of infinite differentiability and support over all reals. Vermehren and de Oliveira [2015] provides a closed-form ex- pressionforMeyerwaveletsandscalefunctioninthetimedomain,whichenablesustoeasilyusethebasis set for representation. However, Meyer wavelets are not naturally orthogonal to 1 and so they need to be orthogonalizedfirst. 8 2.2 AdvantagesOverDirectApproximations We have used the geometry of Γ to develop a natural, local flattening of this nonlinear manifold, and then to approximate it with a finite subspace. Other, seemingly simpler, choices are also possible. For instance, since any γ can also be viewed as a nonnegative function in L2 with appropriate constraints, it may be tempting to use γ ≈ (cid:80)∞j=1cjbj(t) for some orthogonal basis B = {bj,j = 1,2,...} of L2[0,1]. This seems easier than our approach as it avoids going through the non-linear transformations. However the problemisthepresenceofbothγ andγ˙ inthefinalestimate. Agoodapproximationofγ doesnotimplya goodapproximationofγ˙,butthereverseholdstrue,asarguedbelow. Proposition 2. For any γ Γ, let γ˙ be an approximation of γ˙, and let γ be the integral of γ˙ . app app app ∈ For all x (0,1] consider intervals I of the form [0,x ]. Then, on all intervals I , γ γ 0 ∈ x0 0 x0 (cid:107) − app(cid:107) ≤ ∞ γ˙ γ˙ . app (cid:107) − (cid:107) ∞ Proof: Lett I . γ(t) γ (t) = (cid:82)tγ˙(s)ds (cid:82)tγ˙ (s)ds (cid:82)t γ˙(s) γ˙ (s) ds γ˙ γ˙ .t ∈ x0 | − app | | 0 − 0 app | ≤ 0 | − app | ≤ (cid:107) − app(cid:107) ≤ ∞ γ˙ γ˙ .x γ˙ γ˙ (cid:50) app 0 app (cid:107) − (cid:107) ≤ (cid:107) − (cid:107) ∞ ∞ Thispropositionstatesthatagoodapproximationofγ˙ ensuresagoodapproximationofγ,andsupports ourapproachofapproximatingγ viatheinverseexponentialtransformationofitsSRSFtothetangentspace T1(S+). ∞ 2.3 Estimationofdensitieswithunknownsupport So far we have restricted to the interval [0,1] for representing a pdf. However, the framework extends naturally to pdfs with unknown support. For that, we simply scale the observations to [0,1] and carry out the original procedure. Let X ,X ,...,X f where f has unknown support. Then we transform the 1 2 n t t ∼ data as Yi = XBi−AA, where A and B are the estimated boundaries of the density. Following Turnbull and − Ghosh [2014], we take A = X s /√n, and A = X +s /√n where X and X are the first (1) X (n) X (1) (n) − and last order statistics of X, and s is the sample standard deviation of the observed samples. Using the X scaleddata,wecanfindtheestimatedpdff on[0,1]andthenundothescalingtoreachthefinalsolution: w (cid:18) (cid:19) f (t) 1 f t A . TurnbullandGhosh[2014]provideajustificationforthechoiceofAandB as w (cid:55)→ B A w B−A − − theestimatesfortheboundsofthedensity. Theyalsodiscussanalternatewayofestimatingtheboundaries using ideas presented in De Carvalho [2011], and suggest that the Carvalho method produces wider and moreconservativeboundaryestimates. 9 3. CONVERGENCERATEBOUNDS We have represented an arbitrary pdf as a function of the coefficients w.r.t a basis set of the tangent space. F Wenotethatinordertorepresenttheentirespace ,weneedaHilbertbasiswithinfinitelymanyelements. However, in practice, we use only a finite number J of basis elements. Hence, we are actually optimizing overasubsetofthespaceofdensityfunctionsbasedononlyafewbasiselementsandusingittoapproximate the true density. This subset is called the approximating space. Since we are performing maximum likeli- hoodestimationoveranapproximatingspaceforpdfs,ourestimationisakintothesieveMLE,discussedin WongandShen[1995]. First,weintroducesomenotations. RecallthatF isthespaceofallunivariate,strictlypositivepdfson [0,1] and zero elsewhere. Let F be the approximating space of F when using k basis elements for the n n tangent space T1(S+), where kn is some function of the number of observations n. Let fp Fp F be ∞ ∈ ⊂ aninitialestimate,andletFn = fp(γ)γ˙,γ = H−1(c)) c Vπ Rkn ,whereH andVπ aredefinedin { | ∈ ⊂ } Section 2.1. As n ,k . So F F as n . Let η be a sequence of positive numbers n n n → ∞ → ∞ → → ∞ converging to 0. Let Y (n) be the space of n observed points. We call an estimator pˆ: Y (n) F an η n n → sieveMLEif n 1 (cid:88) 1 (cid:88) nlogpˆ(Y ) sup logp(Y ) η i i n n i=1 ≥ p∈Fnn i=1 − In the proposed method, the estimated pdf is exactly sup 1 (cid:80)n logp(Y ). Therefore, this estimate is a n i=1 i p Fn ∈ sieve MLE with η 0. Let p denote the true density which is assumed to be β-smooth for β > 0. The n 0 ≡ followingtheoremprovidesconvergenceratesofthesieveestimators. Theorem1. Undertheassumptionslistedabove P( pˆ1/2 p1/2 (cid:15) ) c exp c n((cid:15) )2 (cid:107) − 0 (cid:107)2 ≥ ∗n ≤ 1 {− 2 ∗n } forsomeconstantsc ,c > 0,where(cid:15) = Mn β/(2β+1)√lognforalargeconstantM > 0. 1 2 ∗n − TheproofofTheorem1isdeferredtotheAppendix. 4. IMPLEMENTATIONISSUES In this sectionwe outline the estimation procedure anddiscuss some implementation issues. We discretize density functions using a dense uniform partition, T = 100 equidistant points over the interval [0,1]. For 10