Table Of ContentA Geometric Framework For Density Modeling
SutanoyDasgupta 1,DebdeepPati1,andAnujSrivastava1
∗
1DepartmentofStatistics,FloridaStateUniversity
7
1
0
2
n
a
J
Abstract
0
2
Weintroduceageometricapproachforestimatingaprobabilitydensityfunction(pdf)givenitssam-
]
E ples. The procedure involves obtaining an initial estimate of the pdf and then transforming it via a
M
warpingfunctiontoreachthefinalestimate. Theinitialestimateisintendedtobecomputationallyfast,
.
t
a albeit suboptimal, but its warping creates a larger, flexible class of density functions, resulting in sub-
t
s
stantially improved estimation. The warping is accomplished by mapping diffeomorphic functions to
[
1 thetangentspaceofaHilbertsphere,avectorspacewhoseelementscanbeexpressedusinganorthog-
v
6 onal basis. Using a truncated basis expansion, we estimate the optimal warping and, thus, the optimal
5
6 densityestimate. Thisframeworkisintroducedforunivariate,unconditionalpdfestimationandthenex-
5
tendedtoconditionalpdfestimation. Theapproachavoidsmanyofthecomputationalpitfallsassociated
0
.
1 with current methods without losing on estimation performance. In presence of irrelevant predictors,
0
7 theapproachachievesbothstatisticalandcomputationalefficiencycomparedtoclassicalapproachesfor
1
: conditional density estimation. We derive asymptotic convergence rates of the density estimator and
v
i
X demonstratethisapproachusingsyntheticdatasets,andacasestudytounderstandassociationofatoxic
r metaboliteonpretermbirth.
a
KEYWORDS: conditionaldensity;densityestimation;Hilbertsphere;irrelevantpredictors;sieveestimation;
tangentspace;weightedlikelihoodmaximization
∗CorrespondingAuthor:s.dasgupta@stat.fsu.edu
1
1. INTRODUCTION
Estimatingaprobabilitydensityfunction(pdf)isanimportantandwellstudiedfieldofresearchinstatis-
tics. The most basic problem in this area is that of univariate pdf estimation from iid samples, henceforth
referred to as unconditional density estimation. Another problem of utmost importance is conditional den-
sityestimation. Hereoneneedstocharacterizethebehavioroftheresponsevariableforeachofthedifferent
values of the predictors. A natural extension of the these problems is that of estimating the joint pdf for
multiplerandomvariables.
Giventheimportanceofpdfestimationinstatisticsandrelateddisciplines, alargenumberofsolutions
have been proposed for each of these problems. While the earliest works focused on parametric solutions,
thetrendoverthelastthreedecadesistouseanonparametricapproachasitminimizesmakingassumptions
about the underlying density (and the relationships between variables for conditional and joint densities).
The most common nonparametric techniques are kernel based; refer to Rosenblatt et al. [1956], Hall et al.
[1991], Sheather and Jones [1991], Li and Racine [2007] for a narrative of works. Related to these ap-
proaches are “tilting” or “data sharpening” techniques for unconditional density estimation, for example
Hjort and Glad [1995], Doosti and Hall [2016] and the references therein. Kernel methods are very pow-
erful in univariate setting. However, as the number of variables involved gets higher, the methods become
computationallyinefficientbecauseofthecomplexitiesinvolvedinbandwidthselection, especiallyincon-
ditional density estimation setup. Another common approach [Leonard, 1978, Lenk, 1988, 1991, Tokdar
et al., 2010, Tokdar, 2012] for pdf estimation is a two step estimation procedure, where one first estimates
an initial pdf, say f , from the data, perhaps restricting to a parametric family. Then, one improves upon
p
this estimate by considering a function g(x) depending on the initial estimate f and an unknown func-
p
(cid:82)
tion W, to reach a final estimate of the type exp g(x) / exp g(x) dx. In a classical paradigm one can
{ } x { }
chooseexp g(x) asW(x)f (x)whereW(x)isapositivefunction. InaBayesiancontext,thefunctiong
p
{ }
is assigned a Gaussian process prior W with mean f . Often the calculation of the normalization constant
p
makes the computation very cumbersome. On the other hand, two step procedures for conditional density
estimation are based on estimating the conditional mean function, and then the conditional density of the
residuals [Hansen, 2004]. Over the recent years, Bayesian methods for estimating pdfs based on mixture
modelsandlatentvariableshavereceivedalotofattentionbecauseoftheirexcellentpracticalperformances
andanincreasinglyrichsetofalgorithmictoolsforposteriorcomputationusingMarkovChainMonteCarlo
2
(MCMC) methods. References include Escobar and West [1995], Mu¨ller et al. [1996], MacEachern and
Mu¨ller [1998], Kalli et al. [2011], Jain and Neal [2012], Kundu and Dunson [2014], Bhattacharya et al.
[2010] among others. However these results also come at a very high computational cost associated with
implementing the MCMC algorithms. Applications of flexible Bayesian models for conditional densities
arediscussedinMacEachern[1999],DeIorioetal.[2004],GriffinandSteel[2006],Dunsonetal.[2007],
ChungandDunson[2009],NoretsandPelenis[2012], amongothers. Althoughtheliteraturesuggeststhat
suchmethodsbasedonmixturemodelshave severalattractiveproperties, theylackinterpretabilityandthe
MCMCsolutionsformodelfittingareoverlycomplicatedandexpensive.
Inthisarticleageometricallyintuitiveframeworkforatwostepapproachisproposedthatisapplicable
tobothconditionalandunconditionaldensityestimation. Themotivationhereistoavoidthecomputational
costassociatedwithtraditionaltechniques, whilenotlosingoutonpracticalperformance. Assumethatwe
have a strictly positive univariate density f on the interval [0,1]; f serves as an initial estimate as in the
p p
traditionaltwostepmethods. LetΓbethesetofallpositivediffeomorphismsfrom[0,1]toitself:
Γ = γ γ issmooth,γ−1 issmooth,γ˙ > 0,γ(0) = 0,γ(1) = 1 (1)
{ | }
TheelementsofΓplaytheroleofwarpingfunctions,ortransformationsoff . Givenaγ Γ,thetransfor-
p
∈
mationoff isdefinedby:
p
(f ,γ) = (f γ)γ˙ . (2)
p p
◦
Henceforth this transformation is referred to as warping and the resulting pdf f as a warped density. This
mapping is comprehensive in the sense that it forms a transitive group action on the space of all strictly
positive pdfs on [0,1]. This allows one to go from any positive pdf to any other positive pdf using an
appropriate γ. Note that since (cid:82)1f (γ(x))γ˙(x)dx = 1, there is no need to normalize this transformation.
0 p
However, since Γ is not a vector space, one has to map the elements of Γ, using a nonlinear mapping, to a
vectorspacetofacilitateexpansionusinganorthogonalbasis. Thediffeomorphismsastransformationsofa
probability density shape have been used in the past, albeit with a different setup and scope [Saoudi et al.,
1994, 1997]. Also, the notion of transformation between pdfs has been used in the literature on optimal
transport [Tabak and Turner, 2013, Tabak and Trigila, 2014]. The transport is achieved using an iterated
compositionofmapsandisthusverydifferentfromestimatingadiffeomorphism.
In this article, we develop the framework for estimating an unconditional, univariate pdf defined on
3
[0,1]. This simple setting helps explain and illustrate the main ingredients of the framework. Besides, the
proposedgeometricframeworkisnaturallyunivariateinthesensethatthetransformationdefinedin(2)acts
on univariate density shapes making it a logical starting point for developments. In this simple setup, the
approach delivers excellent performance while avoiding heavy computational cost, and is comparable to
standardkernelmethods, evenatverylowsamplesizes. Theframeworkisextendedtounivariatedensities
withunknownsupportbyscalingtheobservationdomainto[0,1]. Adefiningcharacteristicofthiswarping
transformation is that it takes any initial density estimate to any final desired estimate, as long as they are
both strictly positive. Hence the initial estimate can be constructed in anyway – parametric (e.g. gaussian)
ornonparametric(e.g. kernelestimate),andisallowedtobeasub-optimalestimateofthetruedensity.
The second part of the article focuses on extending the framework to estimation of conditional density
f(y x)from (yi,xi) : i = 1,...,n,y R,x Rd,d 1 . Theapproachistostartwithanonparametric
| { ∈ ∈ ≥ }
meanregressionmodeloftheformy = m(x )+(cid:15) ,(cid:15) (0,σ2),wherem( )isestimatedusingastandard
i i i i
∼ N ·
nonparametricestimator,toobtainaninitialconditionaldensityestimatef (mˆ(x),σˆ2)atthelocation
p,x
≡ N
x. Then f is warped using a warping function γ into a final conditional density estimate. Importantly,
p,x x
the set Γ and the warping transformation remain the same as the unconditional density estimation, and do
not involve the predictors. Only the choice of the element γ Γ varies with the predictor x. This makes
x
∈
the estimation process computationally efficient even in the presence of many predictors. The selection of
γ is based on a weighted-likelihood objective function that borrows information from the neighborhood
x
ofthelocationxatwhichtheconditionaldensityisevaluated. Sincethetransformationoperationdoesnot
involvethepredictors,thefinalestimateisnaturallyresistanttotheeffectofpredictorsthatareirrelevantto
theresponse.
Therestofthispaperisorganizedasfollows. Section2introducestheformalmethodologyandframe-
work on a univariate unconditional density estimation setting. Section 3 presents notations and the con-
vergence rate of the density estimator. Section 4 focuses on the practical implementation of the method
andoutlinestheassociatednumericaltechniques. Illustrationsofunivariatedensityestimationonsimulated
datasets are presented in Section 5. In section 6 the theory for conditional density estimation is developed
andthepropertiesoftheproposedmethodareillustratedusingsimulateddatasets. Theapplicationofcon-
ditional density estimation using the proposed framework on a real dataset is also presented. Section 7
includesadiscussiononhowtheideaspresentedinearliersectionscanbeusedtoperformbivariateuncon-
4
T1(S )=
∞
Let
(fp,γ) F
theunderlyingtruedensityand
as the second step of the estimation procedure. This way the
)1
| {qv)=exp−11(q):
nonpanraomnpeatrriacm),esturicch),thsuacth.afnFtthuoarptthtaiemnrmaolpoftrFipem,aleltFp cthoemnpqo=sed√oγ˙f,tcwaollesdteiptsssaqsufaorlelo-rwoso.tvFeirlostc,iwtyefuunticlitzioenth(SeRfaVcFt)t,hiastaifpointontheunitHilbert
anyparametricfamilywhospelepsacraommeintegrsfrcoamnbeeasilycomputedisagoodcandidatefor sphereS ⊂
):S
∞
Figure 1: Left: The true pdf f is estimated by transforming an initial estimate f by the warping function
t p
γ. The larger the set of allowed γs, the better the estimate is. Right: Representing warping function γ as
elementofthetangentspaceT1(S+).
∞
ditionaldensityestimation,thechallengesinvolvedinhigherdimensions,andasimplebivariateillustration.
Finally,theproofsoftheconvergenceratepresentedinSection3isderivedinanappendix.
2. PROPOSEDFRAMEWORK
In this section we develop a two-step framework for estimating univariate, unconditional pdf, and start by
F
introducingsomenotation. Let bethesetofallstrictlypositive,univariateprobabilitydensityfunctions
on [0,1]. Let f F denote the underlying true density and x f , i = 1,2,...,n be independent
t i t
∈ ∼
samples from f . Furthermore, let F be a pre-determined subset of F, such that an optimal element
t p
(defined appropriately) f F is relatively easy to compute. For instance, any parametric family with
p p
∈
simple estimation procedure is a good candidate for f . Similarly, kernel density estimates are also good
p
sincetheyarecomputationallyefficientandrobustinunivariatesetups.
Next, we define a warping-based transformation of elements of F , using elements of Γ defined in
p
(1). Note that Γ is an infinite-dimensional set that has a group structure under composition as the group
operation. That is, for any γ ,γ Γ, the composition γ γ Γ. The identity element of Γ is given by
1 2 1 2
∈ ◦ ∈
γ (t) = t, and for every γ Γ, there is a function γ 1 such that γ γ 1 = γ . For any f F and
id − − id p p
∈ ◦ ∈
γ Γ,definethemapping(f ,γ) = (f γ)γ˙ asgivenin(2). Theimportanceofthismappingcomesfrom
p p
∈ ◦
thefollowingresult.
Proposition1. ThemappingF Γ F,specifiedabove,formsanactionofΓonF. Furthermore,this
× →
5
actionistransitive.
Proof: We can verify the two properties in the definition of a group action: (1) For any γ ,γ Γ and
1 2
∈
f F,wehave((f,γ ),γ ) = (((f γ )γ˙ ) γ )γ˙ = (f,γ γ ). (2)Foranyf F,(f,γ ) = f. To
1 2 1 1 2 2 1 2 id
∈ ◦ ◦ ◦ ∈
showtransitivity, weneedtoshowthatgivenanyf ,f F, thereexistsaγ Γ, suchthat(f ,γ) = f .
1 2 1 2
∈ ∈
If F and F denote the cumulative distribution functions associated with f and f , respectively, then the
1 2 1 2
desiredγ issimplyF1−1 ◦F2. Sincef1 isstrictlypositive,F1−1 iswelldefinedandγ isuniquelyspecified.
Furthermore,sincef isstrictlypositive,wehaveγ˙ > 0. (cid:50)
2
Thisresultimpliesthattogetherthepair(f ,γ)spansthefullsetF,ifγ ischosenfreelyfromΓ. One
p
can reach any f F, from any f F using an appropriate element of Γ. However, if one uses Γ ,
t p p J
∈ ∈
a finite-dimensional submanifold of Γ, instead of the full Γ, we may not reach the desired f , but only get
t
F
closer. ThisisdepictedpictoriallyintheleftpanelofFigure1wheretheinnerdiskdenotestheset . The
p
increasingringsaroundF representtheset (f ,γ) f F ,γ Γ foranincreasingsequenceofΓ s.
p p p p J J
{ | ∈ ∈ }
AsΓ approachesΓ,thecorrespondingapproximationapproachesf . MoredetailsareincludedinSection
J t
A.1.
2.1 Finite-DimensionalRepresentationofWarpingFunctions
Solving an optimization problem, say maximum-likelihood estimation, over Γ faces two main challenges.
First, Γ is a nonlinear manifold, and second, it is infinite-dimensional. We handle the nonlinearity by
forming a bijective map from Γ to a tangent space of the unit Hilbert sphere S (the tangent space is a
∞
vector space), and infinite dimensionality by selecting a finite-dimensional subspace of this tangent space.
Together, thesetwostepsareequivalenttofindingafamilyoffinite-dimensionalsubmanifoldsΓ thatcan
J
beflattenedintovectorspaces. Thisallowsforarepresentationofγ usingorthogonalbasis. Oncewehave
afinite-dimensionalrepresentationofγ,wecanoptimizeoverthisrepresentationofγ usingthemaximum-
likelihoodcriterion.
(cid:112)
Define a function q : [0,1] R, q(t) = γ˙(t), as the square-root slope function (SRSF) of a γ Γ.
→ ∈
(ForadiscussiononSRSFsofgeneralfunctions,pleaserefertoChapter4ofSrivastavaandKlassen[2016]).
For any γ Γ, its SRSF q is an element of the interior of the positive orthant of the unit Hilbert sphere
∈
S L2, denoted by S+. This is because q 2 = (cid:82)1q(t)2dt = (cid:82)1γ˙(t)dt = γ(1) γ(0) = 1. We
∞ ⊂ ∞ (cid:107) (cid:107) 0 0 −
have a positive orthant, boundaries excluded, because by definition, q is a strictly positive function. The
6
mapping between Γ and S+ is a bijection, with its inverse given by γ(t) = (cid:82)tq(s)2ds. The unit Hilbert
0
∞
sphere is a smooth manifold with known geometry under the L2 Riemannian metric [Lang, 2012]. It is
not a vector space but a manifold with a constant curvature, and can be easily flattened into a vector space
locally. The chosen vector space is a tangent space of S+. A natural choice for reference, to select the
∞
tangent space, is the point 1 S+, a constant function with value 1, which is the SRSF corresponding to
∈ ∞
γ = γid(t) = t.ThetangentspaceofS+ at1isaninfinite-dimensionalvectorspacegivenby: T1(S+) =
∞ ∞
v L2([0,1],R) (cid:82)1v(t)dt = v,1 = 0 .
{ ∈ | 0 (cid:104) (cid:105) }
Next,wedefineamappingthattakesanarbitraryelementofS+ tothistangentspace. Forthisretraction,
∞
wewillusetheinverseexponentialmapthattakesq S+ toT1(S+)accordingto:
∈ ∞ ∞
θ
exp−11(q) : S+∞ → T1(S+∞), v = exp−11(q) = sin(θ)(q−1cos(θ)), (3)
whereθ = cos 1( 1,q )isthearc-lengthfromqto1. TherightpanelofFig. 1showsapictorialillustration
−
(cid:104) (cid:105)
ofthemappingfromS+ tothetangentspaceT1(S+).
∞ ∞
WeimposeanaturalHilbertstructureonT1(S+∞)usingthestandardinnerprodu(cid:113)ct: (cid:104)v1,v2(cid:105) = (cid:82)01v1(t)v2(t)dt.
Itiseasytocheckthatsinceq ∈ S+∞,θ = cos−1((cid:104)1,q(cid:105)) < π/4,andhence(cid:107)v(cid:107) = (cid:82)01v(t)2dt = θ < π/4,
where v = exp−11(q). Thus the range of the inverse exponential map is not the entire T1(S+), but an
∞
open subset T10(S+∞) = {v ∈ T1(S+∞) : (cid:107)v(cid:107) < π/4}. Further, we can select any orthogonal basis
= bj,j = 1,2,... of the set T1(S+) to express its elements v by their corresponding coefficients;
B { } ∞
(cid:80)
that is, v(t) = ∞j=1cjbj(t), where cj = (cid:104)v,bj(cid:105). The only restriction on the basis elements bj’s is that
theymustbeorthogonalto1,thatis, b ,1 = 0. Inordertomappointsbackfromthetangentspacetothe
j
(cid:104) (cid:105)
Hilbertsphere,weusetheexponentialmap,givenby:
sin( v )
exp(v) : T1(S+) S , exp(v) = cos( v )1+ (cid:107) (cid:107) . (4)
∞ → ∞ (cid:107) (cid:107) v
(cid:107) (cid:107)
If we restrict the domain of the exponential map to the subset T0(S+), then the range of this map is S+.
1
∞ ∞
Usingthesetwosteps,wespecifythefinite-dimensional(approximate)representationofdiffeomorphisms.
WedefineacompositemapH : Γ RJ,pictoricallyillustratedinFigure2,as
→
γ ∈ Γ −S−R−S−→F q = (cid:112)γ˙ ∈ S+∞ −e−x−p−−1→1 v ∈ T10(S+∞) −{−bj→} {cj = (cid:104)v,bj(cid:105)} ∈ RJ . (5)
TherangeofH isVπ = {c ∈ RJ : (cid:107)(cid:80)Jj=1cjbj(cid:107) < π/4} ⊂ RJ. Now,wedefineG : RJ → Γ,as
{cj} ∈ RJ −{−bj→} v = (cid:88)J cjbj ∈ T1(S+∞) −e−x−p→1 q = exp1(v) ∈ S∞ →− γ(t) = (cid:90)0tq(s)2ds. (6)
j=1
7
)1
| v
{
composed of two steps as follows. First, we utilize the fact that if
then q = √γ˙, called its square-root velocity function (SRVF), is a point on the unit Hilbert
sphere S
⊂
Figure2: AgraphicrepresentationofEqn. 5leadingtoabijectivemapbetweenΓandV .
π
If we restrict the domain of G to V , then G is invertible and its inverse is H. Since our attention can be
π
restrictedonlytothesetVπ,ratherthantheentirespaceRJ,weidentifythefunctionGasH−1 henceforth,
restrictingitsdomaintobeV . Foranyc V ,letγ denotethediffeomorphismH 1(c). ForanyfixedJ,
π π c −
∈
thesetH 1(V )isafinite-dimensionalsubmanifoldofΓ,onwhichweposetheestimationproblem. AsJ
− π
goestoinfinity,thissubmanifoldconvergestothefullgroupΓ.
Withthissetting,wecanrewritetheestimateoftheunknowndensityf ,givenaninitialdensityestimate
t
f ,asfˆ(t) = f (γ (t))γ˙ (t),t [0,1],whereγ = H 1(cˆ)and
p p cˆ cˆ cˆ −
∈
(cid:32) (cid:34) (cid:35)(cid:33)
n
(cid:88)
cˆ= argmax log(fp(γc(xi))γ˙c(xi)) , γc = H−1(c). (7)
c∈RJ i=1
Choice of Basis Functions: One can choose from a wide range of basis elements for T1(S ). Since the
∞
proposed analysis is on [0,1], one can use the Fourier basis elements (excluding 1 of course). However,
other bases such as splines and Legendre polynomials can also be used. In the experimental studies, we
demonstrate an example using the Meyer wavelets. Meyer wavelets have attractive properties of infinite
differentiability and support over all reals. Vermehren and de Oliveira [2015] provides a closed-form ex-
pressionforMeyerwaveletsandscalefunctioninthetimedomain,whichenablesustoeasilyusethebasis
set for representation. However, Meyer wavelets are not naturally orthogonal to 1 and so they need to be
orthogonalizedfirst.
8
2.2 AdvantagesOverDirectApproximations
We have used the geometry of Γ to develop a natural, local flattening of this nonlinear manifold, and then
to approximate it with a finite subspace. Other, seemingly simpler, choices are also possible. For instance,
since any γ can also be viewed as a nonnegative function in L2 with appropriate constraints, it may be
tempting to use γ ≈ (cid:80)∞j=1cjbj(t) for some orthogonal basis B = {bj,j = 1,2,...} of L2[0,1]. This
seems easier than our approach as it avoids going through the non-linear transformations. However the
problemisthepresenceofbothγ andγ˙ inthefinalestimate. Agoodapproximationofγ doesnotimplya
goodapproximationofγ˙,butthereverseholdstrue,asarguedbelow.
Proposition 2. For any γ Γ, let γ˙ be an approximation of γ˙, and let γ be the integral of γ˙ .
app app app
∈
For all x (0,1] consider intervals I of the form [0,x ]. Then, on all intervals I , γ γ
0 ∈ x0 0 x0 (cid:107) − app(cid:107) ≤
∞
γ˙ γ˙ .
app
(cid:107) − (cid:107)
∞
Proof: Lett I . γ(t) γ (t) = (cid:82)tγ˙(s)ds (cid:82)tγ˙ (s)ds (cid:82)t γ˙(s) γ˙ (s) ds γ˙ γ˙ .t
∈ x0 | − app | | 0 − 0 app | ≤ 0 | − app | ≤ (cid:107) − app(cid:107) ≤
∞
γ˙ γ˙ .x γ˙ γ˙ (cid:50)
app 0 app
(cid:107) − (cid:107) ≤ (cid:107) − (cid:107)
∞ ∞
Thispropositionstatesthatagoodapproximationofγ˙ ensuresagoodapproximationofγ,andsupports
ourapproachofapproximatingγ viatheinverseexponentialtransformationofitsSRSFtothetangentspace
T1(S+).
∞
2.3 Estimationofdensitieswithunknownsupport
So far we have restricted to the interval [0,1] for representing a pdf. However, the framework extends
naturally to pdfs with unknown support. For that, we simply scale the observations to [0,1] and carry out
the original procedure. Let X ,X ,...,X f where f has unknown support. Then we transform the
1 2 n t t
∼
data as Yi = XBi−AA, where A and B are the estimated boundaries of the density. Following Turnbull and
−
Ghosh [2014], we take A = X s /√n, and A = X +s /√n where X and X are the first
(1) X (n) X (1) (n)
−
and last order statistics of X, and s is the sample standard deviation of the observed samples. Using the
X
scaleddata,wecanfindtheestimatedpdff on[0,1]andthenundothescalingtoreachthefinalsolution:
w
(cid:18) (cid:19)
f (t) 1 f t A . TurnbullandGhosh[2014]provideajustificationforthechoiceofAandB as
w (cid:55)→ B A w B−A
− −
theestimatesfortheboundsofthedensity. Theyalsodiscussanalternatewayofestimatingtheboundaries
using ideas presented in De Carvalho [2011], and suggest that the Carvalho method produces wider and
moreconservativeboundaryestimates.
9
3. CONVERGENCERATEBOUNDS
We have represented an arbitrary pdf as a function of the coefficients w.r.t a basis set of the tangent space.
F
Wenotethatinordertorepresenttheentirespace ,weneedaHilbertbasiswithinfinitelymanyelements.
However, in practice, we use only a finite number J of basis elements. Hence, we are actually optimizing
overasubsetofthespaceofdensityfunctionsbasedononlyafewbasiselementsandusingittoapproximate
the true density. This subset is called the approximating space. Since we are performing maximum likeli-
hoodestimationoveranapproximatingspaceforpdfs,ourestimationisakintothesieveMLE,discussedin
WongandShen[1995].
First,weintroducesomenotations. RecallthatF isthespaceofallunivariate,strictlypositivepdfson
[0,1] and zero elsewhere. Let F be the approximating space of F when using k basis elements for the
n n
tangent space T1(S+), where kn is some function of the number of observations n. Let fp Fp F be
∞ ∈ ⊂
aninitialestimate,andletFn = fp(γ)γ˙,γ = H−1(c)) c Vπ Rkn ,whereH andVπ aredefinedin
{ | ∈ ⊂ }
Section 2.1. As n ,k . So F F as n . Let η be a sequence of positive numbers
n n n
→ ∞ → ∞ → → ∞
converging to 0. Let Y (n) be the space of n observed points. We call an estimator pˆ: Y (n) F an η
n n
→
sieveMLEif
n
1 (cid:88) 1 (cid:88)
nlogpˆ(Y ) sup logp(Y ) η
i i n
n i=1 ≥ p∈Fnn i=1 −
In the proposed method, the estimated pdf is exactly sup 1 (cid:80)n logp(Y ). Therefore, this estimate is a
n i=1 i
p Fn
∈
sieve MLE with η 0. Let p denote the true density which is assumed to be β-smooth for β > 0. The
n 0
≡
followingtheoremprovidesconvergenceratesofthesieveestimators.
Theorem1. Undertheassumptionslistedabove
P( pˆ1/2 p1/2 (cid:15) ) c exp c n((cid:15) )2
(cid:107) − 0 (cid:107)2 ≥ ∗n ≤ 1 {− 2 ∗n }
forsomeconstantsc ,c > 0,where(cid:15) = Mn β/(2β+1)√lognforalargeconstantM > 0.
1 2 ∗n −
TheproofofTheorem1isdeferredtotheAppendix.
4. IMPLEMENTATIONISSUES
In this sectionwe outline the estimation procedure anddiscuss some implementation issues. We discretize
density functions using a dense uniform partition, T = 100 equidistant points over the interval [0,1]. For
10