Table Of ContentNonamemanuscriptNo.
(willbeinsertedbytheeditor)
Overlapping Cover Local Regression Machines
MohamedElhoseiny · AhmedElgammal
7
Received:date/Accepted:date
1
0
2
Abstract WepresenttheOverlappingDomainCover(ODC) tational complexity of the state-of-the-art regression algo-
n
notion for kernel machines, as a set of overlapping subsets rithms limits their applicability for big data. In particular,
a
J ofthedatathatcoverstheentiretrainingsetandoptimized kernel-based regression algorithms such as Ridge Regres-
to be spatially cohesive as possible. We show how this no- sion[12],GaussianProcessRegression(GPR)[18],andthe
5
tionbenefitthespeedoflocalkernelmachinesforregression TwinGaussianProcesses(TGP)[2]requireinversionofker-
] in terms of both speed while achieving while minimizing nelmatrices(O(N3),whereN isthenumberofthetraining
G
the prediction error. We propose an efficient ODC frame- points),whichlimitstheirapplicabilityforbigdata.Werefer
L
work,whichisapplicabletovariousregressionmodelsand tothesenon-scalableversionsofGPRandTGPasfull-GPR
.
s inparticularreducesthecomplexityofTwinGaussianPro- andfull-TGP,respectively.
c
cesses(TGP)regressionfromcubictoquadratic.Ournotion
[
is also applicable to several kernel methods (e.g. Gaussian
1
v ProcessRegression(GPR)andIWTGPregression,asshown Khandekar et. al. [13] discussed properties and bene-
8 inourexperiments).Wealsotheoreticallyjustifiedtheidea fitsofoverlappingclustersforminimizingtheconductance
1 behindourmethodtoimprovelocalpredictionbytheover- from spectral perspective. These properties of overlapping
2 lapping cover. We validated and analyzed our method on clustersalsomotivatestudyingscalablelocalpredictionbased
1
0 threebenchmarkhumanposeestimationdatasetsandinter- onoverlappingkernelmachines.Figure1illustratestheno-
. estingfindingsarediscussed. tion by starting from a set of points, diving them into ei-
1
ther disjoint and overlapping subsets, and finally learning
0
7 a kernel prediction function on each (i.e., f (x∗) for sub-
i
1 1 Introduction set i, x∗ is testing point). In summary, the main question,
:
v weaddressinthispaper,ishowlocalkernelmachineswith
i Estimationofacontinuousreal-valuedorastructured-output overlapping training data could help speedup the compu-
X
function from input features is one of the critical problems
tationsandgainaccuratepredictions.Weachievedconsid-
r thatappearsinmanymachinelearningapplications.Exam-
a erable speedup and good performance on GPR, TGP, and
ples include predicting the joint angles of the human body
IWTGP(ImportanceWeightedTGP)appliedto3Dposees-
from images, head pose, object viewpoint, illumination di-
timationdatasets.Tothebestofourknowledge,ourframe-
rection,andaperson’sageandgender.Typically,theseprob-
workisthefirsttoachievequadraticpredictioncomplexity
lemsareformulatedbyaregressionmodel.Recentadvances
for TGP. The ODC concept is also novel in the context of
instructureregressionencouragedresearcherstoadoptitfor
kernelmachinesandisshownheretobesuccessfullyappli-
formulating various problems with high-dimensional out-
cable to multiple kernel-machines. We studies in this work
put spaces, such as segmentation, detection, and image re-
GPRandTGPandIWTGP(athirdmodel)kernelmachines.
construction,asregressionproblems.However,thecompu-
Theremainderofthispaperisorganizedasfollows:Section
2 and 4 presents some motivating kernel machines and the
MohamedElhoseiny1andAhmedElgammal2
1FacebookAIResearch related work. Section 5 presents our approach and a theo-
2DepartmentofComputerScience,RutgersUniversity retical justification for our ODC concept. Section 6 and 7
E-mail:elhoseiny@fb.com,elgammal@cs.rutgers.edu presentsourexperimentalvalidationandconclusion.
2 MElhoseinyetal.
Fig.1:Top:Left:24points,Middle:OverlappingCover,Right:disjointkernelmachinesof8points(evaluatingx∗ nearamiddleofakernel
machine).Bottom:Left:disjointkernelmachineevaluationonboundary),Right:6Overlappingkernelmachinesof8points.fi(x∗)istheith
kernelmachinepredictionforx∗testpoint.
2 BackgroundonFullGPRandTGPModels TGP [2] encodes the relation between both inputs and
outputs using GP priors. This was achieved by minimiz-
Inthissection,weshowexamplekernelmachinesthatmo- ing the Kullback-Leibler divergence between the marginal
tivated us to propose the ODC framework to improve their GPofoutputs(e.g.,poses)andobservations(e.g.,features).
performance and scalability. Specifically, we review GPR Hence,TGPpredictionisgivenby:
forsingleoutputregression,andTGPforstructuredoutput
regression.WeselectedGPRandTGPkernelmachinesfor
their increasing interest and impact. However, our frame-
workisnotrestrictedtothem.
ˆy(x∗)=argmin[kY(y,y)−2kY(y)(cid:62)(KX+λXI)−1kX(x∗)
GPR [18] assumes a linear model in the kernel space y
withGaussiannoiseinasingle-valuedoutput,i.e.,y =f(x)+ −ηlog(k (y,y)−k (y)(cid:62)(K +λ I)−1k (y))]
Y Y Y Y Y
N(0,σ2),wherex ∈ RdX andy ∈ R.Givenatrainingset (2)
n
{x ,y ,i = 1 : N}, the posterior distribution of y given a
i i
testpointx is:
∗
p(y|x∗)=N(µy =k(x∗)(cid:62)(K+σn2I)−1f, whereη =k (x ,x )−k (x )(cid:62)(K +λ I)−1k (x ),
X ∗ ∗ X ∗ X X X ∗
σy2 =k(x∗,x∗)−k(x∗)(cid:62)(K+σn2I)−1k(x∗)) k (x,x(cid:48)) = exp(−(cid:107)x−x(cid:48)(cid:107)) and k (y,y(cid:48)) = exp(−(cid:107)y−y(cid:48)(cid:107))
(1) X 2ρ2x Y 2ρ2y
are Gaussian kernel functions for input feature x and out-
wherek(x,x(cid:48))iskerneldefinedintheinputspace,Kisan put vector y, ρ and ρ are the kernel bandwidths for the
x y
N × N matrix, such that K(l,m) = k(x ,x ), k(x ) = inputandtheoutput.k (y)=[k (y,y ),...,k (y,y )](cid:62),
l m ∗ Y Y 1 Y N
[k(x ,x ),,...,k(x ,x )](cid:62), I is an identity matrix of size whereN isthenumberofthetrainingexamples.k (x )=
∗ 1 ∗ N X ∗
N,σ isthevarianceofthemeasurementnoise,f =[y ,··· [k (x ,x ),..., k (x ,x )](cid:62), and λ and λ are regu-
n 1 X ∗ 1 X ∗ N X Y
,yN](cid:62). GPR could predict structured output y ∈ RdY by larizationparameterstoavoidoverfitting.Thisoptimization
trainingaGPRmodelforeachdimension.However,thisin- problemcanbesolvedusingaquasi-Newtonoptimizerwith
dicatesthatGPRdoesnotcapturedependencybetweenout- cubicpolynomiallinesearch [2];wedenotethenumberof
putdimensionswhichlimititsperformance. stepstoconvergenceasl .
2
OverlappingCoverLocalRegressionMachines 3
Table1:ComparisonofcomputationalComplexityoftrainingandtestingforeachofFull,NN(NearestNeighbor),FITC,Local-RPC,andour
ODC.Trainingisthetimeincludeallcomputationsthatdoesnotdependontestdata,whichincludesclusteringinsomeofthesemethods.Testing
includescomputationsonlyneededforprediction
TrainingforGPRandTGP Testingforeachpoint
EkmeansClustering RPCClustering Modeltraining GPR-Y GPR-Var TGP-Y
Full - - O(N3+N2dX) O(N·(dX+dY) O(N2·dY) O(l2·N2·dY)
NN[2] - - - O(M3·dY) O(M3·dY) O(M3+l2·M2·dY)
FIC(GPRonly,dY =1[22]) - - O(M2·(N+dX)) O(M·dX) O(M2) -
LOoDcCal-(RouPrCfr(aomnleywGoPrkR),dY =1[5]) O(N·(1−Np)-M·dX·l1) O(N·lNog(·(lo1g−N(pMN)M))·dX) OO(M(M22·(·1(NN−p++ddXX)))) O(K(cid:48)O·M(M·(·ddXX+) dY)) O(KO(cid:48)·(MM22)·dY) O(l2·K(cid:48)-·M2·dY)
3 ImportanceWeightedTwinGaussianProcesses 4 RelatedWorkonApproximationMethods
(IWTGP)
Various approximation approaches have been presented to
Yamadaetal[26]proposedtheimportance-weightedvariant
reducethecomputationalcomplexityinthecontextofGPR.
oftwinGaussianprocesses[2]calledIWTGP.Theweights
As detailed in [16], approximation methods on Gaussian
arecalculatedusingRuLSIF[27](relativeunconstrainedleast-
Processes may be categorized into three trends: matrix ap-
squares importance fitting). The weights were modeled as
w (x,θ)=(cid:80)nte θ k(x,x )tominimizeE [(w (x,θ)− proximation,likelihoodapproximation,andlocalizedregres-
α l=1 l l pte(x) α sion.Thematrixapproximationtrendisinspiredbytheob-
w (x))2]. where k(x,x ) = exp(−(cid:107)x−xl(cid:107)) , w (x) =
α l 2τ2 α servation that the kernel matrix inversion is the major part
pte(x) , 0 ≤ α ≤ 1. To cope with this insta- of the expensive computation, and thus, approximating the
(1−α)pte(x)+αptr(x)
bility issue, setting α to 0 ≤ α ≤ 1 is practically useful matrix by a lower rank version, M (cid:28) N (e.g., Nystro¨m
forstabilizingthecovariateshiftadaptation,eventhoughit Method [25]). While this approach reduces the computa-
cannot give an unbiased model under covariate shift [27]. tional complexity from O(N3) to O(NM2) for training,
According[26]theoptimalθˆvectoriscomputedinaclosed there is no guarantee on the non-negativity of the predic-
formsolutionasfollows.to tive variance [18]. In the second trend, likelihood approxi-
mationisperformedontestingandtrainingexamples,given
θˆ=(Hˆ +νI)−1hˆ (3) M artificial examples known as inducing inputs, selected
from the training set (e.g., Deterministic Training Condi-
where Hˆ = 1−α(cid:80)nte k(xte,xtek(xte,xte)+ α (cid:80)ntr
l,l(cid:48) nte i=1 i l i l(cid:48) ntr j=1 tional(DTC)[19],FullIndependentconditional(FIC)[22],
k(xtjr,xtlek(xtjr,xtl(cid:48)e), hˆ is an nte- dimensional vector with PartialIndependentConditional(PIC)[21]).Thedrawback
thelthelementhˆ = 1 (cid:80)nte k(xte,xte),Iisann ×n - ofthistrendisthedilemmaofselectingM inducingpoints,
l nte i=1 i l te te
dimensionalidentitymatrix.wheren andn andthenum- whichmightbedistantfromthetestpoint,resultinginaper-
te tr
beroftestingandtrainingpointsrespectively.Modelselec- formancedecay;seeTable1forthecomplexityofFIC.
tionofRuLSIFisbasedoncross-validationwithrespectto
A third trend, localized regression, is based on the be-
the squared-error criterion J in [27]. Having computed θˆ,
liefthatdistantobservationsarealmostunrelated.Thepre-
each input and output examples are simply re-weighted by
diction of a test point is achieved through its M nearest
1
wα2 [26]. Therefore, the output of the importance weighted points. One technique to implement this notion is through
TGP(IWTGP)isgivenby
decomposingthetrainingpointsintodisjointclustersduring
training,wherepredictionfunctionsarelearnedforeachof
yˆ=argmin[KY(y,y)−2ky(y)Tuw−ηwlog(KY(y,y)− them[16].Attesttime,thepredictionfunctionoftheclos-
y
estclusterisusedtopredictthecorrespondingoutput.While
ky(y)TW12(W12KYW12 +λyI)−1W12ky(y))] thismethodisefficient,itintroducesdiscontinuityproblems
(4) on boundaries of the subdomains. Another way to imple-
mentlocalregressionisthroughMixtureofExperts(MoE)
where uw = W12(W12KXW12 +λxI)−1W12kx(x), ηw = as an Ensemble method to make prediction based on com-
k (x,x)−k (x)Tu .SimilartoTGP,IWTGPcanalsobe putingthefinaloutputbycombiningoutputsoflocalpredic-
X x w
solvedusingasecondorder,BFGSquasi-Newtonoptimizer torscalledexperts(seeastudyonMoEmethods[28]).Ex-
with cubic polynomial line search for optimal step size se- amples include Bayesian committee machine (BCM [23]),
lection. local probabilistic regression (LPR [24]), mixture of Tree
Table1showsthetrainingantestingcomplexityoffull ofGaussianProcesses(GPs)[9],andMixtureofGPs[18].
GPR and TGP models, where d is the dimensionality of While these approaches overcome the discontinuity prob-
Y
theoutput.Table1alsosummarizesthecomputationalcom- lembythecombinationmechanism,theysufferfrominten-
plexity of the related approximation methods, discussed in sivecomplexityattesttime,whichlimitsitsapplicabilityin
thefollowingsection,andourmethod.N. large-scale setting, e.g., Tree of GPs and Mixture of GPs,
4 MElhoseinyetal.
involvecomplicatedintegration,approximatedbycomputa- Table2:Contrastagainstmostrelevantmethods
tionallyexpensivesamplingorMonteCarlosimulation.
[16] FIC/PIC[22] NN[2] ODC
Parketal.[16]proposedalarge-scaleapproachforGPR Noforhigh
Accurate No Yes Yes
inputdimension
bydomaindecompositiononupto2Dgridoninput,where Efficient No Yes No Yes
Scalabletoarbitrary
a local regression function is inferred for each subdomain No(2D) Yes Yes Yes
inputdimension
such that they are consistent on boundaries. This approach ConsistentonBoundaries Yes No Yes Yes
supportedkernelmachines GPR GPR TGP GPR,TGP,IWTGPandothers
obviously lacks a solution to high-dimensional input data Easytoparallelize No No Yes Yes
becausethesizeofthegridincreasesexponentiallywiththe
dimensions,whichlimitsitsapplicability.Morerecently,[5]
proposedaRecursivePartitioningScheme(RPC)todecom- Then,alocalpredictionmodel(kernelmachine)is created
pose the data into non-overlapping equal-size clusters, and foreachsubdomainandthecomputationsthatdoesnotde-
theybuiltaGPRoneachcluster.Theyshowedthatthislo- pendonthetestdataarefactoredoutandprecomputed(e.g.
calschemegivesbetterperformancethanFIC[22]andother inversion of matrices). The nature of the ODC generation
methods.However,thispartitioningschemeobviouslylacks makes these kernel machines consistent in the overlapped
consistencyontheboundariesofthepartitionsanditwasre- regions, which are the boundaries since we constraint the
strictedtosingle-outputGPR.Table1showsthecomplexity subdomains to be coherent. This is motivated by the no-
ofthisschemedenotedbylocal-RPCforGPR. tion that data lives on a manifold with local properties and
BeyondGPR,wefoundthatlocalregressionwasadopted consistentconnectionsbetweenitsneighboringregions.On
differentlyinstructuredregressionmodelslikeTwinGaus- prediction,theoutputiscalculatedasareductionfunctionof
sianProcesses(TGP)[2],andalsoandatabiasversionofit, thepredictionsontheclosedsubdomain(s).Table1(thelast
denotedbyIWTGP[26].TGPandIWTGPoutperformnot row)showsthecomplexityforourgeneralizedODCframe-
onlyGPRinthistask,butalsovariousregressionmodelsin- work, detailed in Sec 5.1 and 5.2. In contrast to the prior
cludingHilbertSchmidtIndependenceCriterion(HSIC)[10], work, our ODC framework is designed to cover structured
KernelTargetAlignment(KTA)[6],andWeighted-KNN[18]. regression setting, dY > 1 and to be applicable to GPR,
BothTGPandIWTGPhavenoclosed-formexpressionfor TGP,andmanyothermodels.
prediction. Hence, the prediction is made by gradient de- Notations. Given a set of input data X = {x ,··· ,x },
1 N
scent on a function that needs to compute the inverse of ourpredictionframeworkfirstlygeneratesasetofnon-overlapping
boththeinputandoutputkernelmatrices,O(N3)complex- equal-sizepartitions,C ={C ,··· ,C },suchthat∪ C =
1 K i i
ity.Practically,bothapproacheshavebeenappliedbyfind- X,|C | = N/K.Then,theODCisdefinedbasedonthem
i
ingtheM (cid:28)N Nearest-Neighbors(NN)ofeachtestpoint as D = {D ,··· ,D }, such that |D | = M∀i, D =
1 K i i
in[2]and[26].ThepredictionofatestpointisO(M3)due C ∪ O ,∀i. O a the set of points that overlaps with the
i i i
totheinversionofM×M inputandoutputkernelMatrices. otherpartitions,i.e.,O = {x : x ∈ {∪ C }},suchthat
i j(cid:54)=i j
However,NNschemehasthreedrawbacks:(1)Aregression |O | = p·M,|C | = (1−p)·M,0 ≤ p ≤ 1istheratio
i i
model is computed for each test point, which results in a of points in each overlapping subdomain, D , that belongs
i
scalabilityproblemsinprediction(i.e.,Matrixinversionson to/overlapswithpartitions,otherthanitsown,C .
i
the NN of each each test point), (2) Number of neighbors
Itisimportanttonotethat,theODCcouldbespecified
might not be large enough to create an accurate prediction
bytwoparameters,M andp,whicharethenumberofpoints
modelsinceitisconstrainedbythefirstdrawback,(3)Itis
ineachsubdomainandtheratioofoverlaprespectively;this
inefficientcomparedwiththeotherschemesusedforGPR.
issinceK = N/(1−p)M.ThisparameterizationofODC
Table1showsthecomplexityofthisNNscheme.
generationisreasonableforthefollowingreasons.First,M
definesthenumberofpointsthatareusedtotraineachlocal
5 ODCFramework kernelmachine,whichcontrolstheperformanceofthelocal
prediction.Second,givenM andthatK =N/(1−p)M,p
The problems of the existing approaches, presented above, defines how coarse/fine the distribution of kernel machines
motivatedustodevelopanapproachthatsatisfiestheprop- are. It is not hard to see that as p goes to 0, the generated
ertieslistedintable2.Thetablealsoshowswhichofthese ODC reduces to the set of non-overlapping clusters. Simi-
propertiesaresatisfiedfortherelevantmethods. Inorderto larly,aspapproaches1−1/M,theODCreducestogenerat-
satisfy all the properties, we present the Overlapping Do- ingaclusterateachpointwithmaximumoverlapwithother
mainCover(ODC)notion.WedefinetheODCasacollec- clusters,i.e.,K =N,|C |=1,and|O |=M−1.Ourmain
i i
tion of overlapping subsets of the training points, denoted claimistwofold.First,precomputinglocalkernelmachines
by subdomains, such that they are as spatially coherent as (e.g. GPR, TGP, IWTGP) during training on the ODC sig-
possible. During training, an ODC is computed such that nificantly increase the speedup on prediction time. Second,
eachsubdomainoverlapswiththeneighboringsubdomains. givenafixedM andN,aspincreases,localpredictionper-
OverlappingCoverLocalRegressionMachines 5
Fig.2:ODCFramework
formance increases, theoretically supported by Lemma 51 is an test point x∗ and define that the probability that x∗ is
captured by the ODC to be proportional to the maximum
Lemma51. UnderODCnotion,astheoverlappincreases, probabilityofx∗amongthedomains.
thecloserthenearestmodeltoanarbitrarytestpointandthe
morelikelythatmodelgettrainedonabigneighborhoodof
K
thetestpoint. p(x∗)=(cid:88)p(x∗,D )
i
Proof. Westartbyoutliningthemainideabehindtheproof, i=1
K
which is directly connected to the fact that K = N/(1− (cid:88)
= p(x∗|D )δ(p(x∗|D )−maxK (p(x∗|D )))
p)M, which indicates that the number of local models in- i i j=1 i
i=1
creasesaspincreasesgivenfixedN andM.Undertheas-
=maxK p(x∗|D )
sumptionthatthelocalmodelsarespatiallycohesive,p→1 i=1 i
theoreticallyindicatesthatthereisalocalmodelcenteredat =(2π)−d2XmaxKi=1|Σi|−12e−21(x∗−µi)TΣi−1(x∗−µi)
eachpointinthespace(i.e.K =∞).Hence,aspincreases, (7)
the distribution of the kernel machines is the finest and the
more likely a test point to find the closest kernel machines whereδ(0)=1,0otherwise.Thereasonbehindthisdefini-
trainedonabigneighborhoodofitleadingtomoreaccurate tionofp(x∗)isthatourmethodselectthedomainofpreduc-
prediction.Meanwhile,aspgoesto0,thedistributionisthe tion based on argmaxKi=1p(x∗|Di). Hence pODC1(x∗) =
coarsestandthelesslikelyatestpointfinds,theclosestker- maxKi=11pODC1(x∗|Di)andpODC2(x∗)=maxKi=21pODC2(x∗|Di).
nelmachines,trainedonabigneighborhood. We start by the case where the points are uniformally
Let’sassumethateachkernelmachineisdefinedonM distributed in the space. Under this condition and assum-
points that are spatially cohesive, covering the space of N ing that spatially cohesive domain cover, this leads to that
pointswith N .Let’sassumethatcenteroftheMpoints p(x∗|Di) ≈ N(µi,Σ)∀i,whereΣ1 = Σ2··· = ΣK = Σ.
(1−p)M
inkernelmachineiisµ ,thetheCo-variancematrixofthese Hence
i
pointsareΣ .Hence
i
p(x|Di)=N(µi,Σi) p(x∗|Di)∝e−21(x∗−µi)TΣ−1(x∗−µi)
(5) (8)
=(2π)−d2X|Σi|−21e−12(x−µi)TΣi−1(x−µi) ln(p(x∗|Di))∝−(x∗−µi)TΣ−1(x∗−µi)
where N(µ ,Σ ) is a normal distribution of mean µ and Then
i i i
Co-variancematrixΣ .
i p(x∗)=maxK p(x∗|D )
Let’sassumethattherearetwoODCs,ODC andODC , i=1 i
1 2
defined on the same N points, the first one has overlap p1 =(2π)−d2XΣ|−12maxKi=1|e−21(x−µi)TΣ−1(x−µi)
and the second one is with overlap p , such that, p > p .
Let’sassumethatthenumberofkern2elmachinesin2ODC1 ∝maxKi=1e−12(x−µi)TΣ−1(x−µi)
1
andODC2areK1andK2,respectively.Hence, ln(p(x∗))∝maxKi=1−(x−µi)TΣ−1(x−µi)
N N (9)
K = , K = (6)
1 (1−p1)M 2 (1−p2)M Hence, p(x∗) gets maximized as it get closer to one of the
Since p > p , 0 ≤ p < 1 and 0 ≤ p < 1, then centers of the domains µ , defined by the ODC. It is not
2 1 1 2 i
K > K , which indicates that the number of kernel ma- hard to seen that that chances of x∗ to be closer to one of
2 1
chinesinODC withhigheroverlapisbiggerthanthenum- the centers covered by ODC is higher than ODC , espe-
2 2 2
ber of kernel machines in ODC . Let’s assume that there cially when p (cid:29) p . This is since K = N ,K =
2 2 1 1 (1−p1)M 2
6 MElhoseinyetal.
N . Hence K (cid:29) K when p (cid:29) p . For instance, RecursiveProjectionClustering(RPC)[5].Inthismethod,
(1−p2)M 2 1 2 1
when p = 0 and p2 = 0.9, this leads to that ODC will the training data is partitioned to perform GPR prediction.
1 1
generate K = N domains, while ODC will generate Initially all data points are put in one cluster. Then, two
1 M 2
K = 10·N = 10K ,whichistentimesmoredomainsand pointsarechosenrandomlyandorthogonalprojectionofall
2 M 1
centers.ThefactthattherearemuchmoredomainsifK (cid:29) thedataontothelineconnectingthemiscomputed.Depend-
2
K together with that there domains are spatially cohesive ingonthemedianvalueoftheprojections,Thedataisthen
1
leadstomaxK1 −(x∗−µ1)TΣ−1(x∗−µ1)(cid:29)maxK2 − split into two equal size subsets. The same process is then
i=1 i 1 i i=1
(x∗−µ2)TΣ−1(x∗−µ2).Theproofofthisstatementderives appliedtoeachclustertogenerate2l clustersafterl repeti-
i 2 i
fromthefactthatmaxK −(x∗−µ )TΣ−1(x∗−µ )iscould tions. The iterations stops once 2l > K. As indicated, the
i=1 i i
maximizedby(1)ifx∗ getsveryclosetooneofµ ,i = 1 : numberofclustersinthismethodhastobeapoweroftwo
i
K,and(2)smallervariance|Σ|,whichisminimizedbythe anditmightproducelongthinclusters.
naturebywhichODCiscreated,sinceeachdomainiiscre-
atedbyneighboringpointstoitscenter(i.e.|Σ | (cid:29) |Σ |). Equal-SizeK-means(EKmeans).Weproposeavariantof
1 2
ThisdirectlyleadstothatifK (cid:29)K thenmaxK1 −(x∗− k-meansclustering[11]togenerateequal-sizeclusters.The
2 1 i=1
µ1)TΣ−1(x∗−µ1)(cid:29)maxK2 −(x∗−µ2)TΣ−1(x∗−µ2). goalistoobtaindisjointpartitioningofX intoclustersC =
i 1 i i=1 i 2 i
Hence,p (x∗)(cid:29)p (x∗). {C1,··· ,CK}, similar to the k-means objective, minimiz-
ODC2 ODC1
ing the within-cluster sum of squared Euclidean distances,
Even if the points are not uniformally distributed, it is
C = arg J(C) = min(cid:80)K (cid:80) d(x ,µ ), where
still more likely that an ODC with higher overlap would C j=1 xi∈Cj i j
havehigherp(x∗),sincex∗iscloseunderexpectationtoone µi is the mean of cluster Ci, and d(·,·) is the squared dis-
tance.OptimizingthisobjectiveisNP-hardandk-meansit-
ofthecentersifmorespatiallycohesivedomainsaregener-
eratesbetweentheassignmentandupdatestepsasaheuris-
ated which increases with higher overlap. Our experiments
tictoachieveasolution;l denotesnumberofiterationsof
alsoprovesthat theODCconceptgeneralizesonthree real 1
kmeans.Weaddequal-sizeconstraints∀(1≤i≤K),|C |=
datasetwherethetrainingpointsarenotdistributedunifor- i
N/K =(1−p)M.
mally.
In order to achieve this partitioning, we propose an ef-
ficient heuristic algorithm, denoted by Assign and Balance
5.1 Training (AB) EKmeans. It mainly modifies the assignment step of
the k-means to bound the size of the resulting clusters. We
There are several overlapping clustering methods that in- firstassignthepointstotheirclosestseecenterastypically
clude(e.g.[17]and[3]),whichlooksrelevantforourframe- done in the assignment step of k-means. We use C(x ) to
p
work.Howeverthesemethodsdoesnotfitourpurposeboth denote the cluster assignment of a given point x . This re-
p
equal-sizeconstraintsforthelocalkernelmachines.Wealso sultsinthreetypesofclusters:balanced,overfull,andunder-
found them very slow in practice because their complexity full clusters. Then some of the points in the overfull clus-
varies from cubic to quadratic (with a big constant factor) ters are redistributed to the underfull clusters by assigning
on the training-set. These problems motivated us to pro- each of these points to the closest underfull cluster. This is
poseapracticalmethodthatbuildsoverlappinglocalkernel- achievedbyinitializingapoolofoverfullpointsdefinedas
machineswithspatialandequal-sizeconstraints. Thesecon- X˜ ={x :x ∈C ,|C |>N/K};seeFigure3.
p p l l
straintsarecriticalforourpurposesincethenumberofpoints
LetusdenotethesetofunderfullclustersbyC˜ ={C :
ineachkernel-machinedetermineitslocalperformance.Hence, p
|C | < N/K}.Wecomputethedistancesd(x ,µ ),∀x ∈
ourtrainingphaseistwosteps:(1)thetrainingdataissplit p i j i
X˜andC ∈ C˜. Iteratively, we pick the minimum distance
into K = N/(1−p)M equal-sized clusters of (1−p)M i
pair (x ,µ ) and assign x to cluster C instead of cluster
points.(2)anODCwithKoverlappingsubdomainsisgen- p l p l
C(x ). The point is then removed from the overfull pool.
erated by augmenting each cluster with p·M points from p
Once an underfull cluster becomes full it is removed from
theneighboringclusters.
theunderfullpool,onceanoverfullclusterisbalanced,the
5.1.1 Equal-sizeClustering remaining points of that cluster are removed from overfull
pool.Theintuitionbehindthisalgorithmsisthat,thecostas-
There are recent algorithms that deal with size constraints sociatedwiththeinitialoptimalassignment(giventhecom-
in clustering. For example, [29] formulated the problem of putedmeans)isminimallyincreasedbyeachswapsincewe
clusteringwithsizeconstraintsasalinearprogrammingprob- picktheminimumdistancepairineachiteration.Hencethe
lem.Howeversuchalgorithmsarenotcomputationallyeffi- costiskeptaslowaspossiblewhilebalancingtheclusters.
cient,especiallyforlargescaledatasets(e.g.,Human3.6M). We denote the the name of this Algoirthm as Assign and
Westudytwoefficientwaystogenerateequalsizeclusters; Balance EKmeans. Algorithm 1 illustrates the overall as-
seeTable1(lastrow)fortheirODC-complexity. signmentstepandFig.4visualizesthebalancingstep.
OverlappingCoverLocalRegressionMachines 7
5.1.2 OverlappingDomainCover(ODC)Model
Having generated the disjoint equal size clusters, we gen-
erate the ODC subdomains based on the overlapping ra-
tio p, such that p · M points are selected from the neigh-
boring clusters. Let’s assume that we select only the clos-
est r clusters to each cluster, C is closer to C than C
i j k
if (cid:107)µ −µ (cid:107) < (cid:107)µ −µ (cid:107). It is important to note that r
i j i k
must be greater than p/(1 − p) in order to supply the re-
quired p·M points; this is since number of points in each
cluster is (1 − p)M. Hence, the minimum value for r is
(cid:100)(p·M)/((1−p)·M)(cid:101) = (cid:100)p/(1−p)(cid:101) clusters. Hence,
weparametrizerasr =(cid:100)t·p/(1−p)(cid:101),t≥1.Westudythe
effectoftintheexperimentalresultssection.Havingcom-
puted r from p and t, each subdomain D is then created
i
by merging the points in the cluster C with p·M points,
i
Fig.3:AB-EKmeanson300,0002Dpoints,K=57 retrieved from the r neighboring clusters. Specifically, the
points are selected by sorting the points in each of r clus-
ters by the distance to µ . The number of points retrieved
i
for each of the r neighboring clusters is inversely propor-
tionaltothedistanceofitscentertoµ .Ifasubsetofther
i
clustersarerequestedtoretrievemorethanitscapacity(i.e.,
(1−p)M),thesetoftheextrapointsarerequestedfromthe
remainingclustersgivingprioritytothecloserclusters(i.e.,
starting from the nearest neighboring cluster to the cluster
on which the subdomain is created). As t = 1 and p in-
creases, all points that belong to the r clusters tends to be
merged with C . In our framework, we used FLANN [15]
i
forfastNN-retrieval;seepseudo-codeofODCgenerationin
AppendixC.
After the ODC is generated, we compute the the sam-
plenormaldistributionusingthepointsthatbelongtoeach
subdomain.Then,alocalkernelmachineistrainedforeach
Fig.4:ABKmeans:BalancingStep oftheoverlappingsubdomains.Wedenotethepointsetnor-
mal distribution of the subdomains as p(x|D ) = N(µ(cid:48) ∈
i i
RdX,Σ(cid:48) ∈RdX×dX);Σ(cid:48)−1isprecomputedduringthetrain-
Input:X(N×dx),{µi}Ki=1 ing forilater use during tihe prediction. Finally, we factor
Output:labels
out all the computations that does not depend on the test
1-Assignthepointsinitiallytoitsclosestcenter;thiswillput
theclustersinto3groups(1)balancedclusters(2)overflowed point(forGPR,TGP,IWTGP)andstorethemwitheachsub
clusters(3)under-flowedclusters. domain as its local kernel machine. We denote the training
2-CreateamatrixD∈RN×K,whereD[i,j]isthedistance modelforsubdomainiasMi,whichiscomputedasfollows
betweentheithpointtothejthclustercenter;rowsare
forGPRandTGPrespectively.
restrictedpointsbelongsonlytotheoverflowedclusters;
columnsarerestrictedtounderflowedclustercenters GPR. Firstly, we precompute (Ki + σ2 I)−1, where
j ni
3-Getthecoordinate(i∗,j∗)thatmapsthesmallestdistance Ki isanM×M kernelmatrix,definedonthej inputpoints
inD. j
4-Removetheit∗hrowfrommatrixDandmarkitasassigned in Di. Each dimension j in the output could have its own
tothejthcluster hyper-parameters,whichresultsinadifferentkernelmatrix
5-Ifthesizeoftheclusterjachievestheidealsize(i.e. foreachdimensionKi.Wealsoprecompute(Ki+σ2 I)−1y
n/K),thenremovethejthcolumnfrommatrixD. j j ni j
j
6-Gotostep3ifthereisstillunassignedpoints foreachdimension.HenceMi ={(Ki+σ2 I)−1,(Ki+
Algorithm1:AssignandBalance(AB)k-means:Assign- GPR j nij j
mentStep σn2iI)−1yj),j =1:dY}.
j
TGP. Thelocalkernelmachineforeachsubdomainin
TGPcaseisdefinedasMi = {(Ki +λi I)−1,(Ki +
TGP X X Y
λi I)−1},whereKi andKi areM×M kernelmatricesde-
Y X Y
8 MElhoseinyetal.
finedontheinputpointsandthecorrespondingoutputpoints (3)SubdomainsweightingandFinalprediction.Thefinal
respectively,whichbelongtodomaini. predictions are formulated as Y(x ) = (cid:80)K(cid:48) a Yi ,a >
∗ i=1 i x∗ i
IWTGP. It is not obvious how to factor out computa- 0,(cid:80)K(cid:48) a = 1. {a }K(cid:48) are computed as follows. Let the
i=1 i i i=1
tIiWonTsGthP,astindcoeesthneoctodmeppuentadtioonnaltheextteensstivdeatfaacitnort(hi.ee.c,a(sWeio12f distribution of domain {Dxi∗ =(cid:107)x−µ(cid:48)i(cid:107)Σk(cid:48)−1}Ki=(cid:48)1 denotes
1 1 1 tothedistancestotheclosestsubdomains,{Li =1/Di }K(cid:48) ,
Ki Wi2 +λiI)−1,(Wi2Ki Wi2 +λiI)−1)doesdepend x∗ x∗ i=1
onXthe test sext since Wi is cYomputed oyn test time. To help ai =Lix∗/(cid:80)Ki=(cid:48)1Lix∗.
It is not hard to see that when K(cid:48) = 1, the prediction
factor out the computation, we used linear algebra to show
step reduces to regression using the closest subdomain to
that
thetestpoint.Howeveritisreasonableinmostoftheprior
λD−2A−2D−2 worktomakepredictionusingtheclosestmodel,wegener-
(DAD+λI)−1 =D−1A−1D−1−
1+λ·tr(D−1A−1D−1) alizedittoK(cid:48) closestkernelmachinesandcombiningtheir
(10) predictions,soastostudyhowconsistencyofthecombined
predictionbehavesastheoverlapincreases(i.e.,p);seethe
where D is a diagonal matrix, I is the identity matrix, and
experiments.
tr(B)isthetraceofmatrixB.
Proof. KennethMiller[14]proposedthefollowingLemma
onMatrixInverse. 6 ExperimentalResults
1
(G+H)−1 =G−1− G−1HG−1 (11) Equal-Size Kmeans Step Experiment: We also tried an-
1+tr(GH−1)
other variant for Ekmeans that we call Iterative Minimum-
ApplyingMiller’slemma,whereG = DAD andH = λI, Distance Assignments EKmeans (IMDA- Ekmeans). Note
leadsdirectlytoEq.10. that the algorithm presented earlier in the paper is denoted
asAssignandBalanceKmeans(AB-Kmeans).TheIMDA-
1
MappingDtoWi2 1,AtoeitherofKi orKi ,wecan
X Y Ekmeans algorithm works as follows. We initialize a pool
computeMi = {Ki −1,Ki −1}.HavingcomputedWi on of unassigned points X˜ = X and initialize all clusters as
X Y
1 1 1 1
testtime,(Wi2Ki Wi2+λ I)−1,(Wi2K Wi2+λ I)−1 empty.Giventhemeanscomputedfromthepreviousupdate
X x X x
could be computed in quadratic time given Mi following steps,wecomputethedistancesd(xi,µj)forallpoints/center
equation 10, since the inverse and the power of Wi12 has pairs.Weiterativelypicktheminimumdistancepair
linearcomputationalcomplexitysinceitisdiagonal.
(xp,µl):d(xp,µl)≤d(xi,µj)∀xi ∈X˜and|Cl|<N/K
5.2 Prediction and assign point xp to cluster l. The point is then removed
fromthepoolofunassignedpoints.if|C |=N/K,thenitis
l
ODC-Predictionisperformedinthreesteps. markedasbalancedandnolongerconsidered.Theprocess
(1)Findingtheclosestsubdomains.TheclosestK(cid:48) (cid:28)K isrepeateduntilthepoolisempty;seeAlgorithm2.
subdomains are determined based on the covariance norm Table 3presentstheaveragecostover10runsofIMDA-
ofthedisplacementofthetestinputfromthemeansofthe Ekmeans and AB-Ekmeans algorithms. We initialize both
subdomaindistribution(i.e.(cid:107)x−µ(cid:48)(cid:107) ,i=1:K,where the AB-Ekmeans and IMDA-EKmeans algorithms by the
i Σ(cid:48)−1
i
(cid:107)x−µ(cid:48)(cid:107) =(x−µ(cid:48))TΣ(cid:48)−1(x−µ(cid:48)).Thereasonbehind clustercenterscomputedbyrunningthestandardk-means.
i Σ(cid:48)−1 i i i
i
using the covariance norm is that it captures details of the
densityofthedistributioninalldimensions.Hence,itbetter
modelsp(x|D ),indicatingbetterpredictionofxonD . Input:X(N×dx),{µi}Ki=1
i i Output:labels
(2)ClosestsubdomainsPrediction.Havingdeterminedthe 1-CreateamatrixD∈RN×K,whereD[i,j]isthedistance
closest subdomains, predictions are made for each of the betweentheithpointtothejthclustercenter.
closest clusters. We denote these predictions as {Yi }K(cid:48) . 2-Getthecoordinate(i∗,j∗)thatmapsthesmallestdistance
x∗ i=1 inD.
Eachofthesepredictionarecomputed accordingtothese- 3-RemovetheithrowfrommatrixDandmarkitasassigned
∗
lectedkernelmachine.ForGPR,predictivemeanandvari- tothejthcluster
anceareO(M·d ) andO(M2·d ) respectively,foreach 4-Ifthesizeoftheclusterjachievestheidealsize(i.e.
X Y n/K),thenremovethejthcolumnfrommatrixD.
outputdimension.ForTGP,thepredictionisO(l ·M2·d );
2 Y 5-Gotostep2ifthereisstillunassignedpoints
seeEq 2. Algorithm 2: Iterative Minimum-Distance Assignments
(IMDA)k-means:AssignmentStep
1 Wisadiagonalmatrix
OverlappingCoverLocalRegressionMachines 9
Asillustratedintable3,theAB-EkmeansoutperformsIMDA- Subjects(S1,S2,S6,S7,S8,S9)fromit,whichis≈0.5mil-
Ekmeans in these experiments, which motivated us to uti- lionposes.Wesplittheminto67%training33%istesting.
lize AB Ekmeans, which is presented in the paper, against HOGfeaturesareextractedfor4image-viewsforeachpose
IMDA-EkmeansunderourODCpredictionframework.Our andconcatenatedinto3060-dimvector.Errorforeachpose,
interpretationfortheseresultsisbecauseAB-Ekmeansini- inbothHEva(inmm)andHuman3.6(incm),ismeasured
tializes the assignment with an assignment that minimizes asError(yˆ,y∗)= 1 (cid:80)L (cid:107)yˆm−y∗m(cid:107).
the cost J(C) = min(cid:80)K (cid:80) d(x ,µ ) given the L m=1
j=1 xi∈Cj i j There are four control parameters in our ODC frame-
cluster centers and then balance the clusters. In all the fol- work:M,p,t,andK(cid:48).Figure6showsourparameteranaly-
lowing experiments,we usesAB-EKmeans dueto its clear siswithdifferentvaluesofp,tandK(cid:48)onHumanEvadataset
superiorperformancetoIMDA-EKmeans. forGPRandTGPaslocalregressionmachines,whereM =
800.Eachsub-figureconsistsofsixplotsintworows.The
Table 3: J(C) of AB-kmeans and IMDA-kmeans on a dataset of firstrowindicatestheresultsusingAB-Ekmeansclustering
10,000random2Dpoints,averagedover10runs scheme, while the second row shows the results for RPC
clusteringscheme.Eachrowhasthreeplots,oneforK(cid:48) =1,
K=5 K=10 K=50
2,and3respectively. Eachplotshowstheerrorofdifferent
AB-kmeans 1077.3 540.241 105.505
IMDA-kmeans 1290.6 657.446 122.006 t against p from 0 to 0.95; i.e., it shows how the overlap
ErrorReduction 16.53% 17.83% 13.52% affects the performance for different values of t. Each plot
shows, on its top caption, the minimum and the maximum
overlap regression errors where t → 1. Looking at these
plots,thereareanumberofobservations:
(1) As t → 1 (the solid red line), the behavior of the error
tendstoreduceaspincreases,i.e.,theoverlap.
(2) Comparing different K(cid:48), the behavior of the error indi-
catesthatcombiningmultiplepredictions(i.e.,K(cid:48) = 2and
K(cid:48) = 3),givespoorperformance,comparedwithK(cid:48) = 1,
when the overlap is small. However, all of them, K(cid:48) = 1,
2, and 3, performs well as p → 1; see column 2 and 3
in Fig. 6 and Fig. 8. This indicates consistent prediction of
neighboring subdomains as p increases; see also Fig. 7 for
side by side comparison of different K(cid:48). The main reason
Fig.5:Datasets,Representations,andFeatures behindthisbehavioristhataspincreases,thelocalmodels
of the neighboring subdomains normally share more train-
ingpointsontheirboundaries,whichisreflectedasshared
DatasetsandSetup. Weevaluatedourframeworkonthree
constraintsduringthetrainingofthesemodelsmakingthem
humanposeestimationdatasets,Poser,HumanEva,andHu-
moreconsistentonprediction.
man3.6M; see Fig. 5 for summary of setup and represen-
(3)Comparingthefirstrowtothesecondrowineachsub-
tation for each. Poser dataset [1] consists of 1927 train-
figure, it is not hard to see that our AB-Ekmeans partition-
ingand418testimages.Theimagefeatures,corresponding
ing scheme consistently outperforms RPC [5], e.g. the er-
tobag-of-wordsrepresentationwithsilhouette-basedshape-
rorincasesofGPR(M=800)is47.48mmforAB-EKmeans
context features. The error is measured by the root mean-
and 50.66mm for RPC, TGP (M=800) is 38.8mm for AB-
squareerror(indegrees),averagedoveralljointsangles,and
EKmeansand39.8mmforRPC.Thisproblemisevenmore
isgivenby:Error(yˆ,y∗)= 1 (cid:80)54 (cid:107)yˆm−y∗mmod360◦(cid:107),
54 m=1 severewhenusingsmallerM,e.g.theerrorincaseof TGP
where yˆ ∈ R54 is an estimated pose vector, and y∗ ∈ R54
(M=400) is 39.5mm for EKmeans and 47.5mm for RPC;
is a true pose vector. HumanEva datset [20] contains syn-
seeadetailedplotforM=400inFig.9.Wenoticedsigficant
chronized multi-view video and Mocap data of 3 subjects
dropintheperformanceasMdecreases.Forinstancewhen
performing multiple activities. We use HOG features [7]
M =200,TheerrorforTGPbestperformanceincreasedto
(∈ R270) proposed in [2]. We use training and validations
43.88mminsteadof38mmforM =800.
subsets of HumanEva-I and only utilize data from 3 color
(4)TGPgivesbetterpredictionthanGPR(i.e.,38mmusing
cameras with a total of 9630 image-pose frames for each
TGPcomparedwith47mmusingGPR).
camera.Thisisconsistentwithexperimentsin[2]and[26].
Weusehalfofthedatafortrainingandhalffortesting.Hu- (5) As M increases, the prediction error decreases. For in-
man3.6M [4] is a dataset of millions of Human poses. We stance,whenM =200,TheerrorforTGPbestperformance
managed to evaluate our proposed ODC-framework on six increased to 43.88mm instead of 38.9mm for M = 800.
10 MElhoseinyetal.
AB EKmeans, K’ = 1, Err = [51.60,48.52] AB EKmeans, K’ = 2, Err = [130.23,47.66] AB EKmeans, K’ = 3, Err = [173.13,47.49]
55 160 180
54 140 160
53 120 140
Error555012 tt == 11..05625 Error10800 Error11028000
49 tt == 23 60 60
480 0.t2 = 4 0.4 0.6 0.8 1 400 0.2 0.4 0.6 0.8 1 400 0.2 0.4 0.6 0.8 1
p p p
RPC, K’ = 1, Err = [54.39,50.93] RPC, K’ = 2, Err = [67.75,50.54] RPC, K’ = 3, Err = [128.13,50.67]
56 70 140
55 65 120
54
Error53 Error60 Error10800
52
51 55 60
500 0.2 0.4 0.6 0.8 1 500 0.2 0.4 0.6 0.8 1 400 0.2 0.4 0.6 0.8 1
p p p
(a)GPR-ODC(M=800)
AB EKmeans, K’ = 1, Err = [41.12,38.79] AB EKmeans, K’ = 2, Err = [123.75,38.79] AB EKmeans, K’ = 3, Err = [166.79,38.85]
42 140 200
41.5 120
41 100 150
Error404.05 t = 1.0 Error 80 Error100
39.5 tt == 12.5625 60 50
39 t = 3 40
t = 4
38.50 0.2 0.4 0.6 0.8 1 200 0.2 0.4 0.6 0.8 1 00 0.2 0.4 0.6 0.8 1
p p p
RPC, K’ = 1, Err = [41.62,39.49] RPC, K’ = 2, Err = [58.60,39.44] RPC, K’ = 3, Err = [122.05,39.80]
42.5 65 140
42 60 120
41.5 55 100
Error404.15 Error50 Error 80
40 45 60
39.5 40 40
390 0.2 0.4 0.6 0.8 1 350 0.2 0.4 0.6 0.8 1 200 0.2 0.4 0.6 0.8 1
p p p
(b)TGP-ODC(M=800)
Fig.6:ODCframeworkParameterAnalysisofGPRandTGPonHumanEvaDataset
Table4:Error&TimeonPoserandHumanEvadatasets(Intelcore-i72.6GHZ),M=800
Poser HumanEva
Error(deg) TrainingTime PredictionTime Error(mm) TrainingTime PredictionTime
TGP NN [2] 5.43 - 188.99sec 38.1 - 6364sec
ODC(p=0.9,t=1,K(cid:48)=1)-Ekmeans 5.40 (3.7+25.1)sec 16.5sec 38.9 (2001+45.4)sec 298sec
ODC(p=0,t=1,K(cid:48)=1)-Ekmeans 7.60 (3.9+1.33)sec 14.8sec 41.87 (240+4.9)sec 257sec
ODC(p=0.9,t=1,K(cid:48)=1)-RPC 5.60 (0.23+41.6)sec 15.8sec 39.9 (0.45+49.1)sec 277sec
ODC(p=0,t=1,K(cid:48)=1)-RPC 7.70 (0.15+1.7)sec 13.89sec 42.32 (0.19+5.2)sec 242sec
GPR NN 6.77 - 24sec 54.8 - 618sec
ODC(p=0.9,t=1,K(cid:48)=1)-Ekmeans 6.27 (3.7+11.1)sec 0.56sec 49.3 (2001+42.85)sec 79sec
ODC(p=0.0,t=1,K(cid:48)=1)-Ekmeans 7.54 (3.9+1.38sec) 0.35sec 49.6 (240+6.4)sec 48sec
ODC(p=0.9,t=1,K(cid:48)=1)-RPC 6.45 (0.23+17.3)sec 0.52sec 52.8 (0.49+46.06)sec 64sec
ODC(p=0.0,t=1,K(cid:48)=1)-RPC= [5] 7.46 (0.15+1.5)sec 0.27sec 54.6 (0.26+4.6)sec 44sec
FIC [22] 7.63 (-+20.63) 0.3106 68.36 - 102sec
We found these observation to be also consistent on Poser ontesttime,comparedwithcomputingthemattesttimeby
dataset. NNscheme.Thefigureshowssignificantspeedupfrompre-
computinglocalkernelmachines.
Thisanalysishelpedusconcluderecommendingchoos-
ing t close to 1, big overlap (p closer to 1), and K(cid:48) = 1 is
Table 4 shows error, training time and prediction time
sufficientforaccurateprediction.
of NN, FIC, and different variations of ODC on Poser and
Having accomplished the performance analysis which Human-Eva datasets. Training time is formatted as (t +
c
comprehensivelyinterpretsourparameters,weusedtherec- t ), where t is the clustering time and t is the remaining
p c p
ommended setting to compare the performance with other training time excluding clustering. As indicated in the top
methodsandshowthebenefitsofthisframework.Figure10 partoftable 4,TGPunderourODC-framworkcansignif-
showsthespeedupgainedbyretrievingthematrixinverses icantly speedup the prediction compared with NN-scheme