Table Of ContentAPPEARINGINIEEETRANS.PATTERNANALYSISANDMACHINEINTELLIGENCE,DEC.2016 1
Compositional Model Based Fisher Vector
Coding for Image Classification
Lingqiao Liu, Peng Wang, Chunhua Shen, Lei Wang, Anton van den Hengel, Chao Wang, Heng Tao Shen
Abstract—Derivingfromthegradientvectorofagenerativemodeloflocalfeatures,Fishervectorcoding(FVC)hasbeenidentifiedas
aneffectivecodingmethodforimageclassification.Most,ifnotall,FVCimplementationsemploytheGaussianmixturemodel(GMM)
asthegenerativemodelforlocalfeatures.However,therepresentativepowerofaGMMcanbelimitedbecauseitessentiallyassumes
thatlocalfeaturescanbecharacterizedbyafixednumberoffeatureprototypes,andthenumberofprototypesisusuallysmallinFVC.
Toalleviatethislimitation,inthiswork,webreaktheconventionwhichassumesthatalocalfeatureisdrawnfromoneofafew
6 Gaussiandistributions.Instead,weadoptacompositionalmechanismwhichassumesthatalocalfeatureisdrawnfromaGaussian
1 distributionwhosemeanvectoriscomposedasalinearcombinationofmultiplekeycomponents,andthecombinationweightisa
0 latentrandomvariable.IndoingsowegreatlyenhancetherepresentativepowerofthegenerativemodelunderlyingFVC.
2 Toimplementouridea,wedesigntwoparticulargenerativemodelsfollowingthiscompositionalapproach.Inourfirstmodel,themean
vectorissampledfromthesubspacespannedbyasetofbasesandthecombinationweightisdrawnfromaLaplacedistribution.Inour
c
secondmodel,wefurtherassumethatalocalfeatureiscomposedofadiscriminativepartandaresidualpart.Asaresult,alocal
e
featureisgeneratedbythelinearcombinationofdiscriminativepartbasesandresidualpartbases.Thedecompositionofthe
D
discriminativeandresidualpartsisachievedviatheguidanceofapre-trainedsupervisedcodingmethod.Bycalculatingthegradient
0 vectoroftheproposedmodels,wederivetwonewFishervectorcodingstrategies.ThefirstistermedSparseCoding-basedFisher
1 VectorCoding(SCFVC)andcanbeusedasthesubstituteoftraditionalGMMbasedFVC.ThesecondistermedHybridSparse
Coding-basedFishervectorcoding(HSCFVC)sinceitcombinesthemeritsofbothpre-trainedsupervisedcodingmethodsandFVC.
] Usingpre-trainedConvolutionalNeuralNetwork(CNN)activationsaslocalfeatures,weexperimentallydemonstratethattheproposed
V methodsaresuperiortotraditionalGMMbasedFVCandachievestate-of-the-artperformanceinvariousimageclassificationtasks.
C
IndexTerms—FisherVectorCoding,SparseCoding,HybridSparseCoding,ConvolutionalNetworks,GenericImageClassification.
.
s
(cid:70)
c
[
2
v
3
4
1
4
0
.
1
0
6
1
:
v
i
X
r
a
• L.Liu,P.Wang,C.ShenandA.vandenHengelarewiththeSchoolof
ComputerScience,UniversityofAdelaide,SA,Australia.
E-mail:{lingqiao.liu,chunhua.shen,anton.vandenhengel}@adelaide.edu.au
• L.WangandC.WangarewiththeSchoolofComputingandInformation
Technology,UniversityofWollongong,NSW,Australia.
E-mail:{leiw,chaow}@uow.edu.au
• H. T. Shen is with the School of Computer Science and Engineering
University of Electronic Science and Technology of China, Chengdu,
Sichuan,China.
• The first two authors contributed equally to this work. Correspondence
shouldbeaddressedtoC.Shen.
APPEARINGINIEEETRANS.PATTERNANALYSISANDMACHINEINTELLIGENCE,DEC.2016 2
CONTENTS
1 Introduction 3
2 RelatedWork 3
2.1 Fishervectorcoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 FVCwithCNNlocalfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 SupervisedcodingandFVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Background 4
3.1 Fishervectorcoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Gaussianmixturemodel-basedFVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Ourapproaches 5
4.1 Compositionalgenerativemodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.1.1 ApproachI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.1.2 ApproachII . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2 Fishervectorderivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2.1 FishervectorderivationforapproachI(SCFVC) . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2.2 FishervectorderivationforapproachII(HSCFVC) . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3 Inferenceandlearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.4 Implementationdetails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.4.1 Localfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.4.2 Poolingandnormalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.4.3 Supervisedcoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Experiment 9
5.1 Experimentalsetting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.2 Mainresults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.3 AnalysisofSCFVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.3.1 GMMFVCvs.SCFVC:theimpactoflocalfeaturedimensions . . . . . . . . . . . . . . . . . . 11
5.3.2 GMMFVCvs.SCFVC:codebooksizeandfeaturedimensionalitytrade-off . . . . . . . . . . . 11
5.4 AnalysisofHSCFVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.4.1 Theclassificationaccuracyvs.thevalueofλ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.4.2 TheimpactoftheresidualpartFishervectorGX onclassificationperformance. . . . . . . . 13
Bc
6 Conclusion 13
References 14
7 Appendix:MatchingpursuitbasedoptimizationforEqu.(16) 16
APPEARINGINIEEETRANS.PATTERNANALYSISANDMACHINEINTELLIGENCE,DEC.2016 3
1 INTRODUCTION subsequently.
The differences between the two proposed approaches
In the bag-of-features model, Fisher vector coding [1], [2] are the ways of decomposing a local feature. The first
(FVC) is a coding method derived from the Fisher kernel approach adopts a single basis matrix and assumes that
[3]whichwasoriginallyproposedtocomparetwosamples each combination coefficient is drawn from a Laplace dis-
induced by a generative model. The basic idea of FVC tribution. The second approach takes the further step of
is to first construct a generative model of local features assuming that a local feature may be decomposed into a
and use the gradient of the log-likelihood of a particular discriminative part and a residual part. The discriminative
featurewithrespecttothemodelparametersasthefeature’s part represents those patterns which are found to be dis-
coding vector. When applied as an image representation criminativeandtheresidualpartdepictsthepatternswhich
the FVC vectors of local features are calculated by a pool- are not well captured by the identified discriminative part.
ing operation and normalization [2] to generate the final To achieve such a decomposition, we rely on a pre-trained
image representation. FVC has been established as one supervised coding method and use its coding vector as
of the most powerful local feature encoding and image our guide. The motivation for using decomposition-based
representation generation methods. In most of the visual modelingistwofold:(1)Thisdecompositionwillenablepart
classification systems with FVC, Gaussian mixture model ofthegenerativemodeltofocusmoreonthediscriminative
(GMM)isadoptedasthegenerativemodelformodelingthe part and thus to better capture class-specific information.
localfeatures.TheGMMessentiallyassumesthateachlocal (2)Ontheotherhand,thediscriminativepartidentifiedby
feature is generated from one of the Gaussian distributions thepre-trainedsupervisedcodingmethodmaynotcapture
in the GMM, and intuitively the mean of each Gaussian all the useful patterns in the local features due to the
distributionservesasaprototypeforthelocalfeatures.Since imperfection of supervised encoder training 1. In this case,
the dimensionality of the image representation resulting thepartofthegenerativemodelwhichmodelstheresidual
from GMM based FVC is the product of the local feature providesasecondchancetodistillthemissinginformation
dimensionality and the number of Gaussians, to make the andthuscompensatesforthediscriminativepartmodeling.
image representation dimensionality tractable, the number Duetothecomplementarynaturesofthediscriminativeand
ofGaussiansisusuallychosentobefewhundred. residual parts, as well as the high dimensionality of Fisher
With the recent development in feature learning [4], vectors, it is expected that the Fisher vector derived from
higher dimensional local features such as the activations our second model preserves more useful information than
of a pre-trained deep neural network [5], [6], [7], [8] have ourfirstFVCandthesupervisedcodingmethodthatguides
become increasingly popular. However, modeling these lo- thedecomposition.
cal features with the GMM for FVC is challenging. This is We also show that, under some certain approximation,
due to two factors: (1) The dimensionality of these local theinferenceandlearningproblemsofbothmethodscanbe
featurescanbemuchhigherthanthatofthetraditionallocal convertedintovariantsofthesparse-codingproblemwhich
features, e.g., SIFT. As a result, the feature space spanned can be readily solved with an off-the-shelf sparse coding
by these local features can be very large and using lim- solver. For this reason, we name the FVC derived from the
ited number of Gausssian distributions can be insufficient first and the second models as Sparse Coding-based Fisher
to accurately model the true feature distribution. (2) The Vector Coding (SCFVC) and Hybrid Sparse Coding-based
numberofGaussiandistributionscannotbelargeduetothe Fisher Vector Coding (HSCFVC). To accelerate the calcu-
resulting increase in the local feature dimensionality and lation, we also develop efficient approximation solutions
the corresponding increase in the size of the image-level basedonthematchingpursuitalgorithm[9].Byconducting
representation. intensive experimental evaluation on object classification,
To tackle the challenge of using high-dimensional local scene classification, and fine-grained image classification
features in FVC, we propose two alternative solutions in problems, we demonstrate that the proposed methods are
building the generative model. Both solutions rely on the superior to the traditional GMM-based FVC. HSCFVC fur-
idea of compositional modeling which assumes that a local therdemonstratesstate-of-the-artclassificationperformance
feature can be better modeled as the composition of multiple ontheevaluateddatasets.
components than by using a prototype. For many recently Apreliminaryversionofthefirstproposedmethodwas
proposed local features, such as CNN activations on local published in [7]. In this paper we extend this approach
image regions, the image area that a local feature covers significantly, and in particular we develop HSCFVC which
is relatively large.In thiscase, compositionalmodeling isa generalizes the framework of SCFVC and leads to further
morenaturalchoicethansingleprototypemodelingbecause improvedclassificationperformance.Wereleasethecodeof
the visual pattern within the local region is clearly a com- thispaperathttps://bitbucket.org/chhshen/scfvc.
binationofmultipleobject/sceneparts.Mathematically,we
formulatetheaforementionedideaasatwo-stagegenerative 2 RELATED WORK
process: in the first stage, the combination coefficients of
2.1 Fishervectorcoding
multiple bases are drawn from a distribution and a linear
combination of bases is generated; in the second stage, a The concept of Fisher vectors was originally proposed
local feature is drawn from a Gaussian distribution whose in [3] as a framework to build a discriminative classifier
mean vector is the combined vector generated from the
1.This may be due to poor local minima caused by training on a
first stage. The compositional components in the proposed
nonconvex objective function, or the overfitting phenomenon due to
methodsaretreatedasmodelparameterswhicharelearned thedifficultyofregularizingadeeplytrainedsupervisedencoder.
APPEARINGINIEEETRANS.PATTERNANALYSISANDMACHINEINTELLIGENCE,DEC.2016 4
from a generative model. It was later applied to image learningstep[21],[23]orinanend-to-endfashion[22],[24].
classification [1] by modeling the image as a bag of local Supervisedinformationhasalsobeenappliedtodiscovera
features sampled from an i.i.d. distribution. Later, several set of middle-level discriminative patches [26], [27], [28] to
variants were proposed to improve the basic FVC. One trainsomepatchdetectorswhichareessentiallylocalfeature
of the first identified facts is that normalisation of Fisher encoders. The CNN can also be seen as a special case of
vectors is essential to achieving good performance [2]. At supervisedcodingmethodsifweviewtheresponsesofthe
the same time, several similar variants were developed filter bank in a convolutional layer as the coding vector of
independently from different perspectives [10], [11], [12]. convolutional activations of the previous layer. From this
The improved Fisher vector and its variants showed state- perspective, the deep CNN can be seen as a hierarchical
of-the-art performance in image classification and quickly extensionofthesupervisedcodingmethod.
becameoneofthemostpopularvisualrepresentationmeth- Generallyspeaking,theaforementionedsupervisedcod-
ods in computer vision. Numerous approaches have been ing and FVC represent two major methodologies for cre-
developed to further enhance performance. For example, ating discriminative image representations. For supervised
Theworkin[13]closelyanalysedparticularimplementation coding, the supervised information is passed through the
detailsofVLAD,afamousvariantofFVC.Theworkin[14] earlystageofaclassificationsystem,i.e.bylearningadictio-
tried to incorporate spatial information from local features naryorcodingfunction.ForFVCtheinformationcontentof
into the Fisher vector framework. In [15], [16], the authors localfeatureswillbelargelypreservedinthecorresponding
revisitedthebasici.i.dassumptionofFVCandpointedout high-dimensionalsignature.Thenasimpleclassifiercanbe
its limitation. They proposed a non-iid model and derived usedtoextractthediscriminativepatternsforclassification.
anapproximatedFishervectorforimageclassification.Also, There have been several works trying to combine the idea
FVC has been widely applied to various applications and of FVC and supervised coding. The work in [29] learns
hasdemonstratedstate-of-the-artperformanceintherelated the model parameters of FVC in an end-to-end supervised
fields. For example, in combination with local trajectory trainingframework.In[30],multiplelayersofFishervector
features,FVC-basedsystemshaveachievedthestate-of-the- codingmodulesarestackedintoadeeparchitecturetoform
artinvideo-basedactionrecognition[17],[18]. a deep network. In contrast to these works, our HSCFVC
is based on the basic conceptual framework of FVC: first
2.2 FVCwithCNNlocalfeatures building a generative model and then deriving its gradient
vector.
Conventionally, most FVC implementations are applied to
low-dimensional hand-crafted local features, such as SIFT
[19]. With the recent development of deep learning, it has 3 BACKGROUND
been observed that simply extracting neural activations
3.1 Fishervectorcoding
fromapre-trainedCNNmodelachievessignificantlybetter
performance [4]. However, it was soon discovered that Giventwosamplesgeneratedfromagenerativemodel,their
directlyusingactivationsfromapre-trainedCNNasglobal similarity can be evaluated by the Fisher kernel [3]. The
featuresisstillnottheoptimalchoice[5],[6],[7],[8],atleast samples can take any form, including a vector or a vector
forasmall/mediumsizedclassificationproblemsforwhich set, as long as its generation process can be modeled. For
fine-tuning a CNN does not always improve performance the Fisher vector-based image classification approach, the
significantly. Instead, it has been shown that it is beneficial sample is a set of local features extracted from an image
to treat CNN activations as local features. In this case, the which we denote as X = {x1,x2,··· ,xT}. Assuming that
traditional local feature coding approaches, such as FVC, xi isdrawni.i.dfromthedistributionP(x|λ),intheFisher
can be readily applied. The work in [5] points out that kernel a sample X can be described by the gradient vector
the fully-connected activation of a pre-trained CNN is not ofthelikelihoodfunctionw.r.t.themodelparameterλ
translation invariant. Thus, the authors propose to extract
(cid:88)
CNNactivationsfrommultipleregionsofanimageanduse GXλ =∇λlogP(X|λ)= ∇λlogP(xi|λ). (1)
VLAD to encode these local features. In [6] and [20], the i
value of convolutional layer activations are analysed. They The Fisher kernel is then defined as K(X,Y) =
suggestthatconvolutionalfeatureactivationscanbeseenas GXTF−1GY, where F is called information matrix and is
λ λ
asetoflocalfeaturesextractedatadensegrid.Inparticular, defined as F = E[GXGXT]. In this paper, we follow [3]
the work in [6] builds a texture classification system by λ λ
to omit it for computational simplicity. However, we can
applyingFVCtotheconvolutionallayerlocalfeatures.
also approximate it by whitening the dimensions of the
gradient vector Gλ as suggested in [2]. As a result, two
2.3 SupervisedcodingandFVC
samples can be directly compared by the linear kernel of
The proposed HSCFVC combines the idea of supervised theircorrespondinggradientvectorswhichareoftencalled
coding and FVC. Here we briefly review the work of Fisher vectors. From a bag-of-features model perspective,
supervised coding and the attempts to combine it with the evaluation of the Fisher kernel for two images can be
FVC. Using supervised information to create an image seen as first calculating the gradient or Fisher vector of
representation is a popular idea in image classification. For eachlocalfeatureandthenperformingsum-pooling.Inthis
example, supervised information has been utilized to learn sense,theFishervectorofeachlocalfeature,∇λlogP(xi|λ),
discriminative codebooks for encoding local features [21], can be seen as a coding vector and we call it Fisher vector
[22], [23], [24], [25], either by using a separated codebook codinginthispaper.
APPEARINGINIEEETRANS.PATTERNANALYSISANDMACHINEINTELLIGENCE,DEC.2016 5
3.2 Gaussianmixturemodel-basedFVC • Generate a prototype µ by linearly combining the
elementary patterns B with the latent combination
To implement the Fisher vector coding framework intro-
duced above, one needs to specify the distribution P(x|λ). coefficient,thatis,µ=Bu.Thendrawalocalfeature
fromtheGaussiandistributionN(µ,Σ).
In the literature, most works use a GMM to model the
generationprocessofx,whichcanbedescribedasfollows: In this model, B ∈ Rd×m denotes m elementary patterns
• Draw a Gaussian model N(µk,Σk) from the prior and is treated as the model parameters. Also note that
distributionP(k), k =1,2,··· ,m. in this framework, we do not treat the mean vector µ as
• DrawalocalfeaturexfromN(µk,Σk). the model parameter but as a mapping from the latent
combination coefficient. Thus we can essentially generate
Generallyspeaking,thedistributionofxresemblesaGaus- theinfinitenumberofGaussiandistributionsbyvaryingu.
sian distribution only within a local region of the feature Bydoingso,wecansignificantlyincreasetherepresentative
space. Thus for a GMM, each Gaussian distribution in the powerofthegenerativemodelwhilekeepingthenumberof
mixture only models a small partition of the feature space its parameters, which determines the dimensionality of the
and intuitively each Gaussian distribution can be seen as resultedFishervector,beingtractable.
a feature prototype. As a result, a number of Gaussian OnequestionremainsthatishowtomodelP(u),thedis-
distributions will be needed to accurately depict the whole tributionofthelatentcombinationcoefficient.Inthiswork,
feature space. For commonly used low dimensional local weproposetwodifferentwaystomodelthisdistribution.
features, such as SIFT [19], it has been shown that it is
sufficient to choose the number of Gaussian distributions 4.1.1 ApproachI
to be of the order of a few hundred. However, for higher
The first approach models P(u) as a Laplace distribution.
dimensionallocalfeaturesthisnumbermaybeinsufficient.
In other words, it assumes that the combination weight is
Thisisbecausethevolumeoffeaturespaceusuallyincreases
sparse. This choice follows the common belief that visual
quickly with the feature dimensionality. Consequently, the
signals can be modeled by the sparse combination of over-
samenumberofGaussiandistributionswillleaveacoarser
complete bases. Once the combination coefficient is sam-
partitionresolutionandleadtoimprecisemodeling.
pled,wegeneratetheprototypeµviaBu.Morespecifically,
To increase the partition resolution for higher dimen-
thegenerativeprocessiswrittenasfollows:
sional feature spaces, one straightforward solution is to
increase the number of Gaussian distributions. However, • Draw a coding vector u from a zero mean Laplace
it turns out that the partition resolution increases slowly distributionP(u)= 1 exp(−|u|).
2λ λ
(compared to our method which will be introduced in the • DrawalocalfeaturexfromtheGaussiandistribution
nextsection)withthenumberofGaussiandistributions.In N(Bu,Σ),
otherwords,muchlargernumbersofGaussiandistributions
Note that the above process resembles a sparse coding
willbeneededandthiswillresultinaFishervectorwhose
model. To show this relationship, let us first write the
dimensionalityistoohightobehandledinpractice.
marginaldistributionofxaccordingtotheabovegenerative
process:
4 OUR APPROACHES (cid:90) (cid:90)
4.1 Compositionalgenerativemodel P(x)= P(x,u|B)du= P(x|u,B)P(u)du. (2)
u u
Oursolutiontothisissueistoadoptacompositionalmodel
Theaboveformulationinvolvesanintegraloperatorwhich
which does not model local features via a fixed number of
makes the likelihood evaluation difficult. To simplify the
prototypes. Instead, it assumes that the prototype can be
calculation, we use the point-wise maximum within the
adaptively generated by the composition of multiple pre- integraltermtoapproximatethelikelihood2,,thatis,
learned components. In other words, we can essentially
leverage an infinite number of prototypes to model the P(x)≈P(x|u∗,B)P(u∗).
whole feature space. Thus the representative power of the u∗ =argmaxP(x|u,B)P(u) (3)
generativemodelcanbesubstantiallyimproved.Intuitively, u
ourmodelismotivatedbythefactthatmanyvisualpatterns ByassumingthatΣ=diag(σ2,··· ,σ2 )andsetσ2 =···=
1 m 1
within a local image region, especially those in a relatively σ2 = σ2 as a constant, the negative logarithm of P(x) is
m
largelocalregion,canbeseenasthecombinationofmultiple
writtenas
objectorsceneparts.Thecomplexityofthosevisualpatterns
1
can be attributed to the large number of possible combina- −log(P(x|B))=min (cid:107)x−Bu(cid:107)2+λ(cid:107)u(cid:107) ,
u σ2 2 1
tions of some elementary patterns. So it is more efficient to
(4)
use those elementary patterns to model the visual patterns
ratherthantoattempttodirectlymodelallpossiblepattern which is exactly the objective value of a sparse coding
combinations. problem. This relationship suggests that we can learn the
Based on this insight, in this work we propose a two- modelparameterBandinferthelatentvariableubyusing
stage framework to model the generative process of a local off-the-shelfsparsecodingsolvers.
feature,whichcanbeexpressedasfollows:
2.Strictlyspeaking,duetothisapproximationtheresultingdescrip-
• Draw a latent combination coefficient u from a pre- torsdonotexactlycorrespondtoFisherkernels.InsteadtheyareFisher
specifieddistributionP(u). vector-likeencodingmethods.
APPEARINGINIEEETRANS.PATTERNANALYSISANDMACHINEINTELLIGENCE,DEC.2016 6
A obvious question with respect the method described presenceofanonzerocodingvalueessentiallyindicatesthe
above is whether it improves modeling accuracy signifi- occurrenceofadiscriminativeelementarypatternidentified
cantly over simply increasing the number of Gaussian dis- by the supervised coding method. In other words, each
tributionsinthetraditionalGMM.Toanswerthisquestion, active(non-zero)codingdimensioncorrespondstoonedis-
we design an experiment to compare these two schemes. criminative elementary pattern and the discriminative part
In our experiment, we use the average distance (denoted of the local feature is the combination of these patterns.
by d) between a feature and its closest mean vector in Let Bd denote the collection of discriminative elementary
the GMM or the above model as the measurement for patterns(bases)andud betheircorrespondingcombination
modeling accuracy. The larger d, the lower the accuracy. weight. The above insight motivates us to encourage ud to
The comparison is shown in Figure 1. In Figure 1 (a), we share the similar nonzero dimensions with c , that is, to
increase the dimensionality of local features 3 and for each require(cid:107)ud−c(cid:107)0 tobesmall.However,thel0 normmakes
dimensionality we calculate d in a GMM model with 100 theFishervectorderivationdifficult.Thuswerelaxl0 norm
Gaussian distributions. As can be seen, d increases quickly tol2 norminourapproach.
withthefeaturedimensionality.InFigure1(b),weseethat Toincorporatetheaboveideasintoourtwo-stagefeature
it is possible to reduce d by introducing more Gaussian generative process framework, we assume that xd and xr
distributions into the GMM model. However, as may be aredrawnfromGaussiandistributionswhosemeanvectors
seen, d drops slowly with the increase of the number of are the linear combination of two bases Bd and Br respec-
mixtures. In contrast, with the proposed method, we can tively. For the combination weight of the residual part ur,
achieve much lower d using only 100 bases. This result we still assume that it is drawn from a Laplace distribu-
demonstratesthemotivationbehindourmethod. tion.Thecombinationweightofthediscriminativepartud,
however is assumed drawn from a compound distribution
4.1.2 ApproachII which should encourage both sparsity and compatibility
The second approach that we propose for modeling P(u) withthesupervisedcodingc.Morespecifically,wepropose
is based on a further decomposition of the local feature. In thefollowinggenerativeprocessofx:
thisapproach,alocalfeatureisassumedtobecomposedof
adiscriminativepartandaresidualpart: • Drawacodingvectorud fromtheconditionaldistri-
butionP(ud|c).
x=xd+xr, (5) • Draw a coding vector ur from a zero mean Laplace
where xd and xr denote the discriminative part and the • dDirsatrwibaultoiocnalPfe(autur)re=x2f1rλoemxpth(−eG(cid:107)uaλru1(cid:107)s1s)i.andistribution
residualpartrespectively.Thediscriminativepartindicates N(Bdud + Brur,Σ), where Bd and Br are model
the visual pattern that is identified as informative for
parameters. Here we define Σ = diag(σ2,··· ,σ2 )
discrimination by an oracle method. The residual part in 1 m
andsetσ2 =···=σ2 =σ2 asaconstant.
this decomposition can either correspond to the patterns 1 m
sharedbymanyclasses,theirrelevantvisualpatternsorthe In the above process, P(ud|c) is defined as
remaining useful information which has not been success- 1 exp(cid:16)−(cid:107)ud(cid:107)1 − (cid:107)ud−c(cid:107)2(cid:17) to meet its two re-
fully identified by the oracle method. The motivation for Z λ2 λ3
quirements as discussed above, where Z =
mothfoedmdieflfjieonrigninttglhyevtshaeluutsweuofnocdroemtrhmpeoinfinenesanlttshaespepdpliaiscrcaartitimeolnyin,iasatnitvhdeatmptoohwdeyeelrianorgef u(cid:82)d exp(cid:16)−(cid:107)uλd2(cid:107)1 − (cid:107)udλ−3c(cid:107)2(cid:17)dud is a constant. Also note
that we do not separately generate the discriminative
theresultingFishervector. and common part of x in practise, i.e. xd ∼ N(Bdud,Σ¯),
The problem of how to achieve this decomposition re- xr ∼ N(Brur,Σ¯) and x = xd +xr. This is because when
mains, however. Clearly, there are infinitely possibilities to
both parts are generated from Gaussian distributions with
decompose x into xd and xr. To solve this problem, we the same covariance matrix, their summation is simply
resort to the guidance of a pre-trained supervised coding
a Gaussian random variable with the mean vector being
method(wewilldiscussthespecificchoiceinsection4.4.3). Bdud+Brur andcovariancematrixbeingΣ=2Σ¯.
Theideaofthesupervisedcodingmethodisdemonstrated
Similar to the approach I, we can derive the marginal
in Fig. 2, the supervised coding method maps each local
probabilityofxfromtheabovegenerativeprocessas:
featurextoacodingvectorcandpoolscodingvectorsfrom
alllocalfeaturestoobtaintheimage-levelrepresentation.It
(cid:90)(cid:90)
encompassesawiderangeoffeaturecodingmethods,such
P(x)= P(x,u ,u |B ,B ,c)du du
as those discussed in section 2.3. In this paper we further d r d r d r
assume that c is sparse. This is a reasonable assumption ud,ur
(cid:90)(cid:90)
sincemanysupervisedencodingmethodsexplicitlyenforce
= P(x|ud,ur,Bd,Br,c)P(ur)P(ud|c)duddur. (6)
thesparsityproperty[23],[24]andthecodingvectorsfrom
manyothermethodscanbesparsifiedbythresholding[26] ud,ur
orsimplysettingtop-k largestcodingvaluestobenonzero
This formulation involves an integral over latent vari-
[28]. For those kinds of supervised coding methods, the
ables ud and ur, which makes the calculation difficult.
Again,wefollowthesimplificationinapproachItousethe
3.ThisisachievedbyperformingPCAona4096-dimensionalCNN
point-wise maximum within the integral term to approxi-
regionaldescriptor.Formoredetailsaboutthefeatureused,pleaserefer
toSection4.4.1 matethelikelihood:
APPEARINGINIEEETRANS.PATTERNANALYSISANDMACHINEINTELLIGENCE,DEC.2016 7
3 3
GMM
2.8 2.8 Proposedmodel(with100bases)
2.6 2.6
d 2.4 d 2.4
2.2 2.2
GMMwith100distributions
2 2
1.180 0 300 500 1000 1.180 0 300 500 1000
Dimensionality of regional local features Number of Gaussian distributions in the GMM
(a) (b)
Fig. 1: Comparison of two strategies to increase the modeling accuracy. (a) For GMM, d, the average distance (over 500
sampledlocalfeatures)betweenalocalfeatureanditsclosestmeanvector,increaseswiththelocalfeaturedimensionality
withthenumberofGMMisfixedat100.(b)disreducedbytwoideas(1)simplyincreasingthenumberofGaussianmixtures.
(2)usingtheproposedgenerationprocess.Aswesee,thelatterachievesmuchlowerdevenwithasmallnumberofbases.
Local feature encoder
Local feature encoder Pooling
…
Local feature encoder
Image representation =
Fig.2:Demonstrationofthesupervisedcodingmethod.Inasupervisedcodingmethod,thesupervisioninformationisused
tolearntheencoderfunction.Asupervisedcodingmethodisusedtoguidethedecompositionofthediscriminativepartand
theresidualpartofalocalfeature.
By cross-referencing the log likelihood definition of our
P(x)≈P(x|u∗,u∗,B ,B ,c)P(u∗)P(u∗|c) firstmodelinEq.(4),theFishervectorcanbecalculatedas
d r d r r d follows:
u∗,u∗ =argmaxP(x|u ,u ,B ,B ,c)P(u )P(u |c) (7)
d r d r d r r d
ud,ur C(x)= ∂log(P(x|B)) = ∂σ12(cid:107)x−Bu∗(cid:107)22+λ(cid:107)u∗(cid:107)1
The negative logarithm of the likelihood is then formu- ∂B ∂B
latedas: u∗ =argmaxP(x|u,B)P(u). (9)
u
−logP(x|B ,B ,c)= min (cid:107)x−B u −B u (cid:107)2+
d r d d r r 2
ud,ur Note that the differentiation involves u∗ which implicitly
λ1(cid:107)ur(cid:107)1+λ2(cid:107)ud(cid:107)1+λ3(cid:107)ud−c(cid:107)22, (8) interacts with B. To calculate this term, we notice that the
where the model parameters Bd and Br can be learned by sqpuaardsreatciocdpinroggrparmobmleimngcparnobbleemrefboyrmduefilanteindgaus+aagnedneura−l
minimizing the negative logarithm of the likelihood in Eq.
as the positive and negative parts of u, that is, the sparse
(8).
codingproblemcanberewrittenas
4.2 Fishervectorderivation 1
min (cid:107)x−B(u+−u−)(cid:107)2+λ1T(u++u−)
4.2.1 FishervectorderivationforapproachI(SCFVC) u+,u− σ2 2
s.t. u+ ≥0 u− ≥0 (10)
Oncethegenerativemodelisestablished,wecanderiveits
Fisher vector coding for a local feature x by differentiating
itsnegativelog-likelihoodw.r.t.themodelparameters. By further defining u(cid:48) = (u+,u−)T, log(P(x|B)) can be
APPEARINGINIEEETRANS.PATTERNANALYSISANDMACHINEINTELLIGENCE,DEC.2016 8
expressedinthefollowinggeneralform, 4.3 Inferenceandlearning
To learn the model parameters and calculate the Fisher
1
log(P(x|B))=L(B)=max u(cid:48)Tv(B)− u(cid:48)TP(B)u(cid:48), vector, we need to solve the optimization problems in Eq.
u(cid:48) 2
(4) and Eq. (8). These two problems can be solved using
(11)
existingsparsecodingsolvers.However,itcanbestillslow
whereP(B)andv(B)areamatrixtermandavectorterm for high-dimensional local feature in practise. In [9], it has
depending on B respectively. The derivative of L(B) has been suggested that a matching pursuit algorithm can be
beenstudiedin[31].AccordingtotheLemma2in[31],we adopted as a substitute for an exact sparse coding problem
can differentiate L(B) with respect to B as if u(cid:48) did not for local feature encoding approach. Thus, in this work we
dependonB.Inotherwords,wecanfirstlycalculateu(cid:48) or usethemethodin[9]toapproximatelysolveEq.(4).
equivalently u∗ by solving the sparse coding problem and We also develop a similar algorithm to approximately
thenobtaintheFishervector ∂log(P(x|B)) as solve Eq. (8) which essentially solves the following variant
∂B
problemofEq.(8):
∂σ12(cid:107)x−Bu∂∗B(cid:107)22+λ(cid:107)u∗(cid:107)1 = σ12(x−Bu∗)u∗T. (12) umd,iunr(cid:107)x−Bdud−Brur(cid:107)22+λ(cid:107)ud−c(cid:107)22 (16)
s.t.(cid:107)u (cid:107) ≤k , (cid:107)u (cid:107) ≤k .
Note that the Fisher vector expressed in Eq. (12) has an d 0 1 r 0 2
interesting form: it is simply the outer product between In the matching pursuit algorithm, the Eq. (16) is sequen-
thesparsecodingvectoru∗ andthereconstructionresidual tially solved by updating one dimension of ud and ur at
term (x−Bu∗). In traditional sparse coding, only the kth eachiterationwhilekeepingthevaluesatotherdimensions
dimension of a coding vector uk is used to indicate the fixed.Inoursolution,wefirstupdateeachdimensionofud
relationship between a local feature x and the kth basis. and then update ur. The algorithm is described in Algo-
HereinEq.(12),thecodingvalueuk multiplyingtherecon- rithm1.ForthederivationandmoredetailsofAlgorithm1,
structionresidualisusedtocapturetheirrelationship.Inthe pleaserefertotheAppendixsection.
followingsections,wecallthisFishercodingmethodSparse To learn the model parameters B in SCFVC, or Bd and
CodingbasedFishervectorcoding(SCFVCinshort). Br inHSCFVC,weemployanalternatingalgorithmwhich
iterates between the following two steps: (1) fixing B in
4.2.2 FishervectorderivationforapproachII(HSCFVC) SCFVC, or Bd and Br in HSCFVC, then solving u, or ud
andur inHSCFVC; (2)fixingu,orud andur inHSCFVC,
Using the same technique as SCFVC, we can derive the
Fishervectorcodingforoursecondgenerativemodel: then updating B, or Bd and Br in HSCFVC through the
solverproposedin[32].
∂log(P(x|B ,B ,c))
Gx = d r
Bd ∂Bd 4.4 Implementationdetails
∂ 1 (cid:107)x−B u∗−B u∗(cid:107)2+λ (cid:107)u∗(cid:107) +λ (cid:107)u∗(cid:107) +λ (cid:107)u∗−c(cid:107)2
= σ2 d d r r 2 1 r 1 2 d 1 3 d 24.4.1 Localfeatures
∂B
d
(13) Using the neuron activations of a pre-trained CNN model
∂log(P(x|B ,B ,c)) as local features has become popular recently [5], [6], [7],
Gx = d r
Br ∂B [8].Thelocalfeaturecanbeeitherextractedfromthefully-
r
∂ 1 (cid:107)x−B u∗−B u∗(cid:107)2+λ (cid:107)u∗(cid:107) +λ (cid:107)u∗(cid:107) +λ (cid:107)u∗−c(cid:107)2connected layer or the convolutional layer. For the former
= σ2 d d r r 2 1 r 1 2 d 1 3 d 2case, a number of image regions are firstly sampled and
∂B
r (14) each of them will pass through the deep CNN 4 to extract
1 thefully-connectedlayeractivationswhichwillbeusedasa
u∗,u∗ =argmin (cid:107)x−B u −B u (cid:107)2+λ (cid:107)u (cid:107) +λ (cid:107)u (cid:107)
d r σ2 d d r r 2 1 r 1 2 d 1 localfeature.Forthelattercase,thewholeimageisdirectly
ud,ur
+λ3(cid:107)ud−c(cid:107)22, (15) fedintoapre-trainedCNNandtheactivationsateachspa-
tiallocationofaconvolutionallayerareextractedasalocal
where ud,ur interact with Bd,Br. Similar to SCFVC, we feature [20]. It has been observed that the fully-connected
cBadn,Bcarlc.uInlaotethGerxBwdoarndds,GwxBercaansisfoulvde,uthrediindfenreontcdeeppreonbdleomn ltahyeecronfevaotulurteioinsaulslaeyfuelrffoeartugereneisriucsoefbujelcftorcltaesxstiufirceaatinodnfianned-
in Eq. (15) to obtain u∗,u∗ first and then calculate Gx grainedimageclassification(thediscriminativepatternsare
and Gx . In the followdingr sections, we call this FishBedr usuallyspecialtypesoftextures).Inthiswork,weuseboth
vectoreBnrcodingmethodHybridSparseCodingbasedFisher kindsoflocalfeaturesinourexperiment.
vector coding (HSCFVC in short) since the creation of its
4.4.2 Poolingandnormalization
finalimagerepresentationinvolvesthecomponentsofboth
supervisedcodingandFishervectorcoding. Fromthei.i.dassumptioninEq.(1),theFishervectorofthe
Note that HSCFVC essentially combines two ideas of wholeimageequalsto
building a good classification system: (1) identifying the ∂log(P(X|B)) (cid:88)∂log(P(xi|B))
discriminativepatternattheearlycodingstageofanimage = (17)
∂B ∂B
classificationpipeline,i.e.supervisedcoding.(2)preserving i
as much information of local features as possible into the
4.A faster and equivalent implementation is to convert the fully-
high-dimensionalimagerepresentationandreliesonclassi-
connectedlayertotheconvolutionallayertoperformthelocalfeature
fierlearningtoidentifythediscriminativepattern. extractionprocess[33].
APPEARINGINIEEETRANS.PATTERNANALYSISANDMACHINEINTELLIGENCE,DEC.2016 9
Algorithm1MatchingPursuitbasedalgorithmforinferringud,ur inEq.(16)
1: procedureMP (pleaseseemoredetailsintheAppendix)
2: Input:x,Bd,Br,k1,k2,λ,c
3: Output:ud,ur
4: Initializeresiduer=x,u1d =0,u1r =0
5: Fixingur,inferringud
6: fort=1:k1 do
7: Solveminedj,udj(cid:107)x−Bdutd−Brutr−Bdedjudj(cid:107)22+λ(cid:107)utd−c+edjudj(cid:107)22
8: Updater←r−Bde∗dju∗dj , utd+1 =utd+e∗dju∗dj
9: endfor
10: Fixingud,inferringur
11: fort=1:k2 do
12: Solveminerj,urj(cid:107)x−Bdutd−Brutr−Brerjurj(cid:107)22
13: Updater←r−Bre∗rju∗rj , utr+1 =utr+e∗rju∗rj
14: endfor
15: endprocedure
This is equivalent to perform the sum-pooling for the ex- large datasets: Caltech-UCSD Birds-200-2011 (Birds-200 in
tractedFishercodingvectors.However,ithasbeenobserved short), MIT indoor scene-67 (MIT-67 in short) and Pascal
[2], [13] that the image signature obtained by using sum- VOC 2007 (Pascal-07 in short). These three datasets are
pooling tends to over-emphasize the information from the commonlyusedevaluationbenchmarksforfine-grainedim-
background[2]orburstingvisualwords[13].Itisimportant ageclassification,sceneclassificationandobjectrecognition.
toapplysomenormalizationoperationswhensum-pooling The focus of our experiments is to verify two aspects: (1)
is used. In this paper, we apply the intra-normalization whether the proposed SCFVC outperforms the traditional
[13] to normalize the pooled Fisher vectors. For example, GMM based FVC (GMM-FVC in short); (2) whether the
in SCFVC we apply l2 normalization to the subvectors proposed HSCFVC outperforms SCFVC and its guiding
(cid:80)i(xi−Bu∗i)u∗i,k ∀k,wherek indicatesthekthdimension supervisedcodingmethodinsection4.4.3(denotedasSupC
ofthesparsecodingu∗.Besidesintra-normalization,wealso inthefollowingpart)sinceHSCFVCisexpectedtoenjoythe
i
utilizethepowernormalizationassuggestedin[2]. meritsofbothSupCandSCFVC.
4.4.3 Supervisedcoding
5.1 Experimentalsetting
Awiderangeofsupervisedcodingmethodscanbeadopted
As mentioned above, we use the activations of a pre-
in the proposed HSCFVC. However, in this paper, we only
trainedCNNasthelocalfeaturesandactivationsfromboth
consideraparticularoneofthem.Specifically,weencodea
the convolutional layer and the fully-connected layer are
localfeaturexbyusingthefollowingencoder:
used.Morespecifically,weextractthefully-connectedlayer
c=f(PTx+b), (18) activations as the local feature for PASCAL-07 and MIT-
67 because we empirically found that the fully connected
where c is the coding vector and f is a nonlinear func-
layer activations work better for scene and generic object
tion. Here we use the soft-threshold (or hinge) function
classification. For Birds-200, we use the convolutional ac-
f(a) = max(0,a) as suggested in [34]. The final image
tivations as local features since it has been reported that
representationisobtainedbyperformingsum-poolingover
convolutionallayeractivationsleadtosuperiorperformance
thecodingvectorsofalllocalfeatures5.Tolearntheencoder
thanthefully-connectedlayeractivationswhenapplytothe
parameters,wefeedtheimagerepresentationintoalogistic
fine-grained image classification problem [20]. Throughout
regressionmoduletocalculatetheposteriorprobabilityand
ourexperiments,weusethevgg-very-deep-19-layersCNN
employ negative entropy as the loss function. Then P and
model [42] as the pre-trained CNN model. To extract the
b are jointly learned with the parameters in the logistic re-
localfeatureswiththefully-connectedlayeractivations,we
gressorinanend-to-endfashionthroughstochasticgradient
firstresizetheinputimageinto512×512pixelsand614×614
descent.Notethatthissupervisedencoderlearningprocess
pixels. Then we extract regions of size 224x224 pixels at a
issimilartoperformingfine-tuningonthelastfewlayersof
densespatialgridwiththestepsizeof32pixels.Theselocal
aconvolutionalneuralnetworkwithxbeingtheactivations
regionsarefedintothedeepCNNandthe4096-dimensional
ofaCNN6.
activationsofthefirstfully-connectedlayerareextractedas
local features. To extract the local features from the convo-
5 EXPERIMENT lutionallayer,weresizeinputimagesto224×224pixelsand
To evaluate the effectiveness of the two proposed compo- 448×448 pixels and then extract the convolutional feature
sitional FVC approaches, we conduct experiments on three activations from the ”conv5-4” layer as local features (in
such setting, there are 14×14+28×28 local features per
5.Wealsoapplythepower-normalizationtobeconsistentwiththe
image). To decouple the correlations between dimensions
proposedSCFVCandHSCFVC.
ofCNNfeaturesandavoidthedimensionalityexplosionof
6.Infact,theperformanceofthisapproachiscomparablewiththat
offinetuningaCNNnetwork. the Fisher vector representation, for fully-connected layer
APPEARINGINIEEETRANS.PATTERNANALYSISANDMACHINEINTELLIGENCE,DEC.2016 10
TABLE1:ComparisonofresultsonBirds-200.Thelowerpartofthistablelistssomeresultsintheliterature.
Methods ClassificationAccuracy Comments
HSCFVC(proposed) 80.8%
SCFVC(proposed) 77.3%
GMMFVC 70.1%
SupC 69.5%
CNN-Jitter 63.6%
CrossLayer[20] 73.5% useconvolutionalfeatures,combinetworesolutions
GlobalCNN-FT[35] 66.4% noparts,finetunning
Parts-RCNN-FT[36] 76.37% useparts,finetunning
Parts-RCNN[36] 68.7% useparts,nofinetunning
CNNaug-SVM[4] 61.8% -
CNN-SVM[4] 53.3% CNNglobal
DPD+CNN[37] 65.0% useparts
DPD[38] 51.0% -
BilinearCNN[39] 85.1% Twonetworks,fine-tuning
Two-LevelAttention[40] 77.9%
UnsupervisedPartModel[41] 81.0%
featuresweapplyPCAtoreduceitsdimensionalityto2000. as large as 10%. This observation clearly demonstrates the
Forconvolutionallayerfeatures,wedonotperformdimen- advantageofusingcompositionalmechanismformodeling
sionalityreductionbutonlyusePCAfordecorrelation. local features. Also, HSCFVC achieves better performance
Five comparing methods are implemented. Besides the than SCFVC, which outperforms the latter by more than
proposedSCFVC,HSCFVCandthetraditionalGMMbased 3%.RecallthatthedifferencebetweenHSCFVCandSCFVC
FVC, the supervised coding method which serves as the lies in that the former further decomposes a local feature
guiding coding method for HSCFVC is also compared into a discriminative part and a residual part, thus the
to verify if additional performance improvement can be superiorperformanceofHSCFVCclearlyverifiesthebenefit
achieved via our HSCFVC. Also, we compare a baseline of adopting such modeling. To achieve this decomposition,
in [4], [35], denoted as CNN-Jitter, which averages the HSCFVC uses a supervised coding method as guidance.
fully-connected layer activations from several transformed Thus it is interesting to examine the performance relation-
versions of an input image, i.e. cropping the four corners ship between HSCFVC and its guiding coding method.
andmiddleregionofaninputimage.Wealsoquoteresults This comparison is also shown in Table 1. As can be seen,
ofothermethodsreportedfromtheliteratureforreference. HSCFVCalsooutperformsitsguidingsupervisedencoding
However, since they may adopt different implementation by 11%. As discussed previously, this further performance
details, their performance may not be directly comparable boost is expected because the supervised coding method
toours. may not be able to extract all discriminative patterns from
Both the proposed methods and baseline methods in- localfeaturesandthemissinginformationcanbere-gained
volveseveralhyper-parameters,theirsettingsaredescribed from the high-dimensional image signature generated by
asfollows.InSCFVC,thecodebooksizeofBissettobe200. HSCFVC. Also, it can be seen that the CNN-Jitter baseline
InHSCFVC,thedimensionalityofcandthecodebooksize performsworstincomparisonwithallothermethods.This
ofBd,Bc aresettobe100.Therefore,thedimensionalityof suggeststhattobuildimage-levelrepresentationwithapre-
the image representation created by SCFVC and HSCFVC trainedCNNmodelitisbettertoadopttheCNNtoextract
are identical. For GMM-FVC, we also set the number of localfeaturesratherthanglobalfeaturesasintheCNN-Jitter
Gaussian distributions to be 200 to make fair comparison. baseline.Finally,bycross-referencingtherecentlypublished
Weemploythematchingpursuitapproximationtosolvethe performance on this dataset, we can conclude that the
inferenceproblemintheSCFVCandHSCFVC.Thesparsity proposed method is on par with the state-of-the-art. Note
of coding vector is controlled by the parameter k in Eq. thatsomemethodsachievebetterperformancebyadopting
(16). Both k1 in HSCFVC and k in SCFVC have significant strategies which have not been considered here but can
influences on performance. We select k1 from {10,20,30} be readily incorporated into our method. For example, in
and k from {10,20,30,40} via cross-validation. k2 is fixed [39], the CNN model is fine-tuned. We can use the same
to 10 for simplicity. λ in Eq. (16) is fixed to be 0.5 unless techniquetoimproveourperformance.
otherwise stated. Throughout our experiments, we use the
MIT-67 MIT-67 contains 6700 images with 67 indoor
linearSVM[43]astheclassifier.
scene categories. This dataset is very challenging because
the differences between some categories are very subtle.
5.2 Mainresults
The comparison of classification results is shown in Table
Birds-200Birds-200isacommonlyusedbenchmarkforfine- 2. Again, we observe that the proposed HGMFVC and
grained image classification which contains 11788 images SCFVCsignificantlyoutperformtraditionalGMMFVC.The
of 200 different bird species. The experimental results on improvementfromHSCFVCandSCFVCtoGMM-FVCare
this dataset are shown in Table 1. As can be seen, both the around 7% and 5% respectively. In addition, the HSCFVC
proposed SCFVC and HSCFVC outperform the traditional achievessuperiorperformancethanSCFVCandSupC.This
GMM-FVC by a large margin. The improvement can be again shows that HSCFVC is able to combine the benefit