ebook img

Compositional Model based Fisher Vector Coding for Image Classification PDF

0.58 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Compositional Model based Fisher Vector Coding for Image Classification

APPEARINGINIEEETRANS.PATTERNANALYSISANDMACHINEINTELLIGENCE,DEC.2016 1 Compositional Model Based Fisher Vector Coding for Image Classification Lingqiao Liu, Peng Wang, Chunhua Shen, Lei Wang, Anton van den Hengel, Chao Wang, Heng Tao Shen Abstract—Derivingfromthegradientvectorofagenerativemodeloflocalfeatures,Fishervectorcoding(FVC)hasbeenidentifiedas aneffectivecodingmethodforimageclassification.Most,ifnotall,FVCimplementationsemploytheGaussianmixturemodel(GMM) asthegenerativemodelforlocalfeatures.However,therepresentativepowerofaGMMcanbelimitedbecauseitessentiallyassumes thatlocalfeaturescanbecharacterizedbyafixednumberoffeatureprototypes,andthenumberofprototypesisusuallysmallinFVC. Toalleviatethislimitation,inthiswork,webreaktheconventionwhichassumesthatalocalfeatureisdrawnfromoneofafew 6 Gaussiandistributions.Instead,weadoptacompositionalmechanismwhichassumesthatalocalfeatureisdrawnfromaGaussian 1 distributionwhosemeanvectoriscomposedasalinearcombinationofmultiplekeycomponents,andthecombinationweightisa 0 latentrandomvariable.IndoingsowegreatlyenhancetherepresentativepowerofthegenerativemodelunderlyingFVC. 2 Toimplementouridea,wedesigntwoparticulargenerativemodelsfollowingthiscompositionalapproach.Inourfirstmodel,themean vectorissampledfromthesubspacespannedbyasetofbasesandthecombinationweightisdrawnfromaLaplacedistribution.Inour c secondmodel,wefurtherassumethatalocalfeatureiscomposedofadiscriminativepartandaresidualpart.Asaresult,alocal e featureisgeneratedbythelinearcombinationofdiscriminativepartbasesandresidualpartbases.Thedecompositionofthe D discriminativeandresidualpartsisachievedviatheguidanceofapre-trainedsupervisedcodingmethod.Bycalculatingthegradient 0 vectoroftheproposedmodels,wederivetwonewFishervectorcodingstrategies.ThefirstistermedSparseCoding-basedFisher 1 VectorCoding(SCFVC)andcanbeusedasthesubstituteoftraditionalGMMbasedFVC.ThesecondistermedHybridSparse Coding-basedFishervectorcoding(HSCFVC)sinceitcombinesthemeritsofbothpre-trainedsupervisedcodingmethodsandFVC. ] Usingpre-trainedConvolutionalNeuralNetwork(CNN)activationsaslocalfeatures,weexperimentallydemonstratethattheproposed V methodsaresuperiortotraditionalGMMbasedFVCandachievestate-of-the-artperformanceinvariousimageclassificationtasks. C IndexTerms—FisherVectorCoding,SparseCoding,HybridSparseCoding,ConvolutionalNetworks,GenericImageClassification. . s (cid:70) c [ 2 v 3 4 1 4 0 . 1 0 6 1 : v i X r a • L.Liu,P.Wang,C.ShenandA.vandenHengelarewiththeSchoolof ComputerScience,UniversityofAdelaide,SA,Australia. E-mail:{lingqiao.liu,chunhua.shen,anton.vandenhengel}@adelaide.edu.au • L.WangandC.WangarewiththeSchoolofComputingandInformation Technology,UniversityofWollongong,NSW,Australia. E-mail:{leiw,chaow}@uow.edu.au • H. T. Shen is with the School of Computer Science and Engineering University of Electronic Science and Technology of China, Chengdu, Sichuan,China. • The first two authors contributed equally to this work. Correspondence shouldbeaddressedtoC.Shen. APPEARINGINIEEETRANS.PATTERNANALYSISANDMACHINEINTELLIGENCE,DEC.2016 2 CONTENTS 1 Introduction 3 2 RelatedWork 3 2.1 Fishervectorcoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 FVCwithCNNlocalfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 SupervisedcodingandFVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3 Background 4 3.1 Fishervectorcoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2 Gaussianmixturemodel-basedFVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4 Ourapproaches 5 4.1 Compositionalgenerativemodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4.1.1 ApproachI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4.1.2 ApproachII . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.2 Fishervectorderivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.2.1 FishervectorderivationforapproachI(SCFVC) . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.2.2 FishervectorderivationforapproachII(HSCFVC) . . . . . . . . . . . . . . . . . . . . . . . . 8 4.3 Inferenceandlearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.4 Implementationdetails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.4.1 Localfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.4.2 Poolingandnormalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.4.3 Supervisedcoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5 Experiment 9 5.1 Experimentalsetting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5.2 Mainresults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 5.3 AnalysisofSCFVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.3.1 GMMFVCvs.SCFVC:theimpactoflocalfeaturedimensions . . . . . . . . . . . . . . . . . . 11 5.3.2 GMMFVCvs.SCFVC:codebooksizeandfeaturedimensionalitytrade-off . . . . . . . . . . . 11 5.4 AnalysisofHSCFVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.4.1 Theclassificationaccuracyvs.thevalueofλ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.4.2 TheimpactoftheresidualpartFishervectorGX onclassificationperformance. . . . . . . . 13 Bc 6 Conclusion 13 References 14 7 Appendix:MatchingpursuitbasedoptimizationforEqu.(16) 16 APPEARINGINIEEETRANS.PATTERNANALYSISANDMACHINEINTELLIGENCE,DEC.2016 3 1 INTRODUCTION subsequently. The differences between the two proposed approaches In the bag-of-features model, Fisher vector coding [1], [2] are the ways of decomposing a local feature. The first (FVC) is a coding method derived from the Fisher kernel approach adopts a single basis matrix and assumes that [3]whichwasoriginallyproposedtocomparetwosamples each combination coefficient is drawn from a Laplace dis- induced by a generative model. The basic idea of FVC tribution. The second approach takes the further step of is to first construct a generative model of local features assuming that a local feature may be decomposed into a and use the gradient of the log-likelihood of a particular discriminative part and a residual part. The discriminative featurewithrespecttothemodelparametersasthefeature’s part represents those patterns which are found to be dis- coding vector. When applied as an image representation criminativeandtheresidualpartdepictsthepatternswhich the FVC vectors of local features are calculated by a pool- are not well captured by the identified discriminative part. ing operation and normalization [2] to generate the final To achieve such a decomposition, we rely on a pre-trained image representation. FVC has been established as one supervised coding method and use its coding vector as of the most powerful local feature encoding and image our guide. The motivation for using decomposition-based representation generation methods. In most of the visual modelingistwofold:(1)Thisdecompositionwillenablepart classification systems with FVC, Gaussian mixture model ofthegenerativemodeltofocusmoreonthediscriminative (GMM)isadoptedasthegenerativemodelformodelingthe part and thus to better capture class-specific information. localfeatures.TheGMMessentiallyassumesthateachlocal (2)Ontheotherhand,thediscriminativepartidentifiedby feature is generated from one of the Gaussian distributions thepre-trainedsupervisedcodingmethodmaynotcapture in the GMM, and intuitively the mean of each Gaussian all the useful patterns in the local features due to the distributionservesasaprototypeforthelocalfeatures.Since imperfection of supervised encoder training 1. In this case, the dimensionality of the image representation resulting thepartofthegenerativemodelwhichmodelstheresidual from GMM based FVC is the product of the local feature providesasecondchancetodistillthemissinginformation dimensionality and the number of Gaussians, to make the andthuscompensatesforthediscriminativepartmodeling. image representation dimensionality tractable, the number Duetothecomplementarynaturesofthediscriminativeand ofGaussiansisusuallychosentobefewhundred. residual parts, as well as the high dimensionality of Fisher With the recent development in feature learning [4], vectors, it is expected that the Fisher vector derived from higher dimensional local features such as the activations our second model preserves more useful information than of a pre-trained deep neural network [5], [6], [7], [8] have ourfirstFVCandthesupervisedcodingmethodthatguides become increasingly popular. However, modeling these lo- thedecomposition. cal features with the GMM for FVC is challenging. This is We also show that, under some certain approximation, due to two factors: (1) The dimensionality of these local theinferenceandlearningproblemsofbothmethodscanbe featurescanbemuchhigherthanthatofthetraditionallocal convertedintovariantsofthesparse-codingproblemwhich features, e.g., SIFT. As a result, the feature space spanned can be readily solved with an off-the-shelf sparse coding by these local features can be very large and using lim- solver. For this reason, we name the FVC derived from the ited number of Gausssian distributions can be insufficient first and the second models as Sparse Coding-based Fisher to accurately model the true feature distribution. (2) The Vector Coding (SCFVC) and Hybrid Sparse Coding-based numberofGaussiandistributionscannotbelargeduetothe Fisher Vector Coding (HSCFVC). To accelerate the calcu- resulting increase in the local feature dimensionality and lation, we also develop efficient approximation solutions the corresponding increase in the size of the image-level basedonthematchingpursuitalgorithm[9].Byconducting representation. intensive experimental evaluation on object classification, To tackle the challenge of using high-dimensional local scene classification, and fine-grained image classification features in FVC, we propose two alternative solutions in problems, we demonstrate that the proposed methods are building the generative model. Both solutions rely on the superior to the traditional GMM-based FVC. HSCFVC fur- idea of compositional modeling which assumes that a local therdemonstratesstate-of-the-artclassificationperformance feature can be better modeled as the composition of multiple ontheevaluateddatasets. components than by using a prototype. For many recently Apreliminaryversionofthefirstproposedmethodwas proposed local features, such as CNN activations on local published in [7]. In this paper we extend this approach image regions, the image area that a local feature covers significantly, and in particular we develop HSCFVC which is relatively large.In thiscase, compositionalmodeling isa generalizes the framework of SCFVC and leads to further morenaturalchoicethansingleprototypemodelingbecause improvedclassificationperformance.Wereleasethecodeof the visual pattern within the local region is clearly a com- thispaperathttps://bitbucket.org/chhshen/scfvc. binationofmultipleobject/sceneparts.Mathematically,we formulatetheaforementionedideaasatwo-stagegenerative 2 RELATED WORK process: in the first stage, the combination coefficients of 2.1 Fishervectorcoding multiple bases are drawn from a distribution and a linear combination of bases is generated; in the second stage, a The concept of Fisher vectors was originally proposed local feature is drawn from a Gaussian distribution whose in [3] as a framework to build a discriminative classifier mean vector is the combined vector generated from the 1.This may be due to poor local minima caused by training on a first stage. The compositional components in the proposed nonconvex objective function, or the overfitting phenomenon due to methodsaretreatedasmodelparameterswhicharelearned thedifficultyofregularizingadeeplytrainedsupervisedencoder. APPEARINGINIEEETRANS.PATTERNANALYSISANDMACHINEINTELLIGENCE,DEC.2016 4 from a generative model. It was later applied to image learningstep[21],[23]orinanend-to-endfashion[22],[24]. classification [1] by modeling the image as a bag of local Supervisedinformationhasalsobeenappliedtodiscovera features sampled from an i.i.d. distribution. Later, several set of middle-level discriminative patches [26], [27], [28] to variants were proposed to improve the basic FVC. One trainsomepatchdetectorswhichareessentiallylocalfeature of the first identified facts is that normalisation of Fisher encoders. The CNN can also be seen as a special case of vectors is essential to achieving good performance [2]. At supervisedcodingmethodsifweviewtheresponsesofthe the same time, several similar variants were developed filter bank in a convolutional layer as the coding vector of independently from different perspectives [10], [11], [12]. convolutional activations of the previous layer. From this The improved Fisher vector and its variants showed state- perspective, the deep CNN can be seen as a hierarchical of-the-art performance in image classification and quickly extensionofthesupervisedcodingmethod. becameoneofthemostpopularvisualrepresentationmeth- Generallyspeaking,theaforementionedsupervisedcod- ods in computer vision. Numerous approaches have been ing and FVC represent two major methodologies for cre- developed to further enhance performance. For example, ating discriminative image representations. For supervised Theworkin[13]closelyanalysedparticularimplementation coding, the supervised information is passed through the detailsofVLAD,afamousvariantofFVC.Theworkin[14] earlystageofaclassificationsystem,i.e.bylearningadictio- tried to incorporate spatial information from local features naryorcodingfunction.ForFVCtheinformationcontentof into the Fisher vector framework. In [15], [16], the authors localfeatureswillbelargelypreservedinthecorresponding revisitedthebasici.i.dassumptionofFVCandpointedout high-dimensionalsignature.Thenasimpleclassifiercanbe its limitation. They proposed a non-iid model and derived usedtoextractthediscriminativepatternsforclassification. anapproximatedFishervectorforimageclassification.Also, There have been several works trying to combine the idea FVC has been widely applied to various applications and of FVC and supervised coding. The work in [29] learns hasdemonstratedstate-of-the-artperformanceintherelated the model parameters of FVC in an end-to-end supervised fields. For example, in combination with local trajectory trainingframework.In[30],multiplelayersofFishervector features,FVC-basedsystemshaveachievedthestate-of-the- codingmodulesarestackedintoadeeparchitecturetoform artinvideo-basedactionrecognition[17],[18]. a deep network. In contrast to these works, our HSCFVC is based on the basic conceptual framework of FVC: first 2.2 FVCwithCNNlocalfeatures building a generative model and then deriving its gradient vector. Conventionally, most FVC implementations are applied to low-dimensional hand-crafted local features, such as SIFT [19]. With the recent development of deep learning, it has 3 BACKGROUND been observed that simply extracting neural activations 3.1 Fishervectorcoding fromapre-trainedCNNmodelachievessignificantlybetter performance [4]. However, it was soon discovered that Giventwosamplesgeneratedfromagenerativemodel,their directlyusingactivationsfromapre-trainedCNNasglobal similarity can be evaluated by the Fisher kernel [3]. The featuresisstillnottheoptimalchoice[5],[6],[7],[8],atleast samples can take any form, including a vector or a vector forasmall/mediumsizedclassificationproblemsforwhich set, as long as its generation process can be modeled. For fine-tuning a CNN does not always improve performance the Fisher vector-based image classification approach, the significantly. Instead, it has been shown that it is beneficial sample is a set of local features extracted from an image to treat CNN activations as local features. In this case, the which we denote as X = {x1,x2,··· ,xT}. Assuming that traditional local feature coding approaches, such as FVC, xi isdrawni.i.dfromthedistributionP(x|λ),intheFisher can be readily applied. The work in [5] points out that kernel a sample X can be described by the gradient vector the fully-connected activation of a pre-trained CNN is not ofthelikelihoodfunctionw.r.t.themodelparameterλ translation invariant. Thus, the authors propose to extract (cid:88) CNNactivationsfrommultipleregionsofanimageanduse GXλ =∇λlogP(X|λ)= ∇λlogP(xi|λ). (1) VLAD to encode these local features. In [6] and [20], the i value of convolutional layer activations are analysed. They The Fisher kernel is then defined as K(X,Y) = suggestthatconvolutionalfeatureactivationscanbeseenas GXTF−1GY, where F is called information matrix and is λ λ asetoflocalfeaturesextractedatadensegrid.Inparticular, defined as F = E[GXGXT]. In this paper, we follow [3] the work in [6] builds a texture classification system by λ λ to omit it for computational simplicity. However, we can applyingFVCtotheconvolutionallayerlocalfeatures. also approximate it by whitening the dimensions of the gradient vector Gλ as suggested in [2]. As a result, two 2.3 SupervisedcodingandFVC samples can be directly compared by the linear kernel of The proposed HSCFVC combines the idea of supervised theircorrespondinggradientvectorswhichareoftencalled coding and FVC. Here we briefly review the work of Fisher vectors. From a bag-of-features model perspective, supervised coding and the attempts to combine it with the evaluation of the Fisher kernel for two images can be FVC. Using supervised information to create an image seen as first calculating the gradient or Fisher vector of representation is a popular idea in image classification. For eachlocalfeatureandthenperformingsum-pooling.Inthis example, supervised information has been utilized to learn sense,theFishervectorofeachlocalfeature,∇λlogP(xi|λ), discriminative codebooks for encoding local features [21], can be seen as a coding vector and we call it Fisher vector [22], [23], [24], [25], either by using a separated codebook codinginthispaper. APPEARINGINIEEETRANS.PATTERNANALYSISANDMACHINEINTELLIGENCE,DEC.2016 5 3.2 Gaussianmixturemodel-basedFVC • Generate a prototype µ by linearly combining the elementary patterns B with the latent combination To implement the Fisher vector coding framework intro- duced above, one needs to specify the distribution P(x|λ). coefficient,thatis,µ=Bu.Thendrawalocalfeature fromtheGaussiandistributionN(µ,Σ). In the literature, most works use a GMM to model the generationprocessofx,whichcanbedescribedasfollows: In this model, B ∈ Rd×m denotes m elementary patterns • Draw a Gaussian model N(µk,Σk) from the prior and is treated as the model parameters. Also note that distributionP(k), k =1,2,··· ,m. in this framework, we do not treat the mean vector µ as • DrawalocalfeaturexfromN(µk,Σk). the model parameter but as a mapping from the latent combination coefficient. Thus we can essentially generate Generallyspeaking,thedistributionofxresemblesaGaus- theinfinitenumberofGaussiandistributionsbyvaryingu. sian distribution only within a local region of the feature Bydoingso,wecansignificantlyincreasetherepresentative space. Thus for a GMM, each Gaussian distribution in the powerofthegenerativemodelwhilekeepingthenumberof mixture only models a small partition of the feature space its parameters, which determines the dimensionality of the and intuitively each Gaussian distribution can be seen as resultedFishervector,beingtractable. a feature prototype. As a result, a number of Gaussian OnequestionremainsthatishowtomodelP(u),thedis- distributions will be needed to accurately depict the whole tributionofthelatentcombinationcoefficient.Inthiswork, feature space. For commonly used low dimensional local weproposetwodifferentwaystomodelthisdistribution. features, such as SIFT [19], it has been shown that it is sufficient to choose the number of Gaussian distributions 4.1.1 ApproachI to be of the order of a few hundred. However, for higher The first approach models P(u) as a Laplace distribution. dimensionallocalfeaturesthisnumbermaybeinsufficient. In other words, it assumes that the combination weight is Thisisbecausethevolumeoffeaturespaceusuallyincreases sparse. This choice follows the common belief that visual quickly with the feature dimensionality. Consequently, the signals can be modeled by the sparse combination of over- samenumberofGaussiandistributionswillleaveacoarser complete bases. Once the combination coefficient is sam- partitionresolutionandleadtoimprecisemodeling. pled,wegeneratetheprototypeµviaBu.Morespecifically, To increase the partition resolution for higher dimen- thegenerativeprocessiswrittenasfollows: sional feature spaces, one straightforward solution is to increase the number of Gaussian distributions. However, • Draw a coding vector u from a zero mean Laplace it turns out that the partition resolution increases slowly distributionP(u)= 1 exp(−|u|). 2λ λ (compared to our method which will be introduced in the • DrawalocalfeaturexfromtheGaussiandistribution nextsection)withthenumberofGaussiandistributions.In N(Bu,Σ), otherwords,muchlargernumbersofGaussiandistributions Note that the above process resembles a sparse coding willbeneededandthiswillresultinaFishervectorwhose model. To show this relationship, let us first write the dimensionalityistoohightobehandledinpractice. marginaldistributionofxaccordingtotheabovegenerative process: 4 OUR APPROACHES (cid:90) (cid:90) 4.1 Compositionalgenerativemodel P(x)= P(x,u|B)du= P(x|u,B)P(u)du. (2) u u Oursolutiontothisissueistoadoptacompositionalmodel Theaboveformulationinvolvesanintegraloperatorwhich which does not model local features via a fixed number of makes the likelihood evaluation difficult. To simplify the prototypes. Instead, it assumes that the prototype can be calculation, we use the point-wise maximum within the adaptively generated by the composition of multiple pre- integraltermtoapproximatethelikelihood2,,thatis, learned components. In other words, we can essentially leverage an infinite number of prototypes to model the P(x)≈P(x|u∗,B)P(u∗). whole feature space. Thus the representative power of the u∗ =argmaxP(x|u,B)P(u) (3) generativemodelcanbesubstantiallyimproved.Intuitively, u ourmodelismotivatedbythefactthatmanyvisualpatterns ByassumingthatΣ=diag(σ2,··· ,σ2 )andsetσ2 =···= 1 m 1 within a local image region, especially those in a relatively σ2 = σ2 as a constant, the negative logarithm of P(x) is m largelocalregion,canbeseenasthecombinationofmultiple writtenas objectorsceneparts.Thecomplexityofthosevisualpatterns 1 can be attributed to the large number of possible combina- −log(P(x|B))=min (cid:107)x−Bu(cid:107)2+λ(cid:107)u(cid:107) , u σ2 2 1 tions of some elementary patterns. So it is more efficient to (4) use those elementary patterns to model the visual patterns ratherthantoattempttodirectlymodelallpossiblepattern which is exactly the objective value of a sparse coding combinations. problem. This relationship suggests that we can learn the Based on this insight, in this work we propose a two- modelparameterBandinferthelatentvariableubyusing stage framework to model the generative process of a local off-the-shelfsparsecodingsolvers. feature,whichcanbeexpressedasfollows: 2.Strictlyspeaking,duetothisapproximationtheresultingdescrip- • Draw a latent combination coefficient u from a pre- torsdonotexactlycorrespondtoFisherkernels.InsteadtheyareFisher specifieddistributionP(u). vector-likeencodingmethods. APPEARINGINIEEETRANS.PATTERNANALYSISANDMACHINEINTELLIGENCE,DEC.2016 6 A obvious question with respect the method described presenceofanonzerocodingvalueessentiallyindicatesthe above is whether it improves modeling accuracy signifi- occurrenceofadiscriminativeelementarypatternidentified cantly over simply increasing the number of Gaussian dis- by the supervised coding method. In other words, each tributionsinthetraditionalGMM.Toanswerthisquestion, active(non-zero)codingdimensioncorrespondstoonedis- we design an experiment to compare these two schemes. criminative elementary pattern and the discriminative part In our experiment, we use the average distance (denoted of the local feature is the combination of these patterns. by d) between a feature and its closest mean vector in Let Bd denote the collection of discriminative elementary the GMM or the above model as the measurement for patterns(bases)andud betheircorrespondingcombination modeling accuracy. The larger d, the lower the accuracy. weight. The above insight motivates us to encourage ud to The comparison is shown in Figure 1. In Figure 1 (a), we share the similar nonzero dimensions with c , that is, to increase the dimensionality of local features 3 and for each require(cid:107)ud−c(cid:107)0 tobesmall.However,thel0 normmakes dimensionality we calculate d in a GMM model with 100 theFishervectorderivationdifficult.Thuswerelaxl0 norm Gaussian distributions. As can be seen, d increases quickly tol2 norminourapproach. withthefeaturedimensionality.InFigure1(b),weseethat Toincorporatetheaboveideasintoourtwo-stagefeature it is possible to reduce d by introducing more Gaussian generative process framework, we assume that xd and xr distributions into the GMM model. However, as may be aredrawnfromGaussiandistributionswhosemeanvectors seen, d drops slowly with the increase of the number of are the linear combination of two bases Bd and Br respec- mixtures. In contrast, with the proposed method, we can tively. For the combination weight of the residual part ur, achieve much lower d using only 100 bases. This result we still assume that it is drawn from a Laplace distribu- demonstratesthemotivationbehindourmethod. tion.Thecombinationweightofthediscriminativepartud, however is assumed drawn from a compound distribution 4.1.2 ApproachII which should encourage both sparsity and compatibility The second approach that we propose for modeling P(u) withthesupervisedcodingc.Morespecifically,wepropose is based on a further decomposition of the local feature. In thefollowinggenerativeprocessofx: thisapproach,alocalfeatureisassumedtobecomposedof adiscriminativepartandaresidualpart: • Drawacodingvectorud fromtheconditionaldistri- butionP(ud|c). x=xd+xr, (5) • Draw a coding vector ur from a zero mean Laplace where xd and xr denote the discriminative part and the • dDirsatrwibaultoiocnalPfe(autur)re=x2f1rλoemxpth(−eG(cid:107)uaλru1(cid:107)s1s)i.andistribution residualpartrespectively.Thediscriminativepartindicates N(Bdud + Brur,Σ), where Bd and Br are model the visual pattern that is identified as informative for parameters. Here we define Σ = diag(σ2,··· ,σ2 ) discrimination by an oracle method. The residual part in 1 m andsetσ2 =···=σ2 =σ2 asaconstant. this decomposition can either correspond to the patterns 1 m sharedbymanyclasses,theirrelevantvisualpatternsorthe In the above process, P(ud|c) is defined as remaining useful information which has not been success- 1 exp(cid:16)−(cid:107)ud(cid:107)1 − (cid:107)ud−c(cid:107)2(cid:17) to meet its two re- fully identified by the oracle method. The motivation for Z λ2 λ3 quirements as discussed above, where Z = mothfoedmdieflfjieonrigninttglhyevtshaeluutsweuofnocdroemtrhmpeoinfinenesanlttshaespepdpliaiscrcaartitimeolnyin,iasatnitvhdeatmptoohwdeyeelrianorgef u(cid:82)d exp(cid:16)−(cid:107)uλd2(cid:107)1 − (cid:107)udλ−3c(cid:107)2(cid:17)dud is a constant. Also note that we do not separately generate the discriminative theresultingFishervector. and common part of x in practise, i.e. xd ∼ N(Bdud,Σ¯), The problem of how to achieve this decomposition re- xr ∼ N(Brur,Σ¯) and x = xd +xr. This is because when mains, however. Clearly, there are infinitely possibilities to both parts are generated from Gaussian distributions with decompose x into xd and xr. To solve this problem, we the same covariance matrix, their summation is simply resort to the guidance of a pre-trained supervised coding a Gaussian random variable with the mean vector being method(wewilldiscussthespecificchoiceinsection4.4.3). Bdud+Brur andcovariancematrixbeingΣ=2Σ¯. Theideaofthesupervisedcodingmethodisdemonstrated Similar to the approach I, we can derive the marginal in Fig. 2, the supervised coding method maps each local probabilityofxfromtheabovegenerativeprocessas: featurextoacodingvectorcandpoolscodingvectorsfrom alllocalfeaturestoobtaintheimage-levelrepresentation.It (cid:90)(cid:90) encompassesawiderangeoffeaturecodingmethods,such P(x)= P(x,u ,u |B ,B ,c)du du as those discussed in section 2.3. In this paper we further d r d r d r assume that c is sparse. This is a reasonable assumption ud,ur (cid:90)(cid:90) sincemanysupervisedencodingmethodsexplicitlyenforce = P(x|ud,ur,Bd,Br,c)P(ur)P(ud|c)duddur. (6) thesparsityproperty[23],[24]andthecodingvectorsfrom manyothermethodscanbesparsifiedbythresholding[26] ud,ur orsimplysettingtop-k largestcodingvaluestobenonzero This formulation involves an integral over latent vari- [28]. For those kinds of supervised coding methods, the ables ud and ur, which makes the calculation difficult. Again,wefollowthesimplificationinapproachItousethe 3.ThisisachievedbyperformingPCAona4096-dimensionalCNN point-wise maximum within the integral term to approxi- regionaldescriptor.Formoredetailsaboutthefeatureused,pleaserefer toSection4.4.1 matethelikelihood: APPEARINGINIEEETRANS.PATTERNANALYSISANDMACHINEINTELLIGENCE,DEC.2016 7 3 3 GMM 2.8 2.8 Proposedmodel(with100bases) 2.6 2.6 d 2.4 d 2.4 2.2 2.2 GMMwith100distributions 2 2 1.180 0 300 500 1000 1.180 0 300 500 1000 Dimensionality of regional local features Number of Gaussian distributions in the GMM (a) (b) Fig. 1: Comparison of two strategies to increase the modeling accuracy. (a) For GMM, d, the average distance (over 500 sampledlocalfeatures)betweenalocalfeatureanditsclosestmeanvector,increaseswiththelocalfeaturedimensionality withthenumberofGMMisfixedat100.(b)disreducedbytwoideas(1)simplyincreasingthenumberofGaussianmixtures. (2)usingtheproposedgenerationprocess.Aswesee,thelatterachievesmuchlowerdevenwithasmallnumberofbases. Local feature encoder Local feature encoder Pooling … Local feature encoder Image representation = Fig.2:Demonstrationofthesupervisedcodingmethod.Inasupervisedcodingmethod,thesupervisioninformationisused tolearntheencoderfunction.Asupervisedcodingmethodisusedtoguidethedecompositionofthediscriminativepartand theresidualpartofalocalfeature. By cross-referencing the log likelihood definition of our P(x)≈P(x|u∗,u∗,B ,B ,c)P(u∗)P(u∗|c) firstmodelinEq.(4),theFishervectorcanbecalculatedas d r d r r d follows: u∗,u∗ =argmaxP(x|u ,u ,B ,B ,c)P(u )P(u |c) (7) d r d r d r r d ud,ur C(x)= ∂log(P(x|B)) = ∂σ12(cid:107)x−Bu∗(cid:107)22+λ(cid:107)u∗(cid:107)1 The negative logarithm of the likelihood is then formu- ∂B ∂B latedas: u∗ =argmaxP(x|u,B)P(u). (9) u −logP(x|B ,B ,c)= min (cid:107)x−B u −B u (cid:107)2+ d r d d r r 2 ud,ur Note that the differentiation involves u∗ which implicitly λ1(cid:107)ur(cid:107)1+λ2(cid:107)ud(cid:107)1+λ3(cid:107)ud−c(cid:107)22, (8) interacts with B. To calculate this term, we notice that the where the model parameters Bd and Br can be learned by sqpuaardsreatciocdpinroggrparmobmleimngcparnobbleemrefboyrmduefilanteindgaus+aagnedneura−l minimizing the negative logarithm of the likelihood in Eq. as the positive and negative parts of u, that is, the sparse (8). codingproblemcanberewrittenas 4.2 Fishervectorderivation 1 min (cid:107)x−B(u+−u−)(cid:107)2+λ1T(u++u−) 4.2.1 FishervectorderivationforapproachI(SCFVC) u+,u− σ2 2 s.t. u+ ≥0 u− ≥0 (10) Oncethegenerativemodelisestablished,wecanderiveits Fisher vector coding for a local feature x by differentiating itsnegativelog-likelihoodw.r.t.themodelparameters. By further defining u(cid:48) = (u+,u−)T, log(P(x|B)) can be APPEARINGINIEEETRANS.PATTERNANALYSISANDMACHINEINTELLIGENCE,DEC.2016 8 expressedinthefollowinggeneralform, 4.3 Inferenceandlearning To learn the model parameters and calculate the Fisher 1 log(P(x|B))=L(B)=max u(cid:48)Tv(B)− u(cid:48)TP(B)u(cid:48), vector, we need to solve the optimization problems in Eq. u(cid:48) 2 (4) and Eq. (8). These two problems can be solved using (11) existingsparsecodingsolvers.However,itcanbestillslow whereP(B)andv(B)areamatrixtermandavectorterm for high-dimensional local feature in practise. In [9], it has depending on B respectively. The derivative of L(B) has been suggested that a matching pursuit algorithm can be beenstudiedin[31].AccordingtotheLemma2in[31],we adopted as a substitute for an exact sparse coding problem can differentiate L(B) with respect to B as if u(cid:48) did not for local feature encoding approach. Thus, in this work we dependonB.Inotherwords,wecanfirstlycalculateu(cid:48) or usethemethodin[9]toapproximatelysolveEq.(4). equivalently u∗ by solving the sparse coding problem and We also develop a similar algorithm to approximately thenobtaintheFishervector ∂log(P(x|B)) as solve Eq. (8) which essentially solves the following variant ∂B problemofEq.(8): ∂σ12(cid:107)x−Bu∂∗B(cid:107)22+λ(cid:107)u∗(cid:107)1 = σ12(x−Bu∗)u∗T. (12) umd,iunr(cid:107)x−Bdud−Brur(cid:107)22+λ(cid:107)ud−c(cid:107)22 (16) s.t.(cid:107)u (cid:107) ≤k , (cid:107)u (cid:107) ≤k . Note that the Fisher vector expressed in Eq. (12) has an d 0 1 r 0 2 interesting form: it is simply the outer product between In the matching pursuit algorithm, the Eq. (16) is sequen- thesparsecodingvectoru∗ andthereconstructionresidual tially solved by updating one dimension of ud and ur at term (x−Bu∗). In traditional sparse coding, only the kth eachiterationwhilekeepingthevaluesatotherdimensions dimension of a coding vector uk is used to indicate the fixed.Inoursolution,wefirstupdateeachdimensionofud relationship between a local feature x and the kth basis. and then update ur. The algorithm is described in Algo- HereinEq.(12),thecodingvalueuk multiplyingtherecon- rithm1.ForthederivationandmoredetailsofAlgorithm1, structionresidualisusedtocapturetheirrelationship.Inthe pleaserefertotheAppendixsection. followingsections,wecallthisFishercodingmethodSparse To learn the model parameters B in SCFVC, or Bd and CodingbasedFishervectorcoding(SCFVCinshort). Br inHSCFVC,weemployanalternatingalgorithmwhich iterates between the following two steps: (1) fixing B in 4.2.2 FishervectorderivationforapproachII(HSCFVC) SCFVC, or Bd and Br in HSCFVC, then solving u, or ud andur inHSCFVC; (2)fixingu,orud andur inHSCFVC, Using the same technique as SCFVC, we can derive the Fishervectorcodingforoursecondgenerativemodel: then updating B, or Bd and Br in HSCFVC through the solverproposedin[32]. ∂log(P(x|B ,B ,c)) Gx = d r Bd ∂Bd 4.4 Implementationdetails ∂ 1 (cid:107)x−B u∗−B u∗(cid:107)2+λ (cid:107)u∗(cid:107) +λ (cid:107)u∗(cid:107) +λ (cid:107)u∗−c(cid:107)2 = σ2 d d r r 2 1 r 1 2 d 1 3 d 24.4.1 Localfeatures ∂B d (13) Using the neuron activations of a pre-trained CNN model ∂log(P(x|B ,B ,c)) as local features has become popular recently [5], [6], [7], Gx = d r Br ∂B [8].Thelocalfeaturecanbeeitherextractedfromthefully- r ∂ 1 (cid:107)x−B u∗−B u∗(cid:107)2+λ (cid:107)u∗(cid:107) +λ (cid:107)u∗(cid:107) +λ (cid:107)u∗−c(cid:107)2connected layer or the convolutional layer. For the former = σ2 d d r r 2 1 r 1 2 d 1 3 d 2case, a number of image regions are firstly sampled and ∂B r (14) each of them will pass through the deep CNN 4 to extract 1 thefully-connectedlayeractivationswhichwillbeusedasa u∗,u∗ =argmin (cid:107)x−B u −B u (cid:107)2+λ (cid:107)u (cid:107) +λ (cid:107)u (cid:107) d r σ2 d d r r 2 1 r 1 2 d 1 localfeature.Forthelattercase,thewholeimageisdirectly ud,ur +λ3(cid:107)ud−c(cid:107)22, (15) fedintoapre-trainedCNNandtheactivationsateachspa- tiallocationofaconvolutionallayerareextractedasalocal where ud,ur interact with Bd,Br. Similar to SCFVC, we feature [20]. It has been observed that the fully-connected cBadn,Bcarlc.uInlaotethGerxBwdoarndds,GwxBercaansisfoulvde,uthrediindfenreontcdeeppreonbdleomn ltahyeecronfevaotulurteioinsaulslaeyfuelrffoeartugereneisriucsoefbujelcftorcltaesxstiufirceaatinodnfianned- in Eq. (15) to obtain u∗,u∗ first and then calculate Gx grainedimageclassification(thediscriminativepatternsare and Gx . In the followdingr sections, we call this FishBedr usuallyspecialtypesoftextures).Inthiswork,weuseboth vectoreBnrcodingmethodHybridSparseCodingbasedFisher kindsoflocalfeaturesinourexperiment. vector coding (HSCFVC in short) since the creation of its 4.4.2 Poolingandnormalization finalimagerepresentationinvolvesthecomponentsofboth supervisedcodingandFishervectorcoding. Fromthei.i.dassumptioninEq.(1),theFishervectorofthe Note that HSCFVC essentially combines two ideas of wholeimageequalsto building a good classification system: (1) identifying the ∂log(P(X|B)) (cid:88)∂log(P(xi|B)) discriminativepatternattheearlycodingstageofanimage = (17) ∂B ∂B classificationpipeline,i.e.supervisedcoding.(2)preserving i as much information of local features as possible into the 4.A faster and equivalent implementation is to convert the fully- high-dimensionalimagerepresentationandreliesonclassi- connectedlayertotheconvolutionallayertoperformthelocalfeature fierlearningtoidentifythediscriminativepattern. extractionprocess[33]. APPEARINGINIEEETRANS.PATTERNANALYSISANDMACHINEINTELLIGENCE,DEC.2016 9 Algorithm1MatchingPursuitbasedalgorithmforinferringud,ur inEq.(16) 1: procedureMP (pleaseseemoredetailsintheAppendix) 2: Input:x,Bd,Br,k1,k2,λ,c 3: Output:ud,ur 4: Initializeresiduer=x,u1d =0,u1r =0 5: Fixingur,inferringud 6: fort=1:k1 do 7: Solveminedj,udj(cid:107)x−Bdutd−Brutr−Bdedjudj(cid:107)22+λ(cid:107)utd−c+edjudj(cid:107)22 8: Updater←r−Bde∗dju∗dj , utd+1 =utd+e∗dju∗dj 9: endfor 10: Fixingud,inferringur 11: fort=1:k2 do 12: Solveminerj,urj(cid:107)x−Bdutd−Brutr−Brerjurj(cid:107)22 13: Updater←r−Bre∗rju∗rj , utr+1 =utr+e∗rju∗rj 14: endfor 15: endprocedure This is equivalent to perform the sum-pooling for the ex- large datasets: Caltech-UCSD Birds-200-2011 (Birds-200 in tractedFishercodingvectors.However,ithasbeenobserved short), MIT indoor scene-67 (MIT-67 in short) and Pascal [2], [13] that the image signature obtained by using sum- VOC 2007 (Pascal-07 in short). These three datasets are pooling tends to over-emphasize the information from the commonlyusedevaluationbenchmarksforfine-grainedim- background[2]orburstingvisualwords[13].Itisimportant ageclassification,sceneclassificationandobjectrecognition. toapplysomenormalizationoperationswhensum-pooling The focus of our experiments is to verify two aspects: (1) is used. In this paper, we apply the intra-normalization whether the proposed SCFVC outperforms the traditional [13] to normalize the pooled Fisher vectors. For example, GMM based FVC (GMM-FVC in short); (2) whether the in SCFVC we apply l2 normalization to the subvectors proposed HSCFVC outperforms SCFVC and its guiding (cid:80)i(xi−Bu∗i)u∗i,k ∀k,wherek indicatesthekthdimension supervisedcodingmethodinsection4.4.3(denotedasSupC ofthesparsecodingu∗.Besidesintra-normalization,wealso inthefollowingpart)sinceHSCFVCisexpectedtoenjoythe i utilizethepowernormalizationassuggestedin[2]. meritsofbothSupCandSCFVC. 4.4.3 Supervisedcoding 5.1 Experimentalsetting Awiderangeofsupervisedcodingmethodscanbeadopted As mentioned above, we use the activations of a pre- in the proposed HSCFVC. However, in this paper, we only trainedCNNasthelocalfeaturesandactivationsfromboth consideraparticularoneofthem.Specifically,weencodea the convolutional layer and the fully-connected layer are localfeaturexbyusingthefollowingencoder: used.Morespecifically,weextractthefully-connectedlayer c=f(PTx+b), (18) activations as the local feature for PASCAL-07 and MIT- 67 because we empirically found that the fully connected where c is the coding vector and f is a nonlinear func- layer activations work better for scene and generic object tion. Here we use the soft-threshold (or hinge) function classification. For Birds-200, we use the convolutional ac- f(a) = max(0,a) as suggested in [34]. The final image tivations as local features since it has been reported that representationisobtainedbyperformingsum-poolingover convolutionallayeractivationsleadtosuperiorperformance thecodingvectorsofalllocalfeatures5.Tolearntheencoder thanthefully-connectedlayeractivationswhenapplytothe parameters,wefeedtheimagerepresentationintoalogistic fine-grained image classification problem [20]. Throughout regressionmoduletocalculatetheposteriorprobabilityand ourexperiments,weusethevgg-very-deep-19-layersCNN employ negative entropy as the loss function. Then P and model [42] as the pre-trained CNN model. To extract the b are jointly learned with the parameters in the logistic re- localfeatureswiththefully-connectedlayeractivations,we gressorinanend-to-endfashionthroughstochasticgradient firstresizetheinputimageinto512×512pixelsand614×614 descent.Notethatthissupervisedencoderlearningprocess pixels. Then we extract regions of size 224x224 pixels at a issimilartoperformingfine-tuningonthelastfewlayersof densespatialgridwiththestepsizeof32pixels.Theselocal aconvolutionalneuralnetworkwithxbeingtheactivations regionsarefedintothedeepCNNandthe4096-dimensional ofaCNN6. activationsofthefirstfully-connectedlayerareextractedas local features. To extract the local features from the convo- 5 EXPERIMENT lutionallayer,weresizeinputimagesto224×224pixelsand To evaluate the effectiveness of the two proposed compo- 448×448 pixels and then extract the convolutional feature sitional FVC approaches, we conduct experiments on three activations from the ”conv5-4” layer as local features (in such setting, there are 14×14+28×28 local features per 5.Wealsoapplythepower-normalizationtobeconsistentwiththe image). To decouple the correlations between dimensions proposedSCFVCandHSCFVC. ofCNNfeaturesandavoidthedimensionalityexplosionof 6.Infact,theperformanceofthisapproachiscomparablewiththat offinetuningaCNNnetwork. the Fisher vector representation, for fully-connected layer APPEARINGINIEEETRANS.PATTERNANALYSISANDMACHINEINTELLIGENCE,DEC.2016 10 TABLE1:ComparisonofresultsonBirds-200.Thelowerpartofthistablelistssomeresultsintheliterature. Methods ClassificationAccuracy Comments HSCFVC(proposed) 80.8% SCFVC(proposed) 77.3% GMMFVC 70.1% SupC 69.5% CNN-Jitter 63.6% CrossLayer[20] 73.5% useconvolutionalfeatures,combinetworesolutions GlobalCNN-FT[35] 66.4% noparts,finetunning Parts-RCNN-FT[36] 76.37% useparts,finetunning Parts-RCNN[36] 68.7% useparts,nofinetunning CNNaug-SVM[4] 61.8% - CNN-SVM[4] 53.3% CNNglobal DPD+CNN[37] 65.0% useparts DPD[38] 51.0% - BilinearCNN[39] 85.1% Twonetworks,fine-tuning Two-LevelAttention[40] 77.9% UnsupervisedPartModel[41] 81.0% featuresweapplyPCAtoreduceitsdimensionalityto2000. as large as 10%. This observation clearly demonstrates the Forconvolutionallayerfeatures,wedonotperformdimen- advantageofusingcompositionalmechanismformodeling sionalityreductionbutonlyusePCAfordecorrelation. local features. Also, HSCFVC achieves better performance Five comparing methods are implemented. Besides the than SCFVC, which outperforms the latter by more than proposedSCFVC,HSCFVCandthetraditionalGMMbased 3%.RecallthatthedifferencebetweenHSCFVCandSCFVC FVC, the supervised coding method which serves as the lies in that the former further decomposes a local feature guiding coding method for HSCFVC is also compared into a discriminative part and a residual part, thus the to verify if additional performance improvement can be superiorperformanceofHSCFVCclearlyverifiesthebenefit achieved via our HSCFVC. Also, we compare a baseline of adopting such modeling. To achieve this decomposition, in [4], [35], denoted as CNN-Jitter, which averages the HSCFVC uses a supervised coding method as guidance. fully-connected layer activations from several transformed Thus it is interesting to examine the performance relation- versions of an input image, i.e. cropping the four corners ship between HSCFVC and its guiding coding method. andmiddleregionofaninputimage.Wealsoquoteresults This comparison is also shown in Table 1. As can be seen, ofothermethodsreportedfromtheliteratureforreference. HSCFVCalsooutperformsitsguidingsupervisedencoding However, since they may adopt different implementation by 11%. As discussed previously, this further performance details, their performance may not be directly comparable boost is expected because the supervised coding method toours. may not be able to extract all discriminative patterns from Both the proposed methods and baseline methods in- localfeaturesandthemissinginformationcanbere-gained volveseveralhyper-parameters,theirsettingsaredescribed from the high-dimensional image signature generated by asfollows.InSCFVC,thecodebooksizeofBissettobe200. HSCFVC. Also, it can be seen that the CNN-Jitter baseline InHSCFVC,thedimensionalityofcandthecodebooksize performsworstincomparisonwithallothermethods.This ofBd,Bc aresettobe100.Therefore,thedimensionalityof suggeststhattobuildimage-levelrepresentationwithapre- the image representation created by SCFVC and HSCFVC trainedCNNmodelitisbettertoadopttheCNNtoextract are identical. For GMM-FVC, we also set the number of localfeaturesratherthanglobalfeaturesasintheCNN-Jitter Gaussian distributions to be 200 to make fair comparison. baseline.Finally,bycross-referencingtherecentlypublished Weemploythematchingpursuitapproximationtosolvethe performance on this dataset, we can conclude that the inferenceproblemintheSCFVCandHSCFVC.Thesparsity proposed method is on par with the state-of-the-art. Note of coding vector is controlled by the parameter k in Eq. thatsomemethodsachievebetterperformancebyadopting (16). Both k1 in HSCFVC and k in SCFVC have significant strategies which have not been considered here but can influences on performance. We select k1 from {10,20,30} be readily incorporated into our method. For example, in and k from {10,20,30,40} via cross-validation. k2 is fixed [39], the CNN model is fine-tuned. We can use the same to 10 for simplicity. λ in Eq. (16) is fixed to be 0.5 unless techniquetoimproveourperformance. otherwise stated. Throughout our experiments, we use the MIT-67 MIT-67 contains 6700 images with 67 indoor linearSVM[43]astheclassifier. scene categories. This dataset is very challenging because the differences between some categories are very subtle. 5.2 Mainresults The comparison of classification results is shown in Table Birds-200Birds-200isacommonlyusedbenchmarkforfine- 2. Again, we observe that the proposed HGMFVC and grained image classification which contains 11788 images SCFVCsignificantlyoutperformtraditionalGMMFVC.The of 200 different bird species. The experimental results on improvementfromHSCFVCandSCFVCtoGMM-FVCare this dataset are shown in Table 1. As can be seen, both the around 7% and 5% respectively. In addition, the HSCFVC proposed SCFVC and HSCFVC outperform the traditional achievessuperiorperformancethanSCFVCandSupC.This GMM-FVC by a large margin. The improvement can be again shows that HSCFVC is able to combine the benefit

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.