Table Of Content

1 Comprehensive Feature-based Robust Video Fingerprinting Using Tensor Model Xiushan Nie, Yilong Yin, and Jiande Sun Abstract—Content-basednear-duplicatevideodetection(NDVD)isessentialforeffectivesearchandretrieval,androbustvideo fingerprintingisagoodsolutionforNDVD.Mostexistingvideofingerprintingmethodsuseasinglefeatureorconcatenatingdifferent featurestogeneratevideofingerprints,andshowagoodperformanceundersingle-modemodificationssuchasnoiseadditionand blurring.However,whentheysuffercombinedmodifications,theperformanceisdegradedtoacertainextentbecausesuchfeatures 6 cannotcharacterizethevideocontentcompletely.Bycontrast,theassistanceandconsensusamongdifferentfeaturescanimprovethe 1 performanceofvideofingerprinting.Therefore,inthepresentstudy,weminetheassistanceandconsensusamongdifferentfeatures 0 basedontensormodel,andpresentanewcomprehensivefeaturetofullyusethemintheproposedvideofingerprintingframework. 2 Wealsoanalyzewhatthecomprehensivefeaturereallyisforrepresentingtheoriginalvideo.Inthisframework,thevideoisinitiallyset n asahigh-ordertensorthatconsistsofdifferentfeatures,andthevideotensorisdecomposedviatheTuckermodelwithasolutionthat a determinesthenumberofcomponents.Subsequently,thecomprehensivefeatureisgeneratedbythelow-ordertensorobtainedfrom J tensordecomposition.Finally,thevideofingerprintiscomputedusingthisfeature.Amatchingstrategyusedfornarrowingthesearchis alsoproposedbasedonthecoretensor.Therobustvideofingerprintingframeworkisresistantnotonlytosingle-modemodifications, 7 2 butalsotothecombinationofthem. ✦ ] V C 1 INTRODUCTION . s c WITH the development of information technology, the most important features of the media to calculate compact [ number of digital videos available on the Web has digeststhatallowforefficientcontentidentificationwithout increased explosively, and the openness of networks have modifyingthemedia.Therefore,robustvideofingerprinting 1 made access to video contents considerably easier and has attracted an increasingamount of attentionin the field v 0 cheaper. Therefore, many illegal and useless video copies ofNDVD. 7 ornear-duplicatesappearontheWeb,whicharegenerated Many video fingerprinting methods have been devel- 2 by simple reformatting, transformations, and editing. Ille- oped in recent years; global [1-6] and local feature-based 7 gal and useless copies result in user inconvenience when [8-11] methods are two primary types. In global feature- 0 surfingtheInternet.AWebusermaywanttosearchforan basedmethods,thevideoisrepresentedasacompactglobal . 1 interestingvideobutmayendupwithmanynear-duplicate feature, such as color space [2], histograms [3], and block 0 videoswithlow-qualityimages,whichisdisappointingand ordinalranking[4].Inthetemporal[5]andtransformation- 6 time-consuming. In addition, most of these copies are pi- based methods, 3D-discrete cosine transform [6] and non- 1 ratedandinfringeonthevideoproducerscopyright.There- negative matrix factorization [7], for example, can also be : v fore, the presence of massive numbers of copies imposes a consideredasglobalfeature-basedapproaches[8], whereas i strong demand for effective near-duplicate video detection the local feature-based methods use the descriptors of the X (NDVD) in many applications, such as copyright enforce- local region around interest points, such as Harris interest r a ment, online video usage monitoring, and video database point detector [9], scale-invariant feature transform (SIFT) cleansing.NDVDisabroadtopicthathasseveralgoalssuch [10], centroid of gradient orientations (CGO) [11], and asfindingcopiesofbriefexcerpts,partial-framecopies,and speed-up robust feature (SURF) [12]. Global features can near-duplicates of entire clips. In this study, we focus on achievefastretrievalspeedbutarelesseffectiveinhandling thenear-duplicatesofentireclipswithoutsupportingpartial video copies with layers of editing, such as caption/logo near-duplicatedetection. insertion, letter-box, and shifting. Local features are more Watermarkingis a traditionaltechnologyusedtodetect effective for most video editing; however, they usually re- copies of images or videos. It embeds imperceptiblewater- quire more computation.Recently, Li et al. [13]usedgraph marksintothemediatoproveitsauthenticity.However,the model embedding to generate video fingerprints. Almeida watermarksembeddedintothemediamaycausedistortion. et al. [14] calculated horizontal and vertical coefficients of By contrast, robust fingerprinting techniques extract the DCT from video frames as the video fingerprint. Sun et al. [15]proposedatemporallyvisualweightingmethodbased • X. S. Nie was with the School of Computer Science and Technology, on visual attention to produce video fingerprint. Nie et al. ShandongUniversityofFinanceandEconomics,Jinan,250014,China. [16] used graph and video tomography [17] to compute video fingerprint. Although these methods used different • Y. L. Yin was with the School of Computer Science and Technology, features, they can be roughly arranged into global or local ShandongUniversity,Jinan,250100,China. • J. D. Sun was with the School of informationscience and engineering, feature-basedmethods. ShandongNormaluniversity,Jinan,250014,China. Generally, many state-of-the-art methods only use a E-mail:[email protected],[email protected],[email protected] single type of feature, global or local, to represent the 2 Fig.2.Theillustrationofsinglefeature,concatenatingfeatureandnonlinearcombinedfeature fingerprinting. Intuitively, concatenating different feature vectorsonebyoneisadirectedformoffeaturefusion;moreover, combining multiple features with different weights is also a solution. However, the concatenated or weighted vector representationof features weakens the power of the propagationamong the multiple featuresand evenignores their relations to a certain extent. For example, Fig. 2 il- (a) (b) lustrates a simple scenario of this issue. In this figure, two classesofrighttrianglesexistandaresimilarwitheachother Fig.1. Originalpictureanditscopy:(a)Original;(b)Copy ineachclass(wecancallthemnear-duplicates)becauseonly affinetransformations(scalingandrotation)areperformed. We take the two sides of each right triangle as Feature 1 corresponding video, which causes difficulty in resisting andFeature2,respectively.Then,theconcatenatingfeature various video attacks. For example, Fig. 1(a) shows the [Feature 1, Feature 2] is labeled as Feature 3. We use a original picture while Fig. 1(b) shows the copy after some nonlinear combination cosine as the fourth feature which geometric and photometric variations. Obviously, the color is cosine=(Feature1)/( (Feature 1)2+(Feature 2)2). q andbrightnessdistributionaredifferent betweenthesetwo Obviously, the fourth feature cosine shows the best per- pictures. If we only use a global feature such as a color or formance of classification, and proves that correlation and brightness histogram to generate the video fingerprint, we consensus that lead to performance improvement among will obtain inaccurate results. By contrast, if we consider differentfeaturesexists. local features such as Harris interest points together, we Moreover, the concatenatedor weighted featurevectors may detect the copy accurately. A combination of multiple aredifficulttohandlewithdifferentscales.Whenimaginga feature-basedalgorithmshasbeenpresentedinrecentyears scenarioof peoplesimilarityevaluation,we usetwodiffer- because of the limitation of individual features. Song et al. entfeatures,namely,heightandweightofpeople.Ifwecon- [18]proposedavideohashinglearningmethodbyweighted catenatethesetwofeaturestoanewfeature[height,weight], multiple feature fusion. Jiang et al. [19] proposed a video we may not evaluate the similarity of people correctly. For copydetectionsystembasedontemporalpyramidmatching example, given two people with feature vector of [175 cm, based on two layers. Li et al. [20, 21] used SIFT, SURF, 65kg]and[160cm,80kg],thefirstpersonis15cmtallerbut DCTcoefficients,andweightedASFtodetectvideocopies, 15kglighterthanthesecondperson.Thus,evaluatingthese and made the final decision by fusing the detection results two people similar or not is difficult. Therefore, we should from different detectors. Although these methods combine exploreanewstrategytofusemultiplefeatures,whichcan multiple features to a certain extent, they do not fully capture the inherent characteristic of the original data and use the assistance and consensus among features, which is theassistancebetweendifferentfeatures. importantinmultiplefeaturefusion. Inaddition,theenvironmentofanetworkisincreasingly Each type of feature reflects specific information on the complex, and combined modifications applied into videos originalvideo,andcanbetakenasoneviewoftheoriginal are increasingly common. One of the challenges in NDVD video. According to the multi-viewingtheory [36], the mu- istherobustnessunderthecombinedmodifications.Gener- tual information of different views (features) can improve ally,whentraditionalmethodssuchassinglefeature-based theperformanceofthetask.Therefore,afusionofdifferent and concatenation-based methods suffer some combined features is beneficial to improve the performance of video modifications, the performances are not as good as the 3 performance under single-mode modification. In fact, we obtaineddatasettofindamatch.Thismatchingstrategycan canmodeleachfeatureasEq.(1).AssumeV istheprinciple acceleratefingerprintmatchingtoacertainextent,especially contentofavideo,then foralarge-scalevideofingerprintdatabase. Security and binarization are also considered in the V =f(i)+e(i) (1) video fingerprinting system. Randomization strategy [23, where f(i) and e(i) are the ith feature of the video and its 24,29]viasecretkeysispopularlyusedinthefingerprinting error, respectively.Whenthevideosuffersdifferentmodifi- system to enhance the security, such as randomly selected cations, the error is big or small, which demonstrates bad imageblocks[23]andoverlappingsub-videocubes[23,24]. or good robustness. As is said, each feature is a view of The final fingerprint sequence can also be quantified to theoriginaldata.Accordingtotheboundaryofmulti-view binary values via secret keys for binary quantization, and analysis[37], the connection between the consensusof two the video fingerprinting is also called video hashing [13], hypotheses on two views respectivelyand their error rates especiallyafterthequantizationsuchasin[6,11,18].These isfollows: commonstrategiescanbedefinitelyappliedintheproposed method; for example, we can randomly select sub-video P(f(i) =f(j)) max P (f(i)),P (f(j)) (2) 6 ≥ { err err } cubes of the original videos in the tensor decomposition, The probability of a disagreement of two features upper and quantify the final fingerprint vector into binary values bounds the error rate of either feature from the inequality. via a key. These strategies and their analysis have been Thus,byexploringthemaximizedagreementandconsensus described in detail in the existing methods, so we do not of different features, the error rate of each feature will be discusstheseissuesinthepresentstudy. minimized.Therefore,miningtheassistanceandconsensus among multiple features and making full use of them are necessaryforNDVD. 2 PROPOSED METHOD Toaddresstheaforementionedissues,weproposeanew In this section, we present a roubust video fingerprint- notation called comprehensive feature to represent video ing scheme based on comprehensive feature using tensor content in this study. The comprehensive feature is not model, and then we show an example to prove its perfor- an original video feature, but a comprehensive, intrinsic, mance. To make the scheme more understandable, we first and transformed feature that can capture the assistance listcertainnotationsregardingthetensor[25]. and consensus among multiple features. Compared with other types of features, the comprehensive feature has two advantages (detailed analysis is in the section 2.2.4): (1) It 2.1 RelatedNotationsandTechnologies consists of the principal components and intrinsic charac- Tensor. A tensor is a multidimensional array. Formally, an teristics of the original video content, which can maximize N -order tensor is an element of the tensor product of th the consensus of different features; (2) It is a compact and N vector spaces, each with its own coordinate system. A comprehensive representation with eliminating noises and vector and matrix are the first- and second-order tensors, uselessness information among different features. In this respectively. Tensors of order three or higher are called study,weproposeageneralcomprehensivefeature-mining higher-ordertensors. schemebasedonatensormodel.Tensoristhenaturalgen- Subarrays.Similartomatriceswhosesubarraysarerows eralizationof vectorand matrix, andthe algebra of higher- andcolumns,subarraysarecommonlyusedintensoranal- ordertensorscanconsiderthecontextualdifferentfeatures. ysis. Fiber and slice are two subarrays of tensor. A fiber Furthermore, the tensor representation and decomposition is defined by fixing every index but one. Slices are two- canexpressintra-featurecontext correlationintuitivelyand dimensional sections of a tensor, defined by fixing all but propagatecorrespondinginter-featurecontextconveniently twoindices. as well [22], because the tensor decomposition is a mixed Matricization.Matricizationisaprocessoftransforming staggered sampling, i.e., alternating sampling in the dif- anN -ordertensorintoamatrix.Themode-nmatricization th ferent components occurs rather than sampling from the of a tensor χ I1×I2×···×IN is denoted by X(n), and components one by one. Therefore, the tensor model is ∈ ℜ arrangesthemode-nfiberstobethecolumnsoftheresulting one of the best tools for comprehensive feature generation. matrix. In summary, the main contributions of this study are the n-ModeProduct.Then-modeproductcanalsobecalled following: tensor multiplication. The n-mode product of a tensor χ (1) We present a comprehensive feature-based scheme ∈ I1×I2×···×IN witha matrixU J×In is denoted byχ to capture the assistance and consensus among different ℜ ∈ ℜ × U with size I I I J I I . n 1 2 n 1 n+1 N featuresusingtensormodel,anditisfeasibletofusingany × ×···× − × × ×···× Thatis given features using the proposed scheme. In this scheme, the intrinsic and latent characteristics of multiple features In atureree.xWpereasslseodschoomwplwethealyt wtheromuginhedthienctohmepcroemhepnresihveensfievae- (χ×nU)i1···in−1jin+1···iN =iXn=1xi1i2···iNujin (3) feature,andgiveatheoreticalanalysisaboutitsrobustness. where x and u are the elements of tensor χ and matrix U, (2) We propose an auxiliary matching strategy based respectively. Each mode-n fiber is multiplied by matrix U; on the tensor model. In this strategy, the core tensor is thus,theideacanalsobeexpressedasfollows: used to narrow the search range, and then the existing matching algorithm can be implemented in the smaller η =χ nU Y(n) =UX(n) (4) × ⇔ 4 whereX andY arethemode-nmatricizationoftensor (n) (n) χandη,respectively. High-order tensors can be approximatedby the sum of low-rank tensors (the definition of tensorrank is described in [25]). The CANDECOMP/PARAFAC (CP) and Tucker modeldecompositionaretwopopularmethodsforapprox- imation, which factorize a tensor into a sum of component low-ranktensors. CP model. TheCPmodelfactorizesatensorintoasum of component rank-one tensors. Given a third-order tensor χ I1×I2×I3,itcanbeconciselyexpressedas ∈ℜ R Fig.3.Thediagramofthevideotensor χ λ; A(1), A(2), A(3) = λ a(1) a(2) a(3) (5) ≈hh ii Xr=1 r r ◦ r ◦ r 2.2.2 TensorModelSelection where R is a positive integer and a(rn) In, and λr is constantforr =1, ,R.A(n) =[a(n),a(∈n),ℜ ,a(n)]with The CP and Tucker models are two main tensor decompo- ··· 1 2 ··· R sition models. We use the Tucker model in the proposed 1 n 3,andarefactormatricesofthetensor.Thesymbol ≤ ≤ framework. The third-order Tucker model represents the “ ”representsthevectorouterproduct. ◦ data spanning the three modes by the vectors given by Tucker model. The Tucker model decomposes a tensor columnsA(1),A(2)andA(3)asshowninEq.(6).Asaresult, into a core tensormultipliedby matrices alongeach mode. the Tucker model encompasses all possible linear interac- Forexample,a third-order tensorχ,canbe decomposedas tionsbetweenvectorspertainingtothevariousmodesofthe follows: data[28].TheCPmodelisaspecialcaseoftheTuckermodel χ κ;A(1),A(2),A(3) =κ 1A(1) 2A(2) 3A(3) where the sizeof eachmode is the same,i.e. J1 = J2 = J3 ≈ × × × = J1(cid:2)(cid:2) J2 J3 k a(1(cid:3)(cid:3)) a(2) a(3) in Eq. (6). In this study, we used the Tucker model instead j1j2j3 j1 ◦ j2 ◦ j3 of the CP model. The first reason is that the Tucker model j1P=1j2P=1j3P=1 where A(n) =[a(1n),a(2n),··· ,a(Jnn)]∈ℜIn×Jn, iasllmowosreusfleerxsibtolesealnedctsdciafflearbelnetdnuurminbgerdseocfofmacptoosristiaolnon, gwehaicchh 1 n 3 ≤ ≤ mode.TheCPdecompositioncanonlyprovideaparticular (6) tensor with a particular number of components when the where I and J are positive integers, and κ is the core n n component loadings are highlycorrelatedin allthe modes, tensor.A isthefactormatrix. (n) i.e., the results of CP decomposition are mathematical ar- Tensor has been applied to many fields, such as com- tifact that is not physically meaningful. There are strong puter vision and signal processing. In this study, the video effects among the various components decomposed by CP is considered a third-order tensor in the tensor space con- model, which make the CP model unstable, slow in con- structedbydifferentfeatures. vergence,anddifficulttointerpret[30].Moreimportantly,a core tensor, which considers all possible linear interactions 2.2 Comprehensive Feature Mining-based Robust between the components of each mode, can be obtained VideoFingerprinting by the Tucker model, and is stable [25, 30] to a certain extent.Thecoretensorisusedformatchingintheproposed In the proposed scheme, we first use different features to framework. generateavideotensor.Then,thecomprehensivefeatureis Choosing the number of components specified for each minedtogeneratethevideofingerprintusingTuckermodel. mode(i.e.,thevaluesofJ ,J andJ inEq.(6))isapartic- In addition, a matching strategy is presented by the core 1 2 3 ular challenge in the Tucker model. In the present study, tensortoacceleratefingerprintmatching. we use a Bayesian approach called automatic relevance determination (ARD) that described in [28] to determine 2.2.1 VideoTensorConstruction the number of components in each mode. ARD alternately We let M be the number of features, and xm dm×1 optimizestheparameterstodiscoverwhichcomponentsare ∈ ℜ the mth feature, where m = 1,2, ,M and dm is the relevant,andtheparametersaremodeledaseitherGaussian ··· dimensionality of the mth video feature. The video is con- prior or Laplace prior. The Gaussianprior PG and Laplace sidered a third-order tensor constructed by the M features priorP ontheparameterθ arefollowing. L d showed in Fig. 3. In the video tensor, the frontal slice is α α the multiple feature values, whereas mode-3 is the time P (θ α )= ( d)1/2exp( dθ2 ) (7) G d| d 2π − 2 j,d sequence. The length of each feature vector and the total Yj number of features determine the size of the frontal slice. α Onefeasiblestrategyistotakethemeanlengthoftheentire P (θ α )= d exp( α θ ) (8) featurevectorasthestandardlength,andthenconcatenate L d| d Yj 2 − d| j,d|1 all feature vectors column by column in the front slice with standard length for the column. Then, we use tensor We will describe how we use ARD to determine the decompositiontogeneratethecomprehensivefeature. valuesofJ ,J andJ intheproposedmethod. 1 2 3 5 We let χ be the third-order video tensor, then, for the TheparametersbyLaplacepriorcanalsobeobtainedin Tuckerdecomposition, thesamewaybyequatingthederivativesofEq.(15)tozero, whichareasfollows. χ Γ=κ A(1) A(2) A(3) (9) ≈ ×1 ×2 ×3 σ2 = ||χ−Γ||2F, α(n) = In , ακ = J1J2J3 (17) Themodelcanalsobewrittenas I1I2I3 d ||Ad(n)||1 ||κ||1 χ Γ Generally, the first row expressions of Eqs. (14) and i1,i2,i3 ≈ i1,i2,i3 =j1,Pj2,j3κj1,j2,j3A(i11,)j1A(i22,)j2A(i33,)j3 (10) p(1r5o)bcleamnsberecsopnescitdiveerleyd,washill2e-rtehgeuslaecriozneddaannddtlh1i-rrdegruolwarsizaerde thenormalizationconstantsinthelikelihoodterms.Thel - 2 where κ ∈ RJ1×J2×J3 and A(n) ∈ RIn×Jn. If we denote regularized and l1-regularized problems are equivalent to this Tucker model as T(J1,J2,J3), our goal is to find the theregularridgeregressionandsparseregressionproblems optimalvaluesofJ1,J2 andJ3. thatcansolvedbyexistingalgorithms,respectively. According to Eq. (9), the tensor χ can also be rewritten In Eqs. (16) and (17), σ2 can be learned from data or as estimated by a signal-to-noise rate (SNR)-based method χ=Γ+ξ (11) [28]. Given the initialization of the parameters and A(n), The optimal values of J , J and J are obtained by al- 1 2 3 where ξ is an error parameter, and its distribution can be ternately estimating α(n), ακ, κ and A(n) based on Eqs. d takenasanindependentidenticallydistribution(i.i.d.)with (16-17)andthesolutions oftworegularizedproblemsuntil Gaussiannoise.i.e., convergence.Whichpriorassumption(GaussianorLaplace) isusedtoupdatetheparametersiscorrespondingtosetting P(ξ)=P(χΓ,σ) | the parameters of the priors such that they match the = √21πσ2 exp(−(χi1,i2,i32−σΓ2i1,i2,i3)2) (12) posteriorsdistribution[38]. i1,Qi2,i3 Therefore,wecanexploreoptimalJ ,J andJ bymini- 2.2.3 ComprehensiveFeatureMining 1 2 3 mizingtheleastsquaresobjective χ Γ 2.IntheBayesian After the video tensor decomposition by Tucker model, || − ||F framework, it corresponds to minimizing the negative log- three low-rank tensor matrices A(n)(n = 1,2,3) are ob- likelihood. tained, which fused the different video features. To make According to Eq. (7) and (8), we obtain PG(A(n) α(n)), the final fingerprint vector more compact, we average the | PL(A(n) α(n)), PG(κακ) and PL(κακ) by Gaussian or absolute values of elements in component matrix A(n) by | | | Laplace prior, respectively, where 1 n 3. The pa- row to obtain a component vector, and then concatenate ≤ ≤ rameters α(n) and ακ are also given as uniform priors for the three component vectors to obtain the comprehensive simplification in ARD. As a result, the posterior can be feature y which is considered asthe finalvideo fingerprint writtenas vector. Thestepbystepdescriptionoftheproposedframework L=P(κ,A(1),A(2),A(3) χ,σ,ακ,α(1),α(2),α(3)) | ∝ isprovidedinAlgorithm1. P(χΓ,σ2)P(κακ)P(A(1) α(1))P(A(2) α(2))P(A(3) α(3)) | | | | | (13) Algorithm 1 Comprehensive feature based video finger- Subsequently,thenegativeloglikelihood( Log(L))us- − printing ingGaussianandLaplacepriorsareproportionaltotheEqs. (14)and(15),respectively. Input:QueryvideoVq. Output:Avideofingerprintvectory. (1)Videotensorconstruction:Givenmultiplefeaturessuchasglobal, C+ 1 χ Γ 2 +0.5 α(n) A(n) 2 +ακ κ 2 localandtemporalfeatures,avideotensorχisconstructedwithsize 2σ2|| − ||F ∗Pn Pd d || d ||F || ||F I1×I2×I3. +0.5 I I I logσ2 0.5 I logα(n) (2) Tensor decomposition: The video tensor is decomposed by ∗ 1 2 3 − ∗ n d the Tucker model with optimized number of components selected 0.5 J J J logακ Pn Pd based on the ARD-basedalgorithm, and we obtain three modesof − ∗ 1 2 3 (14) components,whichareA(1) = (a(i11)j1)I1×J1,A(2) = (a(i22)j2)I2×J2 C+ 2σ12||χ−Γ||2F + α(dn)||A(dn)||1+ακ||κ||1 (a3n)dFAin(g3)er=pr(ina(it33)jc3a)lcI3u×laJt3io,nre:sWpeectaivveelrya.ge the absolute values of el- +0.5 I I I logσ2 Pn Pd I logα(n) (15) ementsin componentmatrix A(n) (n = 1,2,3)byrow toobtain a ∗ 1 2 3 − n d componentvector,andthenconcatenatethethreecomponentvectors Pn Pd toobtain thecomprehensivefeature vectory whichistakenasthe J J J logακ 1 2 3 videofingerprintvector. − J1 J2 J3 whereC, and areaconstantvalue,F-normand y=[ 1 |a(1) | ; 1 |a(2) | ; 1 |a(3) |] 1-norm,re|s|p•e||cFtively.||•||1 J1 jP1=1 i1j1 J2 jP2=1 i2j2 J3 jP3=1 i3j3 We equate the derivatives of Eq. (14) with respect to σ2,α(n) andακ tozero,respectively.Thenweobtainedthe d 2.2.4 AnalysisofComprehensiveFeature followingparametersbyGaussianprior What we mined in the comprehensive feature is important σ2 = ||χ−Γ||2F, α(n) = In , ακ = J1J2J3 (16) in evaluating the performance of the proposed method. I1I2I3 d ||A(dn)||2F ||κ||2F As mentioned, the Tucker model is used in comprehensive 6 feature mining. We first review the process of Tucker de- different features have been eliminated. Generally, the dif- composition. ferentfeaturesshouldreachaconsensusinrepresentingthe Given a video tensor χ I1×I2×I3, it can be decom- video content [36]. The intrinsic and principal information ∈ ℜ posedas of the original video exploited during the comprehensive feature mining are only the consensus and assistance of χ κ A(1) A(2) A(3) ≈ ×1 ×2 ×N (18) different features, which lead to improved performance in = κ;A(1),A(2),A(3) representingthevideocontent. (cid:2)(cid:2) (cid:3)(cid:3) Considering the preceding analysis, we can conclude where each A(n) (1 n 3) is an orthogonal matrix, ≤ ≤ that the information exploited during the comprehensive and κ is the core tensor. For distinct modes in a series of feature mining is composed of the principal components multiplications,the order of the multiplicationis irrelevant and intrinsic characteristics of the original video tensor. [25].Therefore,thecoretensorκis However, the comprehensive feature mining is different κ χ A(1)T A(2)T A(3)T (19) from SVD or PCA because the extraction of the principal 1 2 N ≈ × × × components and intrinsic characteristics in comprehensive where A(n)T is the transposition of A(n). We can use the feature mining is iterative until converge while it is one- followingoptimizationproblemtoobtainκandA(n). off in SVD or PCA. The iterative process and alternative optimization capture more intrinsic information and con- min χ κ;A(1),A(2),A(3) 2 κ,A(1)A(2)A(3)|| − || sensus,whichisthereasonweusethetensormodelduring s.t.κ J1×J2×J3,(cid:2)(cid:2) (cid:3)(cid:3) (20) comprehensivefeaturemining. ∈ℜ A(n) In×Jnandcolumn wiseorthogonal ∈ℜ − 2.2.5 RobustnessofComprehensiveFeature Consequently, the objective function of Eq. (20) can be Robustnessisthemostimportantissueinvideofingerprint- rewrittenasEq.(21). ingsystem.Inthissection,wepresentaqualitativeanalysis χ κ;A(1),A(2),A(3) 2 of the proposedframework. The experimentalresult in the |=| −χ(cid:2)(cid:2)2 2<χ, κ;A(1)(cid:3),(cid:3)A||(2),A(3) > next section proves the robustness of the comprehensive +|| || κ−;A(1),A(cid:2)((cid:2)2),A(3) 2 (cid:3)(cid:3) feature. = χ||(cid:2)(cid:2)2 2<χ A(1)T(cid:3)(cid:3)|| A(2)T A(3)T,κ>+ κ 2 The video fingerprint vector of the proposed method 1 2 3 =||χ||2−2<κ,×κ>+ κ×2 × || || is the comprehensive feature that consists of three modes =||χ||2− κ 2 || || of video tensors. Therefore, without loss of generality, the || || −|| || mode-3factormatrixA(3)istakenasanexampletoanalyze = χ 2 χ A(1)T A(2)T A(3)T 2 || || −|| ×1 ×2 ×3 || the robustness. Given that χ is the tensor of a video, the (21) approximationofthistensorthroughtheTuckermodelis Furthermore,theoptimizationproblemEq. (21)isequal to the following maximization problem because χ 2 is χ κ A(1) A(2) A(3) constant: || || =≈γ ×A1(3) ×2 ×3 (23) 3 × max χ A(1)T A(1)T A(3)T 2 A(n) || ×1 ×2 ×3 || where γ = κ 1A(1) 2A(2). According to the tensor max A(n)TW multiplicationa×ndmatri×cization,Eq. (23)canbewrittenby ⇔ A(n)|| || (22) s.t. W=X (A(3) A(n+1) A(n 1) A(1)) matrices: 1 n 3(n) ⊗ ⊗ − ⊗ X A(3)P (24) ≈ ≤ ≤ where“ ”isKroneckerproduct,andsettingA(4) =A(0) = where XandParethe matricizationsof χandγ inmode- ⊗ 3, respectively. X is the multiple feature matrix in the I.ThismaximizationcanbesolvedusingtheALSalgorithm bysettingA(n) astheJ leadingleftsingularvectorsofW, proposedmethod.Furthermore,werewriteEq.(24)as n anditwillconvergetoasolution[35]. PTA(3)T XT (25) Thefinallow-ranktensormatrixA(n) iscomputedfrom ≈ the left singular vectors of W that consists of the original Obviously, we can approximately consider Eq. (25) as videotensorχ.AccordingtothepropertyofSingularValue a linear equation Eq. (26), where PT, A(3)T and XT are Decomposition(SVD),theleftmatrixobtainedfromtheSVD coefficientmatrix,variableandconstantterm,respectively. containstheintrinsiccharacteristicsoftheoriginalmatrixin PTA(3)T =XT (26) thecolumnspacewhichconsistsofdifferentfeatures. From another perspective, each column of W is an Now, we canconsider the robustnessof comprehensive approximationofthefeaturevector.Ifeachfeaturevectoris feature A(3) as the stability of the solution in this linear takenasapointinoriginalspace,theprincipalcomponents equation. Therefore, if the solution of Eq. (26) is stable, thatrepresentthemaininformationamongthepointdistri- A(3) becomesrobust.Themetricthatmeasuresthesolution bution can be computed by Principal Component Analysis stabilityofa linearequationis thecondition numberofco- (PCA),andtheyarecomputedfromtheleftsingularmatrix efficient matrix. Smaller condition numbers mean stronger of W according to the relation between PCA and SVD. stabilityof the solution. Then, the condition numberof the Therefore, the comprehensive feature mined from A(n) coefficientmatrixinEq.(26)is can be considered the intrinsic and principal information of original data, where the noise and uselessness among condition (PT) condition ((κ A(1) A(2))T) (27) 2 2 1 2 ≈ × × 7 where condition ( ) is the 2-norm condition number of , database after the pre-matching. Because the length of the 2 • • and it is defined as condition2( ) = 2 −1 2. A(1) matchtagis one, the pre-matchingis rapid. The parameter • || • || ||• || and A(2) are the factor matrices after the Tucker decom- a [0,1] is called the adjustment factor. We show how to ∈ position, which are column-wise orthogonal [25]. Accord- selecttheadjustmentfactorinthenextparagraph. ing to the property of condition (i.e., condition (UM) = GiventhatS isthefingerprintdatasetobtainedafterthe 2 condition2(M),ifUisanorthogonalmatrix),thus,Eq.(27) pre-matching, Ps is the the rate of falling into S for the isrewrittenas visual-similar videos (which are the true near-duplicates we want to find), while P is the rate of not falling into condition ((κ A(1) A(2))T) nd 2 × × S for the visual-different videos (which are not the near =condition ((A(2))T (A(1))T κT)=condition (κT) 2 × × 2 (28) duplicates). Intuitively, we could expect that both Ps and Asmentioned,κT isthetranspositionofthecoretensor, Pnd are high. However, the larger a causes a higher Ps but lower P . Conversely, a smaller a leads to a high P and its matricization is a non-singular matrix. Generally, nd nd but lower P . Therefore, a tradeoff should be considered its condition is small. Fig. 4 shows the distribution of s between P and P by choosing the adjustment factor a. theire condition numbers. Most of the condition numbers s nd Givenavideodataset,wecanusethecurvesofP andP are between 2 and 18, which are acceptable values for a s nd versus a to choose the adjustment factor. Fig. 5 shows P well-conditioned problem. Therefore, Eq. (26) is a well- s andP versustheadjustmentfactoraforavideodatabase. conditioned equation, and has a stable solution when the nd The intersection of the two curves is a good choice for coefficientmatrixorconstanttermchanged.Inotherwords, A(3),whichisusedtogeneratethefinalvideofingerprintis the adjustment factor. However, we cannot use that point because we should first consider avoiding a mismatch in robust. the pre-matching, i.e., the fingerprint vectors similar to the queryshouldbeinvolvedinSasmuchaspossible.P isnot s 18 sufficiently high (less than 90%) on the intersection, which leads to a mismatch. Therefore, we choose a with giving 16 priority to P . Taking Fig. 5 as an example, we choose S 14 a = 0.3, where the Ps is almost 1, and Pnd is more than mbers12 0v.i6s.uIanlsoimthielrarwvoidrdeso,saclommopstaraeldl owfitthhethfienqguerepryrinatreviencvtoorlvseodf 2−condition nu108 ilvneisdsSetohsaafantree4r0np%orteoi-fnmtvhaoetlcvthoeitdnaglin,viaSsnuwdalilttyhhedairfofanetreeesmnotofvrievditeshouaafinln-0dg.6ief.rfeOprrneinnlytt 6 vectorsfallintoS.Therefore,thesizeofthedatasetafterpre- matchingisreduced.Furthermatchingcanbeperformedby 4 anyexistingsearchtechnology.Generally,weassumethatN 20 200 400 600 800 1000 1200 videosexistintheoriginaldatabase,andthecomplexofthe Index of video core tensors matching is reduced to O(K), where K is the number of fingerprintvectorsafterthepre-matching,andK <N. Fig.4.conditionnumbersofthetranspositionofvideocoretensor 1 Ps 2.2.6 MatchingStrategy 0.9 Pnd Fingerprint vector matching is one of the important steps 0.8 in a video fingerprinting system. In real application, the 0.7 vaiudseeorwupeblosaitdesmaanneawgevsidaevoi,dtheoemfinagneargperminetndtafitarsbtagseen.Werahteens ability00..56 b o its fingerprint and then matches the fingerprints in the Pr0.4 database. Many matching schemes exist in the literature, 0.3 such as exhaustive matching, tree-based strategy and in- 0.2 verted files. In this study, we provide an auxiliary strategy 0.1 fortheexistingmatchingschemes.Thegoalofthisstrategy 0 is to narrow the matching range in advance, and then the 0 0.2 0.4 0.6 0.8 1 Adajustment factor a existingmatchingschemescanbe executedinthe obtained dataset with smaller sizes compared with the original. The Fig.5.ThevaluesofPs,Pndunderdifferentadjustmentfactors proposed matching strategy is based on the core tensor κ, which represents the level of interaction between the differentcomponents. 2.2.7 ImplementationofProposedScheme In this strategy, we use the sum of elements in the core Inthissection,weshowanimplementationoftheproposed tensorasamatchtag,andthepre-matchingisconductedin schemeusingglobal,localandtemporalvideofeatures,i.e., thedatabaseusingthetag.Givenaqueryvideofingerprint M = 3. These three types of features are described as with match tag T , the fingerprint, whose match tags are follows. However, fusing any type of features in the pro- q between T a T and T +a T , are obtained in the posedschemeisfeasible.Inaddition,wecanalsoimplement q q q q − ∗ ∗ 8 feature selection strategies before inputting them to the LRTA [23, 24]: This method proposed a tensor-based systemtoeliminatesomesimilarorconflictingfeatures,but hashing method (also called LRTA), and used spatial pixel thisissueisbeyondthescopeoftheproposedframework. values as feature, which is a single feature type along Histograms of different patterns, such as color, ordinal the time axis, to generate the tensor. This method is not and block gradients, are common global features. We use good at resisting local video editing, such as caption/logo normalized gray histograms as the global feature. Other insertion, picture in picture, letter-box and shifting, which globalfeaturescanalsobeusedintheproposedmethod. are common manipulations in UGC sites. They implement Many descriptors of interest points can be taken as the the tensor decomposition by CP model which has certain local features. In this study, we use SURF descriptor as limitations. the local feature that is invariant to scale, rotation and CGO [11]: This method first divides a video frame into brightness variation [26]. Each SURF point in an image is blocksandthencombinesdirectioninformationfrompixels associated with a 64-dimension descriptor, which focuses within the block into a single average or centroid statistic on the spatial distribution of gradient information within namedCGO.Finally,thecentroidstatisticsfromeachvideo theinterestpointneighborhood. frameareconcatenatedtoformthefinalvideofingerprint. In addition, we also compute the differences of his- 3-D DCT [6]: This method takes the video as a 3D tograms of pixels between adjacent frames, and then nor- sequenceandusesDCTtogenerateafingerprintvector. malizethemasafeature,whichrepresentthevariancealong time evolution. This feature can be considered as a kind of 3.2 PerformanceEvaluation global feature, however, we list it separately in this study Toevaluatetheperformanceoftheproposedscheme,quan- becauseitshowsthetemporalinformationofavideo. titative and statistical evaluations are shown in this sec- After extracting the multiple features of the video, we tion,whereF-scoreandthereceiveroperatingcharacteristic then take them as inputs for the proposed framework. The (ROC)curveareused,respectively. comprehensive feature is obtained for the final fingerprint vector.TheexperimentalresultsareshowninSection3. 3.2.1 QuantitativeEvaluation Robustness and discrimination are important for a video fingerprinting system. Generally, robustness and discrimi- 3 EXPERIMENTAL RESULTS AND ANALYSIS nation are expressed by recall and precision [32], respec- 3.1 ExperimentalSetting tively.Recallisthesameasthetruepositiverate(ameasure ofrobustnessofthesystem),whereasprecisionisameasure In the experiment, as mentioned, we take the global fea- ofdiscriminationandisdefinedasthepercentageofcorrect ture (normalized 64-bin histograms), local feature (SURF hitswithinalldetectedcopies.Basedonrecallandprecision, points) and temporal feature (the differences of 64-bin F-score which is a single combined metric is taken as a histograms between adjacent frames) as the inputs of the quantitativemeasureinthisstudy,andisdefinedasfollows scheme. Extensive experiments were conducted to evalu- [32]: ate the performance of the proposed scheme. The origi- P P nal videos were downloaded from the CC WEB VIDEO Fβ =(1+β2)β2Pp∗+rP (29) Dataset (vireo.cs.cityu.edu.hk/webvideo) and OV Dataset p r (www.open-video.org), and then we apply different mod- where Pp and Pr are precision and recall rate. β is a ifications on each video. Thus, almost 20,000 videos are parameter that defines how much weight should be given used in our database for the experiments. A total of 11 torecallversusprecision.F-scoreisbetween0and1,with1 single-modemodifications/attackswereappliedonthetest representingaperfectperformancethatiscompletelyrobust videos: (1) rotation; (2) Additive Gaussian White Noise and completely discriminant (100% precision and 100% (AGWN); (3) blurring; (4) contrast enhancement; (5) letter- recall). A low F-score close to 0 represents a poor system box; (6) logo/caption insertion; (7) cropping; (8) flipping; in terms of both robustness and discrimination. We used (9) picture in picture; (10) affine transformation; and (11) β = 0.5 in the experiments, which provides twice as much re-sampling. Some of the modifications are shown in Fig. importance to precision as to recall, because a video fin- 6. The characterization or parameters of various modi- gerprintingsystemshouldhavehighprecisiontominimize fications are provided in Table 1. Moreover, to evaluate the amount of humaninteraction required [32]. A larger F- the performance of the proposed scheme under combined score means better performance of the algorithm. The F- modifications, we applied four types of combinations of scoreresultsundervariousmodificationsareshowninTable more than one modification according to the ones defined 2. In contrast to other methods, the proposed method is by TRECVID [31], such as: (1) decrease in quality 1 (con- better under most modifications. The improvement under trast+change of gamma+AGWN); (2) decrease in quality combinedmodificationislargerthantheonesundersingle- 2 (contrast+change of gamma+AGWN+blur+frame drop- modemodification. ping); (3) post-production 1 (crop+flip+insertion of patterns);and(4)post-production2(crop+flip+insertionofpat- 3.2.2 StatisticalEvaluation terns+pictureinpicture+shift). The miss and false alarm probability are considered in the We compared the proposed scheme with the LRTA ROC curve. The miss probability is defined as the proba- method [23, 24]. Simultaneously, comparisons were also bility of true copies without detection, whereas false alarm madetotwoclassicalalgorithms,labeledtheCGO[11],and probability is defined as the probability of false positive the3-DDCT[6]methods. copiesthatareactuallynegativecases.Fig.7showstheROC 9 Fig.6.Themodificationsontheoriginalframe TABLE1 Descriptionofvideoattacks Attacks ParameterSetting Rotation 5degcounterclockwise AGWN σN=110 Blurring motionblurwith10pixels Contrastenhancement 1%ofdataissaturatedatlowandhighintensities Letter-box 10%ofpixelsarereplacedbyblackboxatthetopandbottomofframe captioninsertion insertalineoftextatthebottomofframe Pictureinpicture insertadifferentpicturewithsize100x100inframe Framecropping about25%cropping Framere-sampling about5%frameschanging Affinetransformation transformationmatrix[100;0.510;001] TABLE2 commonmethodistosetathresholdτ inadvance.Basedon F-scoreresults the assumptionthattheoriginalandqueryfingerprints are F(V)andF(V ),respectively,adecisionismadeasfollows. A Modifications Proposed LRTA 3D-DCT CGO Letter-box 0.9715 0.9253 0.9708 0.9112 Logoinsertion 0.9476 0.8908 0.9234 0.9046 F(V) F(VA) 2 τ VAisacopyof V || − || ≤ Noise 0.9294 0.9820 0.9989 0.9083 (cid:26) F(V) F(VA) 2 >τ VAandV aredifferent captioninsertion 0.9883 0.9618 0.9627 0.9411 || − || (30) Contrast 0.9788 0.9403 0.9731 0.9784 Pictureinpicture 0.9786 0.9612 0.9251 0.9411 Thethresholdisimportantinarealvideofingerprinting Cropping 0.8015 0.8226 0.6527 0.7036 system. A smaller threshold can improve the true positive Flipping 0.9063 0.9051 0.6403 0.5464 probability, but negatively affects the miss probability. By Framechanging 0.8428 0.8423 0.7148 0.4680 contrast,alargerthresholdcausesalowermissprobability, Combinedmodification1 0.9655 0.9279 0.9097 0.9263 Combinedmodification2 0.8725 0.8292 0.5745 0.5556 but the false alarm probability may be higher as a result. Therefore,thechoiceofthresholdshouldbeconsideredina realsystem.Toanalyzethechoiceofthisthreshold,wefirst (log)curvesunderdifferentattacks,andtheperformanceof listsomesymbolnotationsinTable3. the proposed scheme is better than the other performance underalmostallmodifications,especiallyundersomepop- ular manipulations in the user-generated videos and video TABLE3 Notations post-production, such as caption insertion and picture-in- picture. In addition, the performance has significant im- Symbol Defination provement under the combined modifications, which can L2-normofthefingerprintdistancebetweentwofingerprints w beobservedinFig.7(l-o).However,theperformanceunder ofvisuallydifferentvideocontents AWGNisnotasgoodastheotherperformances.Themain v L2-normofthefingerprintdistancebetweentwofingerprints visuallysimilarvideocontent reasonisthatsomenoisypointsarewronglyconsideredas τ Thethreshold interestedpointsintheSURFdetector. P(•) Distributionof“•” 3.3 ThresholdAnalysis Assumingthat w andv yieldi.i.d. with Gaussiannoise. In real application, the video fingerprinting system should because the distances between pairs of fingerprint vectors decide whether the query video is a modified copy. The of video contents areindependent of eachother. Therefore, 10 100 100 100 10−1 Miss probability1100−−21 TL3CDhRGe−TO DApCroTposed method Miss probability1100−−32 TL3CDhRGe−TO DApCroTposed method Miss probability1100−−21 TL3CDhRGe−TO DApCroTposed method 10−130 −6 10−4 10−2 100 10−140 −6 10−4 10−2 100 10−130 −6 10−4 10−2 100 False alarm probability False alarm probability False alarm probability (a) (b) (c) 100 100 100 10−1 10−1 Miss probability1100−−32 TL3CDhRGe−TO DApCroTposed method Miss probability1100−−32 TL3CDhRGe−TO DApCroTposed method Miss probability1100−−21 TL3CDhRGe−TO DApCroTposed method 10−140 −6 10−4 10−2 100 10−140 −6 10−4 10−2 100 10−130 −6 10−4 10−2 100 False alarm probability False alarm probability False alarm probability (d) (e) (f) 100 100 100 10−1 10−1 Miss probability1100−−32 TL3CDhRGe−TO DApCroTposed method Miss probability1100−−21 TL3CDhRGe−TO DApCroTposed method Miss probability1100−−32 TL3CDhRGe−TO DApCroTposed method 10−140 −3 10−2 10−1 100 10−130 −6 10−4 10−2 100 10−140 −6 10−4 10−2 100 False alarm probability False alarm probability False alarm probability (g) (h) (i) 100 100 100 10−1 10−1 Miss probability1100−−21 Miss probability10−2 Miss probability10−2 TLhReT Aproposed method 10−3 TLhReT Aproposed method 10−3 TLhReT Aproposed method 3D−DCT 3D−DCT 3D−DCT CGO CGO CGO 10−130 −6 10−4 10−2 100 10−140 −6 10−4 10−2 100 10−140 −6 10−4 10−2 100 False alarm probability False alarm probability False alarm probability (j) (k) (l) 100 100 100 10−1 Miss probability1100−−21 TL3CDhRGe−TO DApCroTposed method Miss probability1100−−21 TL3CDhRGe−TO DApCroTposed method Miss probability1100−−32 TL3CDhRGe−TO DApCroTposed method 10−130 −6 10−4 10−2 100 10−130 −6 10−4 10−2 100 10−140 −6 10−5 10−4 10−3 10−2 10−1 100 False alarm probability False alarm probability False alarm probability (m) (n) (o) Fig.7. Theperformanceunderdifferentattacks:(a)Rotation;(b)AWGN;(c)Letter-box;(d)Captioninsertion;(e)Cropping;(f)Flipping;(g)Picturein picture;(h)Affinetransformation;(i)Contrastenhancement;(j)Logosubstitution;(k)temporalresampling;(l)contrast+changegamma+AGWN;(m) contrast+changeofgamma+AGWN+blur+framedropping;(n)crop+flip+insertionofpatterns;(o)crop+flip+insertionofpatterns+pictureinpicture+ shift wandvcanbemodeledas To prove the preceding distribution model assumption, we conducted an experiment in the video database under 1 (w−µw)2 different modifications. Figs. 8 (a) and (b) show the values P(w)∼N(µw,σw)= √2πσ e− 2σw2 (31) ofv andw,respectively.Thebarofv andw aredepictedin w Figs. 8 (c) and (d), and the assumption of normal distribu- tionapproximativelyfitstherealdata. 1 (v−µv)2 P(v)∼N(µv,σv)= √2πσ e− 2σv2 (32) Accordingtothemodelassumptions,weplotthefitting v