ebook img

Co-Regularized Deep Representations for Video Summarization PDF

5 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Co-Regularized Deep Representations for Video Summarization

CO-REGULARIZEDDEEPREPRESENTATIONSFORVIDEOSUMMARIZATION OlivierMore`re∗,1,2,3,HanlinGoh∗,1,3,AntoineVeillard∗,2,3,VijayChandrasekhar1,3,JieLin1,3 I2R1,UPMC2,IPAL3 ABSTRACT 5 Compactkeyframe-basedvideosummariesareapopular 1 way of generating viewership on video sharing platforms. 0 Yet, creating relevant and compelling summaries for arbi- 2 trarily long videos with a small number of keyframes is a n challenging task. We propose a comprehensive keyframe- a based summarization framework combining deep convolu- J tional neural networks and restricted Boltzmann machines. 0 An original co-regularization scheme is used to discover Fig.1. Deepco-regularizedkeyframesummary. Ourmethod 3 meaningful subject-scene associations. The resulting multi- extractsdiverse,representativeandattractivekeyframes. ] modalrepresentationsarethenusedtoselecthighly-relevant V keyframes. A comprehensive user study is conducted com- C paringourproposedmethodtoavarietyofschemes, includ- faces[4,5].Forthisclassofalgorithms,clusteringtechniques . ing the summarization currently in use by one of the most likek-meansarepopular:clusteringorgroupingisperformed s c popular video sharing websites. The results show that our basedonrawRGBpixels, oracombinationoflowandhigh [ method consistently outperforms the baseline schemes for levelfeatures[6,7,8,9,10]. Theframesclosesttothecluster 1 any given amount of keyframes both in terms of attractive- centersarechosentobepartofthesummary. v ness and informativeness. The lead is even more significant Skimming-basedsummarizationisusedtoproducelonger 8 forsmallersummaries. videosummaries. Thevideoisdividedintosmallershotsus- 3 ing shot boundary detection algorithms and a series of shots 7 Index Terms— Video summarization, deep convolu- are selected to form the summary video. Subshot selection 7 tional neural networks, co-regularized restricted Boltzmann is based on motion activity [11, 12, 13] and other high level 0 machines features,suchaspersonandlandmarkdescriptors[14]. . 1 Finally, in storyboard-based summarization, algorithms 0 5 1. INTRODUCTION take into account relationships between the different sub- 1 shots[15]. Thisenableslongegocentricvideostobesumma- : Video sharing websites measure user engagement through rizedtogainanunderstandingoftheunderlyingevents. v i clickratesandviewership. Tomakeanovelvideoattractive Contributions.Thisworkfocusesongeneratingcompact X fortheaudience,itsvideolinkisoftenpresentedasathumb- keyframe-based summarization, with the main contributions r nail of either a single representative frame or a slideshow of areasfollows: a several keyframes. In this work, we explore the problem of Acomprehensivekeyframe-basedsummarizationframe- • automaticallygeneratingdiverse,representativeandattractive work combining deep convolutional neural networks keyframe-basedsummariesforvideos. (DCNNs)andrestrictedBoltzmannmachines(RBMs). Summarization-based techniques can be broadly divided A co-regularization scheme for restricted RBMs able intothreecategories: 1)keyframe-based,2)skimming-based • tolearnjointhigh-levelsubject-scenerepresentations. and 3) story-based. In keyframe-based summarization, the A comprehensive user study comparing our method video is summarized using a small number of keyframes se- • againstvariousschemesincludingthealgorithminuse lected based on some criterion, such as low-level features, bythevideosharingwebsiteDailymotion. like pixel data, motion features, optical flow and frame dif- ferences[1,2,3],orhigher-levelinformation,likeobjectsand 2. CO-REGULARIZEDDEEPREPRESENTATIONS ∗ O.More`re,H.GohandA.Veillardcontributedequallytothiswork. A good keyframe-based summary should consist of easily 1.InstituteforInfocommResearch,A*STAR,Singapore. recognizable subjects in context-setting scenes. To achieve 2.Universite´PierreetMarieCurie,Paris,France. 3.Image&PervasiveAccessLab,UMICNRS2955,Singapore. this,wegenerateframe-leveldescriptionsbyexploitingdeep convolutionalarchitecturestorecognizesubjectsandscenes. Regularize with subject Compact representations are then computed with a novel Subject RBM DCNN scene co-regularization unsupervised learning scheme to exhibit descriptor descriptor the high-level associations between subjects and scenes. RBM RBM Keyframes are subsequently generated from these compact Subject Scene DCNN DCNN representations. VGG- Places-CNN ILSVRC- DCNN subject Scene RBM 2014-D descriptor descriptor 2.1. DeepConvolutionalNeuralNetworks Regularize with scene Deepconvolutionalneuralnetworks(DCNNs)haverecently been used to obtain astonishing performances in both im- (a) Trainingco-regularizedRBMs. age classification [16, 17] and image retrieval [18] tasks. Bird Sky For every frame sampled from the video at regular inter- vals, DCNN descriptors are extracted using the open source Shark Ocean Caffe framework [19] along with two pre-trained networks: Monkey Rainforest VGG-ILSVRC-2014-D[20]andPlaces-CNN[21]. Input frame Kangaroo ↵ (1�↵) Field VGG-ILSVRC-2014-Disthebestperformingsinglenet- Polar bear Iceberg work from the VGG team during the ILSVRC 2014 image Subject RBM Frame Scene RBM classification and localization challenge using the ImageNet descriptor descriptor descriptor [22] dataset. This 138 million parameters network is made (b) Generatingaframedescriptorfromco-regularizedRBMs. of 16 layers: 13 convolutional layers followed by 3 fully- connectedlayers. Itdetects1000mostlysubject-centriccate- Fig. 2. A pair of co-regularized RBMs – one representing gories(e.g. animals,objects,plants,etc...). subjects and another representing scenes – are learned con- Places-CNNisa60millionparametersnetworkfollowing currently. (a) During training, an subject unit is regularized theAlexNet[16]structure:atotalof8layers:5convolutional byitscorrespondingsceneunitandviceversa. (b)Theframe layers followed by 3 fully-connected layers. It is trained on descriptorisalinearcombinationofthetwoco-regularizated thePlaces205dataset,ascene-centricimagedatasetfeaturing RBMdescriptorsformingrelevantsubject-sceneassociations. 205categoriesincludingindoorsandoutdoorssceneries. For both DCNNs, descriptors are extracted from the last layer before the softmax operation, having a dimensionality (cid:18) (cid:19) (cid:88) (cid:88) (cid:88) of1000and205forVGG-ILSVRC-2014-DandPlaces-CNN, argmin− logP(xip,zip)−λp logP(zˆoi,k|zˆpi,k) , respectively. Wp i zi∈Zi k p p (2) where i, i are the RBM projections of i, i , 2.2. Co-RegularizedRestrictedBoltzmannMachines {Zo Zp}i {Xo Xp}i λ ,λ are the regularization constants, and zˆi ,zˆi { o p} { o,k p,k}i To create the video summaries from the DCNN descriptions refer to unit k in the distribution-sparsified representations ofsubjectsx andscenesx ,weintroduceapairofconcur- of the minibatch [25]. Sparsity across units helps avoid co- o p rentlytrainedrestrictedBoltzmannmachines(RBMs)tolearn adaptation between the units and improves representational theirprojections(z andz )toK unitseach,whereK isthe diversity across instances of frames. The co-regularization o p desirednumberofkeyframes. AnRBMisabipartitenetwork termsservesthepurposeofbindingasubjectandthescenein withaprojectionmatrixW thatmapsbetweenitsinputand whichitoccurstothesameunitposition. output units. RBMs are trained through gradient descent on The frame descriptor is a linear combination of the two theapproximatemaximumlikelihoodobjective,basedonnet- RBMdescriptors(Figure2(b)).Thefinalsetofkeyframetim- workstatesdrawnfromGibbssampling[23,24]. ingst ,k [1..K]istheorderedsetofK timingsthatgives k ∈ In this work, we introduce co-regularization for RBMs. themaximumresponseforeachunitoftheframedescriptor: TheobjectRBMisregularizedbyplacerepresentationsandin turnregularizesthetrainingoftheplaceRBM(Figure2(a)). argmax αzt +(1 α)zt , (3) Given randomly sampled minibatches of subject and scene t o,k − p,k DCNNdescriptors i, i ,weintroduceco-regularization {Xo Xp}i crossentropypenaltiestotheRBMobjectivefunctions: whereα [0,1]isabalancehyperparameterthatcausesthe ∈ summarytobemoresubject-centricorscene-centric. (cid:18) (cid:19) (cid:88) (cid:88) (cid:88) argmin logP(xi,zi) λ logP(zˆi zˆi ) , Thisproposedco-regularizationmethodisnotspecificto − o o − o p,k| o,k Wo i zi∈Zi k subjects or scenes, and is generalizable to other concepts or o o (1) modalities,suchasfacesoractivities. 3. VIDEOSUMMARIZATION polar bear tiger shark impala camel grey whale black stork
 king penguin sting ray gazelle hartebeest sea lion spoonbill Subject Using our method, we summarized all 11 episodes from the units BBC educational TV series Planet Earth1. Each episode is approximately 50 minutes long. A sample of our results is Scene showninFigure5(a). units iceberg underwater field desert ocean swamp snowfield coral reef desert canyon coast marsh 3.1. ModelVisualization jay platypus snow leopard capuchin lynx goose bulbul crocodile ibex chimpanzee leopard albatross 3.1.1. BalancingSubject-andScene-Centricity Subject units AsshowninFigure3, biastowardssubjectorscenescanbe adjusted by tuning the α parameter from Equation 3. This Scene flexibility allows for interesting functionalities such as cus- units tomising content based on user profiling or explicit queries. rainforest river snowy mtn. rainforest forest path watering hole sky creek snowfield botanical gdn. rainforest marsh The choice of α value can also be made independently for each unit in order to generate the most visually attractive Fig.4.VisualizationoftheunitsforaK =12model.Thevi- keyframe,forexamplebasedonvibrancy. Inpractice,setting sualrepresentationsofsubject-scenepairsarewellcorrelated. thedefaultvaluetoα = 0.5(asusedinthisempiricalstudy) Thecategoriesofthetwomodelsareassociatedinasensible seemstoproducesatisfactoryresults. wayandcorrespondwellwiththevisualrepresentations. More subject-centric More scene-centric ↵=1 0<↵<1 ↵=0 The top 2 categories of each unit identified from the weight matrices are also shown in Figure 4. We notice that the correlation with the visual representation is strong and the subject-scene association is sensibly learned. We can alsoobserveaninterestingeffectofco-regularization,where associations can be made between subjects (e.g. polar bear andkingpenguin)thatoccurinthesamescene(iceberg)but neverwithinthesameframe. 3.2. UserEngagementStudy 3.2.1. EvaluationFramework Ourmethodiscomparedagainstthreeotherkeyframe-based Fig. 3. Actual keyframes selected by varying α. Our model summarisation schemes: naive uniform sampling, k-means canbetunedtoselectkeyframesthataremoresubject-centric clusteringandthemethodcurrentlyinusebythevideoshar- (left),scene-centric(right)orabalanceofboth(middle). ing website Dailymotion2. Each summary is presented as a timelineofkeyframesasshownonFigure5. Uniformsamplingtakesk keyframeswithevenlyspaced 3.1.2. VisualisationofCo-RegularizedRBMUnits timestamps: t = d(cid:0)1 +i(cid:1),i [1..K] where d isthetotal i k 2 ∈ Although neural networks tend to be thought of as black duration of the video. The k-means clustering scheme uses boxes, visualization is often useful to dechipher what has frames sampled at the same frequency as for our method (1 been learned [26]. To better understand our co-regularized fps) and down-sized to 32 32 RGB pixels. Lloyd’s algo- × model, we analysed the responses of each unit across the rithm [27] is used to separate the data into K clusters. 100 dataset. Forthisanalysis,wetrainedasingleK = 12model runs with different centroid seeds are performed to mitigate across all 11 episodes. For each of the 24 RBM units, the theeffectsoflocalminima. Foreachcluster,theframeclos- top 100 frames that most strongly activate each unit were est to its centroid is selected as keyframe. Dailymotion pro- aggregated via a weighted average. The resulting graphical posesan8-keyframesvideosummary(excludingtitleframe) representation of each unit is shown on Figure 4. We ob- whichwasusedasablackboxschemetocompareourmethod servethatthevisualappearancesofframescorrespondingto against. Theevaluationvideoswereuploadedonthewebsite asubject-scenepairofunitsareconsistentlysimilar. Thereis andtheproposedsummarykeyframeswerethenhandpicked alsodiversityacrosstheunitswithinanRBM. fromtheoriginalfootages. 1http://www.bbc.co.uk/programmes/b006mywy 2http://www.dailymotion.com/ (a) Ourmethod (b) Uniformsampling (c) k-means (d) Dailymotion Fig.5. Eightkeyframessummariesforepisode1fromtheTVseriesPlanetEarth. Thestudywasperformedbyshowingpairsofsummaries Table1. Howoftenourmethodispreferredovereachofthe – our method against one of the three baseline schemes – threeschemes(percentage)fordifferentK. to eight different testers who have not previously seen the uniform k-means daily. videos. For each pair, they are asked to answer the two fol- K 4 6 8 4 6 8 8 lowingquestions: Q1 79.55 82.95 76.14 97.73 82.95 75.00 77.27 Q1: Whichvideowouldyouratherwatch? Q2 78.41 80.68 81.82 94.32 80.68 76.14 78.41 • (attractiveness) Q2: Whichsummarywasmoreinformative? • For varying amounts K of keyframes, the improvement (informativeness) isratherconsistentagainstuniformsamplingwhereasagains Using all the 11 Planet Earth episodes, summaries were k-means, the improvement is more pronounced when K is generatedfordifferentamountofkeyframesK =4,6,8,ex- smaller. Thisisanindicationthatouroverallgoodmethodis cept for Dailymotion which imposes K = 8 by default. In particularlywell-suitedforcompactsummaries. total,8 11 2 3 +8 11=616answerswerecollected × × × × × 4. CONCLUSIONS foreachquestion. Uniform sampling appears as a natural choice for the Building upon recent advances in deep learning and image wildlife documentaries used during this study given to the recognition, we proposed a comprehensive keyframe-based slowpaceoftheactionandhighvisualappealoftheaverage summarization framework combining DCNNs and RBMs. frame.K-meansisexpectedtobeabletocapturethediversity Through a comprehensive empirical study, we showed that of the scenes well whereas it may not perform as well with our method is able to out perform a number of existing respecttosubjects. schemes. In addition, our novel co-regularization scheme, which discovered meaningful subject-scene associations is 3.2.2. ResultsandDiscussion generalizabletootherconceptsandmodalities. Table 1 aggregates the answers from the testers. Overall, Beyondtheselectionofqualitykeyframes, ourcontribu- our method was systematically found more attractive (75% tion represents a strong step towards the Holy Grail of text- to 97.73% of the time) and more informative (76.14% to based video summaries by introducing highly interpretable 94.32%). Perceived attractiveness and informativeness are semanticrepresentations. strongly correlated. Against Dailymotion’s algorithm, our ACKNOWLEDGEMENTS method scores favourably more than three times out of four representing a marked improvement over the scheme cur- WegratefullyacknowledgethesupportofNVIDIACorpora- rentlyusedbytheservice. tionwiththedonationoftheGPUusedforthisresearch. 5. REFERENCES [14] M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool, “Creating summaries from user videos,” in [1] T.Liu,H.Zhang,andF.Qi, “Anovelvideokey-frame- EuropeanConferenceonComputerVision,2014. extractionalgorithmbasedonperceivedmotionenergy model,”CircuitsandSystemsforVideoTechnology,vol. [15] Z. Lu and K. Grauman, “Story-driven summarization 13,no.10,pp.1006–1013,2003. for egocentric video,” in Computer Vision and Pattern Recognition,2013. [2] W.Wolf, “Keyframeselectionbymotionanalysis,” in Acoustics,Speech,andSignalProcessing,1996. [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Ima- genetclassificationwithdeepconvolutionalneuralnet- [3] H. Zhang, J. Wu, D. Zhong, and S. W. Smoliar, “An works,” in Advances in neural information processing integrated system for content-based video retrieval and systems,2012. browsing,” PatternRecognition,vol.30,no.4,pp.643– 658,1997. [17] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carls- son, “CNNfeaturesoff-the-shelf: Anastoundingbase- [4] D. Liu, G. Hua, and T. Chen, “A hierarchical visual line for recognition,” in Computer Vision and Pattern modelforvideoobjectsummarization,” PatternAnaly- Recognition,2014. sisandMachineIntelligence,vol.32,no.12,pp.2178– 2190,2010. [18] J. Lin, O. More`re, A. Veillard, V. Chandrasekhar, and H.Goh, “DeepHash: Gettingregularization,depthand [5] Y.J.Lee,J.Ghosh,andK.Grauman, “Discoveringim- fine-tuningright,” inarXivpreprintarXiv:1501.04711, portantpeopleandobjectsforegocentricvideosumma- 2015. rization,” inComputerVisionandPatternRecognition, 2012. [19] Y.Jia,E.Shelhamer,J.Donahue,S.Karayev,J.Long,R. Girshick, S.Guadarrama, andT.Darrell, “Caffe: Con- [6] P. Mundur, Y. Rao, and Y. Yesha, “Keyframe-based volutional architecture for fast feature embedding,” in videosummarizationusingdelaunayclustering,” Inter- ACMInternationalConferenceonMultimedia,2014. nationalJournalonDigitalLibraries,vol.6,no.2,pp. 219–232,2006. [20] K. Simonyan and A. Zisserman, “Very deep convolu- tional networks for large-scale image recognition,” in [7] Y. Hadi, F. Essannouni, and R. O. H. Thami, “Video arXivpreprintarXiv:1409.1556,2014. summarization by k-medoid clustering,” in ACM Sym- posiumonAppliedComputing,2006. [21] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva,“Learningdeepfeaturesforscenerecognitionus- [8] S.E.F.deAvila, A.P.B.a.Lopes, A.daLuz, Jr., and ingplacesdatabase,” inNeuralInformationProcessing A. de Albuquerque Arau´jo, “VSUMM: A mechanism Systems,2014. designedtoproducestaticvideosummariesandanovel evaluation method,” Pattern Recognition Letters, vol. [22] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. 32,no.1,pp.56–68,2011. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” inComputerVisionandPatternRecognition, [9] M.Furini,F.Geraci,M.Montangero,andM.Pellegrini, 2009. “VISTO:Visualstoryboardforwebvideobrowsing,” in ACM international conference on Image and video re- [23] G. E. Hinton, “Training products of experts by min- trieval,2007. imizing contrastive divergence,” Neural Computation, vol.14,no.8,pp.1771–1800,2002. [10] M.Furini,F.Geraci,M.Montangero,andM.Pellegrini, “STIMO:Stillandmovingvideostoryboardfortheweb [24] G.E.Hinton,S.Osindero,andY.-W.Teh, “Afastlearn- scenario,” MultimediaToolsandApplications, vol.46, ing algorithm for deep belief networks,” Neural Com- no.1,pp.47–69,2010. putation,vol.18,no.7,pp.1527–1554,2006. [11] C.-W.Ngo,Y.-F.Ma,andH.-J.Zhang, “Videosumma- [25] H.Goh,N.Thome,M.Cord,andJ.-H.Lim, “Unsuper- rization and scene detection by graph modeling,” Cir- visedandsupervisedvisualcodeswithrestrictedBoltz- cuits and Systems for Video Technology, vol. 15, no. 2, mannmachines,” inEuropeanConferenceonComputer pp.296–305,2005. Vision,2012. [12] J.NamandA.H.Tewfik, “Event-drivenvideoabstrac- [26] M. D. Zeiler and R. Fergus, “Visualizing and under- tionandvisualization,” MultimediaToolsandApplica- standingconvolutionalnetworks,” inEuropeanConfer- tions,vol.16,no.1,pp.55–77,2002. enceonComputerVision,2014. [13] R. Laganie`re, R. Bacco, A. Hocevar, P. Lambert, G. [27] S.Lloyd, “LeastsquaresquantizationinPCM,” Infor- Pa¨ıs, and B. E. Ionescu, “Video summarization from mationTheory,vol.28,no.2,pp.129–137,1982. spatio-temporalfeatures,” inACMMultimedia,2008.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.