1 Learning from Multiple Sources for Video Summarisation Xiatian Zhu, Student Member, IEEE, Chen Change Loy, Member, IEEE, and Shaogang Gong Abstract—Manyvisualsurveillancetasks,e.g.videosummarisation,isconventionallyaccomplishedthroughanalysingimagery- basedfeatures.Relyingsolelyonvisualcuesforpublicsurveillancevideounderstandingisunreliable,sincevisualobservations obtainedfrompublicspaceCCTVvideodataareoftennotsufficientlytrustworthyandeventsofinterestcanbesubtle.Webelieve thatnon-visualdatasourcessuchasweatherreportsandtrafficsensorysignalscanbeexploitedtocomplementvisualdatafor videocontentanalysisandsummarisation.Inthispaper,wepresentanovelunsupervisedframeworktolearnjointlyfromboth 5 visualandindependently-drawnnon-visualdatasourcesfordiscoveringmeaningfullatentstructureofsurveillancevideodata. 1 0 Inparticular,weinvestigatewaystocopewithdiscrepantdimensionandrepresentationwhilstassociatingtheseheterogeneous 2 datasources,andderiveeffectivemechanismtotoleratewithmissingandincompletedatafromdifferentsources.Weshowthat theproposedmulti-sourcelearningframeworknotonlyachievesbettervideocontentclusteringthanstate-of-the-artmethods,but b alsoiscapableofaccuratelyinferringmissingnon-visualsemanticsfrompreviously-unseenvideos.Inaddition,acomprehensive e userstudyisconductedtovalidatethequalityofvideosummarisationgeneratedusingtheproposedmulti-sourcemodel. F 6 IndexTerms—Multi-sourcedata,heterogeneousdata,visualsurveillance,clustering,eventrecognition,videosummarisation. (cid:70) ] V C . s c Visual features and descriptors are often carefully de- background clutters [13]. [signed and exploited as the sole input for surveillance In this study, we wish to exploit non-visual auxiliary 2 video content analysis and summarisation. For instance, information to complement the unilateral perspective from voptical or particle flow is typically employed in activity visualobservations.Examplesofnon-visualsourcesinclude 9modelling [1], [2], [3], foreground pixel feature is used weather report, GPS-based traffic data, geo-location data, 6 formulti-cameravideounderstanding[4],space-timeimage textual data from social networks, and on-line event sched- 0 gradient is adopted for crowd analysis [5], and mixture of ules.Theauxiliarydatasourcesarebeneficialtovisualdata 3 0dynamic textures is used for video segmentation [6] and modelling because despite that visual and non-visual data .anomaly detection [7]. may have very different characteristics and are of different 1 0 A critical task in visual surveillance is to automatically natures, they depict the common physical phenomenon in 5make sense of massive amount of video data by summaris- a scene. They are intrinsically correlated, although may 1ing its content using higher-level intrinsic physical events1 be mostly indirect in some latent spaces. Effectively dis- v:beyond low-level key-frame visual feature statistics and/or covering and exploiting such a latent correlation space can iobject detection counts. In most contemporary techniques, facilitatetheunderlyingdatastructurediscoveryandbridge X low-levelimageryvisualcuesaretypicallyexploitedasthe the semantic gap between low-level visual features and ronly information source for video summarisation [8], [9], high-level semantical interpretation. a [10],[11],[12].Ontheotherhand,incomplexandcluttered Challenges - Nevertheless, it is non-trivial to formulate a public scenes there are intrinsically more interesting and frameworkthatexploitsbothvisualandnon-visualdatafor salient higher-level events that can provide more mean- videocontentanalysisandsummarisation,bothalgorithmi- ingful and concise summarisation of the video data. How- cally and in practice. ever, such events may not be visually well-defined (easily Algorithmically, unsupervised mining of latent correla- detectable) nor detected reliably by visual cues alone. In tion and interaction between heterogeneous data sources particular, surveillance visual data from public spaces is faces a number of challenges: (1) Disparate sources signif- often inaccurate and/or incomplete due to uncontrollable icantly differ in representation (continuous or categorical), sourcesofvariation,changesinillumination,occlusion,and and largely vary in scale and covariance2. In addition, the dimensionofvisualsourcesoftenexceedsthatofnon-visual information to a great extent (>2000 visual dimensions vs. • XiatianZhuandShaogangGongarewithSchoolofElectronicEngi- neeringandComputerScience,QueenMaryUniversityofLondon. <10 non-visual dimensions). Owing to this dimensionality E-mail:[email protected],[email protected] discrepancy problem, a straightforward concatenation of • ChenChangeLoyiswithDepartmentofInformationEngineering,The featureswillresultinarepresentationunfavourablyinclined ChineseUniversityofHongKong. E-mail:[email protected] towards the imagery data. (2) Both visual and non-visual data in isolation can be inaccurate and incomplete. 1.Spatiotemporal combinations of human activity or interaction pat- terns,e.g.gathering,orenvironmentalstatechanges,e.g.raining. 2.Alsoknownastheheteroscedasticityproblem[14]. 2 Multi-Source data Learning and correlating Training Multi-Source data Multi-Source cluster formation Multi-Source Visual data clustering Non-visual distribution Traffic speed Weather Multi-Source Clustering Forest Non-Visual data Video data Deployment Performing video summarisation Semantic video summary (previously-unseen) (1) Clustering previously-unseen video (2) Extract key-clips Typicality: usual Typicality: interesting (3) Tag inference Weather: cloudy Weather: sunny Traffic: medium Traffic: medium Time: 05:00 am Time: 06:31 am Fig. 1. Theoverviewoftheproposedmulti-sourcedrivenvideosummarisationframework.Weconsideranovelsettingwheremultiple heterogeneoussourcesarepresentduringthemodeltrainingstage.TheproposedMulti-SourceClusteringForestdiscoversandexploits latent correlations among heterogeneous visual and non-visual data sources both of which can be inaccurate and not trustworthy. In deployment,ourmodeluncoversvisualcontentstructuresandinfersemantictagsonpreviously-unseenvideodataforvideosummarisation. In practice, auxiliary data sources, e.g. weather, traffic Contributions - The main contributions of this work are: reports, and event time tables, may be rather unreliable in 1) We propose a unified multi-source learning frame- availability.Specifically,thereportsmaynotbereleasedon- work capable of discovering semantic structures of the-fly at a synchronised time stamp with the surveillance video content collectively from heterogeneous visual videostream.Inaddition,existingvideocontrolroomsmay andnon-visualdata.Thisismadepossiblebyformu- not necessarily have direct access to these sources. This lating a novel Multi-Source Clustering Forest (MSC- renders models that expect complete visual and non-visual Forest) that seamlessly handles multi-heterogeneous information during deployment impractical. datasourcesdissimilarinrepresentation,distribution, and dimension. Although both visual and non-visual Our solution - In this study, we address this multi-source data in isolation can be inaccurate and incomplete, learning problem in the context of video summarisation, ourmodeliscapableofuncoveringandsubsequently conventionally based on visual feature analysis and object exploitingthesharedlatentcorrelationforbetterdata detection/segmentation. In particular, we formulate a novel structure discovery. framework that is capable of performing joint learning 2) The model is novel in its ability to accommodate given heterogeneous multi-sources (Fig. 1). We consider partial or completely missing non-visual sources. visual data as the main source and non-visual data as In particular, we introduce a joint information gain the auxiliary sources, since we believe visual information function that is capable of dynamically adapting to still plays the main role in video content analysis. During arbitrary amount of missing non-visual information training,weassumetheaccesstobothvisualandnon-visual during model learning. In model deployment, only data. The model performs multi-source data clustering and visual input is required for inferring missing non- discovers a set of visual clusters tagged along with non- visual semantics. visual data distribution, e.g. different weathers and traffic Extensive comparative evaluations are conducted on two speeds. We term the model as multi-source model. During public surveillance videos captured from both indoor and the deployment stage, we only assume the availability of outdoor environments. Comparative results show that the previously-unseen video data since non-visual data may proposed model not only outperforms the state-of-the-art not be accessible due to the aforementioned limitations. methods [15], [16] for video content clustering and struc- Since the learned model has already captured the latent ture discovery, but also is more superior in predicting non- structureofheterogeneoustypesofdatasources,themodel visualtagsforpreviously-unseenvideos.Therobustnessof can be used for semantic video clustering and non-visual the proposed model is further validated by a user study on tag inference on previously-unseen video sequence, even video summary quality. without the non-visual data. Subsequently, key clips are automatically selected from the discovered clusters. The 1 RELATED WORK final summary video can be produced by chronologically Multi-modality learning - There exist studies that exploit compositing these key clips enriched by the inferred tags. different sensory or information modalities from a single 3 source for data structure mining. For example, Cai et compressthosetubestoreducespatiotemporalredundancy. al.[17]proposetoperformmulti-modalimageclusteringby Both the above schemes utilise solely visual information learning a commonly shared graph-Laplacian matrix from andmakeimplicitassumptionsaboutthecompletenessand different visual feature modalities. Heer and Chi [18] com- accuracy of the visual data available in extracting features bine linearly individual similarity matrices derived from or object-centered representations. They are unsuitable nor multi-modal webpages for web user grouping. Karydis et scalabletocomplexsceneswherevisualdataareinherently al. [19] present a tensor based model to cluster music incomplete and inaccurate, mostly the case in surveillance items with additional tags. In terms of video analysis, videos. Our work differs significantly to these studies in the auditory channel and/or transcripts have been widely thatweexploitnotonlyvisualdatawithoutobjecttracking, explored for detecting semantic concepts from multimedia but also non-visual sources as complementary information. videos [20], [21], summarising highlights in news and The summary generated by our approach is semantically broadcast programs [22], [23], or locating speakers [24]. enriched – it is labelled automatically with semantic tags, User tags associated with web videos (e.g. YouTube) have e.g. traffic condition, weather, or event. All these tags also been utilised [25], [26], [27]. In contrast, surveillance are learned from heterogeneous non-visual sources in an videos captured from public spaces are typically without unsupervised manner during model training without any auditory signals nor any synchronised transcripts and user manual labels. tags available. Instead, we wish to explore alternative non- Random forests - Random forests [35], [16] have proven visual data drawn independently elsewhere from multiple as powerful models in the literature. Different variants of sources, with inherent challenges of being inaccurate and random forests have been devised, either supervised [36], incomplete, unsynchronised to and may also be in conflict [37],[38],[39],[40],orunsupervised[41],[42],[43],[44], with the observed visual data. [45]. Supervised models are not suitable to our problem since we do not assume the availability of ground truth Multi-source learning - An alternative multi-source learn- labels during model training. Existing clustering forest ingmechanismcanbeclusteringensemble[28],[29]where models,ontheotherhand,assumesonlyhomogeneousdata a collection of clustering instances is generated and then sourcessuchaspureimagery-basedfeatures.Noprincipled aggregated into the final clustering solution. Typically only wayofcombiningmultipleheterogeneousandindependent single data source is considered, but it can be easily data sources in forest models is available. extended to handle multi-source data, e.g. creating a re- spective clustering instance for each source. Nonetheless, cross-source correlation is ignored since the clustering 2 MULTI-SOURCE CLUSTERING instances are separately formed and no interaction between Video summarisation by content abstraction aims to gen- them is involved. A closer approach to ours is the Affin- erate a compact summary composed of key/interesting ity Aggregation Spectral Clustering (AASC) [15], which content from a long previously-unseen video for achiev- learns data structure from multiple types of homogeneous ing efficient holistic understanding [32]. A common way information (visual features only). Their method generates to establish a video summary is by extracting and then independently multiple affinity data matrices by exhaustive combiningasetofkeyframesorshots.Thesekeycontents pairwise distance computation for every pair of samples in are usually discovered and selected from clusters of video every data source. It suffers from unwieldy representation frames or clips [32]. given high-dimensional data inputs. Importantly, despite Inthisstudy,wefollowtheaforementionedapproachbut that it seeks for optimal weighted combination of distinct consider not only visual content of video, but also a large affinity matrices, it does not consider correlation between corpus of non-visual data collected from heterogeneous in- different sources in model learning, similar to clustering dependentsources(Fig.2(a)).Specifically,throughlearning ensemble [28], [29]. Differing from the above models, our latent structure of multi-source data (Fig. 2(b-c)), we wish Multi-Source Clustering Forest overcomes these problems to make reference to and/or impose non-visual semantics by generating a unified single affinity matrix that cap- directly into video clustering without any human manual tureslatentcorrelationsamongheterogeneoustypesofdata annotation of video data (Fig. 2(d)). Formally, we consider sources.Furthermore,ourmodelhasauniqueadvantagein thefollowingdifferentdatasourcesthatformamulti-source handling missing non-visual data over [28], [29], [15]. input feature space: Video summarisation - Contemporary video summarisa- Visualfeatures-WesegmentatrainingvideointoN either tion methods can be broadly classified into two paradigms, overlapping or non-overlapping clips, each of which has a key-frame-based [11], [30], [31], [32], [33] and object- durationofTclip seconds.Wethenextractad-dimensional based [9], [10], [34] methods. The key-frame-based ap- visual descriptor from the ith video clip denoted by xi = proachesselectrepresentativekey-framesbyanalysinglow- (xi,1,...,xi,d) Rd,i=1,...,N. ∈ levelimageryproperties,e.g.opticalflow[30]orimagedif- Non-visual data - Non-visual data are collected from het- ferences[31],object’sappearanceandmotion[11],toform erogeneous independent sources. We collectively represent a storyboard of still images. Object-based techniques [9], m types of non-visual data associated with the ith clip as [10], on the other hand, rely on object segmentation and y = (y ,...,y ) Rm, i = 1,...,N. Note that any i i,1 i,m ∈ tracking to extract object-centric trajectories/tubes, and (or all) dimension of y may be missing. i 4 (a) Visual data {xi} Non-visual data {yi} 2.1 ConventionalRandomForests Classification forests - A general form of random forests (b) Tree 1 Multi-Source Tree Tclust istheclassificationforests.Aclassificationforest[35],[46] Clustering Forest (MSC…-Forest) is an ensemble of Tclass binary decision trees (x): RK, with the d-dimensional feature space,Tand RXK→= X [0,1]K denoting the space of class probability distribution over the label space = 1,...,K . Affinity matrix A L { } Decision trees are learned independently of each other, (c) Graph eachwitharandomsubsetXt ofthetrainingsamplesX = partitioning x , i.e. bagging [35]. Growing a decision tree involves i { } a recursive node splitting procedure until some stopping Non-visual Cluster 1 Cluster k distribution criterion is satisfied, e.g. leaf nodes are formed when no (d) p(yi|c=1…) p(yi|c=k) further split can be achieved given the objective function, or the number of training samples arriving at a node is smaller than the predefined node size, φ. Small φ leads to Fig.2. Multi-sourcemodeltrainingstage:Thepipeline deep trees. We set φ=2 in our experiments for capturing of performing multi-source clustering on visual and sufficiently fine-grained data structure. At each leaf node, non-visual data with the proposed Multi-Source Clus- theclassprobabilitydistributionisthenestimatedbasedon teringForest(MSC-Forest). the labels of the arrival samples. The training of each internal/split node is a process of binary split function optimisation, defined as (cid:26) 0 if x <ϑ , h(x,ϑ)= ϑ1 2 (1) 1 otherwise. Weaimatformulatingaunifiedclusteringmodelcapable Thissplitfunctionisparameterisedbytwoparametersϑ= ofcopingwiththefewchallengesashighlightedinSection. [ϑ ,ϑ ]: (i) a feature dimension x with ϑ 1,...,d , The model needs be unsupervised since no ground truth is an1d (i2i) a feature threshold ϑ Rϑ1. All sam1p∈le{s of a spl}it 2 assumed. To mitigate the heteroscedasticity and dimension nodeswillbechannelledtoeith∈ertheleftlorrightr child discrepancy problems, we require a model that can isolate nodes, according to the output of Eqn. (1). the very different characteristics of visual and non-visual The optimal split parameter ϑ∗ is chosen via data, yet can still exploit their latent correlation in the ϑ∗ =argmax∆ , (2) clustering process. To handle noisy data, feature selection class I Θ is needed and necessary. whereΘ=(cid:8)ϑi(cid:9)mtry(|S|−1) representsaparametersetover i=1 In light of the above demands, we choose to start with m randomly selected features, with S the sample set try the clustering random forest [35], [41], [42] due to (1) reachingthenodes.Thecardinalityofasetisgivenby . |·| unsupervised information gain optimisation thus requiring Typically, a greedy search strategy is exploited to identify no ground truth labels; (2) its flexible objective function ϑ∗. The information gain ∆ is formulated as class I for facilitating the modelling of multi-source data as well L R as the processing of missing data; (3) and its implicit ∆ class = s | | l | | r, (3) I I − S I − S I feature selection mechanism for handling noisy features. | | | | Nevertheless, the conventional clustering forest is not well where L and R denote the sets of data routed into l and r, suited to solve these challenges since it expects a full and L R=S. The information gain can be computed ∪ I concatenated representation as input during both model as either the entropy or Gini impurity [47]. training and deployment. This does not conform to the Clustering forests - In contrast to classification forests, assumptionofonlyvisualdatabeingavailableduringmodel clustering forests require no ground truth label information deployment for previously-unseen videos. Moreover, due during the training phase. A clustering forest consists of to its uniform variable selection mechanism [35] (e.g. each T binary decision trees. The leaf nodes in each tree clust featuredimensionhasthesameprobabilitytobeselectedas define a spatial partitioning of the training data. Interest- acandidateoptimalsplittingvariable),thereisnoprincipled ingly, the training of a clustering forest can be performed way to ensure balanced contribution from individual visual using the classification forest optimisation approach by and non-visual sources in the node splitting process. To adopting the pseudo two-class algorithm [35], [41], [42]. overcometheselimitations,weproposeanewMulti-Source Specifically, we add N pseudo samples x¯ = x¯ ,...,x¯ 1 d { } Clustering Forest (MSC-Forest) by introducing a new ob- (Fig. 3(b)) into the original data space X (Fig. 3(a)), with jective function allowing joint optimisation of individual x¯ Dist(x ) sampled from certain distributions Dist(x ). i i i ∼ informationgainsofdifferentsources.Wefirstdescribethe In the proposed model, we adopt the empirical marginal conventional forests prior to detailing the proposed MSC- distributionsofthefeaturevariablesowingtoitsfavourable Forest. performance [42]. With this data augmentation strategy, 5 (a) (b) (c) (d) Fig. 3. An illustration of clustering toy data with a clustering forest. (a) Original toy data are labelled as class 1, whilst (b) the pseudo-points (red +) as class 2. (c) A clustering forest performs two-class classification in the augmentedspace.(d)Thefinaldatapartitionsontheoriginaldata. the clustering problem becomes a canonical classification ∆ denotes the information gain in the jth non-visual j I problem that can be solved by the classification forest data. A non-visual source can be either categorical or training method as discussed above. The key idea behind continuous. For a categorical non-visual source, similar to this algorithm is to partition the augmented data space into visual term we use the Gini impurity as its data split G dense and sparse regions (Fig. 3(c-d)) [41]. measure criterion. In the case of non-visual source with continuous values, we adopt least squares regression [47] 2.2 Multi-SourceClusteringForest to enforce continuity in the clustering space: Conventionalclusteringforestsassumesonlyhomogeneous |S| |S| 1 (cid:88) 1 (cid:88) data sources such as pure imagery-based features. In con- = (y y )2, (5) i,j i,j trast, theproposed Multi-SourceClustering Forest cantake R S − S | | i=1 | | i=1 heterogeneous sources as input. In particular, the proposed where y represents the value in the jth non-visual space model uses visual features as splitting variables to grow i,j associated with the ith sample x S, and S is the set of Multi-Source Clustering trees (MSC-trees) as in Eqn. (1), i ∈ samples reaching node s. That is ∆ = . and exploits non-visual information as additional data to Ij R help determining the ϑ = [ϑ ,ϑ ]. In this way, auxiliary Temporal term: We also add a temporal smoothness gain 1 2 non-visual information is used, in addition to visual data, ∆ t to encourage temporally adjacent video clips to be I to guide the tree formation. grouped together. This temporal information helps in min- Formally,wedefineanewjointinformationgainfunction ing visual data structure. for node splitting during training MSC-trees as: The information gain by different sources may live in very disparate ranges due to the different natures of m ∆ =α ∆Iv +(cid:88)α ∆Ij +α ∆It . (4) source,eachtermofEqn.(4)isthereforenormalisedbyits v j t I (cid:124) (cid:123)I(cid:122)v0(cid:125) j=1 Ij0 (cid:124) (cid:123)I(cid:122)t0(cid:125) initial data impurity denoted by Iv0, Ij0, and It0. These visual (cid:124) (cid:123)(cid:122) (cid:125) temporal impurities are obtained at the root node of every MSC- non-visual tree. The source weights are denoted by α , α , and α v i t Similar to Eqn. (3), the optimal parameter corresponds to accordingly, holding α + (cid:80)m α + α = 1. We set v i=1 i t thesplitwiththemaximal∆ .Thisformulationdefinesthe α =0.5 obtained by cross-validation. A detailed analysis I v best data split across the joint space of multi-source data, on α is given in Section 5.2. For non-visual and temporal v beyond visual domain alone. All the terms in Eqn. (4) are information, we uniformly assign α = α = 1−αv since t i m+1 interpreted as below. theirimportanceisnotknowninprior,withmthenumber Visual term: ∆ = ∆ (Eqn. (3)) denotes the in- of non-visual sources. v class I I formation gain in visual domain. Precisely, this measure Theroleofdifferentsourcedata-Giventhemainroleand is computed from the pseudo class labels. Therefore, it much more stable provision of the visual source in video reflects the visual data structure characteristics given that understanding, non-visual data are regarded as auxiliary the pseudo data samples are drawn from the marginal information over visual source. During the training of feature distributions (Section 2.1). In this study we utilise MSC-Forest, the split functions (Eqn. (1)) are defined on the Gini impurity G [47] to estimate ∆Iclass by setting visual features, but ϑ=[ϑ1,ϑ2] is collectively determined = in Eqn. (3) due to its simplicity and efficiency. The by visual features and the associated non-visual as well as I G (cid:80) Gini impurity is computed as G = i(cid:54)=jpipj, with pi and temporalinformation(i.e.thenon-visualandtemporalterm pj beingtheproportionofsamplesbelongingtotheithand in Eqn. (4)). Alternatively, one can think of that the main jth category in a split node s. High value in indicates visualdatasourceis‘completely-visible’totheMSC-Forest G pure category distribution. sinceitisneededduringbothforesttrainingandevaluation, Non-visual term: This is a new term we introduce as whilst the auxiliary non-visual data are ‘half-visible’ in auxiliary information on visual term. More specifically, that they are exploited as side information for embedding 6 their knowledge into the MSC-tree growing during model denote the set of all the split nodes as Π and the sample t training but not required any more during the MSC-Forest subset used for training a split node j Π as S . The t j ∈ evaluation (due to their restricted availability as explained training complexity of j-th node is given by m (S try j | |− in Section ). 1)u,whenagreedysearchalgorithmisadopted,withm try Joint information gain - We interpret the intrinsic advan- the number of features attempted to partition Sj, and u the tageofthejointinformationgaindefinedbyEqn.(4),with running time of conducting one data splitting operation. comparisonagainstthena¨ıvefeatureconcatenationstrategy. Consequently, the overall computational cost of learning a With the latter scheme, the information gain (Eqn. (3)) is MSC-Forest can be computed as directly estimated in a heterogeneous joint space where visual, non-visual and temporal data are mixed together. T(cid:88)clust (cid:88) T(cid:88)clust (cid:88) Ω= m (S 1)u=m u (S 1). try j try j This would suffer from the heteroscedasticity problem, | |− | |− t j∈Πt t j∈Πt as discussed in Section . Instead, Eqn. (4) overcomes (6) this challenge by modelling different sources via separate The value of parameter m is identical across all MSC- try information gain terms, resulting in a more balanced ex- trees.Thelearningtimeisthusdeterminedby(1)thevalue ploitation of multi-source data. In this way, the proposed of u, and (2) the factor that we name as tree fan-in joint information gain of multi-source data encourages (cid:88) more appropriate visual data separation both visually and Φ(t)= Sj 1. (7) | − | semantically. This formulation is the essential contribution j∈Πt of our proposed MSC-Forest model. Clearly, u of a MSC-Forest is larger than that of con- The merits of MSC-Forest - The formulation in Eqn. (4) ventional forests since we need to compute additional brings two unique benefits: (A) Thanks to the informa- information gains of non-visual and temporal information tion gain optimisation, the influences of visual and non- (Eqn. (4)). On the other hand, the value of Φ(t) primarily visual domains on data partitioning can be better balanced relies on the tree structure/topological characteristics [49]: compared to na¨ıve feature concatenation. (B) Eqn. (2) and a balanced and shallower tree has smaller Φ(t), thus the Eqn.(4)togetherprovideamechanismtodiscoverstrongly tree shall be more efficient in training and inference on correlated heterogeneous source pairs and to exploit joint previously-unseen samples, in that the paths from the root information gain of such correlated pairs for data par- to leaf nodes are relatively shorter. In Section 5.5, we will titioning. In other words, only selective visual features showthattheadditionalnon-visualinformationencourages (Eqn.(2))thatyieldhighinformationgaincollectivelywith more balanced and shallower decision trees than learning non-visual information (Eqn. (4)) will contribute to the from single visual source alone. MSC-tree growing. Such a mechanism cannot be realised usingtheconventionalclusteringforests[35],[41].Weshall demonstrate the multi-source correlation discovered by our 2.3 LatentMulti-SourceDataStructureDiscovery proposed MSC-Forest in experiments (Section 5.4). The multi-source feature space has high dimension (over 2000 dimensions). This makes learning data structure by 2.2.1 CopingwithPartial/MissingNon-VisualData clustering computationally difficult. To this end, we con- We introduce a new adaptive weighting mechanism to sider spectral clustering on manifold to discover latent dynamically deal with the inevitable partial/missing non- clusters in a lower dimensional space. Fig. 2 depicts the visual data3. Specifically, when some non-visual data are pipelineofourvideodataclusteringapproachbasedonthe missingandsupposethemissingproportionoftheithnon- learned MSC-Forest. visual type in the training set X for MSC-tree t is δ , we t i The spectral clustering [50] groups data using eigen- reduce its weight from α to α δ α . The total reduced i i i i vectors of an affinity matrix derived from the data. The (cid:80) − weight δ α is then distributed evenly to the weights of all sourcesi toi einsure α +(cid:80)m α +α = 1. This linear goodness of the resulting cluster formation primarily relies v i=1 i t on the quality of the input affinity matrix which reflects adaptive weighting method produces satisfactory results in and embeds the essential data structures [45]. Below we our experiments. describethedetailsofconstructingmulti-sourcereferenced affinity matrix from MSC-Forest. Intuitively, the multi- 2.2.2 ModelComplexity source learning nature of MSC-Forest renders its data The upper-bound learning complexity of a whole MSC- similarity measure sensitive to the joint knowledge from Forest can be examined from its constituent parts, i.e. at diverse source data. tree- and node-levels. Formally, given a MSC-tree t, we ThelearnedMSC-Forestoffersaneffectivewaytoderive the required affinity matrix. Specifically, each individual 3.There exist missing data filling algorithms utilised in conventional tree within the MSC-Forest partitions the training samples randomforests,e.g.forthemissingvalueofonefeatureinoneclass,the medianvalue(continuous)orthemostfrequentcategory(discrete)ofthis at its leaves (cid:96)(x): Rd L , where (cid:96) represents a leaf → ⊂N featureoverthecurrentclasscanbeusedastheestimation[48].Whilst indexandLreferstothesetofallleavesinagiventree.For a similar strategy is possible to apply on our MSC-Forest, we consider eachMSC-tree,wefirstcomputeatree-levelN N affinity an alternative by proposing an effective adaptive weighting algorithm in × ordernottofurtherintroducenoisytrainingdata. matrix At with elements defined as At =exp−dist(xi,xj) i,j 7 where tagged as the result of model inference. This is made (cid:26) possiblethroughexploitingthenon-visualdatadistributions 0 if (cid:96)(x )=(cid:96)(x ), dist(xi,xj)= + otherwiise. j (8) associated with the discovered clusters on the training data ∞ (see Eqn. (10) and Fig. 2(d)). Below we discuss the details We assign the maximum affinity (affinity=1, distance=0) of generating a semantic video summary. betweenpointsx andx iftheyfallintothesameleaf,and i j the minimum affinity (affinity=0, distance=1) otherwise. A smooth affinity matrix can be obtained through averaging 3.1 Key-ClipExtractionandComposition all the tree-level affinity matrices Suppose we are given a previously-unseen surveillance video footage without meta-data tagging/script. The video 1 T(cid:88)clust A= At, (9) is pre-processed by segmenting it into a set of M either Tclust overlapping or non-overlapping short clips x∗ M with t=1 { i}i=1 equalduration.Ouraimistofirstassignclustermembership Eqn. (9) is adopted as the ensemble model of MSC- to each previously-unseen clip using the trained multi- Forest due to its advantage of suppressing the noisy tree source model, and then select key-clips from the resulting predictions, though other alternatives such as the product clusters4. The chosen key-clips are then chronologically oftree-levelpredictionsarepossible[16].Wethenconstruct ordered to construct a video summary. a sparse k-NN graph, whose edge weights are defined by A (Fig. 2(c)). Clusteringpreviously-unseenvideoclips-Inferringclus- Subsequently, we symmetrically normalise A to obtain ter memberships of previously-unseen clips is an intri- = D−12AD−12, where D denotes a diagonal degree cate task. A straightforward method is to assign cluster Smatrix with elements D = (cid:80)NA . Given , we membership by identifying the nearest cluster c∗ to performspectralclusteringi,itodiscovejrthie,jlatentclusStersof a sample x∗, where represents the set of clust∈ersCwe C trainingclipswiththenumberofclustersautomaticallyde- discovered in Section 2.3. However, we found this hard termined through analysing the eigenvector structure [50]. cluster assignment strategy susceptible to outliers in C Each training clip x is then assigned to a cluster c , and source noise. To mitigate this problem, we consider i i ∈ C with the set of all clusters. an alternative approach by utilising the MSC-Forest tree C Thelearnedclustersgroupsimilarclipsbothvisuallyand structures for soft cluster assignment. This is more robust semantically, with each of the clusters associated with a to either source noise or outliers. uniquedistributionforeachnon-visualdata(Fig.2(d)).We Fig. 4 depicts the soft cluster assignment pipeline. First, denote the distribution of the ith non-visual data type of we trace the leaf (cid:96) (x∗) of each tree t where x∗ falls by t the cluster c as channelling x∗ into the tree (Fig. 4(a)). This step is critical (cid:88) as it establishes a connection for x∗ with an appropriate p(y c) p(y x ), (10) i| ∝ xj∈Xc i| j training subset X(cid:96)t(x∗) using the split functions {h}t op- timised by multi-source data. Here, X represents the whereXc representsthesetoftrainingsamplesinc.These set of training samples associated with(cid:96)t((cid:96)x∗()x∗). The set is multi-source data clusters form a component of our multi- t consistentwithx∗bothvisuallyandsemanticallysincethey source model (Fig. 1). encompass identical response w.r.t h . t { } Second,weretrievetheclustermembershipC = c 3 SEMANTIC VIDEO SUMMARISATION t { i}⊂ of X , against which we search for the tree-level In Section 2 we presented multi-source data clustering by nCearest c(cid:96)tl(uxs∗te)r c∗ for x∗ (Fig. 4(b)) via t learning a Multi-Source Clustering Forest (MSC-Forest), resulting in a semantic cluster formation. Once this multi- c∗t =argminc∈Ct||x∗−µc||, (11) source model is learned, it can be deployed for semantic with t the tree index, and µ the centroid of cluster c, video summarisation. Specifically, we follow the estab- c estimated as lished approach of summarising videos by clustering [32] 1 (cid:88) but with the introduction of two noticeable differences in µc = xi, (12) X c our method. | |xi∈Xc First, our video summary is multi-source referenced. where X represents the set of training samples in c. c Specifically, the MSC-Forest is trained on heterogeneous Performing nearest cluster search within C rather than the t sources, its optimised split functions h (Eqn. (1)) there- wholeclusterspace bringsakeybenefit:sincethesearch foreimplicitlycapturethecomplexmu{lti}-sourcestructures. space is constrainedCby MSC-tree, it is more meaningful When one deploys the trained model for content summari- andalsolessnoisythantheentirespace ,leadingtomore sation of previously-unseen video data, the model only accurate c∗ estimation. C t needs to take visual inputs without any non-visual data sources.Andyetitisabletoinducevideocontentpartitions 4.It is worth noticing that the purpose of this clustering step is that not only correspond to visual feature similarities, but completely different from the multi-source data clustering during model training, as presented in Section 2.3. The latter is a component of our also are consistent with meaningful non-visual semantic multi-source model training pipeline (Fig. 2), whilst the former aims at interpretations.Second,ourvideosummaryisautomatically revealingthelatentstructureovertestingdataforvideosummarisation. 8 Multi-source referenced similarity graph Tree 1 Tree previously-unseen clip Affinity previously-unseen matrix (a) … clip representative previously-unseen clip Tree leaves (d) … (b) Extract representative clip Perform shortest Cluster for each previously-unseen path search Tree-level memberships d a t a c l uster between nearest cluster adjacent (f) Maximal vote of tree-level nearest cluster (c) Nearest cluster for (e) (g) Key-clips Fig.4. Thepipelineofourmulti-sourcereferencedkey-clipsdetectionalgorithm.(a)Channelaclipx∗intoMSC-trees.(b)Searchtree-level nearestclustersofx∗,hollowcircledenotescluster.(c)Predictthefinalnearestcluster.Ared(cid:63)depictsarepresentativepreviously-unseen clip. Onceweobtainalltree-levelnearestclustersfromallthe Algorithm 1: Infer non-visual tags of previously- trees in the forest, c∗ Tclust , the final nearest cluster c∗ is unseen clips. { t}t=1 obtained as the one with maximal votes from all the trees Input: A previously-unseen clip x∗, a trained (Fig. 4(c)) MSC-Forest, training data clusters ; C Output: Predicted tag yˆ; i c∗ =max c∗ Tclust (13) 1 Initialisation: { t}t=1 2 Compute p(yi c) for each training data cluster | (Eqn. (10)); Byrepeatingtheabovestepsonallpreviously-unseenclips 3 Compute cluster centroid µc (Eqn. (12)); x∗ M , we obtain their cluster labels as = c∗ M 4 Non-Visual Tag Inference: {(Fiig}.i4=(1e)). CL { i}i=1 5 for t 1 to Tclust do ← 6 Trace the leaf (cid:96)t(x∗) where x∗ falls (Fig. 4(a)); Extracting key-clips - With the assigned cluster member- 7 Retrieve the training samples X(cid:96)t(x∗) associated ships on all previously-unseen clips, the key-clip of a with (cid:96) (x∗); CL t pbryevtihoeuslrye-purnesseenentatvivideeoprdeavtiaouclsulyst-eurnsce∗ecnanclbipe rre∗pretsheantteids 98 SOebatracihn tthhee tcrleues-tleervseCl nte=ar{escti}cl⊂ustCerocf∗tXo(cid:96)ft(xx∗∗); closest to the cluster centroid µc∗ (Fig. 4(e)). Concate- within Ct (Eqn. (11)); nating these key-clips chronologically establishes a visual 10 end summary.Suchasummary,however,islikelytobediscon- 11 Estimate tag distribution p(yi x∗) (Eqn. (14)); | tinuous in preserving visual context therefore non-smooth 12 Compute the final tag yˆi (Eqn. (15)). visually due to abrupt changes between adjacent key-clips. Toenforcesomedegreesofsmoothnessinthevisualisation ofvideosummarywhilstminimisingredundancy,weadopt 3.2 VideoTagging a shortest path strategy [51] to induce an optimal path between two temporally-adjacent representative r∗ on a Summarising video with high-level interpretation requires graph G. This approach produces a visually more coherent plausible semantic content inference from video data x∗. video summary whilst discards as much redundancy as We derive a tree-structure aware tag inference algorithm possible. capable of predicting tag types same as training non-visual data, based on the learned MSC-Forest and discovered Moreprecisely,weconstructagraphG=(V,E),where training data clusters. Specifically, we first obtain the tree- V and E indicate the set of previously-unseen video clip level nearest cluster c∗ of a previously-unseen sample x∗ verticesandedges(Fig.4(d)).Theweightsofedgescanbe t using Eqn. (11). Second, the p(y c∗) associated with c∗ is efficiently estimated using Eqn. (8) and (9). Note that the i| t t utilised as the tree-level non-visual tag estimation for the graph G is also multi-source referenced since it is derived ith non-visual data type. To achieve a smooth prediction, fromourmulti-sourceMSC-Forestmodel.Wethenperform we average all p(y c=c∗) obtained from individual trees shortest path search between temporally-adjacent r∗ on G i| t as (Fig. 4(f)) and all the samples that lie on the shortest paths p(y x∗)= 1 (cid:88)Tclustp(y c∗). (14) compose the final key-clip set K (Fig. 4(g)). i| Tclust t=1 i| t 9 The final tag yˆ for the ith non-visual type is obtained as tableofcampuseventsincluding:NoScheduledEvent(No i Schd. Event), Cleaning, Career Fair, Gun Forum Control yˆ =argmax p(y x∗). (15) i yi i| and Gun Violence (Gun Forum), Group Studying, Scholar- With the above steps, we can estimate all m non-visual ship Competition (Schlr. Comp.), Accommodative Service tagsyˆswithi 1,...,m .Theprocedureofourtagging (Accom. Service), Student Orientation (Stud. Orient.). i algorithm is su∈mm{arised in}Algorithm 1. Note that other visual features and non-visual data types Given the extracted key-clips and automatic assign- can be considered without altering the training and infer- ment of non-visual semantic tagsK(Eqn. (15)), we can now encemethodsofourmodelinthattheMSC-Forestmodelis construct a video summary by chronologically concatenat- capable of coping with different families of visual features ing each clip x∗ with smooth inter-clip transition, as well as distinct types of non-visual sources. ∈ K e.g.crossfading, and labelling each clip with their inferred Baselines - To evaluate the proposed method for multi- semantic tags. source video clustering and tag inference, we compared the Visual + Non-Visual + MSC-Forest (VNV-MSC-Forest) model against the following baseline models: 4 EXPERIMENTAL SETTINGS 1) VO-Forest:aconventionalforest[35]trainedwithvi- Datasets - We conducted experiments on two datasets sualfeaturevectorsalone,todemonstratethebenefits collected from publicly accessible webcams that feature an from using non-visual sources7. outdoor and an indoor scene respectively: (1) the TImes 2) VNV-Kmeans:k-meansusingconcatenatedvectorsof Square Intersection (TISI) dataset, and (2) the Educational visual and non-visual features, to highlight the het- ResourceCentre(ERCe)dataset5.Thereareatotalof7324 eroscedasticity and dimensionality discrepancy prob- videoclipsspanningover14daysintheTISIdataset,whilst lem caused by heterogeneous visual and non-visual atotalof13817clipswerecollectedacrossaperiodoftwo data. monthsintheERCedataset.Eachcliphasadurationof20 3) VNV-Forest: a conventional forest [35] trained with seconds.Thedetailsofthedatasetsandtraining/deployment concatenated visual and non-visual feature vectors, partitions are given in Table 1. Example frames are shown to compare the effectiveness of MSC-Forest that in Fig. 5. exploits non-visual data during forest formation. The TISI dataset is challenging due to severe inter- 4) VNV-AASC: a state-of-the-art multi-source spectral object occlusion, complex behaviour patterns, and large clustering method [15] learned by treating each type illumination variations caused by both natural and artificial ofvisualornon-visualfeatureasanindividualsource, lightsourcesatdifferentdaytime.TheERCedatasetisnon- to demonstrate the superiority of MSC-Forest in trivial due to a wide range of physical events involved that handling diverse data representations and correlating are characterised by large changes in environmental setup, multiple sources. participants, crowdedness, and intricate activity patterns. 5) VNV-MSC-Forest-hard: a variant of our model using hard cluster assignment strategy for inferring seman- TABLE1 tic tags of previously-unseen samples (Section 3.2), Detailsofdatasets.FPS=framespersecond. to highlight the effectiveness of the proposed tree - Resolution FPS #TrainingClip #DeploymentClip structure based tag inference algorithm. TISI 550×960 10 5819 1505 6) VT-MSC-Forest: a variant of our model using only ERCe 480×640 5 9387 4430 temporal information and visual data. In order to show the exact effectiveness of exploiting non-visual Visual and non-visual sources - We extracted the follow- data, the weight ratio between visual data and time ing set of visual features for representing visual content retainsthesameasinVNV-MSC-Forestwiththeonly in each clip: (a) colour features including RGB and HSV; differenceofdiscardingnon-visualdataduringmodel (b) local texture features based on Local Binary Pattern training. (LBP) [52]; (c) optical flow; (d) holistic features of the 7) VPNVρ-MSC-Forest: a variant of our model but with scene based on GIST [53]; and (e) person and vehicle6 ρ% of training samples having arbitrary number of detection [54]. missing non-visual types, to evaluate the robustness Wecollected10typesofnon-visualsourcesfortheTISI of MSC-Forest in coping with partial/missing non- dataset: (a) weather data extracted from the WorldWeath- visual data. erOnline with 9 elements: temperature, weather type, wind Implementation details - The clustering forest size T clust speed, wind direction, precipitation, humidity, visibility, was set to 1000, including both the conventional forest pressure, and cloud cover; (b) traffic speed data from the and the proposed MSC-Forest. We observed a slight in- GoogleMapswith4levelsoftrafficspeed:veryslow,slow, crease in performance given a larger forest size, which moderate,andfast.FortheERCedataset,wecollecteddata agrees with [16]. The training set X of the tth MSC- t from multiple independent on-line sources about the time tree was obtained by performing random selection with 5.Datasetsavailable:www.eecs.qmul.ac.uk/%7Exz303/download.html 7.Evaluatingaforestthattakesonlynon-visualinputsisnotpossible, 6.NovehicledetectionontheERCedataset. sincenon-visualdataisnotavailableforpreviously-unseenvideofootages. 10 (a) (b) Fig.5. Examplesofthe(a)TISIand(b)ERCedatasets. TABLE2 replacementfromtheaugmenteddataspace(Fig.3(b)).We Compareclusterpurityinmeanentropy.Loweris setm =√dwithdthedatafeaturedimension(Eqn.(2)). try better. This is typically practiced [35]. We employed linear data separation[16]asthetestfunctionfornodesplitting.Weset Dataset TISI ERCe thesamenumberofclustersacrossallmethods.Thiscluster p(y|c) trafficspeed weather event number was discovered automatically using the method VO-Forest 0.8675 1.0676 0.0616 presented in [50]. For each dataset, 75% out of the total VNV-Kmeans 0.9197 1.4994 1.2519 ∼ VNV-Forest 0.8611 1.0889 0.0811 datawasutilisedformodeltraining,andtheremainingwas VNV-AASC 0.7217 0.7039 0.0691 reserved for testing. Additional previously-unseen video VT-MSC-Forest 0.7275 0.9577 0.0580 datawascollectedfromtheTimeSquareIntersectionscene VNV-MSC-Forest 0.7262 0.6071 0.0024 VPNV10-MSC-Forest 0.7190 0.6261 0.0024 on a separate day for video summarisation. VPNV20-MSC-Forest 0.7283 0.6497 0.0090 5 EVALUATIONS when we increase the non-visual data missing proportion, 5.1 Multi-SourceClustering overall the VNV-MSC-Forest model copes well with par- To evaluate the effectiveness of different clustering models tial/missing non-visual data. With no aid of non-visual tag for multi-source video clustering, we compared the qual- information, VT-MSC-Forest forms much worse clusters. ity of their clusters formed on the training dataset. For Whilst the superiority of VT-MSC-Forest over VO-Forest determining clustering quality, we quantitively measured suggests the effectiveness of temporal information with the mean entropy [55] of non-visual distributions p(y c) i| MSC-Forest. Inferior performance of VO-Forest to VNV- (Eqn.(10))associatedwithtrainingdataclusterstoevaluate MSC-Forest suggests the importance of learning from aux- how coherent video content are partitioned, assuming all iliarynon-visualsources.Nevertheless,notallmethodsper- methods have access to non-visual data during the entropy form equally well when learning from the same visual and computation. non-visual sources: the k-means and AASC perform much It is evident from Table 2 that our VNV-MSC-Forest poorer in comparison to MSC-Forest. The results suggest achieves the best cluster purity on both datasets8. Despite the proposed joint information gain criterion (Eqn. (4)) that there are gradual degradations in clustering quality is more effective in handling heterogeneous data than the conventional clustering models. 8.VNV-MSC-Forest-hard shares the same clusters as VNV-MSC- Forest. For qualitative comparison, we show examples in Fig. 6