Table Of Content1
Learning from Multiple Sources for Video
Summarisation
Xiatian Zhu, Student Member, IEEE, Chen Change Loy, Member, IEEE, and Shaogang Gong
Abstract—Manyvisualsurveillancetasks,e.g.videosummarisation,isconventionallyaccomplishedthroughanalysingimagery-
basedfeatures.Relyingsolelyonvisualcuesforpublicsurveillancevideounderstandingisunreliable,sincevisualobservations
obtainedfrompublicspaceCCTVvideodataareoftennotsufficientlytrustworthyandeventsofinterestcanbesubtle.Webelieve
thatnon-visualdatasourcessuchasweatherreportsandtrafficsensorysignalscanbeexploitedtocomplementvisualdatafor
videocontentanalysisandsummarisation.Inthispaper,wepresentanovelunsupervisedframeworktolearnjointlyfromboth
5
visualandindependently-drawnnon-visualdatasourcesfordiscoveringmeaningfullatentstructureofsurveillancevideodata.
1
0 Inparticular,weinvestigatewaystocopewithdiscrepantdimensionandrepresentationwhilstassociatingtheseheterogeneous
2 datasources,andderiveeffectivemechanismtotoleratewithmissingandincompletedatafromdifferentsources.Weshowthat
theproposedmulti-sourcelearningframeworknotonlyachievesbettervideocontentclusteringthanstate-of-the-artmethods,but
b alsoiscapableofaccuratelyinferringmissingnon-visualsemanticsfrompreviously-unseenvideos.Inaddition,acomprehensive
e
userstudyisconductedtovalidatethequalityofvideosummarisationgeneratedusingtheproposedmulti-sourcemodel.
F
6 IndexTerms—Multi-sourcedata,heterogeneousdata,visualsurveillance,clustering,eventrecognition,videosummarisation.
(cid:70)
]
V
C
.
s
c Visual features and descriptors are often carefully de- background clutters [13].
[signed and exploited as the sole input for surveillance In this study, we wish to exploit non-visual auxiliary
2 video content analysis and summarisation. For instance, information to complement the unilateral perspective from
voptical or particle flow is typically employed in activity visualobservations.Examplesofnon-visualsourcesinclude
9modelling [1], [2], [3], foreground pixel feature is used weather report, GPS-based traffic data, geo-location data,
6
formulti-cameravideounderstanding[4],space-timeimage textual data from social networks, and on-line event sched-
0
gradient is adopted for crowd analysis [5], and mixture of ules.Theauxiliarydatasourcesarebeneficialtovisualdata
3
0dynamic textures is used for video segmentation [6] and modelling because despite that visual and non-visual data
.anomaly detection [7]. may have very different characteristics and are of different
1
0 A critical task in visual surveillance is to automatically natures, they depict the common physical phenomenon in
5make sense of massive amount of video data by summaris- a scene. They are intrinsically correlated, although may
1ing its content using higher-level intrinsic physical events1 be mostly indirect in some latent spaces. Effectively dis-
v:beyond low-level key-frame visual feature statistics and/or covering and exploiting such a latent correlation space can
iobject detection counts. In most contemporary techniques, facilitatetheunderlyingdatastructurediscoveryandbridge
X
low-levelimageryvisualcuesaretypicallyexploitedasthe the semantic gap between low-level visual features and
ronly information source for video summarisation [8], [9], high-level semantical interpretation.
a
[10],[11],[12].Ontheotherhand,incomplexandcluttered Challenges - Nevertheless, it is non-trivial to formulate a
public scenes there are intrinsically more interesting and frameworkthatexploitsbothvisualandnon-visualdatafor
salient higher-level events that can provide more mean- videocontentanalysisandsummarisation,bothalgorithmi-
ingful and concise summarisation of the video data. How- cally and in practice.
ever, such events may not be visually well-defined (easily Algorithmically, unsupervised mining of latent correla-
detectable) nor detected reliably by visual cues alone. In tion and interaction between heterogeneous data sources
particular, surveillance visual data from public spaces is faces a number of challenges: (1) Disparate sources signif-
often inaccurate and/or incomplete due to uncontrollable icantly differ in representation (continuous or categorical),
sourcesofvariation,changesinillumination,occlusion,and and largely vary in scale and covariance2. In addition, the
dimensionofvisualsourcesoftenexceedsthatofnon-visual
information to a great extent (>2000 visual dimensions vs.
• XiatianZhuandShaogangGongarewithSchoolofElectronicEngi-
neeringandComputerScience,QueenMaryUniversityofLondon. <10 non-visual dimensions). Owing to this dimensionality
E-mail:xiatian.zhu@qmul.ac.uk,s.gong@qmul.ac.uk discrepancy problem, a straightforward concatenation of
• ChenChangeLoyiswithDepartmentofInformationEngineering,The
featureswillresultinarepresentationunfavourablyinclined
ChineseUniversityofHongKong.
E-mail:ccloy@ie.cuhk.edu.hk towards the imagery data. (2) Both visual and non-visual
data in isolation can be inaccurate and incomplete.
1.Spatiotemporal combinations of human activity or interaction pat-
terns,e.g.gathering,orenvironmentalstatechanges,e.g.raining. 2.Alsoknownastheheteroscedasticityproblem[14].
2
Multi-Source data Learning and correlating Training
Multi-Source data
Multi-Source cluster formation
Multi-Source
Visual data clustering Non-visual
distribution
Traffic speed Weather
Multi-Source
Clustering Forest
Non-Visual data
Video data Deployment
Performing video summarisation Semantic video summary
(previously-unseen)
(1) Clustering previously-unseen video
(2) Extract key-clips
Typicality: usual Typicality: interesting
(3) Tag inference Weather: cloudy Weather: sunny
Traffic: medium Traffic: medium
Time: 05:00 am Time: 06:31 am
Fig. 1. Theoverviewoftheproposedmulti-sourcedrivenvideosummarisationframework.Weconsideranovelsettingwheremultiple
heterogeneoussourcesarepresentduringthemodeltrainingstage.TheproposedMulti-SourceClusteringForestdiscoversandexploits
latent correlations among heterogeneous visual and non-visual data sources both of which can be inaccurate and not trustworthy. In
deployment,ourmodeluncoversvisualcontentstructuresandinfersemantictagsonpreviously-unseenvideodataforvideosummarisation.
In practice, auxiliary data sources, e.g. weather, traffic Contributions - The main contributions of this work are:
reports, and event time tables, may be rather unreliable in 1) We propose a unified multi-source learning frame-
availability.Specifically,thereportsmaynotbereleasedon- work capable of discovering semantic structures of
the-fly at a synchronised time stamp with the surveillance video content collectively from heterogeneous visual
videostream.Inaddition,existingvideocontrolroomsmay andnon-visualdata.Thisismadepossiblebyformu-
not necessarily have direct access to these sources. This lating a novel Multi-Source Clustering Forest (MSC-
renders models that expect complete visual and non-visual Forest) that seamlessly handles multi-heterogeneous
information during deployment impractical. datasourcesdissimilarinrepresentation,distribution,
and dimension. Although both visual and non-visual
Our solution - In this study, we address this multi-source data in isolation can be inaccurate and incomplete,
learning problem in the context of video summarisation, ourmodeliscapableofuncoveringandsubsequently
conventionally based on visual feature analysis and object exploitingthesharedlatentcorrelationforbetterdata
detection/segmentation. In particular, we formulate a novel structure discovery.
framework that is capable of performing joint learning 2) The model is novel in its ability to accommodate
given heterogeneous multi-sources (Fig. 1). We consider partial or completely missing non-visual sources.
visual data as the main source and non-visual data as In particular, we introduce a joint information gain
the auxiliary sources, since we believe visual information function that is capable of dynamically adapting to
still plays the main role in video content analysis. During arbitrary amount of missing non-visual information
training,weassumetheaccesstobothvisualandnon-visual during model learning. In model deployment, only
data. The model performs multi-source data clustering and visual input is required for inferring missing non-
discovers a set of visual clusters tagged along with non- visual semantics.
visual data distribution, e.g. different weathers and traffic Extensive comparative evaluations are conducted on two
speeds. We term the model as multi-source model. During public surveillance videos captured from both indoor and
the deployment stage, we only assume the availability of outdoor environments. Comparative results show that the
previously-unseen video data since non-visual data may proposed model not only outperforms the state-of-the-art
not be accessible due to the aforementioned limitations. methods [15], [16] for video content clustering and struc-
Since the learned model has already captured the latent ture discovery, but also is more superior in predicting non-
structureofheterogeneoustypesofdatasources,themodel visualtagsforpreviously-unseenvideos.Therobustnessof
can be used for semantic video clustering and non-visual the proposed model is further validated by a user study on
tag inference on previously-unseen video sequence, even video summary quality.
without the non-visual data. Subsequently, key clips are
automatically selected from the discovered clusters. The 1 RELATED WORK
final summary video can be produced by chronologically Multi-modality learning - There exist studies that exploit
compositing these key clips enriched by the inferred tags. different sensory or information modalities from a single
3
source for data structure mining. For example, Cai et compressthosetubestoreducespatiotemporalredundancy.
al.[17]proposetoperformmulti-modalimageclusteringby Both the above schemes utilise solely visual information
learning a commonly shared graph-Laplacian matrix from andmakeimplicitassumptionsaboutthecompletenessand
different visual feature modalities. Heer and Chi [18] com- accuracy of the visual data available in extracting features
bine linearly individual similarity matrices derived from or object-centered representations. They are unsuitable nor
multi-modal webpages for web user grouping. Karydis et scalabletocomplexsceneswherevisualdataareinherently
al. [19] present a tensor based model to cluster music incomplete and inaccurate, mostly the case in surveillance
items with additional tags. In terms of video analysis, videos. Our work differs significantly to these studies in
the auditory channel and/or transcripts have been widely thatweexploitnotonlyvisualdatawithoutobjecttracking,
explored for detecting semantic concepts from multimedia but also non-visual sources as complementary information.
videos [20], [21], summarising highlights in news and The summary generated by our approach is semantically
broadcast programs [22], [23], or locating speakers [24]. enriched – it is labelled automatically with semantic tags,
User tags associated with web videos (e.g. YouTube) have e.g. traffic condition, weather, or event. All these tags
also been utilised [25], [26], [27]. In contrast, surveillance are learned from heterogeneous non-visual sources in an
videos captured from public spaces are typically without unsupervised manner during model training without any
auditory signals nor any synchronised transcripts and user manual labels.
tags available. Instead, we wish to explore alternative non- Random forests - Random forests [35], [16] have proven
visual data drawn independently elsewhere from multiple as powerful models in the literature. Different variants of
sources, with inherent challenges of being inaccurate and random forests have been devised, either supervised [36],
incomplete, unsynchronised to and may also be in conflict [37],[38],[39],[40],orunsupervised[41],[42],[43],[44],
with the observed visual data. [45]. Supervised models are not suitable to our problem
since we do not assume the availability of ground truth
Multi-source learning - An alternative multi-source learn-
labels during model training. Existing clustering forest
ingmechanismcanbeclusteringensemble[28],[29]where
models,ontheotherhand,assumesonlyhomogeneousdata
a collection of clustering instances is generated and then
sourcessuchaspureimagery-basedfeatures.Noprincipled
aggregated into the final clustering solution. Typically only
wayofcombiningmultipleheterogeneousandindependent
single data source is considered, but it can be easily
data sources in forest models is available.
extended to handle multi-source data, e.g. creating a re-
spective clustering instance for each source. Nonetheless,
cross-source correlation is ignored since the clustering 2 MULTI-SOURCE CLUSTERING
instances are separately formed and no interaction between Video summarisation by content abstraction aims to gen-
them is involved. A closer approach to ours is the Affin- erate a compact summary composed of key/interesting
ity Aggregation Spectral Clustering (AASC) [15], which content from a long previously-unseen video for achiev-
learns data structure from multiple types of homogeneous ing efficient holistic understanding [32]. A common way
information (visual features only). Their method generates to establish a video summary is by extracting and then
independently multiple affinity data matrices by exhaustive combiningasetofkeyframesorshots.Thesekeycontents
pairwise distance computation for every pair of samples in are usually discovered and selected from clusters of video
every data source. It suffers from unwieldy representation frames or clips [32].
given high-dimensional data inputs. Importantly, despite Inthisstudy,wefollowtheaforementionedapproachbut
that it seeks for optimal weighted combination of distinct consider not only visual content of video, but also a large
affinity matrices, it does not consider correlation between corpus of non-visual data collected from heterogeneous in-
different sources in model learning, similar to clustering dependentsources(Fig.2(a)).Specifically,throughlearning
ensemble [28], [29]. Differing from the above models, our latent structure of multi-source data (Fig. 2(b-c)), we wish
Multi-Source Clustering Forest overcomes these problems to make reference to and/or impose non-visual semantics
by generating a unified single affinity matrix that cap- directly into video clustering without any human manual
tureslatentcorrelationsamongheterogeneoustypesofdata annotation of video data (Fig. 2(d)). Formally, we consider
sources.Furthermore,ourmodelhasauniqueadvantagein thefollowingdifferentdatasourcesthatformamulti-source
handling missing non-visual data over [28], [29], [15]. input feature space:
Video summarisation - Contemporary video summarisa- Visualfeatures-WesegmentatrainingvideointoN either
tion methods can be broadly classified into two paradigms, overlapping or non-overlapping clips, each of which has a
key-frame-based [11], [30], [31], [32], [33] and object- durationofTclip seconds.Wethenextractad-dimensional
based [9], [10], [34] methods. The key-frame-based ap- visual descriptor from the ith video clip denoted by xi =
proachesselectrepresentativekey-framesbyanalysinglow- (xi,1,...,xi,d) Rd,i=1,...,N.
∈
levelimageryproperties,e.g.opticalflow[30]orimagedif- Non-visual data - Non-visual data are collected from het-
ferences[31],object’sappearanceandmotion[11],toform erogeneous independent sources. We collectively represent
a storyboard of still images. Object-based techniques [9], m types of non-visual data associated with the ith clip as
[10], on the other hand, rely on object segmentation and y = (y ,...,y ) Rm, i = 1,...,N. Note that any
i i,1 i,m
∈
tracking to extract object-centric trajectories/tubes, and (or all) dimension of y may be missing.
i
4
(a) Visual data {xi} Non-visual data {yi} 2.1 ConventionalRandomForests
Classification forests - A general form of random forests
(b) Tree 1 Multi-Source Tree Tclust istheclassificationforests.Aclassificationforest[35],[46]
Clustering Forest
(MSC…-Forest) is an ensemble of Tclass binary decision trees (x):
RK, with the d-dimensional feature space,Tand RXK→=
X
[0,1]K denoting the space of class probability distribution
over the label space = 1,...,K .
Affinity matrix A L { }
Decision trees are learned independently of each other,
(c) Graph eachwitharandomsubsetXt ofthetrainingsamplesX =
partitioning
x , i.e. bagging [35]. Growing a decision tree involves
i
{ }
a recursive node splitting procedure until some stopping
Non-visual
Cluster 1 Cluster k distribution criterion is satisfied, e.g. leaf nodes are formed when no
(d) p(yi|c=1…) p(yi|c=k) further split can be achieved given the objective function,
or the number of training samples arriving at a node is
smaller than the predefined node size, φ. Small φ leads to
Fig.2. Multi-sourcemodeltrainingstage:Thepipeline deep trees. We set φ=2 in our experiments for capturing
of performing multi-source clustering on visual and sufficiently fine-grained data structure. At each leaf node,
non-visual data with the proposed Multi-Source Clus- theclassprobabilitydistributionisthenestimatedbasedon
teringForest(MSC-Forest). the labels of the arrival samples.
The training of each internal/split node is a process of
binary split function optimisation, defined as
(cid:26)
0 if x <ϑ ,
h(x,ϑ)= ϑ1 2 (1)
1 otherwise.
Weaimatformulatingaunifiedclusteringmodelcapable Thissplitfunctionisparameterisedbytwoparametersϑ=
ofcopingwiththefewchallengesashighlightedinSection. [ϑ ,ϑ ]: (i) a feature dimension x with ϑ 1,...,d ,
The model needs be unsupervised since no ground truth is an1d (i2i) a feature threshold ϑ Rϑ1. All sam1p∈le{s of a spl}it
2
assumed. To mitigate the heteroscedasticity and dimension nodeswillbechannelledtoeith∈ertheleftlorrightr child
discrepancy problems, we require a model that can isolate nodes, according to the output of Eqn. (1).
the very different characteristics of visual and non-visual The optimal split parameter ϑ∗ is chosen via
data, yet can still exploit their latent correlation in the
ϑ∗ =argmax∆ , (2)
clustering process. To handle noisy data, feature selection class
I
Θ
is needed and necessary.
whereΘ=(cid:8)ϑi(cid:9)mtry(|S|−1) representsaparametersetover
i=1
In light of the above demands, we choose to start with m randomly selected features, with S the sample set
try
the clustering random forest [35], [41], [42] due to (1) reachingthenodes.Thecardinalityofasetisgivenby .
|·|
unsupervised information gain optimisation thus requiring Typically, a greedy search strategy is exploited to identify
no ground truth labels; (2) its flexible objective function ϑ∗. The information gain ∆ is formulated as
class
I
for facilitating the modelling of multi-source data as well
L R
as the processing of missing data; (3) and its implicit ∆ class = s | | l | | r, (3)
I I − S I − S I
feature selection mechanism for handling noisy features. | | | |
Nevertheless, the conventional clustering forest is not well where L and R denote the sets of data routed into l and r,
suited to solve these challenges since it expects a full and L R=S. The information gain can be computed
∪ I
concatenated representation as input during both model as either the entropy or Gini impurity [47].
training and deployment. This does not conform to the Clustering forests - In contrast to classification forests,
assumptionofonlyvisualdatabeingavailableduringmodel clustering forests require no ground truth label information
deployment for previously-unseen videos. Moreover, due during the training phase. A clustering forest consists of
to its uniform variable selection mechanism [35] (e.g. each T binary decision trees. The leaf nodes in each tree
clust
featuredimensionhasthesameprobabilitytobeselectedas define a spatial partitioning of the training data. Interest-
acandidateoptimalsplittingvariable),thereisnoprincipled ingly, the training of a clustering forest can be performed
way to ensure balanced contribution from individual visual using the classification forest optimisation approach by
and non-visual sources in the node splitting process. To adopting the pseudo two-class algorithm [35], [41], [42].
overcometheselimitations,weproposeanewMulti-Source Specifically, we add N pseudo samples x¯ = x¯ ,...,x¯
1 d
{ }
Clustering Forest (MSC-Forest) by introducing a new ob- (Fig. 3(b)) into the original data space X (Fig. 3(a)), with
jective function allowing joint optimisation of individual x¯ Dist(x ) sampled from certain distributions Dist(x ).
i i i
∼
informationgainsofdifferentsources.Wefirstdescribethe In the proposed model, we adopt the empirical marginal
conventional forests prior to detailing the proposed MSC- distributionsofthefeaturevariablesowingtoitsfavourable
Forest. performance [42]. With this data augmentation strategy,
5
(a) (b) (c) (d)
Fig. 3. An illustration of clustering toy data with a clustering forest. (a) Original toy data are labelled as class
1, whilst (b) the pseudo-points (red +) as class 2. (c) A clustering forest performs two-class classification in the
augmentedspace.(d)Thefinaldatapartitionsontheoriginaldata.
the clustering problem becomes a canonical classification ∆ denotes the information gain in the jth non-visual
j
I
problem that can be solved by the classification forest data. A non-visual source can be either categorical or
training method as discussed above. The key idea behind continuous. For a categorical non-visual source, similar to
this algorithm is to partition the augmented data space into visual term we use the Gini impurity as its data split
G
dense and sparse regions (Fig. 3(c-d)) [41]. measure criterion. In the case of non-visual source with
continuous values, we adopt least squares regression [47]
2.2 Multi-SourceClusteringForest to enforce continuity in the clustering space:
Conventionalclusteringforestsassumesonlyhomogeneous |S| |S|
1 (cid:88) 1 (cid:88)
data sources such as pure imagery-based features. In con- = (y y )2, (5)
i,j i,j
trast, theproposed Multi-SourceClustering Forest cantake R S − S
| | i=1 | | i=1
heterogeneous sources as input. In particular, the proposed
where y represents the value in the jth non-visual space
model uses visual features as splitting variables to grow i,j
associated with the ith sample x S, and S is the set of
Multi-Source Clustering trees (MSC-trees) as in Eqn. (1), i ∈
samples reaching node s. That is ∆ = .
and exploits non-visual information as additional data to Ij R
help determining the ϑ = [ϑ ,ϑ ]. In this way, auxiliary Temporal term: We also add a temporal smoothness gain
1 2
non-visual information is used, in addition to visual data, ∆ t to encourage temporally adjacent video clips to be
I
to guide the tree formation. grouped together. This temporal information helps in min-
Formally,wedefineanewjointinformationgainfunction ing visual data structure.
for node splitting during training MSC-trees as: The information gain by different sources may live
in very disparate ranges due to the different natures of
m
∆ =α ∆Iv +(cid:88)α ∆Ij +α ∆It . (4) source,eachtermofEqn.(4)isthereforenormalisedbyits
v j t
I (cid:124) (cid:123)I(cid:122)v0(cid:125) j=1 Ij0 (cid:124) (cid:123)I(cid:122)t0(cid:125) initial data impurity denoted by Iv0, Ij0, and It0. These
visual (cid:124) (cid:123)(cid:122) (cid:125) temporal impurities are obtained at the root node of every MSC-
non-visual tree. The source weights are denoted by α , α , and α
v i t
Similar to Eqn. (3), the optimal parameter corresponds to accordingly, holding α + (cid:80)m α + α = 1. We set
v i=1 i t
thesplitwiththemaximal∆ .Thisformulationdefinesthe α =0.5 obtained by cross-validation. A detailed analysis
I v
best data split across the joint space of multi-source data, on α is given in Section 5.2. For non-visual and temporal
v
beyond visual domain alone. All the terms in Eqn. (4) are information, we uniformly assign α = α = 1−αv since
t i m+1
interpreted as below. theirimportanceisnotknowninprior,withmthenumber
Visual term: ∆ = ∆ (Eqn. (3)) denotes the in- of non-visual sources.
v class
I I
formation gain in visual domain. Precisely, this measure Theroleofdifferentsourcedata-Giventhemainroleand
is computed from the pseudo class labels. Therefore, it much more stable provision of the visual source in video
reflects the visual data structure characteristics given that understanding, non-visual data are regarded as auxiliary
the pseudo data samples are drawn from the marginal information over visual source. During the training of
feature distributions (Section 2.1). In this study we utilise MSC-Forest, the split functions (Eqn. (1)) are defined on
the Gini impurity G [47] to estimate ∆Iclass by setting visual features, but ϑ=[ϑ1,ϑ2] is collectively determined
= in Eqn. (3) due to its simplicity and efficiency. The by visual features and the associated non-visual as well as
I G (cid:80)
Gini impurity is computed as G = i(cid:54)=jpipj, with pi and temporalinformation(i.e.thenon-visualandtemporalterm
pj beingtheproportionofsamplesbelongingtotheithand in Eqn. (4)). Alternatively, one can think of that the main
jth category in a split node s. High value in indicates visualdatasourceis‘completely-visible’totheMSC-Forest
G
pure category distribution. sinceitisneededduringbothforesttrainingandevaluation,
Non-visual term: This is a new term we introduce as whilst the auxiliary non-visual data are ‘half-visible’ in
auxiliary information on visual term. More specifically, that they are exploited as side information for embedding
6
their knowledge into the MSC-tree growing during model denote the set of all the split nodes as Π and the sample
t
training but not required any more during the MSC-Forest subset used for training a split node j Π as S . The
t j
∈
evaluation (due to their restricted availability as explained training complexity of j-th node is given by m (S
try j
| |−
in Section ). 1)u,whenagreedysearchalgorithmisadopted,withm
try
Joint information gain - We interpret the intrinsic advan- the number of features attempted to partition Sj, and u the
tageofthejointinformationgaindefinedbyEqn.(4),with running time of conducting one data splitting operation.
comparisonagainstthena¨ıvefeatureconcatenationstrategy. Consequently, the overall computational cost of learning a
With the latter scheme, the information gain (Eqn. (3)) is MSC-Forest can be computed as
directly estimated in a heterogeneous joint space where
visual, non-visual and temporal data are mixed together. T(cid:88)clust (cid:88) T(cid:88)clust (cid:88)
Ω= m (S 1)u=m u (S 1).
try j try j
This would suffer from the heteroscedasticity problem, | |− | |−
t j∈Πt t j∈Πt
as discussed in Section . Instead, Eqn. (4) overcomes (6)
this challenge by modelling different sources via separate The value of parameter m is identical across all MSC-
try
information gain terms, resulting in a more balanced ex- trees.Thelearningtimeisthusdeterminedby(1)thevalue
ploitation of multi-source data. In this way, the proposed of u, and (2) the factor that we name as tree fan-in
joint information gain of multi-source data encourages
(cid:88)
more appropriate visual data separation both visually and Φ(t)= Sj 1. (7)
| − |
semantically. This formulation is the essential contribution j∈Πt
of our proposed MSC-Forest model.
Clearly, u of a MSC-Forest is larger than that of con-
The merits of MSC-Forest - The formulation in Eqn. (4) ventional forests since we need to compute additional
brings two unique benefits: (A) Thanks to the informa- information gains of non-visual and temporal information
tion gain optimisation, the influences of visual and non- (Eqn. (4)). On the other hand, the value of Φ(t) primarily
visual domains on data partitioning can be better balanced relies on the tree structure/topological characteristics [49]:
compared to na¨ıve feature concatenation. (B) Eqn. (2) and a balanced and shallower tree has smaller Φ(t), thus the
Eqn.(4)togetherprovideamechanismtodiscoverstrongly tree shall be more efficient in training and inference on
correlated heterogeneous source pairs and to exploit joint previously-unseen samples, in that the paths from the root
information gain of such correlated pairs for data par- to leaf nodes are relatively shorter. In Section 5.5, we will
titioning. In other words, only selective visual features showthattheadditionalnon-visualinformationencourages
(Eqn.(2))thatyieldhighinformationgaincollectivelywith more balanced and shallower decision trees than learning
non-visual information (Eqn. (4)) will contribute to the from single visual source alone.
MSC-tree growing. Such a mechanism cannot be realised
usingtheconventionalclusteringforests[35],[41].Weshall
demonstrate the multi-source correlation discovered by our 2.3 LatentMulti-SourceDataStructureDiscovery
proposed MSC-Forest in experiments (Section 5.4).
The multi-source feature space has high dimension (over
2000 dimensions). This makes learning data structure by
2.2.1 CopingwithPartial/MissingNon-VisualData
clustering computationally difficult. To this end, we con-
We introduce a new adaptive weighting mechanism to
sider spectral clustering on manifold to discover latent
dynamically deal with the inevitable partial/missing non-
clusters in a lower dimensional space. Fig. 2 depicts the
visual data3. Specifically, when some non-visual data are
pipelineofourvideodataclusteringapproachbasedonthe
missingandsupposethemissingproportionoftheithnon- learned MSC-Forest.
visual type in the training set X for MSC-tree t is δ , we
t i The spectral clustering [50] groups data using eigen-
reduce its weight from α to α δ α . The total reduced
i i i i vectors of an affinity matrix derived from the data. The
(cid:80) −
weight δ α is then distributed evenly to the weights of
all sourcesi toi einsure α +(cid:80)m α +α = 1. This linear goodness of the resulting cluster formation primarily relies
v i=1 i t on the quality of the input affinity matrix which reflects
adaptive weighting method produces satisfactory results in
and embeds the essential data structures [45]. Below we
our experiments.
describethedetailsofconstructingmulti-sourcereferenced
affinity matrix from MSC-Forest. Intuitively, the multi-
2.2.2 ModelComplexity
source learning nature of MSC-Forest renders its data
The upper-bound learning complexity of a whole MSC- similarity measure sensitive to the joint knowledge from
Forest can be examined from its constituent parts, i.e. at diverse source data.
tree- and node-levels. Formally, given a MSC-tree t, we ThelearnedMSC-Forestoffersaneffectivewaytoderive
the required affinity matrix. Specifically, each individual
3.There exist missing data filling algorithms utilised in conventional
tree within the MSC-Forest partitions the training samples
randomforests,e.g.forthemissingvalueofonefeatureinoneclass,the
medianvalue(continuous)orthemostfrequentcategory(discrete)ofthis at its leaves (cid:96)(x): Rd L , where (cid:96) represents a leaf
→ ⊂N
featureoverthecurrentclasscanbeusedastheestimation[48].Whilst indexandLreferstothesetofallleavesinagiventree.For
a similar strategy is possible to apply on our MSC-Forest, we consider
eachMSC-tree,wefirstcomputeatree-levelN N affinity
an alternative by proposing an effective adaptive weighting algorithm in ×
ordernottofurtherintroducenoisytrainingdata. matrix At with elements defined as At =exp−dist(xi,xj)
i,j
7
where tagged as the result of model inference. This is made
(cid:26) possiblethroughexploitingthenon-visualdatadistributions
0 if (cid:96)(x )=(cid:96)(x ),
dist(xi,xj)= + otherwiise. j (8) associated with the discovered clusters on the training data
∞ (see Eqn. (10) and Fig. 2(d)). Below we discuss the details
We assign the maximum affinity (affinity=1, distance=0)
of generating a semantic video summary.
betweenpointsx andx iftheyfallintothesameleaf,and
i j
the minimum affinity (affinity=0, distance=1) otherwise. A
smooth affinity matrix can be obtained through averaging 3.1 Key-ClipExtractionandComposition
all the tree-level affinity matrices Suppose we are given a previously-unseen surveillance
video footage without meta-data tagging/script. The video
1 T(cid:88)clust
A= At, (9) is pre-processed by segmenting it into a set of M either
Tclust overlapping or non-overlapping short clips x∗ M with
t=1 { i}i=1
equalduration.Ouraimistofirstassignclustermembership
Eqn. (9) is adopted as the ensemble model of MSC-
to each previously-unseen clip using the trained multi-
Forest due to its advantage of suppressing the noisy tree
source model, and then select key-clips from the resulting
predictions, though other alternatives such as the product
clusters4. The chosen key-clips are then chronologically
oftree-levelpredictionsarepossible[16].Wethenconstruct
ordered to construct a video summary.
a sparse k-NN graph, whose edge weights are defined by
A (Fig. 2(c)). Clusteringpreviously-unseenvideoclips-Inferringclus-
Subsequently, we symmetrically normalise A to obtain ter memberships of previously-unseen clips is an intri-
= D−12AD−12, where D denotes a diagonal degree cate task. A straightforward method is to assign cluster
Smatrix with elements D = (cid:80)NA . Given , we membership by identifying the nearest cluster c∗ to
performspectralclusteringi,itodiscovejrthie,jlatentclusStersof a sample x∗, where represents the set of clust∈ersCwe
C
trainingclipswiththenumberofclustersautomaticallyde- discovered in Section 2.3. However, we found this hard
termined through analysing the eigenvector structure [50]. cluster assignment strategy susceptible to outliers in
C
Each training clip x is then assigned to a cluster c , and source noise. To mitigate this problem, we consider
i i
∈ C
with the set of all clusters. an alternative approach by utilising the MSC-Forest tree
C
Thelearnedclustersgroupsimilarclipsbothvisuallyand structures for soft cluster assignment. This is more robust
semantically, with each of the clusters associated with a to either source noise or outliers.
uniquedistributionforeachnon-visualdata(Fig.2(d)).We Fig. 4 depicts the soft cluster assignment pipeline. First,
denote the distribution of the ith non-visual data type of we trace the leaf (cid:96) (x∗) of each tree t where x∗ falls by
t
the cluster c as channelling x∗ into the tree (Fig. 4(a)). This step is critical
(cid:88) as it establishes a connection for x∗ with an appropriate
p(y c) p(y x ), (10)
i| ∝ xj∈Xc i| j training subset X(cid:96)t(x∗) using the split functions {h}t op-
timised by multi-source data. Here, X represents the
whereXc representsthesetoftrainingsamplesinc.These set of training samples associated with(cid:96)t((cid:96)x∗()x∗). The set is
multi-source data clusters form a component of our multi- t
consistentwithx∗bothvisuallyandsemanticallysincethey
source model (Fig. 1).
encompass identical response w.r.t h .
t
{ }
Second,weretrievetheclustermembershipC = c
3 SEMANTIC VIDEO SUMMARISATION t { i}⊂
of X , against which we search for the tree-level
In Section 2 we presented multi-source data clustering by nCearest c(cid:96)tl(uxs∗te)r c∗ for x∗ (Fig. 4(b)) via
t
learning a Multi-Source Clustering Forest (MSC-Forest),
resulting in a semantic cluster formation. Once this multi- c∗t =argminc∈Ct||x∗−µc||, (11)
source model is learned, it can be deployed for semantic
with t the tree index, and µ the centroid of cluster c,
video summarisation. Specifically, we follow the estab- c
estimated as
lished approach of summarising videos by clustering [32] 1 (cid:88)
but with the introduction of two noticeable differences in µc = xi, (12)
X
c
our method. | |xi∈Xc
First, our video summary is multi-source referenced. where X represents the set of training samples in c.
c
Specifically, the MSC-Forest is trained on heterogeneous Performing nearest cluster search within C rather than the
t
sources, its optimised split functions h (Eqn. (1)) there- wholeclusterspace bringsakeybenefit:sincethesearch
foreimplicitlycapturethecomplexmu{lti}-sourcestructures. space is constrainedCby MSC-tree, it is more meaningful
When one deploys the trained model for content summari- andalsolessnoisythantheentirespace ,leadingtomore
sation of previously-unseen video data, the model only accurate c∗ estimation. C
t
needs to take visual inputs without any non-visual data
sources.Andyetitisabletoinducevideocontentpartitions 4.It is worth noticing that the purpose of this clustering step is
that not only correspond to visual feature similarities, but completely different from the multi-source data clustering during model
training, as presented in Section 2.3. The latter is a component of our
also are consistent with meaningful non-visual semantic
multi-source model training pipeline (Fig. 2), whilst the former aims at
interpretations.Second,ourvideosummaryisautomatically revealingthelatentstructureovertestingdataforvideosummarisation.
8
Multi-source referenced similarity graph
Tree 1 Tree
previously-unseen clip Affinity
previously-unseen
matrix
(a) … clip
representative
previously-unseen
clip
Tree leaves (d)
…
(b)
Extract representative clip Perform shortest
Cluster for each previously-unseen path search
Tree-level
memberships d a t a c l uster between
nearest cluster
adjacent (f)
Maximal vote of tree-level nearest cluster
(c)
Nearest cluster for (e) (g) Key-clips
Fig.4. Thepipelineofourmulti-sourcereferencedkey-clipsdetectionalgorithm.(a)Channelaclipx∗intoMSC-trees.(b)Searchtree-level
nearestclustersofx∗,hollowcircledenotescluster.(c)Predictthefinalnearestcluster.Ared(cid:63)depictsarepresentativepreviously-unseen
clip.
Onceweobtainalltree-levelnearestclustersfromallthe Algorithm 1: Infer non-visual tags of previously-
trees in the forest, c∗ Tclust , the final nearest cluster c∗ is unseen clips.
{ t}t=1
obtained as the one with maximal votes from all the trees Input: A previously-unseen clip x∗, a trained
(Fig. 4(c)) MSC-Forest, training data clusters ;
C
Output: Predicted tag yˆ;
i
c∗ =max c∗ Tclust (13) 1 Initialisation:
{ t}t=1 2 Compute p(yi c) for each training data cluster
|
(Eqn. (10));
Byrepeatingtheabovestepsonallpreviously-unseenclips 3 Compute cluster centroid µc (Eqn. (12));
x∗ M , we obtain their cluster labels as = c∗ M 4 Non-Visual Tag Inference:
{(Fiig}.i4=(1e)). CL { i}i=1 5 for t 1 to Tclust do
←
6 Trace the leaf (cid:96)t(x∗) where x∗ falls (Fig. 4(a));
Extracting key-clips - With the assigned cluster member- 7 Retrieve the training samples X(cid:96)t(x∗) associated
ships on all previously-unseen clips, the key-clip of a with (cid:96) (x∗);
CL t
pbryevtihoeuslrye-purnesseenentatvivideeoprdeavtiaouclsulyst-eurnsce∗ecnanclbipe rre∗pretsheantteids 98 SOebatracihn tthhee tcrleues-tleervseCl nte=ar{escti}cl⊂ustCerocf∗tXo(cid:96)ft(xx∗∗);
closest to the cluster centroid µc∗ (Fig. 4(e)). Concate- within Ct (Eqn. (11));
nating these key-clips chronologically establishes a visual 10 end
summary.Suchasummary,however,islikelytobediscon- 11 Estimate tag distribution p(yi x∗) (Eqn. (14));
|
tinuous in preserving visual context therefore non-smooth 12 Compute the final tag yˆi (Eqn. (15)).
visually due to abrupt changes between adjacent key-clips.
Toenforcesomedegreesofsmoothnessinthevisualisation
ofvideosummarywhilstminimisingredundancy,weadopt
3.2 VideoTagging
a shortest path strategy [51] to induce an optimal path
between two temporally-adjacent representative r∗ on a Summarising video with high-level interpretation requires
graph G. This approach produces a visually more coherent plausible semantic content inference from video data x∗.
video summary whilst discards as much redundancy as We derive a tree-structure aware tag inference algorithm
possible. capable of predicting tag types same as training non-visual
data, based on the learned MSC-Forest and discovered
Moreprecisely,weconstructagraphG=(V,E),where
training data clusters. Specifically, we first obtain the tree-
V and E indicate the set of previously-unseen video clip
level nearest cluster c∗ of a previously-unseen sample x∗
verticesandedges(Fig.4(d)).Theweightsofedgescanbe t
using Eqn. (11). Second, the p(y c∗) associated with c∗ is
efficiently estimated using Eqn. (8) and (9). Note that the i| t t
utilised as the tree-level non-visual tag estimation for the
graph G is also multi-source referenced since it is derived
ith non-visual data type. To achieve a smooth prediction,
fromourmulti-sourceMSC-Forestmodel.Wethenperform
we average all p(y c=c∗) obtained from individual trees
shortest path search between temporally-adjacent r∗ on G i| t
as
(Fig. 4(f)) and all the samples that lie on the shortest paths p(y x∗)= 1 (cid:88)Tclustp(y c∗). (14)
compose the final key-clip set K (Fig. 4(g)). i| Tclust t=1 i| t
9
The final tag yˆ for the ith non-visual type is obtained as tableofcampuseventsincluding:NoScheduledEvent(No
i
Schd. Event), Cleaning, Career Fair, Gun Forum Control
yˆ =argmax p(y x∗). (15)
i yi i| and Gun Violence (Gun Forum), Group Studying, Scholar-
With the above steps, we can estimate all m non-visual ship Competition (Schlr. Comp.), Accommodative Service
tagsyˆswithi 1,...,m .Theprocedureofourtagging (Accom. Service), Student Orientation (Stud. Orient.).
i
algorithm is su∈mm{arised in}Algorithm 1. Note that other visual features and non-visual data types
Given the extracted key-clips and automatic assign- can be considered without altering the training and infer-
ment of non-visual semantic tagsK(Eqn. (15)), we can now encemethodsofourmodelinthattheMSC-Forestmodelis
construct a video summary by chronologically concatenat- capable of coping with different families of visual features
ing each clip x∗ with smooth inter-clip transition, as well as distinct types of non-visual sources.
∈ K
e.g.crossfading, and labelling each clip with their inferred Baselines - To evaluate the proposed method for multi-
semantic tags. source video clustering and tag inference, we compared
the Visual + Non-Visual + MSC-Forest (VNV-MSC-Forest)
model against the following baseline models:
4 EXPERIMENTAL SETTINGS
1) VO-Forest:aconventionalforest[35]trainedwithvi-
Datasets - We conducted experiments on two datasets
sualfeaturevectorsalone,todemonstratethebenefits
collected from publicly accessible webcams that feature an
from using non-visual sources7.
outdoor and an indoor scene respectively: (1) the TImes
2) VNV-Kmeans:k-meansusingconcatenatedvectorsof
Square Intersection (TISI) dataset, and (2) the Educational
visual and non-visual features, to highlight the het-
ResourceCentre(ERCe)dataset5.Thereareatotalof7324
eroscedasticity and dimensionality discrepancy prob-
videoclipsspanningover14daysintheTISIdataset,whilst
lem caused by heterogeneous visual and non-visual
atotalof13817clipswerecollectedacrossaperiodoftwo
data.
monthsintheERCedataset.Eachcliphasadurationof20
3) VNV-Forest: a conventional forest [35] trained with
seconds.Thedetailsofthedatasetsandtraining/deployment
concatenated visual and non-visual feature vectors,
partitions are given in Table 1. Example frames are shown
to compare the effectiveness of MSC-Forest that
in Fig. 5.
exploits non-visual data during forest formation.
The TISI dataset is challenging due to severe inter-
4) VNV-AASC: a state-of-the-art multi-source spectral
object occlusion, complex behaviour patterns, and large
clustering method [15] learned by treating each type
illumination variations caused by both natural and artificial
ofvisualornon-visualfeatureasanindividualsource,
lightsourcesatdifferentdaytime.TheERCedatasetisnon-
to demonstrate the superiority of MSC-Forest in
trivial due to a wide range of physical events involved that
handling diverse data representations and correlating
are characterised by large changes in environmental setup,
multiple sources.
participants, crowdedness, and intricate activity patterns.
5) VNV-MSC-Forest-hard: a variant of our model using
hard cluster assignment strategy for inferring seman-
TABLE1
tic tags of previously-unseen samples (Section 3.2),
Detailsofdatasets.FPS=framespersecond.
to highlight the effectiveness of the proposed tree
- Resolution FPS #TrainingClip #DeploymentClip structure based tag inference algorithm.
TISI 550×960 10 5819 1505 6) VT-MSC-Forest: a variant of our model using only
ERCe 480×640 5 9387 4430 temporal information and visual data. In order to
show the exact effectiveness of exploiting non-visual
Visual and non-visual sources - We extracted the follow- data, the weight ratio between visual data and time
ing set of visual features for representing visual content retainsthesameasinVNV-MSC-Forestwiththeonly
in each clip: (a) colour features including RGB and HSV; differenceofdiscardingnon-visualdataduringmodel
(b) local texture features based on Local Binary Pattern training.
(LBP) [52]; (c) optical flow; (d) holistic features of the 7) VPNVρ-MSC-Forest: a variant of our model but with
scene based on GIST [53]; and (e) person and vehicle6 ρ% of training samples having arbitrary number of
detection [54]. missing non-visual types, to evaluate the robustness
Wecollected10typesofnon-visualsourcesfortheTISI of MSC-Forest in coping with partial/missing non-
dataset: (a) weather data extracted from the WorldWeath- visual data.
erOnline with 9 elements: temperature, weather type, wind Implementation details - The clustering forest size T
clust
speed, wind direction, precipitation, humidity, visibility, was set to 1000, including both the conventional forest
pressure, and cloud cover; (b) traffic speed data from the and the proposed MSC-Forest. We observed a slight in-
GoogleMapswith4levelsoftrafficspeed:veryslow,slow, crease in performance given a larger forest size, which
moderate,andfast.FortheERCedataset,wecollecteddata agrees with [16]. The training set X of the tth MSC-
t
from multiple independent on-line sources about the time tree was obtained by performing random selection with
5.Datasetsavailable:www.eecs.qmul.ac.uk/%7Exz303/download.html 7.Evaluatingaforestthattakesonlynon-visualinputsisnotpossible,
6.NovehicledetectionontheERCedataset. sincenon-visualdataisnotavailableforpreviously-unseenvideofootages.
10
(a)
(b)
Fig.5. Examplesofthe(a)TISIand(b)ERCedatasets.
TABLE2
replacementfromtheaugmenteddataspace(Fig.3(b)).We
Compareclusterpurityinmeanentropy.Loweris
setm =√dwithdthedatafeaturedimension(Eqn.(2)).
try
better.
This is typically practiced [35]. We employed linear data
separation[16]asthetestfunctionfornodesplitting.Weset
Dataset TISI ERCe
thesamenumberofclustersacrossallmethods.Thiscluster p(y|c) trafficspeed weather event
number was discovered automatically using the method VO-Forest 0.8675 1.0676 0.0616
presented in [50]. For each dataset, 75% out of the total VNV-Kmeans 0.9197 1.4994 1.2519
∼ VNV-Forest 0.8611 1.0889 0.0811
datawasutilisedformodeltraining,andtheremainingwas
VNV-AASC 0.7217 0.7039 0.0691
reserved for testing. Additional previously-unseen video VT-MSC-Forest 0.7275 0.9577 0.0580
datawascollectedfromtheTimeSquareIntersectionscene VNV-MSC-Forest 0.7262 0.6071 0.0024
VPNV10-MSC-Forest 0.7190 0.6261 0.0024
on a separate day for video summarisation.
VPNV20-MSC-Forest 0.7283 0.6497 0.0090
5 EVALUATIONS
when we increase the non-visual data missing proportion,
5.1 Multi-SourceClustering
overall the VNV-MSC-Forest model copes well with par-
To evaluate the effectiveness of different clustering models
tial/missing non-visual data. With no aid of non-visual tag
for multi-source video clustering, we compared the qual-
information, VT-MSC-Forest forms much worse clusters.
ity of their clusters formed on the training dataset. For
Whilst the superiority of VT-MSC-Forest over VO-Forest
determining clustering quality, we quantitively measured
suggests the effectiveness of temporal information with
the mean entropy [55] of non-visual distributions p(y c)
i| MSC-Forest. Inferior performance of VO-Forest to VNV-
(Eqn.(10))associatedwithtrainingdataclusterstoevaluate
MSC-Forest suggests the importance of learning from aux-
how coherent video content are partitioned, assuming all
iliarynon-visualsources.Nevertheless,notallmethodsper-
methods have access to non-visual data during the entropy
form equally well when learning from the same visual and
computation.
non-visual sources: the k-means and AASC perform much
It is evident from Table 2 that our VNV-MSC-Forest
poorer in comparison to MSC-Forest. The results suggest
achieves the best cluster purity on both datasets8. Despite
the proposed joint information gain criterion (Eqn. (4))
that there are gradual degradations in clustering quality
is more effective in handling heterogeneous data than the
conventional clustering models.
8.VNV-MSC-Forest-hard shares the same clusters as VNV-MSC-
Forest. For qualitative comparison, we show examples in Fig. 6