ebook img

Joint Key-frame Extraction and Object Segmentation for Content-based Video Analysis PDF

21 Pages·0.404 MB·English
by  Song X.
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Joint Key-frame Extraction and Object Segmentation for Content-based Video Analysis

Joint Key-frame Extraction and Object Segmentation for (cid:3) Content-based Video Analysis Xiaomu Song, Member, IEEE, and Guoliang Fan, Senior Member, IEEE. y Abstract Key-frame extraction and object segmentation are usually implemented independently and sepa- rately due to the fact that they are on di(cid:11)erent semantic levels and involve di(cid:11)erent features. In this work, we propose a joint key-frame extraction and object segmentation method by constructing a uni(cid:12)ed feature space for both processes, where key-frame extraction is formulated as a feature selec- tion process for object segmentation in the context of Gaussian mixture model (GMM)-based video modeling. Speci(cid:12)cally, two divergence-based criteria are introduced for key-frame extraction. One recommendskey-frameextractionthatleadstothemaximumpairwiseinterclassdivergencebetween GMM components. The other aims at maximizing the marginal divergence that shows the intra- frame variation of the mean density. The proposed methods can extract representative key-frames for object segmentation, and some interesting characteristics of key-frames are also discussed. This work provides a unique paradigm for content-based video analysis. Index Terms | Key-frame extraction, object segmentation, Gaussian mixture model, feature selection, cluster divergence. (cid:3)ThisworkissupportedinpartbytheNationalScienceFoundation(NSF)underGrantIIS-0347613(CAREER)andthe Department of Defense EPSCoR (DEPSCoR) under Grant W911NF-04-1-0221. This work is partially published in IEEE Workshop on Motion and Video Computing, Breckenridge, Colorado, Jan. 5-6, 2005, and IEEE International Conference on Acoustics, Speech and Signal Processing, Philadelphia, PA, March 18-23, 2005. yX. Song was with the School of Electrical and Computer Engineering, Oklahoma State University, Stillwater, OK 74078, USA. He is now with Northwestern University, Evanston, IL 60208, USA, email: [email protected]. G. Fan is with the School of Electrical and Computer Engineering, Oklahoma State University, Stillwater, OK 74078, USA, email: [email protected]. Guoliang Fan is the contact author. 1 1 Introduction How to bridge the semantic gap between low-level features and high-level concepts has been a long- standing problem in content-based video analysis [1, 2, 3]. In this paper, we address this issue by jointly studying two video analysis tasks, i.e., key-frame extraction and object segmentation. Key-frames are those frames important to understand video content, and key-frame de(cid:12)nition is quite subjective. It could be related to motion, or objects, or events. Objects usually refer to either regions with homoge- neous features (e.g., color, motion), or meaningful real world entities, which may be composed of one or multiple regions [4]. In this paper, we refer objects to the former one. Usually, key-frame extraction and object segmentation are implemented independently and separately using di(cid:11)erent features. Low- level color and motion features are often used for key-frame extraction [5], which are computationally e(cid:14)cient for time-critical applications. The extracted key-frames usually indicates signi(cid:12)cant changes in the feature space that have limited semantic meaning. We call key-frames semantically meaningful if they can imply some object-related behaviors/events. Supervised methods are often used to enrich the semantical meaning of key-frames by incorporating certain templates or domain knowledge [2, 3], e.g., news, sports, etc. The unsupervised process could also extract semantically meaningful key-frames if object information is involved during key-frame extraction [1, 6]. Object segmentation provides better interpretability and manipulability of video data than key-frame extraction, while it is more challeng- ing. In [7], most object segmentation methods can be classi(cid:12)ed into three categories: segmentation with spatial priority, with temporal priority, and joint spatio-temporal segmentation, which attracts more and more attention [8, 9, 10]. The idea of joint spatio-temporal video segmentation is consistent with the nature of human vision that recognizes salient video structures in space and time simultaneously [11]. A mean shift clustering was proposed in [8] to segment objects in space and time. The Gaussian mixture model (GMM) was applied for joint spatio-temporal video characterization in [9]. The method using the graph partition theory was also suggested in [10] for joint spatio-temporal video modeling. Interesting issues arise if the two processes can be jointly considered. For example, when objects are representedasclustersinafeaturespace,thespatio-temporalrelationshipofclustersimpliessomeobject behaviors or events, such as running away/toward, apprearing/disappearing, enlarging/shrinking, etc, and extracted key-frames may imply such object behaviors. In [1], the position of segmented regions is used to extract key-frames where objects merge. In [6], shape features are used to extract key- frames implying changes of human gestures. Moreover, it was noted that key-frames may facilitate object segmentation in the context of GMM-based video modeling [12]. An initial set of key-frames is (cid:12)rst selected based on the color histogram and used to estimate GMM for object segmentation [5], and segmentation results and the trained GMM are further used to re(cid:12)ne the initial key-frames. This method signi(cid:12)cantly reduces the computational load and improve the robustness of video segmentation. Since key-frame extraction and object segmentation are implemented sequentially in di(cid:11)erent feature spaces and under di(cid:11)erent criteria in [12], this method is called a \combined" approach. 2 In this work, we propose a joint key-frame extraction and object segmentation method by extending our previous work in [12]. The idea is to formulate key-frame extraction as a feature selection process for object segmentation in a uni(cid:12)ed feature space. Speci(cid:12)cally, in the context of GMM-based modeling [13, 9], a video sequence is represented by spatio-temporal feature clusters, which are characterized by multivariateGaussiancomponents(MGC)ofaGMM.Theseparabilitybetweenclustersisestimatedby the MGC-based divergence measurements, and the frames leading to the largest cluster separability are extracted as key-frames. Two divergence criteria are used here: maximum average interclass Kullback Leibler distance (MAIKLD), and maximum marginal divergence (MMD) that is de(cid:12)ned as the average distance between each of the marginal class-conditional densities and their mean [14]. Compared with previous GMM-based video segmentation methods [9, 13, 12], key-frames with large cluster divergence can facilitate the GMM-based video modeling and support robust object segmentation, leading to more homogeneous segmentation results. More interestingly, since key-frame extraction is guided by cluster divergence-based criteria, the extracted key-frames, as the by-product of this process, may contain some object behavior/event information characterized by these spatio-temporal clusters. The key- frames extracted by the proposed method can carry some object-level semantic information that is not available in the key-frames extracted by using low-level features only. However, the major limitation of the proposed method and those under the same framework [12, 9, 13] is the ine(cid:14)ciency of GMM to handle complex objects of di(cid:11)erent low-level features (e.g., colors and motions). Nevertheless, it is an e(cid:14)cient step of early vision. By involving region-based features, it can be further integrated with complementary techniques to produce higher level semantically meaningful results [4, 15]. It is worth mentioning that the major purpose of this work is to improve the performance of object segmentation by (cid:12)nding an optima/sub-optimal key-frame set for GMM estimation and key-frame extraction is a by-product of this process. The proposed method is a preliminary study towards new tools for (cid:13)exible and informative content-based video analysis that may introduce new frame/object descriptors and functionalities for MPEG-4/-7 industrial standards. 2 Joint Key-frame Extraction and Object Segmentation Key-frameextractionandobjectsegmentationhavebeenintensivelystudiedinthepast. We(cid:12)rstbrie(cid:13)y review several relevant methods. In [9], a probabilistic framework for spatio-temporal video modeling was proposed, where a object (homogeneous region) is characterized by a Gaussian \blob" in a spatio- temporal feature space, which contains color (L, a, b), time (t), and coordinate (x and y). A video of M objectsismodeledbyaM-orderGMM.TheEMalgorithmisappliedtoestimatemodelparameters,and the MDL criterion is used to (cid:12)nd a proper M [16]. After GMM estimation, the video is segmented into M spatio-temporal blobs via maximum a posteriori (MAP) classi(cid:12)cation. A piece-wise implementation is proposed in [9] to handle nonlinear and non-convex motion patterns. One major bottleneck of this method is the high computational load because all video frames are involved for GMM estimation. 3 In[12],acombinedkey-frameextractionandobjectsegmentationapproachwasproposedtoimprove the e(cid:14)ciency and robustness of GMM estimation. An initial set of key-frames are (cid:12)rst extracted using the frame-wise 16(cid:2)8 2-D Hue and Saturation color histogram [5], and the GMM is estimated using the key-frames. After object segmentation, each initial key-frame is modeled by a GMM that is used for key-frame re(cid:12)nement. This method considerably mitigates the computational load, and improves the robustness of model estimation by involving a compact and discriminative feature set from key- frames. In addition, GMM-based key-frame re(cid:12)nement is able to provide more compact key-frames. This combined approach further triggers three interesting issues: (1) How to warrant the optimality of extracted key-frames in terms of GMM estimation or object segmentation? (2) Can we jointly optimize key-frame extraction and object segmentation? (3) If the answer to question (2) is yes, is there any semantically useful information (such as object’s behavior) carried by the extracted key-frames? In this work, we address these issues by proposing a joint key-frame extraction and object segmentation method to explore possible relationship and synergy between them. 2.1 Problem Formulation Incontrasttoperformkey-frameextractionandobjectsegmentationwithdi(cid:11)erentfeatures, wepropose to implement them within a uni(cid:12)ed feature space as shown in Fig. 1. In this (cid:12)gure, a video shot of Frame index(cid:13) Key-frames(cid:13) 1(cid:13) 1(cid:13) 2(cid:13) 3(cid:13) 1(cid:13) 2(cid:13) 3(cid:13) 2(cid:13) 1(cid:13) 2(cid:13) 3(cid:13) 3(cid:13) 1(cid:13) 2(cid:13) 3(cid:13) 3(cid:13) 1(cid:13) 1(cid:13) 2(cid:13) 3(cid:13) 1(cid:13) 2(cid:13) 3(cid:13) 2(cid:13) 1(cid:13) 2(cid:13) 3(cid:13) 1(cid:13) 2(cid:13) 3(cid:13) N(cid:13) 1(cid:13) 2(cid:13) 3(cid:13) Video frames(cid:13) Feature space(cid:13) Video objects(cid:13) Figure 1: An example of the uni(cid:12)ed feature space: an input video shot has three major objects. N frames contains three major objects represented as three clusters in the feature space. Usually, the frames within a shot represent a spatially and temporally continuous action, sharing the common visual and often semantic-related characteristics. Consequently, there exists tremendous redundancy. In addition, irrelevant outliers, which indicate noise and insigni(cid:12)cant objects that might randomly appear around frame boundaries, increase the cluster overlap in the feature space. Both redundancy and irrelevance decrease the e(cid:14)ciency of statistical modeling. Therefore, modeling performance can be improved by removing redundant, irrelevant data/features, or in other words, by selecting most relevant and compact data/features to boost the learning process [17]. In the case of GMM-based object segmentation, this can be implemented by selecting most relevant key-frames for video modeling, i.e., key-frame extraction is formulated as a feature selection process for object segmentation. 4 Feature selection methods have been intensively studied, as reviewed in [18]. Given an initial can- didate feature set X0 = fxiji = 1;2;(cid:1)(cid:1)(cid:1);ng, feature selection aims at selecting a subset X~ from X0 so that an criterion function F(X~) related to classi(cid:12)cation performance can be optimized: X~ = arg max F(X): (1) X(cid:18)X0 It is important to choose an appropriate F(X). One often used criterion is to select features to ap- proximate the true density rather than to extract the most discriminative features. Although it is often desired that this criterion could lead to good discrimination among classes as well, this assumption is not always valid, and for robust classi(cid:12)cation, divergence-based feature selection criteria are preferred [19]. In the following, we introduce two divergence-based criteria for feature selection, based on which the new joint key-frame extraction and object segmentation method is developed. 2.2 Maximum Average Interclass Kullback Leibler Distance (MAIKLD) Kullback Leibler distance (KLD) can measure the distance or dissimilarity between two MGCs that model two clusters. Given M clusters characterized by M MGCs, the average interclass KLD (AIKLD) is de(cid:12)ned as: M M KLD(p ;p )+KLD(p ;p ) i j j i AIKLD(X) = C ; (2) X X 2 i=1j=1;j6=i where KLD(p ;p ) = p (x)ln pi(x)dx is the KLD between MGCs p (x) and p (x), and C = 2 . i j R i pj(x) i j M(M(cid:0)1) Ideally, the larger the AKLD, the more separability between clusters. Since key-frame extraction is formulated as a feature selection process, we want to extract key-frames where the average pairwise cluster divergence is maximized. Let X0 = fx ; 1 (cid:20) i (cid:20) Ng be the original video shot with N frames i that is represented as a set with cardinality jXj = N. Let X = fx(cid:3); 1 (cid:20) i (cid:20) N(cid:3)g be any subset of X i with cardinality jXj = N(cid:3) (cid:20) N. If there are M objects in the shot, the objective function is de(cid:12)ned as: X~ = arg max AIKLD(X); (3) X(cid:18)X0;jXj(cid:20)N where X~ is a subset of X0 that is optimal in terms of MAIKLD. According to [20], MAIKLD is optimal in the sense of minimum Bayes error. If we use the zero-one classi(cid:12)cation cost function, this leads to the MAPestimation. Thereforeanoptimalsolutionto(3)resultsinanoptimalkey-framesetthatminimizes the error probability of object segmentation. An exhaustive searching can guarantee the optimality of X~, nevertheless, it is computationally expensive and impractical for large X0, which needs to try 2N frame subsets. Therefore, a suboptimal but computationally e(cid:14)cient solution is more practical. A deterministic feature selection method called Sequential Forward Floating Selection (SFFS) is used here [21],andthesequential forward selection (SFS)methodisappliedtoinitializeSFFS[18]. WhenN isnot very large, SFFS could even (cid:12)nd the optimal X~. In this work, key-frames are extracted from N0 (cid:20) N 5 so-called key-frame candidates. After the GMM estimation obtains its optimal estimation in terms of MDL,theKLDbetweenMGCsp ;p isapproximatelycomputedasKLD(p ;p ) = 1 K log p(omj(cid:18)i), i j i j K Pm=1 p(omj(cid:18)j) wherefo jm = 1;(cid:1)(cid:1)(cid:1);Kgareallpixelfeaturevectorsofkey-framecandidatesundertest,and(cid:18) denotes m i the parameters of the ith MGC. Then AIKLD is calculated using (2). The search process is as follows: (1) Start with an empty set X~, and n is the cardinality of X~, i.e., n = jX~j and initially n = 0; (2) Use SFS to generate a combination of 2 key-frame candidates that maximizes AIKLD, and jX~j = 2; (3) Search for one key-frame candidate that maximizes AIKLD when jX~j = n+1 , and add it to X~, let n = n+1; (4) If n > 2, remove one key-frame candidate from X~ and compute AIKLD based on the remained key-frame candidates in X~, and go to (5), otherwise go to (3); (5) Determine if AIKLD increases or not after removing the selected key-frame candidate. If the answer is yes, let n = n(cid:0)1, and go to (4), otherwise go to (3). The search stops when n reaches a prede(cid:12)ned number or after a given number of iterations. Besides its e(cid:14)ciency compared with those using all frames [9], this method has two major advantages: (1) The optimal or near-optimal key-frame sets in terms of MAIKLD are extracted for model estimation. These key-frames can provide better discriminability for GMM-based object segmentation compared with those extracted by the color histogram [12]. (2) The algorithm is (cid:13)exible and almost threshold- 0 free. However, some issues need further consideration. First, SFFS is not very e(cid:14)cient when N is very large. Second, the MDL-based GMM estimation that is prior to key-frame extraction is time consuming. An alternative approach is to perform SFFS based on a high-order GMM, and the MDL- based GMM estimation is performed on key-frames only. However, with a high-order GMM, the video isover-segmentedandmoreclustersareoriginatedfromonesemanticobject. Toincreasethedivergence betweentheclustersfromthesameobject,MAIKLDfavorsframeswithmoreoutliersleadingto\noisy" key-frames. Another possible approach is to use SFS for key-frame extraction. However, it is unable to remove redundant key-frame candidates from the candidate sets under testing. In order to reduce the computational load, we suggest another divergence-based criterion. 2.3 Maximum Marginal Diversity In [14], a maximum marginal diversity (MMD) criterion is proposed for e(cid:14)cient feature selection based on the infomax principle [22], which recommends to preserve maximum information about input be- havior while minimizing the information redundancy. In the context of classi(cid:12)cation, it tends to select features that maximize the mutual information (MI) between the features and class labels [14]. When the infomax principle is applied to this work, the objective function is written as: X~ = arg max I(X;Y); (4) X(cid:18)X0;jXj(cid:20)N where I(X;Y) is the MI between the key-frame subset X and class label Y = f1;2;(cid:1)(cid:1)(cid:1);Mg: I(X;Y) = p(x ;y )ln p(xi;yj) . Pxi2XPyj2Y i j hp(xi)p(yj)i 6 Considering I(X;Y) = H(Y)(cid:0)H(YjX), where H(Y) is the entropy of the class label, and H(YjX) is the conditional entropy. A relation between the tightest lower bound on Bayes error and H(YjX) is derived in [14]. This relation indicates that minimizing H(YjX) (the infomax principle) is equivalent to minimize a lower bound of Bayes error. I(X;Y) can be re-written as [14]: I(X;Y) = p(y )KLD(p(Xjy );p(X)) X j j yj2Y = E [KLD(p(XjY = y );p(X))]; Y j N(cid:3) N(cid:3) N(cid:3) = MD(x(cid:3))+ I(x(cid:3);x(cid:3) jY)(cid:0) I(x(cid:3);x(cid:3) ) (5) X i X i 1;i(cid:0)1 X i 1;i(cid:0)1 i=1 i=2 i=2 where MD(x(cid:3)) = E [KLD(p(x(cid:3)jY = y );p(x(cid:3)))], and x(cid:3) = fx(cid:3);x(cid:3);(cid:1)(cid:1)(cid:1)x(cid:3) g. MD(x(cid:3)) is called i Y i j i 1;i(cid:0)1 1 2 i(cid:0)1 i the marginal diversity (MD), and indicates the variance of the mean density. The analysis in [14] shows that I(X;Y) can be approximated by a summation of MD values if the mutual information between features is not a(cid:11)ected by class labels, i.e., N(cid:3) I(x(cid:3);x(cid:3) jY) = N(cid:3) I(x(cid:3);x(cid:3) ). Then maximum Pi=2 i 1;i(cid:0)1 Pi=2 i 1;i(cid:0)1 MI becomes MMD. As generalized in [14], this condition is originated from recent studies about image statistics, which suggest that a rough structure of pattern dependencies between some image features followsgeneralstatisticallawsthatareindependentoftheirclasslabels. Thesefeaturesareextractedvia various biologically plausible image transforms, such as the wavelet transform. Although this condition is not always strictly held, at least it shows that MMD is near optimal in the sense of minimum Bayes error. When applying MMD to key-frame extraction, frames with the largest MD values are extracted as key-frames. Similar to MAIKLD, MMD key-frame extraction is performed after the initial GMM estimation. However,MAIKLDneedstotestdi(cid:11)erentcombinationsofkey-framecandidateswhileMMD only considers the divergence in each frame and ignores the inter-frame dependence. The MD value of key-frame candidate x(cid:3) is approximately calculated as MD(x(cid:3)) = M (cid:11)i K log p(omj(cid:18)i), where i i Pi=1 K Pm=1 p(om) fo jm = 1;(cid:1)(cid:1)(cid:1);Kg are all pixel feature vectors of x(cid:3), and N(cid:3) frames that have the largest MD values m i are selected as key-frames. N(cid:3) can be predetermined, or be adaptively determined given a threshold of the MD value. We use the average MD of all key-frame candidates as the threshold, and any candidates whose MD values are greater than the threshold are selected as key-frames. 2.4 Proposed Algorithm An overview of the joint method is presented in Fig. 2. The input key-frame candidates are either all frames in a shot, or key-frames initially selected by the color histogram [5, 12]. (Y;u;v) color features, x (cid:0) y spatial location, and time t are used to construct the uni(cid:12)ed feature space. The input video is modeled by a GMM that is estimated via the EM algorithm and MDL criterion. After the initial modeling, MAIKLD or MMD is used to guide the key-frame extraction. The extracted key-frames are applied to re-estimate the GMM, which is used to segment all frames via the MAP classi(cid:12)cation. Even 7 Object(cid:13) Input Key-frame(cid:13) Segmentation of(cid:13) Objects(cid:13) Candidates(cid:13) All Frames(cid:13) GMM-based(cid:13) GMM Re-(cid:13) Learned(cid:13) Object Modeling(cid:13) estimation using(cid:13) Models(cid:13) (EM+MDL)(cid:13) Key-frames(cid:13) Divergence Computation and(cid:13) Key-frames(cid:13) Key-frame Extraction(cid:13) Figure 2: The (cid:13)owchart of the proposed algorithm. though the initial GMM estimation, key-frame extraction, and model re-estimation are implemented sequentially and separately, the whole process is well-integrated by taking into account their mutual in(cid:13)uence in the uni(cid:12)ed feature space. Compared with the method using all frames [9] or the combined approach using key-frames extracted from color histogram [12], it is expected that the proposed method not only improves the computational e(cid:14)ciency by minimizing feature redundancy, but also enhances the robustness of video modeling by reducing feature irrelevance. As mentioned before, MAIKLD extracts key-frames that maximize the pairwise cluster divergence, and considers the statistical characteristics of clusters within and across frames by computing AIKLD over a group of key-frame candidates, whereas MD values are estimated in each individual frame by assuming frame independence, and MMD selects those with the largest MD values as key-frames. Ac- cordingly, they could extract di(cid:11)erent key-frames, although both of them are lower bounded by the Bayes error. In terms of GMM-based video modeling, MAIKLD could extract more discriminative key-frames than MMD because maximizing the variance of mean density does not necessarily maximize the pairwise cluster divergence or reduce the overlap between clusters. Moreover, MMD only considers the cluster divergence in each frame, taking a risk of overlooking inter-frame dependency. Nevertheless, MMD is computationally more e(cid:14)cient than MAIKLD because no combinatorial search is needed. y(cid:13) t=a(cid:13) t=b(cid:13) t(cid:13) B(cid:13) I(cid:13) A(cid:13) II(cid:13) III(cid:13) x(cid:13) Figure 3: Two clusters in a joint spatio-temporal feature space. 8 B(cid:13) A(cid:13) X(cid:13) A(cid:13) X(cid:13) A(cid:13) X(cid:13) B(cid:13) B(cid:13) Y(cid:13) Y(cid:13) t(cid:13) B(cid:13) B(cid:13) A(cid:13) X(cid:13) A(cid:13) X(cid:13) A(cid:13) X(cid:13) B(cid:13) Y(cid:13) Y(cid:13) t(cid:13) (a) Runing toward/away (b) Enlarged/shrinked (c) Appearing/disappearing. Figure 4: Clusters A and B in the feature space are corresponding to two objects. Axes X and Y are for the spatial coordinate, and Axis-t is for the time. 2.5 Key-frame Characteristics So far, we have addressed the (cid:12)rst two issues raised in Section 2, and now we want to study the third issue regarding the characteristics of key-frames extracted under the new formulation. Fig. 3 shows two clusters in a feature space de(cid:12)ned in space and time, and two temporal slices (frames A and B), which capture the spatial locations of two clusters at time t = a and t = b, splitting the feature space into three regions. Two clusters are spatially closest when they overlap in the x(cid:0)y plane in Region II (the shaded area). If the clusters are associated with two real objects, then the objects are spatially adjacent in the frames of Region II, and far away in the frames of Region I or III. By understanding the mechanism of MAILKD or MMD for key-frame extraction, a link between low-level features and high-level concepts could be established. MAIKLD is equivalent to minimize the Bayes error, which is caused by the cluster overlap in the feature space. In order to minimize the Bayes error, the cluster divergence should be maximized. Therefore, MAIKLD tends to extract key-frames where clusters have the least overlap, i.e., Regions I and III in Fig. 3. When applying MMD, the MD value is calculated in each individual frame. Any frame whose mean density has a su(cid:14)ciently large variance, or in other words, whose clusters are widely dispersed in the x(cid:0)y plane will be extracted as a key-frame. Hence the location of such frames mainly depends on the cluster dispersion in space. Below we further discuss three speci(cid:12)c cases of object behavior. (cid:15) Running away/toward: When multiple objects are moving away from each other, the average cluster divergence increases so long as their colors and sizes do not change signi(cid:12)cantly. MAIKLD tends to extract key-frames where objects are spatially far away. Also, the cluster dispersion in 9 a frame measured by MD is large when objects are far away. Therefore, MMD also extracts key- frames where object are well separated in space. Fig. 4 (a) and (b) show the cluster distribution when the objects are spatially close and far away, respectively. (cid:15) Enlarging/shrinking: When an object’s size is enlarging, AIKLD usually decreases, and the vari- ation of the mean density increases because the cluster size of that object also increases in the feature space. Therefore, MAIKLD tends to extract key-frames where the object is relatively small, and MMD selects those where the object is relatively large, as shown in Fig. 4 (c) and (d) where the clusters are de(cid:12)ned in the x(cid:0)y plane. (cid:15) Appearing/disappearing: Fig. 4 (e) and (f), where the clusters are de(cid:12)ned in the x(cid:0)t plane of Fig. 3, illustrate the cases of object appearing and disappearing. When an object appears in a scene, usually AIKLD decreases due to the appearing of new clusters, and MD increases. Conse- quently, MAIKLD tends to extract key-frames where the object disappears (or before it appears), while MMD extracts key-frames where a new object appears (or before an object disappears). In addition to above cases, it is possible to have other behaviors/events implied by the extracted key-frames. For instance, if motion features are used, the objects’motion pattern could be indicated by key-frames. If shape features are involved, the appearance of speci(cid:12)c objects can also be implied. According to above analysis, it is possible that MAIKLD and MMD extract similar or even the same key-framesiftheclusterdistributionineachframedoesnotchangeovert,i.e.,theprojectionsofspatial- temporal clusters to any temporal slices (frames) are similar. In this situation, MMD is equivalent to MAIKLD because the frames with the largest MD values also form the spatio-temporal clusters with the maximum AIKLD. It is worth mentioning that the above discussion is based on a fundamental assumption: the GMM-based spatio-temporal grouping can provide reasonable clusters in space and time that correspond to objects and their spatio-temporal behaviors. With this assumption, the proposed method can produce interesting key-frame extraction results. Especially, when the color distribution does not signi(cid:12)cantly change over the time but the object is still moving (the object motion may not necessarily a(cid:11)ect the color histogram), the key-frame extraction method using color histogram cannot provide representative key-frames for object segmentation. As we mentioned earlier, key-frame de(cid:12)nition is quite subjective. For instance, we may want frames where objects are spatially close to study objects’ interactions; or those where objects have large sizes to study objects’ detail, etc. The proposed method can still provide the key-frames of interest that may not be the same used for object segmentation. Basically, during the process, all frames are evaluated in terms of their contribution to object segmentation. We have the (cid:13)exibility to chose relevant frames to best summarize the video content based on our preferences. Moreover, in terms of video content organization [23], which groups video shots of similar content together by content-based matching or via key-frame similarities, the extracted key-frames can imply salient points of video content related to object behaviors/events and support more semantically e(cid:11)ective video grouping. 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.