ebook img

Efficient Background Modeling Based on Sparse Representation and Outlier Iterative Removal PDF

0.75 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Efficient Background Modeling Based on Sparse Representation and Outlier Iterative Removal

JOURNALOFLATEXCLASSFILES,VOL.11,NO.4,DECEMBER2012 1 Efficient Background Modeling Based on Sparse Representation and Outlier Iterative Removal Linhao Li, Ping Wang, Qinghua Hu, Senior Member, IEEE, Sijia Cai Abstract—Background modeling is a critical component for variations,andshadows[11],[12],maypreventusfromdistin- variousvision-basedapplications.Mosttraditionalmethodstend guishing the background from a video sequence. Second, the to be inefficient when solving large-scale problems. In this data onpracticalproblemsis increasingwith the development 6 paper, we introducesparse representation intothetask of large- 1 scale stable-background modeling, and reduce the video size by ofnewtechnologiesandimprovementsintheequipmentused. 0 exploringits”discriminative”frames.Acycliciterationprocessis However, there is also an increasing demand for efficient 2 thenproposedtoextractthebackgroundfromthediscriminative backgroundmodelingtechniques,andfasttrackingofmassive n frame set. The two parts combine to form our Sparse Outlier video sequences is required for certain practical tasks, like Iterative Removal (SOIR) algorithm. The algorithm operates in a crime detection and recognition. As a result, it has become J tensorspacetoobeythenaturaldatastructureofvideos.Exper- anurgenttask to developan efficientandrobustalgorithmfor imental resultsshow that afew discriminativeframes determine 5 the performance of the background extraction. Further, SOIR practical background modeling. can achieve high accuracy and high speed simultaneously when A large number of background modeling methods have ] V dealingwithrealvideosequences.Thus,SOIRhasanadvantage been reported in the literature over the past decades. Most in solving large-scale tasks. C researchers have regarded a series of pixel values as fea- Index Terms—Background Modeling; Sparse Representation; tures, and set up pixel-wise models. Initially, each pixel- . s Tensor Analysis; Principal Component Pursuit; Alternating Di- value series is modeled using a Gaussian distribution, e.g., c rection Multipliers Method; Markov Random Field. [ the Single Gaussian (SG) model developed in 1997 [13] and the Multiple of Gaussian (MOG) model developed in 1999 2 v I. INTRODUCTION [14]. Some improved Gaussian-based algorithms [15], [16], 3 [17] also achieved a high level of performance in the few Thebackgroundmodelingofavideosequenceisakeypart 1 yearsfollowingthereleaseoftheseabovemodels.Inaddition, in many vision-based applications such as real-time tracking 0 clustering methods have also been used to model a back- 6 [1], [2], information retrieval, and video surveillance [3], ground, e.g., codebook [18], [19] and time-series clustering . [4]. In a video sequence, some scenes will remain nearly 1 [20]. Furthermore, a non-parametric method was proposed constant, even though they may be polluted by noise [5]. The 0 in 2000 [21] and improved in 2012 [22], and has shown a 4 invariableaspectisthebackground.Amodelforextractingthe competitive performance. The Visual Background Extractor 1 backgroundisanimportanttoolthatcanhelpushandleavideo (ViBe) was recently proposed in 2012 and later improved, : sequence, especially one taken in a public area [6]. Back- v and performs better than most popular pixel-wise techniques i groundmodelingis also an essential step in many foreground X [8], [23]. These methods solve the background problem by detectiontasks[7],[8],[9].Oncethebackgroundisextracted, building a model for each pixel and initializing the models r we can detect or even track the foreground information by a during the training process. High accuracy is obtained if comparing a new frame with the learned background model sufficient training data are provided, but more training data [4]. means additional training time. There are two challenges to a background modeling al- Anothertypeof modelingtechniquesisto set upthemodel gorithm. First, although we consider the background to be at the region level. Some works have focused on the local stationary, it is often interfered with by certain factors such region, and different local features have been proposed [11], as fluttering flags, waving leaves, or rippling water [10]. In [24], [25], [26]. In addition, global-region based algorithms addition, other issues such as signal noise, sudden lighting have also been proposed. In 2000, Oliver et al. [27] first modeleda backgroundusing a PrincipalComponentAnalysis This work was partially supported bythe National Natural Science Foun- dation of China (No. 51275348, No. 61379014, No. 6122210, and No. (PCA), i.e., they modeled the backgroundby projecting high- 61432011), and New Century Excellent Talents in University (Grant No. dimensional data into a lower dimensional subspace. Robust NCET-12-0399). PCAdevelopedin2010[28],andPrincipalComponentPursuit L. Li is with the School of Computer Science and Technology, Tianjin University, Tianjin300072,P.R.China.(e-mail: [email protected]). (PCP) developed in 2011 [29], have shown their superiority P. Wang is with the School of Science and the Center for Applied over the original PCA. Based on these models, heuristic Mathematics of Tianjin University, Tianjin University, Tianjin 300072, P.R. background methods have also been introduced [30], [31], China. (e-mail:[email protected]). Q. Hu is with the School of Computer Science and Technology and the [33]. These PCA-based models omit the training process and Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin usedatatoextractthebackgrounddirectly.However,Singular University, Tianjin300072,P.R.China.(e-mail: [email protected]). Value Decomposition (SVD) is an inevitable time-consuming S. Cai is with the School of Science, Tianjin University, Tianjin 300072, P.R.China. (e-mail:[email protected]). stepina PCA-basedmodel,andthusthesemodelsarelimited JOURNALOFLATEXCLASSFILES,VOL.11,NO.4,DECEMBER2012 2 in large-scale tasks because their speed and memory require- accuracy by taking full advantages of both processes. ments are all sensitive to the scale of the data. • The tensor-wise model in the cyclic iteration is a PCP Sparse representation and dictionary learning is also an model,andisrobusttogeneralnoises[29].Differingfrom important region-based method. It is widely employed in previous works, the vectorized static background in our the tasks of computer vision such as face recognition [32], algorithm is explicit rank-1, instead of just being low- [34], [35], classification [34], [36] and denoising [37]. Some rank. To constrain this, we propose a new space R(4) researchers have introduced sparse representation into back- where the backgroundactually lies. Owing to the rank-1 groundmodeling[38],[39].Theymodeledthebackgroundby hypothesis, SVD is non-essential. the dictionary and regarded the foreground as noise. Besides, Theremainderofthispaperisorganizedasfollows.Section theymadesomeassumptionsofindependenceamongdifferent IIintroducessomePreliminaryworks.SectionIIIprovidesthe pixels.However,theseassumptionsfailinmanypracticaltasks formulation and convergence of the SOIR algorithm. Section where the foreground region is usually not sparse and some IV presents the foreground detection method. Section V de- pixels are highly correlated. scribes the experimental results. Finally, Section VI provides In this paper, first, we use sparse representation to reduce some concluding remarks regarding our research. the video size by exploring the ”discriminative” frames of the video, instead of modeling the background directly. No II. PRELIMINARY WORK assumption is needed in this process. We then extract the background from these discriminative frames using a PCP- The Principal Component Pursuit (PCP) model and tensor based cyclic iteration. These two steps combine to form theory play key roles in our algorithm. Here, we introduce our algorithm, i.e., Sparse Outlier Iterative Removal (SOIR) some basic works of both. algorithm. The framework of the algorithm is shown in Fig. 1. A. Principal component pursuit SOIRmeetsthedemandofglobal-regionbasedbackground models on solving large-scale problems. For our algorithm, Low-rank matrix recovery is the key problem in many we rebuild the PCP model based on a rank-1 hypothesis. practical tasks, including backgroundmodeling. A given data Moreover, our algorithm operates in a tensor space in order matrix M is the superposition of a low-rank matrix L and to obey the natural data structure of the videos. We detect a sparse matrix S, i.e., M = L+ S. Principal Component foreground objects using the Markov Random Field (MRF) Analysis (PCA) is an effective way to solve this problem, once the background is extracted. Experimental results show but the brittleness of the original PCA model with respect to that our algorithm can achieve high accuracy and high speed grossly corrupted observations jeopardizes its validity [29]. simultaneously when dealing with real-life video sequences. Cande`setal. recentlyprovedthatonecanrecovermatrixL and the sparse matrix S precisely under mild conditions[29]. This model, known as Principal Component Pursuit (PCP), can be formulated as min kLk +λkSk ∗ 1 L,S (1) s.t. L+S =M, whereλisaregularizationparameter,k·k andk·k denotethe ∗ 1 nuclear norm (sum of singular values) and the l -norm (sum 1 of the absolute values of the matrix elements), respectively. Model (1) was modified to solve the backgroundmodeling problem [31], [33]. All of the original works modeled the backgroundusing a low-rank matrix. In contrast, we consider Fig. 1. The framework of the Sparse Outlier Iterative Removal (SOIR) the background as a rank-1 matrix. algorithm. Themaincontributionsofthisworkaresummarizedbelow: B. Tensors theory • We utilize sparse representationto reduce the size of the A tensor is a multidimensional array. More formally, a N- video by exploring its discriminative frames. Instead of way or N-order tensor is an element of the tensor product using all frames to model the background, we simply of N vector spaces, each of which has its own coordinate use the discriminative frames. In this way, our model system[40],[41].Intuitively,avectorisa1-ordertensor,while can meet the demands of many practical background a matrix is a 2-order tensor. In this paper, in addition to the modeling problems in terms of speed and memory. specificinstructions,wedenotevectorbylowercaseletter,e.g., • Thecycliciterationprocessiscomposedofatensor-wise r,andmatrixbyuppercaseletter,e.g.,R.Inaddition,ahigher- model and a pixel-wise strategy. In a general case, a ordertensorisdenotedinboldfacetype,e.g.,R.Thespaceof tensor-wiseprocessalwaysconsiderstheoverallinforma- all tensors is denoted by a swash font, e.g., K. We denote the tion, whereas a pixel-wise process pays more attention spaceofallN-ordertensorsbyRN,RN =RK1×···×KN,N = to particular information. Our algorithm achieves high 1,2,3,4,···. JOURNALOFLATEXCLASSFILES,VOL.11,NO.4,DECEMBER2012 3 Atensorcanbemultipliedbyamatrix,whichisalsoknown framescanberepresentedthrougha linearcombinationofthe as the n-mode (matrix) product [41]. The n-mode product of remaining frames, and the other frames are usually repeated. tensor X ∈ RK1×···×KN with matrix U ∈ RJ×Ki is denoted In other words, a series of frames can represent all frames: bEyleXme×nitU-waisned,iwseofhsaivzeeK1×···×Ki−1×J×Ki+1×···×KN. mCin kD−D×4Ck2F +λkCk1,2, (5) (X× U) = Ki X U , (2) where k·kF is the Frobenius norm, which equals the square i k1···ki−1jki+1···kN k1···kN jki root of the sum of squares of the entries of the tensor. C ∈ kPi=1 RN×N is a coefficient matrix, and λ is used to balance the where ki ∈[1,··· ,Ki](i=1,··· ,N); j ∈[1,··· ,J]. two parts. In addition, k·k1,2 is the l1,2-norm and is the sum of the l -norm of all rows in C [42]. We solve this model by 2 III. SPARSE OUTLIERITERATIVEREMOVALALGORITHM converting it into an equivalent problem: a vInidethoissesqeucetinocne,.wWeefuosceusDontomdoendoeltienga tvhiedeboa,cakngdroausnsdumoef Wm,iCn kD−D×4Wk2F +λkCk1,2, s.t. W =C. (6) that there are N frames in D. Each colorful frame is a 3- ThisproblemisthestandardaugmentedLagrangeformulation order tensor by nature, and the j-th frame is denoted by Ij ∈ and can be solved using the Alternating Direction Multipliers Rm×n×3. Then D=[I1,··· ,IN]∈Rm×n×3×N. In addition, (ADM) method [43], [44]. we use B = [BI ,...,BI ] and A = [AI ,...,AI ] to Thej-throwofC recordsthecoefficientsofthej-thframe 1 N 1 N denote the background and foreground of video D. to represent other frames, and the j-th column of C records We first analyze the components of a video. In the video, the coefficients of other frames to represent the j-th frame. the background is covered by the foreground objects. We We can then deduce the role of each frame by observing the denote the foreground region as Ω, and the outside region correspondingrowinC.Theframeswhosecoefficientsareall as Ω. Let PΩ be an orthogonalprojector onto the span of the zero are regarded as redundant, and the non-zero rows in C tensors vanishing outside of Ω. Then for an arbitrary frame correspond to discriminative frames. A new set D is formed Ij, the (x,y,z)-thcomponentofPΩ(Ij) isequalto (Ij)xyz if to contain all discriminative frames. e (x,y,z) ∈ Ω, and is zero, otherwise. Thus, the video can be expressed as D=P (B)+P (A), (3) Ω Ω where PΩ(B) = [PΩ(BI1),...,PΩ(BIN)] and PΩ(A) = [PΩ(AI1),...,PΩ(AIN)]. Actually, PΩ(AIi)=AIi because Ωissimplytheforegroundregion.Thenoiseisalsoanaspect, i.e., D=P (B)+A+E, (4) Ω where E is the noise. Equation (4) shows the actual compo- nents of a video and is a strict constraint in our model. A. Discriminative exploration using sparse representation In most large-scale background modeling problems, the framesarehighlyredundant.Someoftheframesalreadycarry sufficientbackgroundinformation,andaremorediscriminative than other frames. In this section, we refine frame sequence D and obtaina new informativeset D, whichis composedof the selected discriminative frames. e Fig. 2. The sparse representation process of a discriminative exploration. We use sparse representation to explore the discriminative The original frame set is composed of 50 frames (airport1000- airport1049 frames by solving the maximum linearly independent group inthe”hall” sequence fromtheI2Rdataset), wherethegridgroupindicates of video frames. The sparse representation process, which is coefficient matrixC. robust to noise [34], is based on the video content. Once a frame is represented by other frames, its content is no Fig. 2 illustrates our sparse representationprocess. A video longer discriminative. In a real-life video, the discriminative sequence equals a sparse linear combination of itself plus framesarethosewhoseforegroundobjectsaredifferentinboth errors. The color rows in the grid groupindicate the non-zero position and appearance. Thus, we gain different background rows in C. The corresponding frames of the nonzero rows in information from different discriminative frames. Once a suf- C are selected, i.e., the ten frames in Fig. 2, which contain ficient amount of information is obtained, we can model the almost all background information of the 50 frames. background. For most practical problems, the frames selected by Model Now, we will introduce our sparse representation model. (5) are suitable for background modeling. However, some Foregroundobjectsmovecontinuallyinavideosequence.Two abnormal behaviors and serious noise pollution of the video adjacent frames are usually approximately the same. Some may lead to more complicated relations among frames. For JOURNALOFLATEXCLASSFILES,VOL.11,NO.4,DECEMBER2012 4 example,in Fig. 2, if the womanin yellowjumpsfromleft to realbackgroundsof differentframesare the same, i.e., BI = i right,moreframeswillbeconsidereddiscriminative.However, BI =B∗,∀i6=j, or the following: j tenframesare sufficientfora backgroundmodel.In thiscase, updatingDisessential.Wedothisbasedoncoefficientmatrix B=B∗×4Z, (7) C, from which we can measure the similarity between two where Z = (1,1,...,1)eT ∈ RNe×1 isea 1-order matrix. B∗ is e frames. If frame Ia resembles frame Ib, the coefficient of regarded as a 4-order tensor, i.e., B∗ ∈Rm×n×3×1. frame I is close to 1 in representing frame I ; otherwise, it e a b Constraint (7) in fact insists that the background is rank- is far from1. First, we find the frame that resemblesall other framesin D the most. This is the first reselected frame.Next, 1. Just like the vectorizing process in previous works [31], [35], [39], we transform the frames into gray-scales and we choosethe lastsimilar frameof thefirstre-selectedframe. e combinemode-1andmode-2ofeachframeintoasinglemode. Then, every time we choose a new frame, it is the last frame This means the operator vectorize(·) reduces the dimension thatallofthepreviouslyselectedframesresemble.Eventually, of high-dimensional data. After the vectorizing process, a wechooseanewframesetinwhichthesimilaritiesamongthe members are low. This set is the updated D. The appropriate video tensor is transformed into a matrix, and a frame tensor number of D will be explored experimentally. is transformed into a vector. The vectorized formulation of e Equation (7) is then After the sparse representation of the video, we refine the original videeo D and form a new selected discriminative vectorize(B)=vectorize(B∗)× Z, (8) 2 frame set D. Assume that there are N frames in D : D∈Rm×n×3×Ne, where N ≪N. where vectorize(B) ∈eRp×Ne is a matrix ,vecteorize(B∗) ∈ e e e Rp is a vector and the 2-mode product (× ) is the outer e e e 2 B. Background extraction using cyclic iteration process product of the vectors. Thus, the equation reduces to the standard definition of rank-1. To solve Constraint (7), we consider a subspace of R . We 4 denote this subspace by R(4). All tensors in R(4) are 4-order tensors,andforeachtensorXinthisspace,element-wise,we have X = X ,∀a 6= b. Thus, B∗ must lie within this ijka ijkb space. R(4) is convex, and it is therefore easy to solve (7). LEMMA1. Given tensor B, the solution to the problem minkB−B∗× Zk2 (9) B∗ 4 F e e e is B∗ =( Nl=1BeIl)/N. P e e e Proof: kBe − B∗ ×4 Zk2F = Nl=1kBeIl − B∗k2F = NkB∗ − ( Nle=1BeIl)/Nk2Fe+ constP, wheree const and N indicate conPstants. This completes the proof. e e e e Fig.3. Thecyclic iteration processinaniteration. Tomodelthebackground,wewantthestaticvideocontent. We therefore minimize the changing part to group more In this section, we describe the design of a background information into the background. In addition, we take strict extraction using a cyclic iteration (the outer loop in SOIR). Constraints (4) and (7) into account and give our model: This process is shown in Fig. 3. In each iteration, we use a min kA+E−P (B)k Ω 1 PCPmodeltosolvethepurified-meanframefromtheselected Be,Ae,Ee discriminativeframesetD,andinapixel-wiseoutlierremoval s.t. D=eP (Be)+A+eE (10) Ω strategy, we use the purified-mean frame to ameliorate the B=B∗× Z. selected discriminative freame set D conversely. The iteration e e4 e e Intheforegroundregieon,we minimeizethenumberofnonzero will continue until the purified-mean frames converge to a e elements in A + E − B; otherwise, such pixels can be fixed frame, which is the background. considered as background if A+ E− B = 0. Outside the 1) Tensor-wise PCP model: The tensor model is used e e e foregroundregion,weminimizethenoiseE.Benefittingfrom to calculate the purified-mean of the selected discriminative e e e the definition of the l -norm, we arrange the two regions into frames. It is the mean of the frames initially, but moves 1 e a single formula, i.e., the objective function in Model (10). slightlyawayduringthedenoisingprocess.Followingtheidea of ADM, we solve this tensor model through an iterative Model (10) is a PCP model. To solve this model, we first approximation (the inner loop in SOIR). The solution is arrangeit.TheconstraintB=B∗× Z cannotbetransformed 4 limited to a R(4) space. into a single variable linear equation.We thereforeuse it as a e e The purified-mean frame, denoted by B∗ :B∗ ∈Rm×n×3, correction term. In addition, we denote all non-background is the optimal background of the current frame set D. The parts by S : S = A + E − P (B). We then obtain the Ω e e e e e e JOURNALOFLATEXCLASSFILES,VOL.11,NO.4,DECEMBER2012 5 following: of the video. The convergence condition of the algorithm is kBk+1−Bkk/kBkk≤1e-3. min kSk1, s.t. D=B+S. (11) Be,Se e eAlgorithem 1: SOIRAlgorithm. Model(11)canbesolveedbytheiteeratioen[43e](theinnerloop): Input:D∈Rm×n×3×N. output:B∗. Sk+1 =T (D+ Λek −Bk) 1:sparse representation:  ΛBekk++11 ==ΛDµ1k−−eSµk(+D1µ−Bek+1−Sk+1), (12) 2:cycgleictmDietCei,nrabtkiyoDns:o−lvDin×g:4Ck2F +λkCk1,2. e e e whereµ>0eisastep-elengthpaerameeterandΛeistheLagrange (w1h):iluepndoattceonvBer∗g,edbdyo:(outer loop): while not converged do(inner loop): multiplier.In addition,T (·) is a soft-thresholdoperator[43], [44]. For an arbitrary tenµ1sor X∈RK1×···×KeN, element-wise, Sek+1=Tµ1(De + Λeµk −Bek); Bk+1=B∗k+1,where the operator is given by e 1 1 B∗k+1=(PNle=1(DeeIl −SeekIl+1))/Ne; T (X) =max(X − ,0)+min(X + ,0). Λk+1=Λk−µ(D−Bk+1−Sk+1). µ1 k1···kN k1···kN µ k1···kN µ endewhile. e e e e Here, we consider the correction term. The tensor B m(1u3s)t (2):FuoprdaptiexelDe(x(,Dye,z=)[Ie1,···,IfNe]), by: plireojienctthiteiRnto(4)thsepRac(e4.)OspnacceewanedoubsteiaintsavenretwicaBlpirnoje(1ce2ti)o,nwtoe Nth∗en=(aIrggNm∗a)xxeiy|z(Iee=i)xByz∗xy−z.B∗xyz|, replace itself. Thus, the updated formula of Bke+1 in (12) is End end while. replaced by e Bk+1 =B∗k+1, (B∗k+1 :D−Sk+1 =B∗k+1× Z). (14) 4 D. Convergence analysis eThe result of the inner looep isethe optimal backgroeund B∗ WewillnowdescribetheconvergenceofSOIR.Theconver- of the current frame set D. gence of a sparse representation model has been well studied 2) Pixel-wise strategy: Once we obtain the purified-mean [45].Thus,wefocusontheconvergenceofthecycliciteration frameB∗ ofthe currentferamesetD, we use B∗ to renewthe process. current frame set D conversely. This is a pixel-wise method, For an arbitrary pixel (x,y,z), there are N pixel values e called outlier removal strategy. We repeat this strategy for all in the selected set D. In the i-th outer loop iteration, the e e pixels in each frame. As an example, we then take the pixel minimum and maximum of the N values are recorded as a i e (x,y,z) (x=1,··· ,m;y =1,··· ,n;z =1,2,3). and b . B∗i is the purified-meanof the N valuesin last iter- i xyz e In Fig. 3, we have labeled pixel (x,y,z) in all frames ation, and thus B∗i ∈[a ,b ]. If lim (b −a )=0, xyz i−1 i−1 ei→∞ i i withsolidcolorpoints.Differentcolorsindicatedifferentpixel the purified-mean B∗i will converge. This can be inferred xyz values. When we locate these values in the axis, we find that from the Nested Intervals Theorem [46]. To complete the mostofthevaluesgatherintoacluster,andothersdonot.The convergence,we prove the following lemma. outliersarethepixelvaluesofthenon-backgroundpixels.The LEMMA 2. In the SOIR algorithm, for an arbitrary pixel purified-mean value and worst outlier are also marked in the (x,y,z), the minimum and maximum of the N pixel values figure.Theworstoutlieristhevaluethatisfarthestawayfrom in the i-th iteration are recorded as a and b . we then have the purified-mean value. i ie lim (b −a )=0. The ground truth of the background is inside the cluster. i→∞ i i The purified-mean value is much closer to the ground truth Proof: First, the purified-mean value Bi+1 is inside xyz than the worst outlier. Thus, the extraction performance will a subinterval of the interval [a ,b ], because the value is i i e be improved if we use the purified-mean value to replace the simply around the mean of all the N values in last iteration. worst outlier. As shown in Fig. 3, once we extract a purified- Otherwise, if the mean of the N valuesis close to either their mean frame B∗, we continue replacing the worst outlier with minimum or maximum, we can coneclude that the N values e the purified-mean value for each pixel. Here, replacing the are already close to each other[46]. e worst outlier means deleting the worst value and returning Second, after N iterations, we record the minimum and the purified-mean value to the frame. Thus, frame set D is the maximum of all the N purified-mean values (Bi ,i = e xyz renewed after the replacement is conducted for all pixels. e 1,2,...,N) as c and d. Tehen, we have [ai+Ne,bi+Nee]⊆[c,d], because the worst outlier is replaced by the purified-mean C. Algorithm formulation value in eeach iteration, and all of the N purified-meanvalues The methods in Sections III-A and III-B combine to form are in the interval [c,d]. The subinterval [c,d] is shorter than e our Sparse Outlier Iterative Removal (SOIR) algorithm. In the interval [a ,b ]. In other words, there is a constant ratio i i this algorithm, the sparse representation model is an impor- γ :γ <1, subject to (bi+Ne −ai+Ne)≤γ(bi−ai). tant aspect. It returns the selected discriminative frames that Finally,weassumethattheoriginalminimumandmaximum carry sufficient background information. The cyclic iteration of the N values are a and b, respectively. Then (b e − 1+N process is the main aspect, which extracts the background a e) ≤ γ(b − a), (b e − a e) ≤ γ2(b − a), ···, 1+N e 1+2N 1+2N JOURNALOFLATEXCLASSFILES,VOL.11,NO.4,DECEMBER2012 6 (b e −a e)≤γn(b−a), ···. We know that (b−a) is Model (18) can be rearranged as 1+nN 1+nN a constant, and γn is close to zero when n is large. We then min (λ − 1((I ) −(B ) )2)∗O have limi→∞(bi−ai)=0, which completes the proof. Oij Pij a 2 k ij Ik ij ij (19) +λ ∗kA× Ok , b 3 1 Wehavetopointoutthatthederivedsolutionmaynotbethe where A is a projection tensor. The constant part of Model ground truth, and is influenced by the property of the video. (18) is omitted because it is insignificant in an optimization The experiments in Section V will show that the solution is problem. Model (19) is the standard form of a first-order pretty close to the ground truth if the video quality is not too MarkovRandomField,andcanbesolvedexactlyusinggraph poor. cuts [49]. V. EXPERIMENTALANALYSIS IV. FOREGROUND REGION DETECTION In this section, we describe the performance evaluation of our Sparse Outlier Iterative Removal (SOIR) algorithm. Having computed the background tensor, as described in To evaluate the algorithm, we will explore the appropriate SectionIII,wethendetecttheforegroundregionofthevideo. number of discriminative frames and test the performance Background subtraction is a common method for detecting and time consumption of the algorithm. The experiments are the foregroundregion.We denote the result of the subtraction conducted on real sequences from public datasets such as the as FIk: FIk =Ik−BIk, where Ik =PΩ(BIk)+AIk +EIk I2R[24],flowerwall[50],SABS[51],andBMCdatasets[52]. and BI = B∗. We found that the residual background only In addition, some public video sequences from the Internet k existsintheforegroundregion,andoutsidethisregionnothing are also included in our experiments. All experiments are but noise exists, i.e., conducted and timed in Matlab R2010a on a PC with a FIk =(cid:26) AIk +EIk −BEIIk ,, oinustisdideeththeerergegioionnΩΩ;. 3.20GHz Intel(R) Core(TM) CPU and 4GB of RAM. k (15) A. Number of discriminative frames FromExpression(15),wecanconcludethatthebackground A major aspect of this paper is utilizing sparse representa- subtraction method works depending on the properties of tiontoreducethesizeofthevideo.Inthissection,weexplore AIk −BIk and EIk, and the relationship between them. If the appropriate number of discriminative frames. the distribution of AI −BI is different from that of EI , k k k thebackgroundsubtractionwill beanimpactfulwayto detect the foreground. 0.4 We next explore the foreground region Ω for an arbitrarily o0.3 given image Ik from the original frame set D. To simplify ati r thisproblem,wetransformthecolorframeintograyone(Ik). ce 0.2 We model the region using Markov Random Field, following n a previous works [33], [47], [48]. st di0.1 First, we set up a matrix O to represent the foreground region Ω: 0 0 10 20 30 1 , (A ) 6=0 number of the discriminative frames Oij =(cid:26) 0 , (AIIkk)iijj =0. (16) (a) The energy of Ω can then be obtained using the Ising 0.4 Bootstrap model[47]: Escalator o0.3 hall λa∗Oij + λb∗|Oij −Oxy|, (17) e rati SVhidoepop0in0g1Mall Xi,j i,j,x,y:|i−Xx|+|j−y|≤1 nc0.2 a where λa and λb are two positive parameters, that penalize dist0.1 O =1 and |O −O |=1, respectively. ij ij xy Clearly,ifwesimplyminimizetheenergyoftheforeground 0 0 10 20 30 region Ω, it will converge to an empty set, i.e., O = 0. To number of the discriminative frames avoid this, an important component of the objective function (b) is P (F ). In addition, the non-zero elements outside the Ω Ik foreground region should also be minimized. Thus, we have Fig. 6. The relationship between the number and the performance (1). The distanceratioistodividethedistancebetweentheresultandthestandardby the following foreground detection model: thatbetweenthestandardandtheoriginal ofcoordinate. min 1((I ) −(B ) )2+ λ ∗O Oij OPij=02 k ij Ik ij Pi,j a ij (18) First, we provide the details of our experiment on the + λ ∗|O −O |. b ij xy ”Bootstrap” sequence of the I2R dataset. A scene from this i,j,x,y:|i−Px|+|j−y|≤1 JOURNALOFLATEXCLASSFILES,VOL.11,NO.4,DECEMBER2012 7 Fig. 4. The background extracting results with the size of the selected set varying from 1 to 30. The standard background is extracted when using all the 300framesastheselected set. 0.4 0.4 Bootstrap Bootstrap Escalator Escalator o0.3 hall o0.3 hall ati ShoppingMall ati ShoppingMall r r e Video001 e Video001 c0.2 c0.2 n n a a st st di0.1 di0.1 0 0 0 10 20 30 40 0 10 20 30 40 number of the discriminative frames number of the discriminative frames (a) (b) 0.4 0.5 Bootstrap Bootstrap Escalator Escalator 0.4 o0.3 hall o hall ati ShoppingMall ati ShoppingMall e r Video001 e r0.3 Video001 c0.2 c n n a a0.2 st st di0.1 di0.1 0 0 0 10 20 30 40 0 10 20 30 40 number of the discriminative frames number of the discriminative frames (c) (d) Fig. 5. The relationship between the number and the performance (2):(a) the original frame number is 450; (b) the original frame number is 600; (c) the original framenumberis750;(d)theoriginal framenumberis900. video takes place in front of a buffet. We use the first 300 nearly every frame. We measure the relationship between the frames in the sequence as our original frame set D, and rate of convergenceand the number of discriminative frames, measure the performance of the SOIR algorithm when the the results of which are shown in Fig. 6(a). number of frames of the selected discriminative frame set Next, we repeat the operations on some additional video D varies from 1 to 30. For comparison, we need a standard clips, i.e., the ”Escalator”, ”hall”, and ”ShoppingMall” se- background, which we use all 300 frames to extract. quences in the I2R dataset, and a real video sequence in the e BMC dataset. The results of each video sequences (including The results are shown in Fig. 4. In the figure, we can ”Bootstrap”) are shown in Fig. 6(b). see that most of the extracted backgrounds are quite similar to the standard, even when the number of frames is small. WecanseefromFig.6thattheperformanceisnotveryhigh However, it is a little disappointing that the counter is not when the numberof frames is small. However,it improvesas recovered exactly, even in our standard background. The two the number of frames increases. When the number of frames small fuzzy areas are the spaces just in front of the buffet, is larger than 20, the ratio tends to become stable. The small wherepeoplearecontinuouslystandingandtakingthemealin fluctuation of each curve is caused by the non-background JOURNALOFLATEXCLASSFILES,VOL.11,NO.4,DECEMBER2012 8 informationof each new frame;however,the influenceof this time consumption increases as the resolution of each video weakens after the frame is processed using our algorithm. In sequence increases. Our model spends dozens of seconds addition, we also find that the contentof the video affects the solving the high-resolution video sequences. In addition, we results. In the ”Video001” sequence, the foregroundregion is can also conclude from Fig. 8 that the content of the video small.Thus,asmallamountofframesalreadycarrysufficient sequence also influences the performance. In the third video, background information, and the curve is smooth. However, the distant cars move slowly in the fixed lens owing to the in the ”Bootstrap” sequence, we use many more frames to perspective. which is actually an approximation of temporary deal with the changes in illumination, although the 30 results stay.Wecanseethat150framesarestillinsufficienttoresolve shown in Fig. 4 all look the same. this phenomenon, and more frames are needed. Finally,we also exploretheappropriatenumberofdiscrim- For comparison, we also examine the time consumption inative frames when the original frame number is larger, the of some additional methods. Because our model is a PCP results of which are shown in Fig. 5. We can see that the model,twoPCP-basedmethodsareincluded,i.e.,thePrincipal curve varies when the original number of frames increases Component Pursuit (PCP) [29] and the Detecting Contigu- because the discriminative frames are different. However, the ous Outliers in Low-Rank Representation (DECOLOR) [33], distance ratio tends to become stable in all experimentswhen which perform well for small-scale problems. In addition, the number of discriminative frames increases to around 35. we also use the Single Gaussian (SG) model [13], which is Theratiowillimproveifweusemoreframes,buttheeffectis currently the fastest method [3]. unremarkable. This means that 35 frames or so already carry In the experiments, we use the ”MovedObject” and ”Boot- sufficientbackgroundinformationinmostcases.Ifthecontent strap” sequences from the flowerwall dataset, and the ”hall” of the video sequence is pretty simple, fewer frames will be and”Campus”sequencesfromtheI2Rdataset.Foreachvideo required, and oppositely, more frames will be needed if the sequence, we use 150, 300, and 450 frames as the original content is complex. frame set to extract the background. As shown in Table I, SG is the fastest, but its speed is maintained at the expense B. Experiments on the time consumption of accuracy, which is lower than that of most other popular In this section, we discribe the time consumption of our methods [3]. model when solving tasks of different sizes. The majority of traditionalmethodsare usedforsequenceswithresolutionsof TABLEI around150×150,andwherethenumberofframesisusually THETIMECONSUMPTIONOFPCP,DECOLOR,SGANDSOIR. around50.Whenthescaleofthedataincreases,thesemethods video number PCP DECOLOR SG SOIR tend to become inefficient. 150 24.66 43.76 1.37 2.65 First, we focus on the number of frames of the sequence. MovedObject 300 53.97 74.66 2.81 6.19 We extract the background in four levels, i.e., from the first 450 85.52 124.12 4.30 11.90 150 61.27 186.18 1.58 6.25 300 frames, the first 600 frames, the first 900 frames and the Bootstrap 300 124.61 491.57 3.43 7.5 first 1200 frames. The results are shown in Fig. 7. 450 207.56 902.26 5.43 10.59 Whendealingwithsequenceswithlargernumbersofframes 150 64.57 144.28 2.00 3.75 than usual, our model solves the backgroundefficiently. Hun- hall 300 154.71 303.98 4.47 5.94 450 246.26 599.91 7.26 9.61 dredsofframesonlycostusdozensofseconds.Asthenumber 150 39.26 27.18 1.72 5.66 offramesincreases,the precisionof theextractedbackground Campus 300 89.01 47.13 3.63 6.94 is improved.Temporary static motion is a problemthat exists 450 150.67 80.14 6.77 12.09 in most traditional background modeling methods. Once a person remains at a particular spot for a while, he may be SOIR is on average ten times faster than other PCP-based consideredapartofthebackgroundinashortvideosequence. models, and can almost reach the speed of SG. This level of In our experiments, as the number of frames increases, the speedisbecausetheresultofSOIRextractingthebackground problemoftemporarystayisperfectlysolved,asillustratedin from the discriminative frames, instead of the original frame the results on the ”Hall” sequence. We can also see that the set. When the scale of the data is large, the major time time consumption is non-linear with the number of frames. consumption of SOIR is in exploring these discriminative On one hand, the uncertainty of a background created by the frames. Once they are obtained, however, we can model the temporary stay may cost some additional time. On the other backgroundquicklyandprecisely.Inthenextsection,we will hand,theforegroundcontentandnoisealsoinfluencethetime illustrate how the accuracy of our method is high in dealing consumption. with real-life video sequences. Next, we test our model on video sequences of both low and high resolutions. We use four video sequences, the first C. Detecting the foreground of which is from the BMC dataset, and the other three are intersection monitoring video sequences from a public 1) Evaluation on artificial dataset: In this section, we resource. We test our model on the first 50 frames and 150 providetheevaluationresultsobtainedusingtheSABSdataset. frames of each sequence. The results are shown in Fig. 8. Some approaches in the literatures [52], [53] have been eval- The man-made video from the BMC dataset consumes the uated on the SABS dataset, and their recall-precision curves least amount of time. In the later three real-life videos, the have been given. For a comparison of these curves, we also JOURNALOFLATEXCLASSFILES,VOL.11,NO.4,DECEMBER2012 9 Fig.7. Theexperimentsondifferentvideosequences.Theresultsareshowntogetherwiththetimeconsumption(Theunitoftimeissecond).Thesequences arefromtheI2Rdataset andtheBMCdataset. evaluate our algorithm on the SABS dataset, and give the [14], PCP [29], DECOLOR [33], and Robust Dictionary corresponding recall-precision curves of our algorithm. Learning(RDL)[39].Here,thefastestSGisnotincluded,be- InFig.9,thecurvesareevaluatedondifferentscenesinthe causeitsaccuracyislowerthanthoseofmostpopularmethods SABS dataset. Our method performs well for most scenes, [3].Thus,weuseamorecomplexGaussianmodel,i.e.,MOG. especially in the ”Basic” and ”Camouflage” sequences. The We use the video sequences from the I2R and flowerwall performanceforthe”LightSwitch”sequenceisnotverygood. datasets and compare the detected foregroundregion with the As shown in the literature [52], [53], the light switch in given hand-segmented foreground region. The test frame is the video is a huge challenge for most foreground detection chosen randomly from all hand-segmented frames. To avoid methods. the influence of a temporarystay, we use 250 frames, the last of which is the test frame. 1 Basic The sequences and results are shown in Fig. 10. In the 0.9 Bootstrap LightSwitch experiments, SOIR can extract the background exactly for 0.8 Darkening Camouflage almost all of the sequences, and is robust to noise. In video 0.7 sequence (g), the man remains at the same spot in all 250 Precision00..56 fprearmfoersm.sWweeclalninsefeortehgartoouunrdadlgeoterictthiomn,iswrhoibcuhstbteonenfiotissefraonmd 0.4 the accurate results of the background and the MRF model. 0.3 DECOLOR also performs well because it also models the 0.2 foreground using the MRF model. In most sequences, the 0.1 results of SOIR are better than those of DECOLOR, because 0 the extracted backgrounds of our algorithm are more exact. 0 0.2 0.4 0.6 0.8 1 Recall To quantitatively evaluate the performance of the different Fig.9. PrecisionandrecallofourmethodforthevideosintheSABSdataset. algorithms,wecomputetheF-measure,whichisderivedfrom the precision and recall, and is computed as 2) Evaluationonrealscenes: Wecompareourperformance 2×precision×recall with someotherresearches,i.e.,Multipleof Gaussian(MOG) F-measure= . (20) precision+recall JOURNALOFLATEXCLASSFILES,VOL.11,NO.4,DECEMBER2012 10 Fig. 8. The experiments on the videos whose resolutions are much higher. The results are shown together with the time consumption (the unit of time is second).Theresolutions aregivenoutatthetopofeachcolumn.TheleftmostvideoisfromBMCdatasetwhiletheothersarefromtheInternet. TABLEII REFERENCES F-MEASURESOFTHESEQUENCESSHOWNINFIG.10. [1] H. Li, C. Shen, Q. Shi, Real-time visual tracking using compressive Sequence SOIR PCP MOG DECOLOR RDL sensing, Proc. IEEE Int. Conf. Comput. Vision Pattern Recognit., (a) 0.9737 0.6110 0.2047 0.5669 0.1170 Providence, RI,USA,Jun.2011,pp.1305-1312. (b) 0.9020 0.7129 0.3841 0.8244 0.7326 [2] K. Zhang, L. Zhang, M. Yang, Real-time compressive tracking, Euro- (c) 0.8452 0.6986 0.5406 0.7225 0.6160 peanConf.onComput.Vision,Oct.2012,pp.864-877. (d) 0.8314 0.5248 0.2498 0.6439 0.5367 [3] T.Bouwmans,Recentadvanced statical backgroundmodelingforfore- (e) 0.8170 0.6046 0.4014 0.8966 0.3476 grounddetection-A systematicsurvey,RecentPatents onComput.Sci., (f) 0.7972 0.5902 0.2455 0.6487 0.2399 vol.6,no.3,pp.147-176,Nov.2011. (g) 0.6382 0.5104 0.1962 0.3941 0.5409 [4] C.Kim,J.Hwang,Object-basedvideoabstractionforvideosurveillance systems,IEEETrans.CircuitsSyst.VideoTechnol.,vol.12,no.12,pp. 1128-1138, Apr.2010. Table II shows the F-measures of all detected foreground [5] S. Li, H. Lu, L. Zhang, Arbitrary body segmentation in static images, regions in Fig. 10. We can see that the results of SOIR PatternRecognit., vol.45,no.9,pp.3402-3413, Sep.2012. are better than those of the other four methods for six [6] D. Park, H. Byun, A unified approach to background adaptation and initializationinpublicscenes,PatternRecognit.,vol.46,no.7,pp.1985- sequences, i.e., (a),(b),(c),(d),(f), and (g). However, it is a 1997,Jul.2013. little worse than DECOLOR in sequence (e). We can also [7] W. Wang, J. Yang, W. Gao, Modeling background and segmenting see that the performance of SOIR varies among the different moving objects from compressed video, IEEE Trans. Circuits Syst. VideoTechnol.,vol.18,no.5,pp.670-681,May.2008. video sequences. On one hand, this is due to the result of the [8] M. Droogenbroeck, O. Paquot, Background subtraction: Experiments backgroundextraction,whichisthe case forsequence(g).On and improvements for vibe, IEEE Int. Conf. Comput. Vision Pattern the other hand, the instability of the background also affects Recognit. Workshop,Providence, RI,USA,Jun.2012,pp.32-37. [9] V. Mahadevan, N. Vasconcelos, Background subtraction in highly dy- the performance of SOIR. namicscenes,Proc.IEEEInt.Conf.Comput.VisionPatternRecognit., horage, AK,USA,Jun.2008,pp.1-6. VI. CONCLUSION AND FUTURE WORK [10] Y.Chen,C.Chen,C.Huang,Y.Hung,Efficienthierarchicalmethodfor In this paper, we propose a Sparse Outlier Iterative Re- background subtraction, Pattern Recognit., vol. 40, no. 10, pp. 2706- 2715,Oct.2007. moval (SOIR) algorithm to model the background of a video [11] S. Wang, T. Su, S. Lai, Detecting moving objects from dynamic sequence.Wefindthatafewdiscriminativeframesarealready backgroundwithshadowremoval,IEEEInt.Conf.Acoust.Speechand sufficient to model the background. We use the sparse repre- SignalProcess.,Prague, CZ,May.2011,pp.925-928. [12] A. Ulges, T. Breuel, A local discriminative model for background sentation process to reduce the size of the video. Although it subtraction, Pattern Recognit., Springer Berlin Heidelberg, Jun. 2008, takes our algorithm some time to explore the ’discriminative’ pp.507-516. frames,itsaves muchmoretime in modelingthebackground. [13] C. Wren, A. Azarbayejani, T. Darrell, A. Pentland, Pfinder:Real-time tracking of human body, IEEETrans. Pattern Anal. Mach. Intell., vol. A cyclic iteration process is proposed for background ex- 19,no.7,pp.780-785,Jul.1997. traction. SOIR achieves both high accuracy and high speed [14] C.Stauffer,W.Grimson,Adaptivebackgroundmixturemodelsforreal- simultaneously when dealing with real-life video sequences. timetracking, Proc.IEEEInt.Conf.Comput.VisionPatternRecognit., FortCollins, CO,USA,1999. In particular, SOIR has an advantage in solving large-scale [15] D. Lee, Effective Gaussian mixture learning for video background tasks. subtraction, IEEE Trans. Pattern Anal. and Mach. Intell., vol. 27, no. Asfuturework,wewill dealwithsomeadditionalcomplex 5,pp.827-832,May.2005. [16] J.K.Suhr,H.G.Jung,G.Li,J.Kim,MixtureofGaussians-BasedBack- problems in which the backgroundis no longer stable among ground Subtraction for Bayer-Pattern Image Sequences, IEEE Trans. different frames. Circuits Syst.VideoTechnol., vol.21,no.3,pp.365-370,Mar.2011.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.