Table Of Content

Outlier Detection for Text Data : An Extended Version Ramakrishnan Kannan Hyenkyun Woo Charu C. Aggarwal OakRidgeNational KoreaUniversityof IBMT.J.WatsonResearch Laboratory TechnologyandEducation Center OakRidge,TN,USA RepublicofKorea YorktownHeights,NY,USA [email protected] [email protected] [email protected] Haesun Park GeorgiaInstituteof Technology 7 Atlanta,GA,USA 1 [email protected] 0 2 ABSTRACT outlier analysis is applicable to a wide variety of domains n suchasmachinemonitoring,financialmarkets,environmen- a J tal modeling and social network analysis. Correspondingly, The problem of outlier detection is extremely challenging theproblemhasbeenstudiedinthecontextofdifferentdata 5 in many domains such as text, in which the attribute val- typeswhichariseinthesedomains,suchasmultidimensional uesaretypicallynon-negative,andmost valuesarezero. In data,spatialdata,anddiscretesequences. Numerousbooks ] R such cases, it often becomes difficult to separate the out- and surveys have been written on the problem of outlier I liers from the natural variations in the patterns in the un- detection [1, 7, 8, 16]. . derlying data. In this paper, we present a matrix factor- In this paper, we will study the problem of text outlier s c ization method, which is naturally able to distinguish the analysis. The problem of text outlier analysis has become [ anomalies with the use of low rank approximations of the increasingly important because of the greater prevalence of underlying data. Our iterative algorithm TONMF is based web-centric and social media applications, which are rich 1 on block coordinate descent (BCD) framework. We define in text data. Some important applications of text outlier v blocks over the term-document matrix such that the func- analysis are as follows: 5 tion becomes solvable. Given most recently updated values 2 ofothermatrixblocks,wealwaysupdateoneblockatatime Web Site Management: An unusual page from a set 3 • toitsoptimal. Ourapproachhassignificantadvantagesover of articles in a web site may be flagged as an outlier. 1 0 traditional methods for text outlier detection. Finally, we The knowledge of such outliers may be used for web . present experimental results illustrating the effectiveness of site management. 1 our method overcompeting methods. 0 Sparse High Dimensional Data: While the methods • 7 discussedinthispaperhavetextapplicationsinmind, 1 they can be used for other sparse high dimensional : domains. For example, such methods can be used for v 1 Introduction market basket data sets. Unusual transactions may i X sometimes providean idea of fraudulent behaviour. ar The problem of outlier detection is that of finding data News Article Management: Itis often desirabletode- pointswhichareunusuallydifferentfromtherestofthedata • termineunusualnewsarticlefromacollectionofnews set. Suchoutliersarealsovariouslyreferredtoasanomalies, documents. An unusual news from a group of articles deviants, discordants or abnormalities in the data. Since may be flagged as an interesting outlier. outliers correspond to unusual observations, they are often of interest to the analyst in finding interesting anoma- Whiletextisanextremelyimportantdomainfrom theper- lies in the underlying generating process. The problem of spectiveofoutlieranalysis,therearesurprisinglyfewmeth- odswhicharespecificallyfocusedonthisdomain,eventhough ThismanuscripthasbeenauthoredbyUT-Battelle,LLCunderContractNo. many generic methods such as distance-based methods can DE-AC05-00OR22725 withtheU.S.Department ofEnergy. TheUnited beeasilyadaptedtothisdomain[21,29],andareoftenused States Government retains andthe publisher, byaccepting thearticle for for text outlier analysis. Domains such as text are particu- publication,acknowledgesthattheUnitedStatesGovernmentretainsanon- larlychallengingfortheproblemofoutlieranalysis,because exclusive,paid-up,irrevocable,worldwidelicensetopublishorreproduce thepublishedformofthismanuscript,orallowotherstodoso,forUnited oftheirsparsehighdimensionalnature,inwhichonlyasmall StatesGovernmentpurposes.TheDepartmentofEnergywillprovidepub- fraction ofthewordstakeonnon-zerovalues. Furthermore, lic access to these results of federally sponsored research in accordance many words in a document may be topically irrelevant to withtheDOEPublicAccessPlan(http://energy.gov/downloads/doe-public- thecontextofthedocumentandaddtothenoiseinthedis- access-plan). Thisresearch usedresources oftheOakRidge Leadership tance computations. For example, the word“Jaguar”may ComputingFacility, whichisaDOEOfficeofScienceUserFacilitysup- correspond to a car, or a cat depending on the context of portedunderContractDE-AC05-00OR22725 . the document. In particular, the significance of a word can be interpreted only in terms of the structure of the data canalso beextendedtotextdata,thoughtheyarenotpar- within thecontextofaparticulardatalocality. Asaresult, ticularly suited to the latter. Some methods have been de- document-to-document similarity measures often lose their signedforoutlierdetectionwithmatrixfactorizationinnet- robustness. Thus, commonly used outlier analysis methods work data sets [31], that are not applicable to text data. for multidimensionaldata, suchasdistance-based methods, Textdataisuniquelydifficultbecauseofitssparseandhigh arenotparticularlyeffectivefortextdata. Ourexperiments dimensional nature. As a result, many of the outliers de- also validate thisobservation. tected using conventional methods may simply correspond In this paper, we will use non-negative matrix factoriza- to noisy text segments. Therefore, careful modeling is re- tion (NMF) methods to address the aforementioned chal- quired with the useof matrix factorization methods. lenges in text anomaly detection. One advantage of ma- Over the last decade, Non-negative Matrix Factorization trixfactorizationmethodsisthattheydecomposetheterm- (NMF)hasemergedasanotherimportantlowrankapprox- documentstructureoftheunderlyingcorpusintoasetofse- imation technique, where the low-rank factor matrices are mantic term clusters and document clusters. The semantic constrainedtohaveonlynon-negativeelements. LeeandSe- nature of this decomposition provides the context in which ung [25] introduced a multiplicative update based low rank a document may be interpreted for outlier analysis. Thus, approximation with non-negative factors to overcome the documentscanbedecomposedintowordclusters,andwords challengesoftruncatedSVD.Subsequenttothiswork,NMF aredecomposedintodocumentclusterswithalow-rank1ap- has received enormous attention and has been successfully proximation. Outliers are therefore defined as data points appliedtoabroadrangeofimportantproblemsinareasin- whichcannotbenaturallyexpressedintermsofthisdecom- cludingcomputervision,communitydetectioninsocialnet- position. Byusingcarefullychosenmodelformulations,one works, visualization, recommender systems bioinformatics, can further sharpen the matrix-factorization method to re- etc. In spite of broad range of applications, NMF’s litera- veal document-centric outliers. One challenge in this case, ture in text domain is scarce. Xu et. al. [34] experimented isthatthedesignof amatrix factorization approach,which with NMF for document clustering instead of SVD based isoptimizedtoanomalydetection,resultsinanon-standard LatentSemanticIndexing(LSI).Otherthanapplicationsof formulation. Therefore, we will design an optimization so- NMF in the text domain, Gaussier and Goutte [14] estab- lution for this model. The NMF model also has the ad- lished the equivalence between NMF and pLSA. Similarly, vantage of providing better interpretability, and it can also Ding et. al. [11] explained the equivalence between NMF provide insights into why a document should be considered and pLSI. an outlier. We present extensive experimental results on In thispaper, we usean NMFapproach for concise mod- many data sets, and compare against a variety of baseline ellingofthepatterns,thebackground,andtheanomaliesin methods. We show significant improvements achieved by the underlyingdata. It should be pointed out that NMF is theapproach overa variety of other methods. similar to the generative models of text such as pLSI and This paperis organized asfollows. Theremainder of this LDA[14][11][30],thoughNMFoftenprovidesbetterinter- section discusses the related work. Section 2 introduces pretability. Ourimportantchallengeistomodeltheoutliers the model for outlier analysis. The algorithm to solve this alongwiththelowrankspaceoftheinputmatrix. Weiden- model is provided in section 3. Section 4 provides the ex- tifiedℓ1,2-normasanappropriateapproachforfactorization perimental results. The conclusions and summary are con- in outlieranalysis. Recently,theresearchershaveusedℓ2,1- tained in section 5. Our code can be downloaded from norm in their models tosolve variousproblems, though the https://github.com/ramkikannan/outliernmfandtriedwith corresponding solution techniques are not easily generaliz- any text dataset. able to the ℓ1,2-norm. Yang et.al., [35], under the assump- 1.1 Related Work tionthattheclasslabelofinputdatacanbepredictedbya linearclassifier,incorporatediscriminativeanalysisandℓ2,1- The outlier analysis problem has been studied extensively norm minimization intoa joint framework for unsupervised in the literature [1, 7, 16]. Numerous algorithms have been featureselectionproblem. Similarly,Liuetal[27],solveℓ2,1- proposed in the literature for outlier detection of conven- normregularizedregressionmodelforjointfeatureselection tional multidimensional data [2, 5, 21, 29]. The key meth- from multiple tasks. They also propose to use Nesterov’s ods, which are used frequently for outlier analysis include methodtosolvetheoptimizationproblemwithnon-smooth distance-basedmethods[21,29],density-basedmethods[5], ℓ2,1-normregularization. Also,Kongetal[22]proposearo- and subspace methods[2, 18, 24, 28, 23]. In distance-based bust formulation of NMF using ℓ2,1-norm loss function for methods, data points are declared outliers, when they are data with noises. situated far away from the dense regions in the underlying 1.2 OurContributions data. Typically, indexing or other summarization schemes may be used in order to improve the efficiency of the ap- Text data is uniquely challenging to outlier detection both proach. In density-based methods [5], data points with low because of its sparsity and high dimensional nature. Given local density with respect to the remaining points are de- therelevantliteratureforNMFandtextoutliers,wepropose clared outliers. In addition, a number of subspace methods thefirst approach to detect outliers in text data using non- [2, 18, 24, 28, 23] have been proposed recently, in which negativematrixfactorization. WeextendthefactthatNMF outliersaredefinedonthebasisofsubspacebehaviorofthe is similar to pLSI and LDA generative models and model underlyingdata. Most of the traditional multidimensional methods [7, 1] the outliers using the ℓ1,2-norm. This particular formulation of NMF is non-standard, and requires careful design 91In this paper, we use the terms “low rank approximation” of optimization methods to solve the problem. We solve and“matrixfactorization”interchangeably. Similarly,weusedthe the resulting optimization problem using block coordinate terms“anomalies”and“outliers”interchangeably. descent technique. We also present extensive experimental Notation Explanation Here, L0 is a low rank matrix and Z0 representsthe ma- A=[a1···an]∈Rm+×n Document-wordmatrix trix of outlier entries. Typically, the matrix L0 represents m Vocabularysize the documents created by a lower rank generative process n Numberof documents (such as that modeled by pLSI),and thepartsof thedocu- Z Rm×n Outliermatrix ments that do not correspond to the generative process are ∈ r<rank(A) Rank representedaspartofthematrixZ0. Inrealworldscenarios, WH ∈RRrm+××nr TTeorpmic--TDoopciucmmeanttrimxatrix tthoezeoruot,liaenrdmoantlryixasZm0acllonntuaminbserenotfreienstrwiehsihchavaerseigvneirfiycacnlotslye ∈ + A(i) Matrix A from theith iteration non-zero values. These significantly nonzero entries are of- kAk1,2 Paini=1Rkmaiiksℓ2thℓe12i--Nthorcmoluwmhneroe,f A twehnicphreasreentfuinllyonrleyparessmenatlalbfrleacitniotneromfsthoefcfoalcutmornssa.rCeoclounmsinss- ∈ tent with the low rank behavior of the data, and therefore Table 1: Notations used in the paper not outliers. The rank of L0 is not known in advance, and it can be expressed in terms of its underlyingfactors. L W H 0 0 0 ≈ Here, thetwo matrices havedimensions W Rm×r, H 0 ∈ + 0 ∈ Rr×n, and r rank(L ). The matrices W and H are + ≤ 0 0 0 non-negative, and this provides interpretability in terms of being able to express a document as a non-negative linear combination of the relevant basis vectors, each of which in itselfcanbeconsideredafrequency-annotatedbagofwords (topics) because of its non-negativity. Specifically, H0 corresponds to the coefficients for the basis matrix W0. Intu- itively,thiscorrespondstothecasethateverydocumentai, is represented as the linear combination of the r topics. In Figure 1: Text Outliers Using NMF cases,wherethisisnottrue,thedocumentisanoutlier,and those unrepresentable sections of the matrix are captured by the non-zero entries in the Z0 matrix. In real scenarios, the entries in this matrix are often extremely skewed, and results both on text and other kinds of market basket data the small number of non-zero entries very obviously expose sets. Weshowsignificantimprovementsachievedbytheap- theoutliers. Thedecomposition ofthematrix intodifferent proach overother baseline methods. component is pictorially illustrated in Figure 1. 2 MatrixFactorizationModel Inordertodeterminethebestlowrankfactorization, one musttrytooptimizetheaggregatevaluesoftheresidualsin Thissectionwillpresentthematrixfactorizationmodelwhich thematrix. Thiscanofcoursebedoneinavarietyofways, is usedfor outlierdetection. Before discussingthemodelin depending upon the goals of the underlying factorization detail, we present the notations and definitions. We repre- process. WemodelthedeterminationofthematricesW,H, sentthecorpusoftextdocumentsasabagofwordsmatrix. and Z, as thefollowing optimization problem: AlowercaseoruppercaselettersuchasxorX,isusedtode- tnootdeeanostcealaarv.eActobro,ladnfadcealboowledrfcaacseeuleptpteerrc,assuechletatserx,,siuscuhseads (W0,H0;Z0)=Wa≥r0g,Hm≥in0;Z21kA−WH−Zk2F +αkZk1,2 X,is usedtodenoteamatrix. Thisis consistent with what (2) is commonly used in much of the data mining literature. The specific location of outliers in each column does not IFnodricaesXt,yxpiicdalelnyostteasrtitsfriotmh c1o,luumnlne,ssyoj⊺thdeernwotiseesmitsenjttihonreodw. ahpavpelieadctlooseZd. foTrhme sloogluictiofonr, asipnpcleyitnhgetℓh1e,2-ℓn1o,2r-mnoprmenainltythies and xij or X(i,j) or (X)ij denoteits (i,j)th element. context of the outlier detection problem is as follows. Each For greater expressibility, we have also borrowed certain entryintheZcorrespondstoaterminadocument,whereas notations from matrix manipulation scripts such as Matlab weareinterestedintheoutlierbehaviorofentiredocument. and Octave. For example, thenotation max(x)returnsthe This aggregate outlier behavior of the document x can be maximal element x x and max(X) returns a vector of modeledwiththeℓ2normscoreofaparticularcolumnzx. In ∈ maximalelementsfromeachcolumnx X. Similarly,X(i,: arealscenario,ifalargesegmentofadocumentxisnotrep- ∈ ) denotes the i-th row of the matrix and X(:,i) for i-th resentableasthelinearcombinationofther topicsthrough column. Forthereader’sconvenience,thenotationsusedin L0, the corresponding column zx in the matrix Z will be thepaper are summarized in Table 1. compensatedbyhavingmoreentriesinitscolumn. Inother LetAbethematrixrepresentingtheunderlyingdata. In words, we will have a higher ℓ2 value for the corresponding the context of a text collection, this corresponds to a term- column zx, and this corresponds to a higher outlier score. documentmatrix,wheretermscorrespondtorowsanddoc- Furthermore,theℓ1,2-normpenaltyonZdefinesthesumof umentscorrespond to columns. In other words, aij denotes the ℓ2 norm outlier scores over all the documents. There- thenumberoftimesthetermiappearsindocumentj. Gen- fore, the optimization problem essentially tries to find the erally, we can write A as follows: bestmodel,animportantcomponentofwhichistominimize thesumoftheoutlierscoresoveralldocuments. Whileava- A=L0+Z0. (1) riety of different (and more commonly used) penalties such astheFrobeniusnormareavailableformatrixfactorization models,wehavechosentheℓ1,2-normpenaltybecauseofits 90 intuitive significance in the context of the outlier detection 80 problem, and its tendency to create skewed outlier scores 70 aseccrtoisosnt,htehicsocluommnessaotftthheemexapterinxs.eAofsawefowrmillulsaeteioinntwhheicnhexist matrix Z60 moFroerdhiiffighcudlitmteonssoiolvnealadlgaotrai,thspmaircsaellcyo.efficientsaredesirable mns of outlier 50 fpourropbotsaei,nwinegaadn1dinthteerpℓ1re-ptaebnlaeltlyowonraHnk:matrixWH. Forthis L2 norm of colu3400 W≥m0,Hin≥0;Z2kA−WH−Zk2F +αkZk1,2+βkHk1 (3) 20 10 The constant α defines the weight for the outlier matrix Z over the recovery of the low rank space L and the sparsity 00 100 200 300 400 500 600 700 800 900 1000 Document Index term. Inthecaseofoutlierdetectionintextdocuments,we give more weight for the outlier matrix over the low rank Figure 2: ℓ norm of columns of Z outlier matrix representationL. Thisproblemdoesnothaveaclosedform 2 solution, and therefore we cannot directly recover the low rank matrix WH in closed form. However, we can recover the column space. Without non-negativityconstraints, this lems. property is also known as the rotational invariant property 3 AlgorithmicSolution [12,33]. Thisparticularformulationofthematrixfactoriza- tionmodelisabitdifferentfromthecommonlyusedformu- AsdiscussedearlierourtechniqueisbasedonNMF,andthis lations, and off-the-shelf solutions do not directly exist for particular formulation (3), which is suited to outlier analy- thisscenario. Therefore, in a latersection, we will carefully sis,isrelativelyuncommon,anddoesnothaveaclosedform designanalgorithmwiththeuseofblockcoordinatedescent solution. In order to address this issue we use a Block Co- for thisproblem. ordinate Descent (BCD) framework and its application to In order to understand the modeling of the outliers bet- solve the optimization problem (3). The BCD framework ter, we present the readers with a toy example from a real is a popular choice not only because of the ease in imple- worlddataset,toshowhowskewedthetypicalvaluesofthe mentation,butalso because it isscalable. First, wewill lay correspondingcolumnz(x)maybeinrealscenarios. Inthis thefoundation for thebasic BCD technique,as it generally case, we used the BBC dataset2. This dataset consists of applies to non-linear optimization problems. We will then documentsfromBBCnewswebsitecorrespondingtostories relate it to our non-negative matrix factorization problem, in area business, entertainment, politics, sport, tech from and explain ouralgorithm Text Outliers using Nonnegative 2004-2005 . We took all the documents from business and Matrix Factorization(TONMF) in detail. politicsand50documentsfromtechlabeledasoutliers. We 3.1 Block coordinateDescent randomlypermutedthecolumnstoshuffletheoutliersinthe matrixtoavoidanyspatialbias. WecomputedtheZmatrix Inthissection,wewillseerelevantfoundationforusingthis andgeneratedtheℓ2 scoresofthecolumnsofoutliermatrix framework. Consider a constrained non-linear optimization Z. Figure 2 shows the outlier(ℓ2) scores of the documents. problem as follows: TheX-axisillustratestheindexofthedocument,andtheY- axisillustratestheoutlierscore. Itisevidentthatthescores minf(x) subject to x , (4) ∈X forsomecolumnsaresoclosetozero,thattheycannoteven Here, is a closed convex subset of Rn. An important beseen on thediagram drawn toscale. Thesecolumns also X assumption to be exploited in the BCD method is that the happened to be the non-outlier/regular documents of the set is represented by a Cartesian product: collection. Such documents ax Rm correspond to the low X ∈ rank space, and are approximately representable as a prod- = 1 m, (5) X X ×···×X uct of the basis matrix W with the corresponding column vdeocctuomreonftcsoethffiactieanrtes nhoxt∈reRprredsreanwtanblferoimn Hsu.chHoawleovwer,ratnhke swahtiesrfeyiXngj,nj== 1,mj·=·1·N,mj.,Aisccaorcdloinsegdly,cotnhveevxecstuobrsextiosfpRarNtji-, space have a large outlier score. From the distribution of tionedasx=(xP1,··· ,xm)sothatxj ∈Xjforj =1,··· ,m. theoutlierscore, wecanalsoobservethatthescoresofout- TheBCDmethodsolvesforxj byfixingallothersubvectors luiesirndgoacusimmepnltessatgaatiisntsitcanlomn-eoauntlaienrdssatraencdlaeradrlydesveipaatrioanblaen,ably- oisfgxiviennaasctyhceliccumrraennnteirt.erTatheaattist,heifixth(is)te=p,(xth(1ie),a·l·g·or,ixth(mim)) ysis. Therefore, while we use the scores to rank the docu- generatesthenextiteratex(i+1) =(x(1i+1),··· ,x(mi+1))block ments in terms of their outlier behavior, the skew in the byblock,accordingtothesolutionofthefollowingsubprob- entries ensures that it is often easy to choose a cut-off in lem: order to distinguish theoutliers from thenon-outliers. x(k+1) argminf(x(k+1), ,x(k+1),ξ,x(k) , ,x(k)). Inthefollowingsections,wewillanalyzethepropertyand j ← ξ∈Xj 1 ··· j−1 j+1 ··· m performance of this model (3) for outlier detection prob- (6) Also known as a non-linear Gauss-Seidel method [3], this 92http://mlg.ucd.ie/datasets/bbc.html algorithm updates one block each time, always using the mostrecentlyupdatedvaluesofotherblocksx˜j,˜j 6=j. This where W = [w1,···wr] ∈ Rm+×r and H = [h1,··· ,hr]⊺ ∈ is important since it ensures that after each update, the Rr×n. The form in Eq. (8) represents the fact that A can + objective function value does not increase. For a sequence beapproximated by thesum of r rank-onematrices. x(i) where each x(i) is generated by the BCD method, Following the BCD scheme, we can minimize f by itera- n o tively solving thefollowing: thefollowing property holds. Theorem 1: Suppose f is continuously differentiable in wi←arwgim≥0inf(w1,··· ,wr,h⊺1,··· ,h⊺r) X = X1 ×···×Xm, where Xj, j = 1,··· ,m, are closed for i=1, ,r, and convex sets. Furthermore, suppose that for all j and i, the ··· minimummoinf f(x(k+1), ,x(k+1),ξ,x(k) , ,x(k)) h⊺i ←arhg⊺im≥0inf(w1,··· ,wr,h⊺1,··· ,h⊺r) ξ∈Xj 1 ··· j−1 j+1 ··· m for i=1, ,r. ··· The 2K-block BCD algorithm has been studied as Hier- is uniquely attained. Let x(i) be the sequence generated archicalAlternatingLeastSquares(HALS)proposedbyCi- n o bytheblockcoordinatedescentmethodasinEq.(6). Then, chocki et al. [10, 9] and independently by Ho et al. [17] as everylimitpointof x(i) isastationarypoint. Theunique- rank-oneresidueiteration (RRI). n o 3.1.3 BCDwithk(n+m)ScalarBlocks nessoftheminimumisnotrequiredforthecasewhenm=2 [15]. Wecan also partition thevariables with thesmallest k(n+ m) element blocks of scalars, where every element of W The proof of this theorem for an arbitrary number of andHisconsideredasablockinthecontextofTheorem 1. blocks is shown in Bertsekas [3]. For a non-convex opti- To this end, it helps to write the objective function as a mization problem, most algorithms only guarantee the sta- quadratic function of scalar wij or hij assuming all other tionarity of a limit point [26]. elements in W and H are fixed: When applying the BCD method to a constrained non- linearprogrammingproblem,itiscriticaltowiselychoosea f(wij)=k(a⊺i − wik˜q⊺k˜)−wijh⊺jk22+const, (9a) partition of , whose Cartesian product constitutes . An Xk˜6=j X X iamrepeoffirtcainetntclryitseorlivoanbilse.wFhoerthexeramthpeles,uibf-tphreobsolelumtsioinnsEofq.su(b6)- f(hij)=k(aj− wk˜hk˜j)−wihijk22+const, (9b) Xk˜6=i problemsappearinaclosed form,each updatecanbecom- puted fast. In addition, it is worth checking how the so- where a⊺i and aj denote the ith row and the jth column of lutions of sub-problems depend on each other. The BCD A, respectively. method requires that the most recent values be used for Inthispaperforsolvingtheoptimizationproblem(3),we each sub-problem in Eq. (6). When the solutions of sub- partition the matrices Z,W,H into vector blocks such as problems depend on each other, they have to be computed z1, ,zn,w1, ,wr,h1, ,hr. The reasoning behind ··· ··· ··· sequentially to make use of the most recent values. If so- this partitioningis explained in thenext section. lutions for some blocks are independent of each other, they 3.2 TextOutliersusingNonnegativeMatrixFac- can be computed simultaneously. We discuss how different torization(TONMF) choices ofpartitionslead todifferentNMFalgorithms. The partitioningcanbeachievedinseveralways,byusingeither In this section, we propose an efficient algorithm for the matrix blocks, vectorblocks or scalar blocks. outlier detection model(3). To determine the Z,W,H for the aforementioned opti- 3.1.1 BCDwithTwoMatrixBlocks-ANLSMethod mization problem (3), we use the block coordinate descent The most natural partitioning of the variables is to have method. In other words, by fixing W,H, we determine the two big blocks, W and H. In this case, following the BCD optimal Z as vector blocks z1, ,zn and vice versa. Due ··· method in Eq.(6), we take turnssolving the following: toℓ1,2-norm,thisoptimizationcorrespondstothetwoblock non-smooth BCD framework. (cid:26) HW((kk++11))←←aarrggmmininHW≥≥00ff(W(W(k,+H1)(,kH))). (7) Z(k+1) ←argZmin12kA−Z−W(k)H(k)k2F Sincethesub-problemsarenon-negativityconstrainedleast +α Z 1,2 k k squares (NLS) problems, the two-block BCD method has 1 beencalledthealternatingnon-negativeleastsquare(ANLS) (W(k+1),H(k+1)) argmin A WH Z(k+1) ←W≥0,H≥02k − − k framework [26, 19, 20]. +β H 1 3.1.2 BCDwith2kVectorBlocks-HALS/RRIMethod k k (10) We partition the unknowns into 2k blocks in which each RegardingZ=[z1,...,zn],theminimizationproblemin(10) has a separable structure: block is a column/row of W or H. In this case, it is easier to consider theobjective function in thefollowing form: 1 r Z(k+1) =argZminXi 2k¯ai−zik22+αkzik2 f(w1,··· ,wr,h⊺1,··· ,h⊺r)=kA−Xj=1wjhTjk2F, (8) where ¯ai = ai−(W(k)H(k))i. Therefore, we only need to define a solution with respect to one variable zi. Thus, The following is the framework of the block coordinate de- we partition the matrix Z into vector blocks zi and con- scentmethodwithaseparableregularizersuchastheFrobe- struct Z as a set of vectors zi. Also, the blocks zi is in- nius norm. We iteratively minimize (W,H) with respect F dependent of zj, i = j. That is, the closed form solution to each column of W and H : ∀ 6 of zi is dependent only on ¯ai. When all other blocks of for j =1...r wwTbfreahh1me,sen·ooe·lwrva·eelo,ldmrwtkhtroe,2ohf:oo1tps,hotT·eil·mvrh·ieanb,lglhsooircntl,kuhsapteiraaoevrrnaeefilcolxfitefeoxl.dret,hdTbe.elhvofuecorsklyl,sowwvoeeifncatgzodir,mheztirnoiei∈motpoiZztaBi,mtcCioaaDnln,  ehn(jkd+1) =arhgjm≥0inα2kw+j(kg)h(hTj(1k−+(1A)¯,.−..,Wh˜j,j(k..).),kh2F(rk)) problem for j =1...r is the gezn∗ier=aliazregdzmishinrifn(kzaig)e=opγ2ekrzatio−rα:aik22+αkzik2  enwdj(k+1) =awrgjm≥0inkwj(h(jk+1))T −(A¯ −H˜j(k+1))k2F(13) z∗i =shrink(ai, γ) where Proof. j−1 r where generalized shrinkageoperator is defined as: W˜ (k) = w(k)(h(k+1))T + w(k)(h(k))T, shrink∂(fa∂i(z,ziCi))==γm(zaix−(kaaiik)2+−αCk,zz0ii)k aaii2 and j Xi=1 i i i=Xj+1 i i k k When kaik2 ≤ αγ, j−1 r γ H˜(k+1) = w(k+1)(h(k+1))T + w(k)(h(k+1))T. f(zi)≥ 2(kzik22+kaik22)+(α−γkaik2)kzik2 j Xi=1 i i i=Xj+1 i i Therefore we have: AccordingtoTheorem 2,thesolutionofziisindependent argminf(zi)=0. ofzj,∀i6=j,anditenablesustosolvethesolutioninparal- z lel. Thisisveryusefulwhencomputingforverylarge input i matrices. Similarly, the vector blocks of W,H can also be When kaik2 ≥ αγ, let zi =cai then updatedinparallel. Now,wehaveallthebuildingblocksfor the Text Outliers using Nonnegative Matrix Factorization ∂f∂(zzii) =γ(zi−ai)+αkzziik2 =[γ(c−1)+ kaαik2]ai =0 aWlg,oHrithfrmom. W(1e3)w.ilTlhbeeAuslginogritThhmeo1r,emgiv2esanthdethouetulipndeaotfetfhoer where TONMFand its complete implementation can be obtained α 1 fromhttps://github.com/ramkikannan/outliernmf totry c=1 . with any real world text dataset. − γ ai 2 k k Therefore, we get Algorithm 1: Text Outliers using Nonnegative Matrix zi =(kaik2− αγ) aaii2 Factorization (TONMF) k k input : Matrix A Rm×n,reduced rank r, α, β Now,utilizingthegeneralized shrinkageoperatorasdefined output: Matrix W∈ R+m×r,H Rr×n,Z Rm×n in [13][32], ∈ + ∈ + ∈ // Rand initialization of W, H, Z a z∗i =shrink(ai,C)=max(kaik2−C,0)kaiik2 21 wInhitiilaeliszteoWppi,nHg,crZitearsiaaCn1onnnoetgmateivtedroandom matrix ; where C =α/γ. // Compute Z for the given A,W,H,α,β based on Theorem 2 Now, we need to solve the following NMF model with 3 for i 1 to n do sparsity constraints on H: 4 zi←←max(kaik2− αγ,0)kaaiik2 (W(k+1),H(k+1))=War≥g0m,Hi≥n0kA¯ −WHk2F +βkHk1 56 whifloersjtoppi1ngtocrritderoia C2 not met do ← where A¯ =A−Z(k+1). Let 7 h(jk+1) =arhgjm≥0inα2kwj(k)hTj −(A¯ − F(w1,...,wr;h1,...,hr)=kA¯ −Xi=r1wihik2F +g(h1,...,hr). 8 forWj˜←j(k)1)kt2Fo+rgd(oh(1k+1),··· ,hj,···,h(rk)); (11) 9 w(k+1) = j FwohreraenyWj ∈={1[,w..1.,,rw}2,,(.1.1.),wcarn] banedreHwrit=ten[ha1s,h2,...,hr]T. arwgjm≥0inkwj(h(jk+1))T −(A¯ −H˜(jk+1))k2F; r r kA¯ − wihTi k2F =kA¯ − wihTi −wjhTjk2F. (12) Xi=1 i=X1,i6=j 4 Experimental Results uments to be considered important enough to be retained. From the collected data, the sections Career and Life were In this section, we present the experiments on text outlier chosen as non-outlier and whereas thesmall section section analysis using matrix factorization. We used both real and Death was chosen as outlier. The constructed dataset has syntheticdatasetstotestouralgorithm. Therealdatasets a vocabulary size of 18834 and total of 9593 documents. A correspond to the well known RCV20, Reuters and Wiki total of 100 documents that belong to section Death were People data,whereas thesyntheticdataset wascreated us- labeled as outlier. ing a well known market basket generator described later. Market Basket Data Generator: Wealso wanted to un- Itshouldbepointedoutthatthesedatasetswerenotorigi- derstand the performance of our algorithm in some large nallydesignedforoutlieranalysis,andtheyhavenoground sparse matrices that is similar to the bag of words matrix. truthinformationavailable. Therefore,someadditionalpre- Towardsthisend,weusedthestandardIBMSyntheticData processing neededtobeapplied to thereal datasets, in or- Generation CodeforAssociationsandSequentialPatterns– dertoisolate groundtruthclasses, andusethemeffectively market-basket data generator, that is packaged as part of for the outlier analysis problem. In this section, we will Illimine5 software. We set the average length of the trans- describe the data sets, their preparation, the performance actiontobe300andnumberofdifferentitemstobe50,000. criteria and the results obtained by our algorithm. At the Notethatthisgeneratorusesarandomseed,andbychang- end of this section, we will also present a discussion that ing the seed, it is possible to completely change the trans- providesinteresting insightsabout theeffectivenessof algo- action distribution, even if all other parameters remain the rithm TONMF. same. We generated 10,000 data points as a group of four 4.1 Data Sets differentsetsof2500datapointswithrandomlychosenseed values. Inaddition,therareclasscontained250datapoints Theexperimentswereconductedwithbothlabelledrealand fromasingleseedvalue. Inaddition,werandomlypermuted syntheticdatasets. These are described below: the positions of the outliers and regular data points in the RCV20 Data Set: The RCV20 data set 3 is a collection matrix representation, to avoid any unforeseen bias in the of approximately 20,000 newsgroup documents, partitioned algorithm. (nearly) evenly across 20 different newsgroups. We took all datapointsfromtworandomlychosenclasses,whichinthis 4.2 Performance Metrics case corresponded to the IBM and Mac Hardware classes. The effectiveness was measured in terms of the ROC curve In addition, 50 data points were chosen from one randomly drawn on theoutlier scores. Weuse thearea undertheRe- chosen class, which corresponds to the Windows Operating ceiver Operating Characteristics(ROC) curve – the defacto System (OS) class. As it turns out, this is a rather hard metric for evaluation in outlier analysis. The idea of this problem for our algorithm because of some level of rela- curveistoevaluatearankingofoutlierscores,byexamining tionship between one of the rare classes and the base data. the tradeoff between the true positives and false positives, Specifically,WindowsOperating SystemandIBMHardware asthethresholdontheoutlierscoreisvariedinarange. By are bothcomputerrelated subjects, andtheformer is often usingdifferentthresholds,itispossibletoobtainarelatively used with the latter. Therefore, some vocabulary is shared largerorsmallernumberoftruepositiveswithrespecttothe betweentheregularclassandtherareclass,andthismakes false positives. thedetection of outlier harder. Werandomly permutedthe position of theoutliers and regular datapoints. Let S(t) be the set of outliers determined by using a Reuters-21578 Data Set: ThedocumentsintheReuters- thresholdtontheoutlierscores. Inthiscase,theTrue Pos- 21578collection4appearedontheReutersnewswirein1987. itive Rate is graphed against the False Positive Rate. The Itcontains21578documentsin135categories. Everydocu- truepositiverateTPR(t)isdefinedinthesameway asthe ment belongs to one or more categories. We selected those metricofrecallisdefinedintheIRliterature. Thefalsepos- documents that belong to only one category. We chose to- itive rate FPR(t) is the percentage of the falsely reported tally 5768 documents that belong to the category earn and positivesoutoftheground-truthnegatives. Therefore,fora acq. The outliers were 100 documents from category inter- data set D with ground truth positives G, these definitions est. The vocabulary size of all the documents from these are as follows: categoriesputtogetherwere18933. Werandomlypermuted S(t) G TPR(t)=Recall(t)=100 | ∩ | theposition of theoutliers and regular datapoints. ∗ G Wiki People Dataset: This is a subset of thedataset col- | | lected by Blasiak et.al., [4]. The dataset is constructed by S(t) G crawlingWikipediastartingfromhttp://en.wikipedia.org/wiki/Category:ListFs_PoRf_(pt)ol=it1i0c0ian|s − | ∗ D G to a depth of four. Pages describing people were extracted | − | fromthelistofallcrawledpages. Textfrom thebodypara- Note that the end points of the ROC curve are always at graphs of the pages were extracted, and section headings (0,0) and (100,100), and a random method is expected to were used as labels for blocks of text. Text blocks were as- exhibitperformancealongthediagonallineconnectingthese sumed to begin with <p> and end with </p>. Only text points. The lift obtained above this diagonal line provides in section headings that occurred 10 times or more was re- an idea of the accuracy of the approach. The area under tained. Wordswerestemmed,stopwordswereremoved,and the ROC curve provides a measure of the accuracy. A ran- words of length at least 3 and at most 15 were considered. dom algorithm would have an area of 0.5 under the ROC The words need to occur at least 4 times in at least 2 doc- curve. TheROCcurvewasusedtoprovidedetailedinsights into the tradeoffs associated with the method, whereas the 93http://qwone.com/~jason/20Newsgroups/ 94http://archive.ics.uci.edu/ml/datasets/Reuters-21578+Te9x5th+tCtapt:e/g/iolrliizmaitnieo.nc+sC.ouliluecc.teidoun/ area under the ROC curve was used in order to provide a nificant,becauseof theinherentlychallenging natureof the summary of theperformance of themethod. data set. The k-NN method performed particularly poorly in this case. In a later section, we will provide some in- 4.3 Baseline Algorithms sights about the fact that some of this“poor”performance The baselines used byour approach were as follows: is because of the noise in the data set itself, where some of Distance-basedAlgorithm: Thefirstalgorithmwhichwas the points in the regular class should really be considered usedwasthek-nearestneighbouralgorithm,whichisaclas- outliers. We generated a datasets in RCV20 where we just sicaldistance-basedalgorithmfrequentlyusedforoutlierde- changed the outlier class to christian religion. We received tection[21,29]. Theoutlierswererankedbasedondistances a best ROC of 0.9732 and it is not shown in Figure 5. in order to create an ROC curve, rather than using a spe- Figure9showsthecomparisonofouralgorithmTONMF cificthreshold as in [21]. Inaddition, wegavethek-nearest againstthebaselinesfortheWikiPeopledataset. Thearea neighbouralgorithmanadvantagebypickingavalueofkop- under the ROC for k-NN was 0.5395, which is rather poor. timallybasedonareaunderROCcurvebysweepingk from All the other methods performed better than k-NN with 1to50. Notethatsuchanadvantagewouldnotbeavailable area undertheROCfor SVD being0.5670 and RPCAbe- to the baseline under real scenarios, since the ground-truth ing0.5471. OuralgorithmTONMF performedsignificantly outliers in the data are unknown, and therefore the ROC betterthanallthemethodswithanAUCof0.8552. Clearly, curvecannot be optimized. thisisasignificantqualitativedifferencebetweenthemeth- Simplified Low Rank Approximation: We used a low ods. The above three were experiments on real life dataset rankapproximationbasedonSingularValueDecomposition and we chose market basket for syntheticdataset. (SVD). ForagivenmatrixA,abestr-rankapproximation TheROCcomparisonforthesyntheticmarketbasketdata Aˆr is given by Aˆr = USrV⊺, where is illustrated in Figure 7. In this case, the improvement Sr =diag(σ1, ,σr,0, ,0). Thatis,thetrailingrank(A) of the algorithm TONMF over the baseline methods was ··· ··· − r in the descending ordered singular values are set to 0. It quitesignificant. Specifically,thealgorithmTONMFhadan is naturaltounderstandthattheoutlierdocumentsrequire area under the ROC curve of 0.7598, which is a significant linearcombinationofmanybasisvectors. Thustheℓ norm lift. This significantly outperformed the SVD and RPCA 2 onthe√SrV⊺ canbeusedascoretodeterminetheoutliers. method, which had an area undertheROCcurveof 0.5731 In the graphs, we use SVD as the legend to represent this and 0.5758 respectively. As in the case of the other data baseline. FortheSVDapproach,weusedthesamelowrank sets, the k-NN algorithm performed very poorly with an as our algorithm. areaundertheROCcurveof0.5431. Theconsistentlypoor Robust PrincipalComponentAnalysis(RPCA):Re- performance of the k-NN approach over all algorithms is cently Candes et.al.,[6], proposed a new technique called quitestriking,andsuggeststhatstraightforward generaliza- Robust PCA that is insensitive to noises and outliers. It tionsofoutlieranalysistechniquesfromotherdatadomains is important to note that both PCA and NMF are differ- are often not well suited to thetext domain. ent forms of low rank approximation. Hence, we wanted Based on our conducted experiments on real world and to leverage the output of RPCA and recover the outliers. syntheticdatasets,weobservedthatTONMF outperformed RPCAyieldstwomatrices(1)alowrankmatrix-Land(2) every other baseline. Furthermore, the rank of the meth- asparsematrixSsuchthatA L+S,whereAisthegiven odsfrom besttoworst isTONMF,RPCA,SVD andNN. ≈ inputmatrix. ThemaindisadvantageofRPCA isitslarger Clearly, conventional distance-based methods do not seem memory requirements. Retaining L,S for large matrices re- to work verywell for text data. quire significant memory. We used theℓ2 norm on the S as 4.5 Parameter Sensitivity an outlier score for every document. In the graphs, we use RPCA as thelegend to represent thisbaseline. From (3) in Section 2, we can see that the parameters for ouralgorithmareα,β andthelowrankr. Wetestedtheal- 4.4 Effectiveness Results gorithmfordifferentvariationsintheparameters,andfound We first present theROCcurvesfor thedifferentdata sets. thatouralgorithmwasinsensitivetochangesinβ. Inother TheROCcurvefortheReutersdatasetisillustratedinFig- words, for a given low rank r and α, the changes in the ure 3. In this case, our algorithm shows a drastic improve- value of β did not result in significant change in the area mentoverboththebaselinealgorithms. Thisisevidentfrom under ROC. Hence, in this paper, we provide the charts of the rather large lift in the chart. Our algorithm TONMF theROCareavariationwiththeparametersαandr onthe hadanareaof0.9340underROC.Thek-NNapproachper- data sets. formed quitepoorly,andhadan areaundertheROCcurve The sensitivity results for the Reuters data set are illus- of 0.5370. Thisisslightlybetterthanrandom performance. trated in Figure 4. The value of α is illustrated on the The area under ROC for the SVD method was 0.5816 and X-axis, and different values of the low rank r are graphed RPCA was 0.6120, which is better than thek-NN method, bydifferentcurvesintheplot. Itisevidentinthiscase,that but still significantly less than theproposed algorithm. theareaundertheROCincreasedwithincreaseinlowrank The comparison of our algorithm with baselines for the randα. Howevertheimprovementstarteddiminishingand RCV20 data set is shown in Figure 5. As discussed in the changed very marginally at higher ranksr. data generation section, this is a particularly challenging The results for the RCV20 and Wiki People datasets are data set, because of the similarity in the vocabulary distri- illustrated inFigure6andFigure10respectively. Asinthe bution between the rare class, and the regular class. It is previous case, the value of α is illustrated on the X-axis, evident that our algorithm TONMF performed better than anddifferentvaluesofthelowrankrarerepresentedbydif- the SVD, RPCA and the k-NN method. However, the lift ferent curves. In this case, the area under the ROC curve intheROCcurveforallthemethodsisnotparticularlysig- was relatively insensitive to the parameters. This implies ROC Parameter Sensitivity 100 0.95 90 0.9 ALL) 80 0.85 E POSITIVE RATE (REC 3456700000 Area under the ROC000.7..785 KKKKK=====1234500000 RU 0.65 T 20 NN 10 SRVPDCA 0.6 TONMF 0 0.55 0 20 40 60 80 100 0 5 10 15 20 25 30 35 FALSE POSITIVE RATE Alpha Figure 3: Reuters Figure 4: Reuters ROC Parameter Sensitivity 100 0.7 90 0.68 E POSITIVE RATE (RECALL) 345678000000 Area under the ROC0000...666.6246 KKKKK=====2345100000 U R T 20 NN SVD 0.58 10 RPCA TONMF 0 0 20 40 60 80 100 0 5 10 15 20 25 30 35 FALSE POSITIVE RATE Alpha Figure 5: RCV20 Figure 6: RCV20 ROC Parameter Sensitivity 100 0.8 K=10 90 K=20 0.75 K=30 E POSITIVE RATE (RECALL) 345678000000 Area under the ROC0000..56..6755 KK==4500 U R T 20 NN SVD 0.5 10 RPCA TONMF 0 0.45 0 20 40 60 80 100 0 5 10 15 20 25 30 35 FALSE POSITIVE RATE Alpha Figure 7: Market Basket Figure 8: Market Basket ROC Parameter Sensitivity 100 0.9 90 0.85 ALL) 80 0.8 E POSITIVE RATE (REC 3456700000 Area under the ROC000..67.755 KKKKK=====1234500000 RU 0.6 T 20 NN 10 SRVPDCA 0.55 TONMF 0 0.5 0 20 40 60 80 100 0 5 10 15 20 25 30 35 FALSE POSITIVE RATE Alpha Figure 9: Wiki People Figure 10: Wiki People thatthealgorithm canbeusedoverawiderangeofparam- data,andthereforeprovidesmorerobustmethodsthancom- eters, without affecting theperformance too much. Finally, petingmodels. Theapproachhasthepotentialtobeapplied the results for the market basket data set are illustrated in tootherdomainswithsimilarstructure,andasaspecificex- Figure 8. In this case, the area under the ROC curve de- ample, we provide experiments on market basket data. We creases with increase in low rank r and α. This is because also presented extensive experimental results, which illus- the market-basket data has inherently very low (implicit) tratethesuperiorityoftheapproach. Ourcodecanbedown- dimensionality, and therefore, it is best to use a relatively loadedfromhttps://github.com/ramkikannan/outliernmf low rank in order to minetheoutliers. and tried with any text dataset. Fromtheparametersensitivitygraphsforrealworlddatasets, Inthispaper,wehadaparallelimplementationusingthe we observe thatfor agiven α, theapproach is relatively in- Matlab’sparallelcomputingtoolboxtoruninmulticoreen- sensitive to the rank of the approximation. It needs to be vironments. In the future, we would like to explore a scal- kept in mind that it is generally faster to determine ap- able implementation of our algorithm. The solution is em- proximations with lower rank. This implies that, for very barrassingly parallelizable, and would like to experiment in large matrices, the algorithm can be made computationally web scale data. One of the potential extension is incorpo- faster by choosing approximations with lower rank without rating temporal and spatial aspects into the model. Such compromising on theperformance. Accordingto themodel an extension, make the solution applicable to emerging ap- explained in equation (3), the parameters α and β balance plications such as topic detection and streaming data. We the importance given to outliers against the matrix spar- experimented the solution primarily on text data and mar- sitycriterionduringregularization. Bypickingα>>β,the ketbasketdata. Infuturework,wewillextendthisbroader importance of the outlier portion of the regularization in- approach to other domains suchas videodata. creases. From the parameter sensitivity graph, it is evident 6 Acknowledgements thatformostlowranksK,theincreaseinthevalueofαdoes notimprovetheperformanceoftheoutlierdetection. Thisis Thismanuscripthasbeenco-authoredbyUT-Battelle,LLC because, beyondaparticularlimit, theweightsgiventothe under Contract No. DE-AC05-00OR22725 with the U.S. outliercriteriondonotsupersedetheoptimizationproblem’s Department of Energy. This project was partially funded mainobjectiveofextractingthelow-rankpatternsfromthe by the Laboratory Director’s Research and Development underlyingdata. fund and also sponsored bythe ArmyResearch Laboratory 4.6 Further Insights (ARL)andwasaccomplishedunderCooperativeAgreement NumberW911NF-09-2-0053. Also, H.Woo is supported by In order to illustrate the inner workings of the matrix fac- NRF-2015R101A1A01061261. torizationapproach,weprovidesomefurtherinsightsabout TheUnitedStatesGovernmentretainsandthepublisher, thestatistics burieddeepin thealgorithm. Wealso present by accepting the article for publication, acknowledges that some interesting observations when outliers share the same theUnitedStatesGovernmentretainsanon-exclusive,paid- vocabularydistributionasregulardatapoints,asisthecase up, irrevocable, world-wide license to publish or reproduce fortheRCV20dataset. Oneobservationisthatthemethod thepublishedform ofthismanuscript,orallowotherstodo ofdatagenerationimplicitlyassumesthatallthedocuments so, for United States Government purposes. The Depart- within a“regular”class in a real data set are not outliers. ment of Energy will provide public access to these results Thisisofcoursenottrueinpractice,sincesomeofthedoc- offederallysponsoredresearchinaccordancewiththeDOE umentswithintheseclasses willalso beoutliers, forreasons PublicAccessPlan(http://energy.gov/downloads/doepublic-access-pla otherthantopicalaffinity. OuralgorithmTONMFwasalso Any opinions, findings and conclusions or recommenda- able to detect such distinct documents, much better than tionsexpressedinthismaterialarethoseoftheauthorsand the other baseline algorithms. We isolated those false pos- do not necessarily reflect the views of the USDOE, NSF or itives of our algorithm TONMF that were not detected in ARL. the baselines in thecase of theRCV20 data set. It was ob- 7 References served that while these outliers officially belonged to oneof the regular classes, they did show different kinds of distinc- tivecharacteristics. Forexample,whiletheaveragenumber [1] C. Aggarwal. Outlier analysis. Springer, 2013. of words in regular documents was 195, the“false positive” [2] C. C. Aggarwal and P. S. Yu.Outlier detection for outliers chosen by our algorithm were typically either very high dimensional data. In SIGMOD Conference, pages lengthywithover400words,orwereunusuallyshortwillless 37–46, 2001. than150words. Thisbehaviourwasalsogenerallyreflected [3] D.P. Bertsekas. Nonlinear Programming. Athena in the number of distinct words per document. Another Scientific,Belmont, MA, 1999. observation is that these outlier documents typically had a [4] S.J. Blasiak, H.Rangwala, and S.Sudarsan. Joint significantvocabularyrepetitionoverasmallnumberofdis- segmentation and clustering in textcorpuses. InSIAM tinct words. Thus, the algorithm was also able to identify Data Mining(SDM),pages 485–493, 2013. those natural outliers, which ought to havebeen considered [5] M. M. Breunig, H.-P.Kriegel, R.T. Ng,and outliers for reasons of statistical word distribution, as op- J. Sander.Lof: Identifyingdensity-basedlocal posed to theirtopical behaviour. outliers. In SIGMOD Conference, pages 93–104, 2000. [6] E. J. Candès, X. Li, Y.Ma, and J. Wright.Robust 5 Conclusion principalcomponent analysis. Journal of the ACM Thispaperpresentsamatrixfactorizationbasedapproachto (JACM), 58(3):11, 2011. textoutlieranalysis. Theapproachisdesignedtoadjustwell [7] V.Chandola, A. Banerjee, and V.Kumar. Anomaly tothewidely varyingstructuresin differentlocalities ofthe detection: Asurvey.ACM Comput. Surv., 41(3), 2009.