Volume 15, Number 2 Published by the Association for Computing Machinery December 2013 Special Interest Group on Knowledge Discovery and Data Mining SIGKDD explorations TABLE OF CONTENTS Position Paper 1 Clustering high dimensional data: Examining differences and commonalities between subspace clustering and text clustering – A position paper Hans-Peter Kriegel, Eirini Ntoutsi Contributed Articles 9 Mining Text and Social Streams: A Review Charu C. Aggarwal 20 Mining Social Media with Social Theories: A Survey Jiliang Tang, Yi Chang, Huan Liu 30 Brain Network Analysis: a Data Mining Perspective Xiangnan Kong, Philip S. Yu 39 Predictive Analysis of Engine Health for Decision Support Shubhabrata Mukherjee, Aparna S. Varde, Giti Javidi, Ehsan Sheybani 49 OpenML: networked science in machine learning Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, Luis Torgo Editor-in-Chief: Bart Goethals Associate Editors: Charu Aggarwal Srinivasan Parthasarathy Ankur Teredesai http://www.kdd.org/explorations/ Clustering high dimensional data: Examining differences and commonalities between subspace clustering and text clustering - A position paper Hans-Peter Kriegel Eirini Ntoutsi Ludwig-Maximilians-UniversityofMunich Ludwig-Maximilians-UniversityofMunich Germany Germany [email protected]fi.lmu.de [email protected]fi.lmu.de ABSTRACT bothclustersandthedimensionsuponwhichtheseclusters are defined. To this end, both domains evaluate the impor- The goal of this position paper is to contribute to a clear tance of the dimensions for the clustering task and appro- understandingofthecommonalitiesanddifferencesbetween priatelyincorporatethisimportanceinthedistancefunction subspace clustering and text clustering. Often text data is and consequently, in the clustering algorithm. Besides the foistedasanidealfitforsubspaceclusteringduetoitshigh common goal and the similar approach though, there are dimensionalnatureandsparsityofthedata. Indeed,thear- fundamental differences between the two domains. To the eas of subspace clustering and text clustering share similar best of our knowledge, no work that elaborates on how the challengesandthesamegoal,thesimultaneousextractionof two areas are related has been proposed so far. Our goal of bothclustersandthedimensionswheretheseclustersarede- thisworkistoserveasapointforfruitfulexchangeofideas fined. However, there are fundamental differences between and techniques between the two domains. thetwoareasw.r.tobjectfeaturerepresentation,dimension The rest of the paper is organized as follows: A short weighting and incorporation of these weights in the dissim- overviewofsubspaceclusteringandtextclusteringdomains ilarity computation. We make an attempt to bridge these is presented in Sections 2 and 3, respectively. The differ- twodomainsinordertofacilitatetheexchangeofideasand ences and commonalities between the two domains are dis- best practices between them. cussedinSection4andrefertothefollowingaspects: object- feature representation (Section 4.1), dimension weighting Keywords (Section 4.2) and incorporating dimension weights in the high dimensional data, subspace clustering, text clustering similarity/distance function (Section 4.3). In Section 5, we overview a few existing approaches that make use of con- 1. INTRODUCTION cepts from both domains. Section 6 concludes our work. SubspaceClustering isanewresearchareareceivingalotof attention from the community [23; 18; 19; 8]. In contrast 2. SUBSPACE CLUSTERING IN A NUT- to traditional clustering that partitions objects in the full SHELL dimensional feature space, the subspace clustering methods simultaneously partition both objects and dimensions. A Theareaofsubspaceclusteringhaslatelyemergedasasolu- subspaceclusterisdefinedintermsofbothitsmembersand tiontotheproblemofhighdimensional data,asitisdifficult the dimensions where these members are grouped together. tofindmeaningfulclustersinhundredsoreventhousandsof dimensions. Different features might be relevant for differ- As with traditional full dimensional clustering, subspace entclustersandtherefore,thegoalofsubspaceclusteringis clustering is not confined to specific data types or appli- to find both the cluster members and the dimensions upon cationdomains. Relatedworkoftenmentionstext data asa which these members form a cluster. candidate application for subspace clustering [23; 8] due to the high dimensionality and sparsity of the text data [26]. Thisisincontrasttotraditionalclusteringthatsearchesfor Typically, documents are represented as vectors in a very clusters in the full dimensional feature space [15; 13]. Also, highdimensionalspacewheretheentriesofthisvectorrep- this is different from global dimensionality reduction tech- resent words appearing in the document collection. niques like PCA [16] that reduce the dimensionality of the featurespaceandsearchforclustersinthereduced(though Text data have been studied extensively in the fields of In- still full) dimensional space. Several subspace clustering formation Retrieval and Text Mining. The text clustering methods have been proposed in the literature so far (for a domain [4] is among the most important ones with a lot of comprehensiveoverviewoftheareaseee.g.,[23],[18],[21]). applications like organization, summarization and indexing ofdocumentcontent. Nowadays,itisaveryactiveresearch A crucial point in subspace clustering is the decision about field due to the abundance of textual data. the relevant dimensions that should be considered for clus- tering. Givenad-dimensionaldataset,thenumberofpossi- Both subspace clustering and text clustering domains face blesubspacesis2d−1,thereforeitisinfeasibletoexamineall the challenge of high dimensionality and aim at extracting possiblesubspacesforclusters. Thesolutionistoefficiently navigatethroughthesearchspaceofallpossiblesubspaces; tothisend,twodifferentapproacheshavebeenproposedin SIGKDD Explorations Volume 15, Issue 2 Page 1 the literature [18]: • Stopword removal: A stopword is a word which is not thought to convey any meaning as a dimension in the • Bottom-up approaches: They start with 1–D sub- vectorspace(i.e.,withoutcontext),e.g.,thewords“the”, spaces (i.e., single dimensions) and iteratively merge “and”. Usually the document words are compared with lower dimensional subspaces in order to compute respect to a known list of stopwords and the detected higher dimensional subspaces. The downward-closure stopwords are removed from the document. property of Apriori [6] is usually employed to prune non-appropriatesubspaces. Thekeyinthiscategoryis • Rare words removal: Rare words, i.e., words that ap- themergingprocedure,i.e.,howlowdimensionalsub- pearinveryfewdocumentsareusuallyremovedsincethey spaces would be merged to yield higher dimensional are considered not to capture much information about subspaces. Once the appropriate subspaces are de- somecategoryofdocuments[30]. Athresholdonthenum- tected, the clustering algorithm is applied over each berofdocumentsisusedthatdetermineswhetheraword subspace similarly to full dimensional clustering. is rare and should be removed from the feature space as an outlier. To this category belong methods like CLIQUE [5], MAFIA[22]andSUBCLU[17]. Thesemethodsresult Althoughpreprocessingresultsinsomesortofdimensional- inoverlappingclusters;anobjectmightbeassignedto ity reduction, the number of dimensions remains very high more than one clusters defined in different subspaces (hundreds or thousands of keywords1). To deal with this of the original feature space. problem several dimensionality reduction techniques have beenproposedwhichcanbedistinguishedintofeaturetrans- • Top-downapproaches: Theystartsearchinginthefull formation and feature selection techniques. The feature dimensional space and iteratively learn the “correct” transformation techniques try to reduce the dimensional- subspace for each point or cluster. The key in this itytoafewernewdimensionswhicharecombinationsofthe category is how to learn the “correct” subspace for a original dimensions. In this category belongs methods like point or cluster. Principal Component Analysis (PCA) [16] and Latent Se- Representatives of this category are methods like mantic Indexing (LSI) [20]. Because they are global meth- PROCLUS[3],DOC[25],COSA[11]andPreDeCon[9]. ods though, they might be problematic for topics defined upon different keywords. Feature selection methods aim at Top-downsubspaceclusteringmethodsareclosertothetext removing dimensions which seem irrelevant for modeling, clusteringmethodssincetheysimultaneouslysearchforthe e.g.,wordsappearingveryofteninacollectionorveryrarely, best non-overlapping partitioning of the data and the best akaoutliers. Nevertheless,theresultingfeaturespaceisstill subspace for each partition. Therefore, we refer to the top- highdimensionalandthereforedimensions’weightingtakes down subspace clustering methods from now on. place as we will explain in the following sections. Typical examples of distance-based text clustering algo- 3. TEXTCLUSTERINGINANUTSHELL rithms are hierarchical clustering and k-Means [4], while The goal of text clustering is to organize documents into cosine similarity is the commonly used similarity function. clusters of similar content, the so-called topics. Documents aretypicallyrepresentedintermsoftheircomponentwords, 4. DIFFERENCES BETWEEN SUBSPACE throughthevectorspacemodel [27]. Usually,noinformation CLUSTERING AND TEXT CLUSTER- abouttheorderofthewordsisconsideredinthismodeland ING thusthismodelisalsoknownasthebagofwords representa- tion model. Although more elaborate representations have In the previous sections, we presented a short introduction beenproposedsuchasmulti-wordtermsandN-Grams[28], totheareasofsubspaceclusteringandtextclustering. Here- thevectorspacemodelremainsthemostpopularone. This after,weshowhowthetwoareasarerelatedbypointingout representation though results in a high dimensional feature their commonalities and differences. space, where features correspond to words appearing in the As already mentioned, both areas deal with high dimen- document collection. Moreover, only a small subset of all sional data: Subspace clustering targets high dimensional wordsappearinginthecollectionappearsineachdocument. data, though not focusing on specific application domains As a result, the document representation is very sparse. or data types. The data instances are described in terms of Typically,beforeapplyinganyminingtechniqueinthetext all dimensions, in the full (high) dimensional feature space. collection, preprocessing takes place in order to remove Thevastmajorityofthealgorithmsdoesnotdealwithmiss- non–informative words. Preprocessing consists of several ing values, though recently fault tolerant subspace cluster- steps [7]: ing[12]hasbeenproposedtoaccommodatedatawithmiss- ingvalues. Fromatextclustering perspective,thekeywords • Filtering: Special characters and punctuation are re- ofadocumentcomprisethedimensionsofthefeaturespace moved since they are considered not to hold any discrim- and documents are points or vectors in this space. There inative power. are hundreds or thousands of keywords (thus, high dimen- sional space) and the vectors describing each document are • Stemming: Thewordsarereducedtotheirbaseformor very sparse (most of the entries are null). A null entry in stem or root with the goal of eliminating multiple occur- this case might mean that the corresponding keyword is ir- rencesofaword(e.g.,thewords“fish”,“fishing”,“fisher”, relevant (e.g., consider the word “Ukraine” in a document “fished”areallderivedfromthesameroot“fish”). Stem- ming is language dependent. The Porter’s algorithm [24] 1The terms “word”, “keyword” “term” are used inter- is the best known stemmer for the English language. changeably. SIGKDD Explorations Volume 15, Issue 2 Page 2 about the importance of vegetables in our daily diet) or The matrix V is the result of the vector space model rep- the corresponding word might be missing because it is im- resentation, called document–term matrix in this settings. plied by the context (e.g., in a document about “Greece” Each entry v in the matrix corresponds to the value of i,j theword“country”mightnotbereportedsinceitisknown keyword A ∈ A in document p ∈ D. Usually v equals j i i,j that Greece is a country). the number of times that keyword A appears in document j Note also that both areas share the same goal, namely the pi. Topreventbiasingtowardslongerdocuments, thenum- simultaneous partitioning of the data points and the di- berofoccurrencesofakeywordinadocumentisnormalized mensions (points/ instances and dimensions/ features, re- with respect to the total number of keyword occurrences in spectivelyaccordingtothesubspaceclusteringterminology; the document, resulting in the so-called term frequency: documentsandkeywords,respectivelyaccordingtothetext Definition 1. Term Frequency (TF) clusteringterminology). Inbothcases,thenotionofaclus- The term frequency of a term/keyword A ∈ A in a docu- terincludes,exceptfortheclustermembers,andthedimen- j ment p ∈D is defined as follows: sions where these members are similar enough to form the i csulubsstpear.ceInclusustbesrpsa,cwehcelruesatseriinngt,exthtecsleusctleursitnegrsasarteopkincosw. n as TFj,i = (cid:80)nkjn,ik,i The problem of simultaneously partitioning both the data where the numerator represents the number of occurrences andthedimensionsistackledinasimilarwaybybothsub- of keyword A in document p and the denominator repre- j i spaceclusteringandtextclusteringareas. Inparticular,the sents the occurrences of all keywords A ∈A in p . k i solutioninvolvesanappropriateweightingofthedimensions according to their relevance for the clustering task and the TheTFscoreexpresseshowimportantakeywordiswithina incorporationoftheseweightsinthedissimilarityfunction2 document. Itsvalueslieinthe[0,1]rangewithlargervalues and consequently, in the clustering algorithm. indicatingmoreimportantkeywords. Hereafter,weconsider We present the commonalities and differences between the theentriesofmatrixV tobetheTFvaluesofthekeywords two domains with respect to the following aspects: (i) the in the documents of the collection. Note that since not all object–feature representation (Section 4.1), (ii) the weight- keywords in the collection appear in each single document, ing of the dimensions (Section 4.2) and (iii) the incorpora- there are a lot of null entries in this matrix corresponding tion of dimensions’ weighting in the dissimilarity functions to non-appearing words in a document. andconsequently,intheclusteringalgorithms(Section4.3). 4.1.3 Discussiononobject-featurerepresentationdif- 4.1 Object–featurerepresentationdifferences ferences Let D = {p ,p ,...,p } be a dataset of n objects and let There is a crucial difference in the object-feature represen- 1 2 n A=(A , A , ..., A ) be the d-dimensional feature space. tation of the two domains. In subspace clustering, there 1 2 d LetV bea|D|×|A|matrix,referredhereafterastheobject– are no semantics on the values of the dimensions and all feature value matrix, where the rows are the objects and values are considered to be of the same importance. On the columns are the dimensions. Each entry v in this thecontrary,intextclusteringgreatervaluesindicatemore i,j matrix corresponds to the value of the object p ∈D in the important dimensions/keywords. For example, consider a i dimension A ∈A. We explain hereafter what these values keyword A appearing in two documents p ,p with term j 1 1 2 are for each domain. frequencies 0.8,0.2, respectively. Since 0.8>0.2, A is con- 1 sideredtobemoreimportantfordocumentp comparedto 1 4.1.1 Object–featurerepresentationinsubspaceclus- document p . Such a value differentiation though does not 2 tering take place in the subspace clustering domain. Insubspaceclustering,eachobjectisconsideredtohaveval- Although, as already mentioned, recently subspace cluster- ues for all dimensions3. So, the construction of the object– ing over missing data has been proposed [12], missing data feature value matrix V is rather straightforward. The rows are conceptually different from non–appearing words in a correspond to the objects in the database and the columns document. A missing value in subspace clustering indicates correspondtothedimensions. Eachentryvi,j inthematrix a value that is unknown rather than a value that is not im- containsthevalueoftheobjectpi ∈DindimensionAj ∈A. portantforthedescriptionofadocumentorredundantw.r.t. The original values of the dimensions may refer to different alreadyexistingwordsinthedocument(asitisusuallythe scalesofreference. Topreventattributeswithinitiallylarge case for non appearing words in a document). ranges (e.g., salary) from outweighing attributes with ini- 4.2 Dimensionweightingdifferences tialsmallerranges(e.g.,age),anormalization processtakes place prior to clustering. In this way, different dimensions Bothsubspaceclusteringandtextclusteringdomainsrelyon can be compared meaningfully. somenotionofweightingforthedifferentdimensionsbased on their importance for the clustering task. 4.1.2 Object–featurerepresentationintextclustering Insubspaceclustering,theimportantdimensionsarelearned In text clustering, the database D corresponds to a collec- either for an instance/object (instance–based approaches) tion of documents and the dimensions A correspond to the or for a cluster (cluster–based approaches). Representative distinctkeywordsinthiscollection. Weassumethatthepre- weightingschemesforbothapproachesarepresentedinSec- processing step (c.f., Section 3) has been already applied. tion 4.2.1. In text clustering, the most well known weight- 2We use the term “dissimilarity” to refer to either distance ing schema is inverse document frequency (IDF) described or similarity function. in Section 4.2.2. 3Recently,subspaceclusteringoverdatawithmissingvalues Wecanmodelthedimensionweightingintermsofamatrix has been considered [12]. W referred to hereafter as the object–feature weight matrix. SIGKDD Explorations Volume 15, Issue 2 Page 3 Inthegeneralcase,W isa|D|×|A|matrixwheretherows number of clusters k is given as input to the algorithms correspondtoobjectsandthecolumnscorrespondtodimen- of this category. The definitions and details below are sions. Each entry w in this matrix represents the weight from the algorithm PROCLUS [3]. Each cluster (called i,j (or importance) of the dimension A ∈ A for the object projected cluster) is represented by its medoid and it j p ∈D. Weexplainhereafterhowtheseentriesarefilledfor is assigned a subspace of preferred4 dimensions. Let i each domain. C ={c ,c ,...,c }beasetofk medoids(formoredetails 1 2 k on the initial selection and refinement of this set, please 4.2.1 Dimensionweightinginsubspaceclustering refer to [3]). The importance of a dimension is evaluated either with re- Locality of a cluster: ThelocalityL ofaclusterc ∈C i i specttosomeinstances(instance–basedapproaches)orwith isdefinedasthesetofobjectsinDthatarewithindistance respect to some cluster (cluster–based approaches). Both δ fromitsmedoidc . Thedistanceiscomputedinthefull i i approaches rely on the so called “locality assumption” ac- dimensionalspace,whereasthedistancethresholdδ isthe i cording to which, the subspace preference for an object or minimum distance of c from any other medoid c ∈ C , i j a cluster can be learned from its local neighborhood in the i.e., δ =min{dist(c ,c )}, i(cid:54)=j. i i j full dimensional space. We describe each case below. Preferred dimensions of a cluster: For each medoid c , the average distance between c and the objects in its (i) Dimension weighting in instance–based approaches i i locality L is computed along each dimension A ∈A, de- The preferred dimensions subspace is defined per instance i j noted by X . If this value is as small as possible w.r.t. and is learned by evaluating the local neighborhood of i,j the statistical expectation, then A is a preferred dimen- the instance/object in the full dimensional feature space. j sion for cluster c . The statistical expectation is evaluated The definitions and details below are from the algorithm i in terms of the mean Y and the standard deviation σ . PreDeCon [9]. i i ThevalueZ = Xi,j−Yi iscomputedwhichindicateshow Localityofaninstance: Foranobjectp∈D,itslocality i,j σi the average distance between cluster c and its locality L or neighborhood (N (p)) consists of all objects that fall i i ε in dimension A is related to the average distance in all within distance (cid:15) from p in the full dimensional space. j dimensions. The lowest Z values are picked leading to a i,j Preferred dimensions of an instance: Roughlyspeak- totalofk∗l dimensions,wherel istheaveragedimension- ing, an object prefers a dimension if it builds “com- ality per cluster given as input to the algorithm. pact”neighborhoods,i.e.,neighborhoodsofsmallvariance, After the weighting, each cluster in C is assigned a sub- across this dimension. The variance in the neighborhood space of preferred dimensions. Usually a bitvector of all of p along some dimension A is defined as follows: j dimensions is used to model the preferences and to distin- guishbetweenpreferred(value1)andnonpreferred(value Definition 2. Variance along a dimension 0) dimensions for a cluster [2]. Letp∈D andε∈R. Thevariance in the neighborhood of p N (p) along a dimension A ∈A is given by: The objects assigned to a cluster “inherit” the subspace ε j preferences of the cluster, therefore each object can be as- Var (N (p))= (cid:80)q∈Nε(p)(distAj(p,q))2 sumed to have a bitvector of dimension preferences. This Aj ε |N (p)| way,wecanfilltheentriesoftheobject–featureweightma- ε trix W with values 1 and 0 indicating preferred and non– where dist (p,q) is the distance of p,q in dimension A . Aj j preferred,respectively,dimensionsforanobject. Notethat this way the objects belonging to the same cluster, would Asmallvariance,withrespectto agivenvariancethresh- have the same bitvectors of subspace preferences. old δ, indicates a preferable dimension. For each p ∈ D, its subspace preference vector w¯p is built [9] which distin- 4.2.2 Dimensionweightingintextclustering guishesbetweenpreferableandnon-preferabledimensions. In text clustering, the importance of a keyword is evalu- ated in terms of the whole collection of documents, rather Definition 3. Subspace preference vector thanperdocumentorpertopic/cluster. Usually,theInverse Let p ∈ D, δ ∈ R and κ ∈ R be a constant with κ (cid:29) 1. Document Frequency (IDF) is employed towards this end. The subspace preference vector of p is defined as: Definition 4. Inverse Document Frequency (IDF) w¯ =(w ,w ,...w ) p 1 2 d The inverse document frequency of a keyword A in a col- j where: lection of documents D is given by: (cid:26) 1 if Var (N (p))>δ |D| wi = κ if VarAAii(Nεε(p))≤δ IDFAj = |{pi ∈D:Aj ∈pi}| wherethedenominatorrepresentsthenumberofdocuments Thevaluesofthesubspacepreferencevectoraretheentries in D which contain the specific keyword/dimension A and of the object–feature weight matrix W: k-value entries in- j |D| is the cardinality of the collection. dicate preferable dimensions while 1-value entries indicate non-preferable ones. The basic intuition behind IDF is that common keywords (i.e., keywords appearing very often in the collection) have (ii) Dimension weighting in cluster–based approaches nodiscriminativepowerandthus,theyareassignedasmall The subspace of the preferred dimensions is defined per cluster and is learned by evaluating the local neighbor- 4Weusebothterms“preferred”and“projected”todescribe hood of the cluster in the full dimensional space. The a dimension that is important. SIGKDD Explorations Volume 15, Issue 2 Page 4 IDF score. Larger IDF scores indicate more discriminative Definition 6. Preference weighted distance keywords. Let p,q ∈ D, then their preference weighted distance is UsingtheIDFscoresofthekeywordswecanfilltheentriesof given by: theobject–featureweightmatrixW. NotethatsinceIDFis (cid:118) definedperkeyword,eachcolumnofthematrixcorrespond- dist (p,q)=(cid:117)(cid:117) (cid:88) 1 ·(V[p,A ]−V[q,A ])2 ing to a single keyword would be filled with the same IDF p (cid:116) W[p,Aj] j j Aj∈A value. Theaboveformulaconsidersonlythepreferencesofp. The 4.2.3 Discussionondimensionweightingdifferences symmetric version is: Although, the weighting of the dimensions is performed in dist(p,q)=max(dist (p,q),dist (p,q)) p q bothsubspaceclusteringandtextclusteringdomains,there are core differences in the weighting schemes. where V[p,Aj] (V[q,Aj]) is the value of object p (q, re- In subspace clustering, the dimension weighting is local re- spectively)indimensionAj andW[p,Aj](W[q,Aj])isthe lyingontheneighborhoodofeachobjectorcluster. Intext weight of Aj for point p (q, respectively). clustering, the weighting of the keywords is global and is Note, that the dimension weights are either 1 or κ (κ>> based upon the whole collection of documents. 1). Therefore, the similarity function weights attributes Also, the dimension weights in subspace clustering act withlowvariance(dimensionweightκ)considerablylower mainly as a filter in order to distinguish between preferred (by a factor 1/κ) than attributes with a high variance (di- and non–preferred dimensions. For example, PreDeCon [9] mension weight 1). uses the dimension weights κ and 1 to model preferred andnon-preferreddimensions,respectively. Similarly,PRO- (ii) Dissimilarity in cluster–based approaches We CLUS[3]employstwodimensionweights,0and1with1in- adapt the distance function of PROCLUS to the generic dicating a preferred dimension. That is, the actual weights diss() function notation. ofthedimensionsarenotconsideredinsubspaceclustering, Definition 7. Projected distance but rather the information about which dimension is pre- Let p,q ∈ D with q being the medoid of a cluster. The ferred or not. This is in contrast to IDF weighting in text. projecteddistanceofpwithrespecttothemedoidqisgiven 4.3 Incorporating dimension weighting in by: similarity/distance computation differ- (cid:80) W[q,A ]∗|V[p,A ]−V[q,A ]| ences dist(p,q)= Aj∈A j j j |{A :W[q,A ]=1}| j j The weighting of the dimensions is incorporated in the dis- W[q,A ] is the importance of dimension d for the cluster similarityfunctioninbothsubpsaceclusteringandtextclus- j representedbythemedoidq. Notethatinthiscase,weights tering domains. We model this contribution in terms of a take values in {0,1}. That is, the non important dimen- generic dimension–weighted dissimilarity function. sionsarenotconsideredatallduringdistancecomputation and the important ones are equally taken into account. Definition 5. Dimension–weighted dissimilarity Let p,q ∈D. We denote by V[p,Aj] (V[q,Aj]) the value of 4.3.2 Dissimilaritycomputationintextclustering object p (q, respectively) with respect to dimension A and by W[p,A ] (W[q,A ]) the importance of A for the ojbject TF ×IDF is the most common schema for combining key- j j j p (q, respectively). The dissimilarity between p and q is a word weights and their values across different documents. combination of their values and dimensions’ weights: According to this schema, the value of a keyword in a doc- ument(TF)ismultipliedbytheimportanceofthekeyword diss(p,q)=Aggr f(V[p,A ],V[q,A ],W[p,A ],W[q,A ]) Aj∈A j j j j in the whole collection (IDF). The function f() combines the value and weights in each Definition 8. TF×IDF score dimension. The total score is an aggregation of the cor- Let p ∈ D be a document and let A ∈ A be one of its j responding dimensions’ scores, through some aggregation keywords. The TF×IDF score of A in p is given by: function Aggr(). j (TF ×IDF) =TF ×IDF p,Aj p,Aj Aj Wealreadydiscussedwhattheobjectvaluesanddimension Typically, the dissimilarity function is applied upon the weightsstandforineachdomainandhowtheobject–feature TFIDFvaluesofthedocuments. Acommonlyuseddissimi- value matrix V and the object–feature weight matrix W larityfunctionisthecosinesimilaritythatfindsthecosineof are filled. We explain hereafter how the diss() function is the angle between the two documents. It becomes 1 if the instantiated per domain. documents are identical and 0 if there is nothing in com- mon between them (i.e., the vectors are orthogonal to each 4.3.1 Dissimilarity computation in subspace cluster- other). ing The cosine similarity can be re-written in terms of the We distinguish between instance–based and cluster–based generic formula diss() as follows: approaches, as with the weighting of dimensions case (cf., Definition 9. Cosine similarity Section 4.2). Let p,q∈D be two documents. Their similarity is defined as: (i) Dissimilarity in instance–based approaches We addisasp(t)tfhuencdtiisotnanncoetafutinocnt.ionofPreDeCon[9]tothegeneric cos(p,q)=(cid:114)(cid:80)Aj∈A(cid:80)(AVj[p∈,AAjV]∗[pW,A[pj,]A∗Wj])[2p∗,A(cid:114)j(cid:80)]∗AVj[∈q,AAj(V]∗[Wq,A[qj,A]∗jW][q,Aj])2 SIGKDD Explorations Volume 15, Issue 2 Page 5 Note that the weight of a keyword A is defined over the words that frequently occur within the cluster are used for j whole collection D thus, documents p and q share the same theassignmentprocess. Thenumberofwords,i.e.,projected weight for this keyword, i.e.,W[p,A ]=W[q,A ]=IDF . dimensions for each cluster, is reduced at each iteration by j j Aj a geometric factor. In their experiments, they started with 4.3.3 Discussion on incorporation of dimension a projected dimensionality of 500 words which was finally weightingindistancefunctiondifferences reduced to 200 words. This work introduces subspace clus- Although,bothdomainsconsiderobjects’valuesanddimen- tering concepts to text clustering, in particular, the local sions’weightsfordissimilaritycomputation,thereisaclear (within each cluster) weighting of the words and selection difference between the two domains. The difference lies in of the most prominent ones for cluster centroid representa- the fact that in text clustering the value of a dimension/ tion. TheproposedalgorithmresemblesPROCLUS[3](c.f., keyword in a document, i.e.,TF, also expresses some notion Section4.2.1),howeverthedistancefunctionandtheselec- of importance for the keyword in the document. This is in tion of projected dimensions at each iteration is adopted to contrast to subspace clustering where all values are treated the text domain. In particular, cosine function is used for similarly and there is no importance discrimination. dissimilarity assessment and the selection of projected di- mensions is done on the basis of their occurrences in each This fact is also reflected in the chosen dissimilarity func- cluster. tions. In subspace clustering, where the values carry no semantics on their importance, the value difference in each In [14], the authors propose a modification of W–k– dimension(e.g.,,absoluteasinPROCLUS[3]orsquaredas Means [10], that calculates cluster specific weights for each in PreDeCon [9]) is considered. In text clustering, though, feature during the clustering process, for text documents. thewidelyusedcosinesimilarityfunctionmultipliestheac- Totallym×k weightsareproducedbythealgorithmwhere tual object values at each dimension, so that higher values m is the number of features and k the number of clusters. result in higher scores. For example, if we consider two ob- Theinitialweightsarerandomlygeneratedandrefineddur- jectswithvalues0.8and0.6foraspecificdimension(case1), ing the iteration phase, similarly to K–Means. Based on thecontributionofthisdimensiontothedissimilarityscore theseweights,thesubsetofkeywordsthatcanserveasclus- willbe0.8−0.6=0.2forPROCLUSand(0.8−0.6)2 =0.04 ter label can be identified. In their experiments, the pro- for PreDeCon, whereas for the cosine similarity it will be posedsubspacemethodperformedbetterthanstandardK– 0.8∗0.6=0.48. For two other objects, with values 0.4,0.2 Means and Bisecting–KMeans [29]. inthesamedimension(case2),theresultwouldbethesame There are also methods that extract an initial clustering forPROCLUSandPreDeConsincetheyrelyontheirvalue with respect to the whole feature space of documents (usu- difference which is again 0.2, however it will be different ally through TF ×IDF weighting) and refine the cluster- forcosinesimilarity;thecontributionofthisdimensionthis ingresultbyapplyingclusteringagaininsideeachextracted time would be: 0.4∗0.2 = 0.08 counting for the fact that clusterbutusingarefinedfeaturespace. Forexample,in[31] the actual object values are lower in case 2 than in case 1. the authors propose the extraction and maintenance of a Also, in subspace clustering the dimension weighting pre- two level hierarchy of global and local topics: the global dominantlyindicateswhetherthecorrespondingobjectval- topics are extracted by applying clustering over the whole ues should be considered or not. For example, in PRO- collectionofdocumentsinthefeaturespacederivedfromthe CLUS [3] only projected dimensions (having weight 1) are whole collection through TF ×IDF. To extract the local considered during dissimilarity assessment and there is no topics, clustering is applied again inside each global clus- specialweightingperpreferreddimension. InPreDeCon[9] terusingtheclusterpopulationtogeneratethenewcluster also,thevaluedifferencesinthepreferreddimensions(those specificfeaturespace(theIDF scoresarebasedontheclus- having weight κ) are down-weighted by 1/κ, but again this terpopulation,ratherthanonthewholecollection),instead holds for all preferred dimensions. On the contrary, in text of using the generic feature space extracted from the whole clusteringtheactualweightsofthedimensionsasexpressed dataset. This way, the feature space is refined per cluster. by their IDF values over the whole collection of documents Discussion Although there are some methods for text are considered and contribute proportionally to the dissim- clustering that make use of subspace clustering concepts, ilarity score. like [1], [14] and methods that refine the feature space through global and local weighting like [31], there are still manythingsthatthetwodomainscouldexchangeandben- 5. TOWARDS COMBINING THE BEST OF efit from each other, like for example a concurrent incorpo- BOTHWORLDS ration of both global and local weighing of features in the Intheprevioussection,weelaboratedonthedifferencesand clustering process. There is a bunch of subspace clustering commonalitiesbetweensubspaceclusteringandtextcluster- algorithmsintheliterature,athoroughreportontheirper- ingdomainswithrespecttodatarepresentation,dimension formance over text data would be very useful as a starting weighting and incorporation of these weights in the dissim- point. Of course, the weighting of the dimensions and the ilarity functions. In this section, we overview approaches distancefunctionshouldrespectthepeculiaritiesofthetext that make use of concepts from both domains. data as discussed in Section 4. There are also other fields In [1], the authors study the problem of semi-supervised which resemble the semantics of the data values in text; textclassificationanduseclusteringtocreatethesetofcat- forexample,inrecommendationapplicationshigherratings egoriesfortheclassificationofdocuments. Toimproveclus- indicatemoredesireditems,e.g.,moviesorrestaurants. Sub- teringquality,theyuseamodifiedversionofK–Meanswhere space clustering is relevant to such kind of data too, since they iteratively refine both the clusters and the dimensions groups of users with different item preferences might exist, insideeachcluster. Inparticular,ateachiterationtheyfilter however as with text, the data semantics should be also outsomeofthewordsofthefeaturespaceensuringthatonly taken into account. SIGKDD Explorations Volume 15, Issue 2 Page 6 6. CONCLUSIONS 8. REFERENCES Weoutlinedthedifferencesandcommonalitiesbetweensub- spaceclusteringandtextclusteringandwearguedthattext [1] C. Aggarwal, S. Gates, and P. Yu. On using par- data is not a straightforward application domain for sub- tial supervision for text categorization. Knowledge and space clustering as it is often suggested in the literature. Data Engineering, IEEE Transactions on, 16(2):245– Althoughbothdomainssharethesamegoal,theconcurrent 255, Feb 2004. extraction of clusters and dimensions where these clusters are formed, and deal with similar challenges, high dimen- [2] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A sional and sparse data, there are key differences between frameworkforprojectedclusteringofhighdimensional the two domains which we summarize bellow: data streams. In VLDB, pages 852–863, 2004. [3] C.C.Aggarwal,C.M.Procopiuc,J.L.Wolf,P.S.Yu, • Theobjectvaluesintextclusteringalsoreflecttheim- andJ.S.Park.Fastalgorithmsforprojectedclustering. portanceofakeywordwithinadocument(TFscores). In Proceedings of the ACM International Conference onManagementofData(SIGMOD),Philadelphia,PA, • Missing data in subspace clustering imply unknown 1999. data, whereas in text clustering a non-appearing key- word indicates words that are not important or are [4] C.C.AggarwalandC.Zhai.Asurveyoftextclustering probably redundant for the description of the docu- algorithms. In C. C. Aggarwal and C. Zhai, editors, ment. Mining Text Data, pages 77–128. Springer, 2012. • Bothdomainsusesomenotionofdimensionweighting [5] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Ragha- tocountforthedifferentimportanceofthedimensions van.Automaticsubspaceclusteringofhighdimensional for the clustering task. data for data mining applications. In Proceedings of the ACM International Conference on Management of • In subspace clustering, dimension weighting is local Data (SIGMOD), Seattle, WA, 1998. and is derived with respect to the local neighborhood [6] R. Agrawal and R. Srikant. Fast algorithms for mining of a point or a cluster. association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data • In text clustering, dimension weighting is global and Bases,VLDB,pages487–499,SanFrancisco,CA,USA, is defined with respect to the whole collection of doc- 1994. Morgan Kaufmann Publishers Inc. uments. [7] N.O.AndrewsandE.A.Fox.Recentdevelopmentsin • Dimension weighting in subspace clustering mainly document clustering. Technical report, Computer Sci- filters important from non-important dimensions by ence, Virginia Tech, 2007. assigning the same weight to all important/non- important dimensions. [8] I. Assent. Clustering high dimensional data. Wiley In- terdisciplinary Reviews: Data Mining and Knowledge • Intextclustering,theactualweightsofthedimensions Discovery, 2(4):340–350, 2012. are used in the dissimilarity function. [9] C. Bo¨hm, K. Kailing, H.-P. Kriegel, and P. Kro¨ger. • The dissimilarity function in text clustering also con- Density connected clustering with local subspace pref- siderstheimportanceofakeywordwithinadocument erences. In Proceedings of the 4th IEEE International (TFscores),soasmoreimportantkeywordsaregiven Conference on Data Mining (ICDM), Brighton, UK, higher scores compared to less important ones. pages 27–34, 2004. [10] E. Y. Chan, W. K. Ching, M. K. Ng, and J. Z. Our goal of this work is to point out the differences be- Huang. An optimization algorithm for clustering using tweenthetwodomainsandshowtheircommonalities. Text weighted dissimilarity measures. Pattern Recognition, clustering is an established area of research with a bulk of 37(5):943–952, May 2004. applications and an increased interest nowadays due to the webandthedigitizationofinformation. Ontheotherhand, [11] J. H. Friedman and J. J. Meulman. Clustering objects thesubspaceclusteringareagrowsfastashighdimensional- onsubsetsofattributes.JournaloftheRoyalStatistical itycomprisesoneofthebasicfeaturesofnowadaysdata. A Society: Series B (Statistical Methodology), 66(4):825– few works combine concepts from both domains and could 849, 2004. serve as a starting point for further exchange of ideas and bestpracticesbetweenthesedomainsthatwouldbefruitful [12] S. Gu¨nnemann, E. Mu¨ller, S. Raubach, and T. Seidl. for their further development. Flexiblefaulttolerantsubspaceclusteringfordatawith missing values. In D. J. Cook, J. Pei, W. Wang, O. R. Za¨ıane, and X. Wu, editors, ICDM, pages 231–240. 7. ACKNOWLEDGMENTS IEEE, 2011. The authors would like to thank Arthur Zimek for useful commentsandsuggestionstoimprovethequalityofthepa- [13] J. Han and M. Kamber. Data Mining: Concepts and per. Techniques. Academic Press, 2nd edition, 2006. SIGKDD Explorations Volume 15, Issue 2 Page 7 [14] J. Z. Huang, M. K. Ng, H. Rong, and Z. Li. Auto- [28] M. Shafiei, S. Wang, R. Zhang, E. Milios, B. Tang, mated variable weighting in k-means type clustering. J. Tougas, and R. Spiteri. A systematic study of doc- IEEE Transactions on Pattern Analysis and Machine umentrepresentationanddimensionreductionfortext Intelligence, 27(5):657–668, 2005. clustering. Technical Report CS-2006-05, Faculty of Computer Science, Dalhousie University, Canada, July [15] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clus- 2006. tering: a review. ACM Computer Surveys, 31:264–323, 1999. [29] M. Steinbach, G. Karypis, and V. Kumar. A compar- ison of document clustering techniques. In In KDD [16] I. T. Jolliffe. Principal Component Analysis. Springer, Workshop on Text Mining, 2000. 2nd edition, 2002. [30] Y. Yang and J. O. Pedersen. A comparative study on [17] K. Kailing, H.-P. Kriegel, and P. Kro¨ger. Density- feature selection in text categorization. In Proceedings connected subspace clustering for high-dimensional oftheFourteenthInternationalConferenceonMachine data. In Proceedings of the 4th SIAM International Learning, pages 412–420, 1997. ConferenceonDataMining(SDM),LakeBuenaVista, [31] M.Zimmermann,E.Ntoutsi,andM.Spiliopoulou.Ex- FL, 2004. tracting opinionated (sub)features from a stream of product reviews. In Discovery Science - 16th Interna- [18] H.-P. Kriegel, P. Kro¨ger, and A. Zimek. Clustering tional Conference (DS), Singapore, 2013. high dimensional data: A survey on subspace cluster- ing, pattern-based clustering, and correlation cluster- ing. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(1):1–58, 2009. [19] H.-P. Kriegel, E. Ntoutsi, M. Spiliopoulou, G. Tsoumakas, and A. Zimek. Mining complex dynamic data. Tutorial at the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), Athens, Greece., 2011. [20] T. K. Landauer, D. S. McNamara, S. Dennis, and W. Kintsch. Handbook of Latent Semantic Analysis. Lawrence Erlbaum Associates, 2007. [21] E.Mu¨ller,S.Gu¨nnemann,I.Assent,andT.Seidl.Eval- uatingclusteringinsubspaceprojectionsofhighdimen- sional data. In Proceedings of the 35nd International Conference on Very Large Data Bases (VLDB), Lyon, France, 2009. [22] H. Nagesh, S. Goil, and A. Choudhary. Adaptive grids for clustering massive data sets. In Proceedings of the 1st SIAM International Conference on Data Mining (SDM), Chicago, IL, 2001. [23] L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: A review. SIGKDD Explo- rations, 6(1):90–105, 2004. [24] M. F. Porter. An algorithm for suffix stripping. Pro- gram, 14(3):130–137, 1980. [25] C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. M. Murali. A Monte Carlo algorithm for fast projective clustering. In Proceedings of the ACM International ConferenceonManagementofData(SIGMOD),Madi- son, WI, 2002. [26] M. Radovanovic, A. Nanopoulos, and M. Ivanovic. On theexistenceofobstinateresultsinvectorspacemodels. In SIGIR, pages 186–193, 2010. [27] G. Salton. Automatic Text Processing: The Trans- formation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989. SIGKDD Explorations Volume 15, Issue 2 Page 8
Description: