IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 1, JANUARY 1997 1 Pairwise Data Clustering by Deterministic Annealing Thomas Hofmann, Student Member, IEEE, and Joachim M. Buhmann, Member, IEEE Abstract—Partitioning a data set and extracting hidden structure from the data arises in different application areas of pattern recognition, speech and image processing. Pairwise data clustering is a combinatorial optimization method for data grouping which extracts hidden structure from proximity data. We describe a deterministic annealing approach to pairwise clustering which shares the robustness properties of maximum entropy inference. The resulting Gibbs probability distributions are estimated by mean–field approximation. A new structure-preserving algorithm to cluster dissimilarity data and to simultaneously embed these data in a Euclidian vector space is discussed which can be used for dimensionality reduction and data visualization. The suggested embedding algorithm which outperforms conventional approaches has been implemented to analyze dissimilarity data from protein analysis and from linguistics. The algorithm for pairwise data clustering is used to segment textured images. Index Terms—pairwise data clustering, maximum entropy method, multidimensional scaling, exploratory data analysis —————————— ✦ —————————— 1 INTRODUCTION MODERN information and communication technology istics of the data set are hidden in these pairwise relations confronts us with massive amounts of data. The pri- or proximity values which frequently violate the require- mary goal of pattern recognition is to extract hidden struc- ments of a distance measure, i.e., the triangular inequality ture from data in order to generate a compact data repre- does not necessarily hold, the self-dissimilarity may not sentation and to enable symbolic data processing concepts. vanish and the proximity values might be negative. The One of the basic problems in pattern recognition is con- grouping of proximity data is mathematically formulated cerned with the detection of clusters in data sets. The po- as a combinatorial optimization problem which we solve tential applications of clustering algorithms cover a wide with a minimization heuristic called deterministic annealing. range from data compression of video and audio signals to Sets of relational data are abundant in many applications, structure detection and automatic inference engines in ma- e.g., in molecular biology, psychology, linguistics, econom- chine learning and artificial intelligence. We will describe a ics and image processing. stochastic optimization approach to data clustering which Data clustering as a problem in pattern recognition and relies on the well-known robustness of maximum entropy statistics belongs to the class of unsupervised learning 1 inference [1], [2], [3] . problems. There is a large body of literature available on The problem of optimally partitioning a data set arises in this topic and the reader is referred to the text books of two different forms dependent on the data representation Duda and Hart [8] and of Jain and Dubes [4] for an over- as vectorial or proximity data. A partitioning approach view. The method of deterministic annealing is described in known as central clustering derives a set of reference or various papers mostly in the literature on neural networks prototype vectors which quantize a set of vectorial data [9], [10], [11], [12] and on computer vision [13], [14], [15]. with minimal quantization error [6], [7]. Data compression Deterministic annealing applied to central clustering has is achieved by transmission and storage of the indices of been discussed by Rose et al. in a series of papers [16], [17], reference vectors rather than the original data vectors. The [18], [19]. A solution of the deterministic annealing proce- second approach to data clustering, referred to as pairwise dure for vector quantization with different rate constraints data clustering [8], partitions a set of data into clusters in was suggested in [20], [21]. This work, as well as Chou et al. which the data are indirectly characterized by pairwise [22], emphasized the design question of how large the code comparisons instead of explicit coordinates. The character- book should be chosen dependent on prespecified costs per code vector. ———————————————— The remainder of this paper is structured in the follow- • The authors are with Rheinische Friedrich-Wilhelms-Universität, Institut ing way: We discuss the advantage of a maximum entropy für Informatik III, Römerstra(cid:1)e 164, D-53117 Bonn, Germany. E-mail: {th, jb}@informatik.uni-bonn.de. based search heuristic in Section 2. A discussion of cost functions for central and pairwise data clustering is pre- Manuscript received June 3, 1995; revised Sept. 26, 1996. Recommended for accep- sented in Section 3. Approximation techniques to calculate tance by D.M. Titterington. For information on obtaining reprints of this article, please send e-mail to: expectation values for the data assignments are discussed [email protected], and reference IEEECS Log Number P96100. in Section 4. The widely used estimation technique called mean–field approximation is derived by variational tech- niques and, alternatively, by an expansion for small fluc- 1. Data clustering is viewed as a partitioning problem throughout this tuations. An extension of pairwise data clustering to data paper and not as a density estimation problem of a mixture model [4], [5] in the sense of parametric statistics. 0162-8828/97$10.00 ©1997 IEEE 2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 1, JANUARY 1997 a f a f visualization is described in Section 5. The results of central (cid:1) ∫ ÂPGb w (cid:1) w . (3) clustering (Section 3) are employed to simultaneously wŒW group and embed proximity data in a low dimensional The Gibbs free energy (cid:1)((cid:2)) is related to the expected costs Euclidian space. Simulation results of clustering problems (cid:2)(cid:2)(cid:3) and to the entropy (cid:3) by in molecular biology and linguistics, a performance com- parison between deterministic annealing and a conven- e j a f a f 1 1 c h (cid:4) PGb = -ÂPGb w logPGb w = (cid:1) - (cid:3) (cid:1) . (4) tional, gradient descend technique for clustering as well as T T wŒW an application of pairwise data clustering to image seg- mentation are summarized in Section 6. 2.2 Deterministic Annealing A stochastic search according to a Markov process with a 2 STOCHASTIC OPTIMIZATION transition matrix (1) allows us to estimate expectation val- BY MAXIMUM ENTROPY INFERENCE ues of system parameters by computing time averages in a 2.1 Simulated Annealing Monte Carlo simulation, e.g., the variables of the optimiza- tion problem are drawn according to PGb((cid:1)). This random, In sem(inal papers Kirkpatrick et al. [23] and, independ- sequential sampling of the solution space, however, is slow ently, Cerny [24] have proposed the stochastic optimization compared to deterministic optimization techniques due to strategy Simulated Annealing. By analogy to an experimental the diffusive nature of the search process. A deterministic annealing procedure where the stability of metal or glass is variant of simulated annealing, “deterministic annealing,” improved by heating and cooling, solutions for an optimi- analytically estimates relevant expectation values of system zation problem are heated and cooled in simulations to find parameters, e.g., the variables of the optimization problem. one with very low costs. The search for good solutions is We introduce the generalized free energy implemented by a Markov process which stochastically a f a f ~ samples the solution space (cid:1) of an optimization problem. (cid:3) P ∫ (cid:1) -T(cid:4) P = (5) a Pf a f a f a f The optimization problem is characterized by a cost func- ÂPw (cid:1) w +TÂPw logPw tion (cid:1) :Wa(cid:1),w ŒW denoting an admissible solution of wŒW wŒW the optimization problem. A new solution is accepted or which has to be minimized over a (tractable) subspace of prob- rejected according to the Metropolis algorithm, i.e., new ~b g ~e j c h solutions with decreased costs are always accepted and ability distributions. The inequality (cid:3) P ≥(cid:3) PGb ∫(cid:3) (cid:1) solutions with increased costs are accepted with an expo- holds since the Gibbs distribution maximizes entropy [1], nentially weighted probability, i.e., [2] for (cid:2)(cid:2)(cid:3) kept fixed. The search space of probability den- R P e j S1 if D(cid:1) £ 0, sities is defined in order to analytically approximate ex- Pwold Æwnew = Texpc-D(cid:1) (cid:2) Th else. pectation values of the optimization parameters. The tem- (1) perature parametrizes a family of generalized free energies e j e j with D(cid:1) ∫ (cid:1) wnew -(cid:1) wold . The parameter T is called with increasing complexity for T (cid:4)(cid:5)0, i.e., high temperature smoothes the cost function and low temperature reveals the the computational temperature. The philosophy of simu- full complexity of the original optimization problem, i.e., lated annealing is to gradually reduce the temperature recovering it for T (cid:4)(cid:5)0. Deterministic annealing algorithms during the search process, thereby forcing the system into track good solutions from high to low temperature in a solutions with low costs. Mathematically, the stochastic similar way to cooling in simulated annealing. We will dis- search process for the optimal solution is a random walk in cuss this technique in detail in Section 4. the solution space. Cost differences between neighboring Why should we consider a stochastic or deterministic states act as a force field. The effect of the temperature can search strategy based on principles from statistical physics? be interpreted as a random force with an amplitude pro- The fundamental relationship between statistical physics portional to T. Valleys and peaks with a cost difference less and robust statistics has been established by Jaynes [1], [2], than T are smeared out and vanish in the stochastic search. [3] who postulated the principle of maximum entropy in- A Markov process with a transition matrix (1) converges to ference. Maximizing the entropy yields the least biased in- an equilibrium probability distribution [25] c a f h ference method being maximally noncommittal with respect to a f exp -(cid:1) w (cid:2) T missing data. In the context of data clustering the missing PGb w = c a f h  exp -(cid:1) w¢ (cid:2) T information are the assignments of data to clusters. Another we¢ŒWe a f c hj j important argument in favor of the maximum entropy ∫ exp - (cid:1) w -(cid:3) (cid:1) (cid:2) T (2) method stresses the robustness of this inference technique. Tikochinsky et al. [26] have proven that the maximum en- whcichh is known as thec Gibbbsg dishtribution. The quantity tropy probability distribution is maximally stable in terms (cid:3) (cid:1) ∫ -Tlog exp-(cid:1) w¢ (cid:2) T denotes the Gibbs free of the L norm if the expected cost (cid:2)(cid:2)(cid:3) is lowered or raised w¢ŒW 2 energy. The temperature T formally plays the role of a La- by changes of the temperature. The family of Gibbs distri- grange parameter to enforce a constraint on the expected butions for a given cost function possesses the optimality costs property to induce the least variations if (cid:2)(cid:2)(cid:3) is reduced. Using concepts from differential geometry, the family of Gibbs distributions parameterized by the temperature HOFMANN AND BUHMANN: PAIRWISE DATA CLUSTERING BY DETERMINISTIC ANNEALING 3 forms a trajectory in the space of probability distributions Stochastic optimization of the cost function (8) requires which has minimal length [27]. We conclude from these us to determine the probability distribution of assignments facts that a stochastic search heuristic which starts with a M. The maximum entropy principle, originally suggested large noise level and which gradually reduces stochasticity by Rose et al. [16] for central clustering, states that the as- to zero should be based on the family of Gibbs distributions signments are distributed according to the Gibbs distribution F I wmiatxhi mdael crroebausisntng estse mwipther raetuspree.c t Ttoh niso issetr.ategy guarantees PGbe(cid:2)ccbMgj=expH-e(cid:2)ccbMg-(cid:4)e(cid:2)ccjj/TK, (9) e j e b g j 3 COST FUNCTIONS FOR DATA CLUSTERING (cid:4) (cid:2)cc = -Tlog Âexp-(cid:2)cc M /T . (10) MŒ(cid:1) 3.1 Central Clustering As pointed out in Section 2, the free energy (cid:3)((cid:4)cc) in (10) The most widely used nonparametric technique for finding can be interpreted as a smoothed version of the original data prototypes is central clustering or vector quantization. cost function (cid:4)cc. The factor exp((cid:3)((cid:4)cc)/T) normalizes the Given a set of d-dimensional data vectors (cid:1) = {xi (cid:2) (cid:1)d : i (cid:2) exponential weights exp(−(cid:4)cc(M)/T). Its inverse can be {1, …, N}}, central clustering poses the problem of deter- rewritten as F I mining an optimal set of d-dimensional reference vectors or prototypes (cid:3) = {y(cid:1) (cid:2) (cid:1)d : (cid:1)(cid:2)(cid:2)(cid:4){1, …, K}}. To specify a data  expHG-ÂN ÂK M (cid:3)cx ,y h/TKJ = in i n partition we introduce Boolean assignment variables M and MŒ(cid:1) i=1n=1 a configuration space (cid:1), N K d c h i ’Âexp -(cid:3) x ,y /T , (11) c h k p i n M = M Œ 0,1 N¥K, (6) i=1 n=1 in i=1,...,N n=1,...,K since the sum over assignments is constrained by (cid:1) = RS|T|MŒk0,1pN¥K:ÂK Min = 1, "iUV|W|. (7) yÂienlKd=1s Ma ifnac=to1r.i z eTdh eG cibobsts fduinstcrtiibount i(cid:4)oncc which is linear in Mi(cid:1) n=1 F c h I Mi(cid:1) (cid:5) 1 if the data point xi is assigned to reference vector y(cid:1), PGbe(cid:2)ccaMfj = ’N expH-ÂnK=1eMin(cid:3)e xi,yjn /jTK (12) while Mi(cid:1)(cid:2)(cid:5)(cid:4)0 otherwise. The solution space is defined as i=1 ÂKm=1exp -(cid:3) xi,ym /T the set of admissible configurations in (7) with the con- straints ÂK M = 1, "i assuring that each data point is for predefined reference vectors (cid:3) = {y(cid:1)}. This Gibbs distri- n=1 in bution can also be interpreted as the complete data likeli- represented by a unique reference vector y(cid:3)(cid:2)(cid:2)(cid:4)(cid:3). The qual- hood for mixture models with parameters (cid:3). Following Rose et al. [16], the optimal reference vectors ity of a set of reference vectors is assessed by the objective o t function for central clustering which sums up the average yu* are derived by maximizing the entropy of the Gibbs distortion error (cid:2)(x, y(cid:1)) between a data vector x and the distribution, keeping the average costs (cid:6)(cid:4)cc(cid:7) fixed, i.e., i i F I corresponding reference vector y(cid:1), i.e., G e a fj e a fjJ (cid:1)*= argmaxH- ÂPGb (cid:2)cc M logPGb (cid:2)cc M K (cid:2)ccbMg= ÂN ÂK Min(cid:3)cxi,ynh. (8) (cid:1) F MŒ(cid:1) i=1n=1 G a f e a fj = argmaxH Â(cid:2)cc M PGb (cid:2)cc M /T tAhne aapppplriocaptriioante ddoimstaoirnti, own itmh etahseu mre o(cid:2)st (cxoi,m ym(cid:1))o nd ecpheonicdes boen- (cid:1) MŒ(cid:1) I N K e e j jJ ing the squared Euclidian distance (cid:2)(xi, y(cid:1)) (cid:5) (cid:1)(cid:1)xi − y(cid:1)(cid:1)(cid:1)2 +ÂlogÂexp -(cid:3) x(cid:5),ym /T K. (13) between the data vector and its reference vector. Applica- i=1 m=1 tions with a topological ordering of the reference vectors as Gb P (M) is the Gibbs distribution of the assignments for a set for source-channel coding demand a distortion measure which considers the topological organization of the refer- (cid:3) of fixed reference vectors. To determine closed equations c h ence vectors, e.g., (cid:3) x ,y ∫ ÂK T x -y 2, where for the optimal reference vectors yu* we differentiate the i a n=1 an i n argument in (13) with the expected costs kept constant. The T(cid:3)(cid:1) specifies the probability that index (cid:3) is confused with resulting equation index (cid:1) due to transmission noise. Distortions with a low- N ∂ c h dimensional, topological arrangement defining a chain or a 0 =  M (cid:3) x ,y (14) in ∂y (cid:5) n two-dimensional grid are very popular such as self- i=1 n organizing topological maps in the area of neural comput- e c h j ing [28], [29]. The number of clusters can be limited by ad- exp -(cid:3) x(cid:5),yn /T m r ditional rate distortion constraints, e.g., Shannon entropy or Min ∫  expe-(cid:3)ex ,y j/Tj "n Œ 1,..., K (15) penalties for small clusters, rather than postulating a fixed m (cid:5) m number K [21]. is known as the centroid equation in signal processing. The 4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 1, JANUARY 1997 angular brackets denote Gibbs expectation values, i.e., not change the clustering costs and, consequently, has no (cid:1)f(M)(cid:2) (cid:3) (cid:4)M(cid:5)(cid:1) f(M)PGb(M). The reader should realize that influence on the statistics of the assignments. Another cost entropy maximization implies y(cid:1) to be a centroid which is function with this offset invariance is F I also optimal in the sense of rate distortion theory [30]. a f Equations (14) and (15) are efficiently solved in an itera- (cid:2)¢ M = 21ÂiN=1ÂkN=1(cid:1)ikHÂnK=1MinMkn /p2n -KK/N tive fashion using the expectation maximization (EM) algo- which, however, displays a tendency for extremely hetero- rithm [30]. The EM algorithm alternates an expectation step to determine the expected assignments (cid:1)M(cid:1)(cid:2) with a maxi- geneous partitionings with very large and very small clus- i ters [32]. A second important property of the proposed cost mization step to estimate maximum likelihood values for function concerns non–symmetric dissimilarities, (cid:2) (cid:6) (cid:2) . the cluster centers y(cid:1). Dempster et al. [31] have proven that ik ki (cid:3)pc is not changed if all dissimilarities are replaced by the the likelihood increases monotonically under this alterna- arithmetic mean, (cid:2) (cid:7) ((cid:2) + (cid:2) )/2. For reasons of sim- tion scheme which demonstrates convergence of the algo- ik ik ki plicity, we henceforth assume symmetric (cid:2) . Furthermore, rithm toward a local maximum of the likelihood function. ik The log-likelihood is up to a factor (−T) equivalent to the (cid:3)pc is also invariant under an arbitrary permutation of the cluster indices (cid:1)(cid:2)(cid:8)(cid:9)(cid:3)((cid:1))2. free energy for central clustering. In Section 5 we will use An important, although often ignored consideration for the solutions of central clustering with squared Euclidian distances to simultaneously group a data set and embed it stochastic optimization problems is the scaling of the (cid:2)ik in the Euclidian space (cid:1)d by preserving the cluster struc- values with the number N of data. The correct scaling ture. should yield constant costs per data point to achieve inde- pendence of annealing schedules and stochastic search heu- 3.2 Pairwise Clustering ristics from the instance size N. In the case of completely Central clustering requires that the data can be character- consistent dissimilarities, i.e., data i, k in different clusters ized by feature values xi (cid:5) (cid:1)d in a d-dimensional Euclidian have large (cid:2)ik and data i, k in the same clusters have small space. Frequently in empirical sciences, however, the only (cid:2) , constant costs per datum require a scaling (cid:2) (cid:2) (cid:1)(1). available information source about a data set are compari- ik ik In the opposite case of random dissimilarity values aver- sons between data pairs; these dissimilarity values are de- d i aging effects necessitate a scaling(cid:1) ~(cid:1) N . A thorough c h ik noted in the following by D = (cid:1) Œ(cid:1)N¥N. Clus- discussion of this point can be found in the statistical phys- ik i=1,...,N k=1,...,N ics literature of optimization problems [33]. tering on the basis of this data description is achievable by grouping the data to clusters such that the sum of dissimi- 4 MEAN-FIELD APPROXIMATION larities between data of the same cluster is minimized [4], OF PAIRWISE CLUSTERING [8]. This criterion favors compact and coherent groups over heterogeneous data collections. We again use the set of as- Following the strategy of stochastic optimization as dis- signment variables M as defined in (6) and denote by M(cid:1) cussed in Section 3.1 for central clustering, we estimate the i expectation values for the assignment of data to clusters at the indicator function of an assignment of datum i to cluster (cid:1). To compensate for different numbers of data per cluster a specified uncertainty level parametrized by the computa- the costs of a particular cluster (cid:1) are normalized by the per- tional temperature T. Assignments M of data to clusters are randomly drawn from the set of admissible configurations centage pn = ÂiN=1Min /Nof data in that cluster. Without (7) according to the Gibbs distribution e a f j this normalization, the undesirable and often detrimental e a fj exp -(cid:2)pc M /T tendency can be observed that clusters with few data grow PGb (cid:2)pc M = e a f j at the expense of equally coherent clusters with many data.  exp(cid:3)(cid:2)pc M¢ /T The cost function for pairwise clFustering with K Iclusters ∫ expFH-e(cid:2)pcaMf-M(cid:4)¢Œ(cid:5)e(cid:2)pcjj/TIK, a f 1 N N (cid:1) G K M M J (18) (cid:2)pc M =  ik H in kn -1K (16) 2 i=1 k=1 N n=1 pn where (cid:3)pc are the costs for a pairwise clustering solution (16). Contrary to the Gibbs distribution for central cluster- stresses cluster coherency. Alternative clustering costs have ing, the data assignments M in pairwise clustering are sta- been proposed [8], but have not found wide spread accep- tistically dependent and the Gibbs distribution (18) cannot tance in pattern recognition applications. The constant term a f be exactly rewritten in factorized form. Each assignment ÂN ÂN (cid:1) / 2N has been subtracted in (16) to empha- variable M(cid:1) interacts with all other assignment variables. i=1 k=1 ik i These cost contributions, however, converge to averages in size the independence of the clustering cost function on the the limit of large data sets and reduce the influence of cor- absolute dissimilarity scale, i.e., e c hj e c hj relations on individual data assignments. We, therefore, (cid:2)pc M (cid:1) -(cid:1) = (cid:2)pc M (cid:1) , (17) ik 0 ik F I 2. The permutation symmetry can be removed by adding a small pertur- since (cid:1)0ÂiN=1ÂkN=1HÂnK=1MinMkn /pn -1K/N = 0. A uni- bation d(cid:2):= ÂnK=1Rn(N)ÂiN=1Min to the cost function (16) with 0 < R1(N) < … form shift of the dissimilarity values by an offset (cid:2) does < RK(N), limN(cid:8)(cid:10) R(cid:1)(N) = 0, limN(cid:8)(cid:10) N R(cid:1)(N) = (cid:10). The perturbations R(cid:1)(N) favor an 0 indexing of the clusters according to their size (p> p> … > p). 1 2 K HOFMANN AND BUHMANN: PAIRWISE DATA CLUSTERING BY DETERMINISTIC ANNEALING 5 approximate the average interaction of M(cid:1) with other as- stricted space of factorized probability distributions for M. i signment variables by a mean–field (cid:1)(cid:1). The following two The minimization of the upper bound on the free energy i sections present a variational technique and a perturbation yields the “optimal” potentials (cid:1)* for assigning datum i to in expansion to derive the mean–field approximation and cor- cluster (cid:1): rections to the assignment correlations. The method is, ∂ e e j j however, not restricted to clustering problems and can be (cid:4) (cid:2)0 + (cid:2)pc -(cid:2)0 = 0 applied to many combinatorial optimization problems. ∂(cid:1)in (cid:1)in=(cid:1)i*n ~ m r 4.1 Mean–Field Approximation fi (cid:1)i*n ∫ (cid:1)in "n Œ 1,..., K , (23) as Minimization of KL-Divergence with A mean–field approximation of the Gibbs distribution PGb((cid:1)pc) neglects the correlations between the stochastic (cid:1)~ ∫ 1 variables in the pairwise clustering cost function (cid:1)pc and in ÂNj=1Mjn +1 determines the “most similar” factorized distribution L jπi F IO within an (cid:1)–parametrized family of distributions P0((cid:1)). The M G JP dstiasttirsitbiucst ioonf tPh0e( (cid:1)o*ri)g iwnahli cphr orbeplermes eins tssp mecoifsite da cbcyu rtahtee lym itnhie- NMMM21(cid:6)ii +ÂkkNπ=1iMknHGG(cid:6)ik - 21ÂNjjπ=1iÂMNl=1jnMln (cid:6)jkKJJQPPP. (24) mum of the Kullback–Leibler divergence to the original lπi Gibbs distribution, i.e., The resulting optimal (with respect to (21)) assignments are (cid:1)* = arg min(cid:1)eP0a(cid:1)fPGbe(cid:2)pcjj. (19) given by (cid:1) e j exp -(cid:1)* /T maItnin gth ef apmaiirlyw isoef clduisstterriibnugt iocnasse iwnetr oddeuficnine ga np aoptepnrtoiaxlis- Mia = ÂK expei-a(cid:1)* /Tj. (25) n=1 in c h (cid:1)(cid:2) (cid:1) Œ(cid:1)N¥K for the effective interactions, where The technical details can be found in Appendix A. The in ni==11,,......,,NK reader should note that the potentials (cid:1)i*n do not depend on (cid:1)in represents the partial costs for assigning datum i to the variables ((cid:2)Mi1 (cid:4), ..., (cid:2)MiK(cid:4)). cluster (cid:1). Summing up the partial costs we arrive at a fam- We introduce an approximation which neglects terms of ily of cost functions without correlations between the as- order (cid:2)(1/N) tFo simplify thIe potentFials (cid:1)i*n. The apIproxi- signments, i.e., G J G J (cid:2)0aM,(cid:1)f = ÂN ÂK M (cid:1) . (20) matiFons 1/HIÂNjjπ=1iMFjn +1K ª1/IHÂNjjπ=1i Mjn +1K and in in G J G J i=1n=1 1/HÂNj=1MjnK ª1/HÂNj=1 Mjn K are correct in the limit The linearity in the assignments of (20) reflects the fact that jπi jπi we assume statistical independence between the assign- of large N. To simplify the presentation further, we assume ments, i.e., they are distributed according to a factorized zero self–dissimilarities, (cid:3) = 0, (cid:6)i. The simplified Gibbs distribution PGb((cid:1)0) (cid:1) P0((cid:1)). ii “optimal” potentials An equivalent minimization condition for the free en- F I ergy can be derived from (19) by the following algebraic G J tra(cid:1)nFHsPfoGbrme(cid:2)at0iojnPsGbe(cid:2)pcjIK = ÂPGbe(cid:2)0aMfj (cid:1)i*n =ÂNjjπ=1i M1jn +1ÂkN=1 Mkn HGG(cid:6)ik - 2ÂNjjπ=1i1Mjn ÂjN=1 Mjn (cid:6)jkKJJ (26) MŒ(cid:5) depend on the given distance matrix, the averaged assign- a f a f exp(cid:3)(cid:2)0 M /T  exp-(cid:2)pc M¢ /T ment variables and the cluster probabilities. The following log LM pca f OP M¢Œ(cid:5) a f algorithm estimates the assignment probabilities (cid:2)Mi(cid:1)(cid:4) and expN(cid:3)(cid:2) M /TQ  exp-(cid:2)0 M¢ /T the optimal potentials (cid:1)* (defined in (26)) iteratively. M¢Œ(cid:5) in Algorithm I 1 e j e j = T (cid:4) (cid:2)0 -(cid:4) (cid:2)pc + (cid:2)pc -(cid:2)0 . (21) INITIALIZE (cid:1)i*n(0) and (cid:2)Mi(cid:1)(cid:4)(0) randomly; The averaging brackets (cid:2)(cid:3)(cid:4) denote the average with respect temperature T (cid:7) T0; to PGb((cid:1)0). Since the KL–divergence is always positive and WHtI L(cid:7)E T0 ;> TFINAL vanishes only for PGb((cid:1)0) (cid:1) PGb((cid:1)pc) we obtain the well- REPEAT known upper bound first derived by Peierls [34]: E-like step: estimate (cid:2)M (cid:1)(cid:4)(t+1) af i (cid:2)((cid:1)pc) (cid:5) (cid:2)((cid:1)0) + (cid:2)(cid:1)pc − (cid:1)0(cid:4). as a function of (cid:1)i*nt ; a f In summary, the optimal mean–fields (cid:1)* result from a M-like step: calculate (cid:1)i*nt+1 variational approach to minimizing the upper bound (22) for given (cid:2)M (cid:1)(cid:4)(t+1); i on the free energy and thus to minimizing the KL- t (cid:7) t + 1; divergence (19). The upper bound can be interpreted as the UNTIL all {(cid:2)M (cid:1)(cid:4)(t), (cid:1)*atf} satisfy (26); generalized free energy (5) which is defined in the re- i in 6 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 1, JANUARY 1997 a f af F I T (cid:1) (cid:1)T; (cid:2)Mi(cid:2)(cid:3)(0) (cid:1) (cid:2)Mi(cid:2)(cid:3)(t); (cid:1)i*n0 (cid:1) (cid:1)i*nt ; expH-e (cid:1)~ -h~ j/TK ia ia The algorithm decreases the temperature exponentially Mia =  expFH-e (cid:1)~ -~h j/TIK (28) (0 < (cid:1)(cid:3)< 1) and alternates the estimation of data assign- n in in ments for given potentials (E-like step) with an estimate of e j pcaonte bneti aclasr froierd g iovuetn saesqsuigennmtiaelnlyts .o Tr hiins epsatrimalaletli(cid:1)o3n F porro tcheed uarse- ~hia = 21T ÂN (cid:2)i2kÂK Mka pdapmN-2Mkm edam -2 Mim j (29) k=1 m=1 a m signments in the E-like step and the potentials in the M-like ~ step. A sequential version where the E-like step and the M- where (cid:4)(cid:5)(cid:6) denotes the Kronecker delta. The corrections h ia like step are performed for a randomly selected datum i are also called cavity fields. converges to a local minimum (with respect to ((cid:1)i1, .a.. (cid:1)fiK)) The question about the range of validity of (28) is subtle, of the upper bound of the free energy, since (cid:1)*t+1 is and it has been studied extensively in the statistical physics in of disordered systems [33]. Empirically, we can measure uniquely determined by { M at+1f}N , which have no average values of the assignments by Monte Carlo simula- kn k=1 kπi tion. These estimates are inserted into (28) or into (25), a f explicit dependency on (cid:1)*t+1. The upper bound in (22) which yields residual errors, i.e., the difference between the in right and the left side of both equations. The residual errors plays the role of a Lyapunov function for the update dy- a f determine the quality of the TAP approximation compared namics of the potentials (cid:1)*t+1 [32]. The sequential update to the naive mean field approximation. According to our in scheme has been implemented in the clustering experi- Monte Carlo experiments with matrices of Gaussian dis- ments (see Section 6). The outer loop of the algorithm re- tributed random dissimilarity values (N = 1,200), the TAP duces the temperature in an exponential fashion, i.e., we equations (28) estimate the average assignments (cid:2)M(cid:2)(cid:3) with i choose T (cid:1) (cid:1)T. Other choices such as linear annealing a reduced residual error of up to 50% less compared with schedules to lower the temperature could be used as well the naive mean field approximation. The difference reaches and they might yield superior optimization results since the a maximum for temperatures near the phase transition search process is extended by a slower temperature reduc- point, i.e., when degenerate clusters split into separate tion. clusters. The naive mean–field equation is superior in the low temperature range. Furthermore, we observed that the 4.2 Equations for Expected Data Assignments improvements achieved by the TAP equations can be neg- The variational approach with a family of factorized distri- lected for small problems (N < 100) because of the N (cid:5)(cid:6)(cid:7) butions implicitly assumes that correlations between as- asymptotics. signments can be neglected. A direct estimate of the aver- age assignments allows us to check how well this assump- 5 PAIRWISE CLUSTERING AND EMBEDDING tion holds and what estimation errors are introduced by the underlying independence hypothesis. The detailed deriva- Grouping data into clusters is an important concept in dis- tions of the equations (27), (28), and (29) are summarized in covering structure. Apart from partitioning data into Appendix B. The true expected assignments are given by classes, the data analyst often relies on visual inspection of Mia = Âexepxep–e(cid:1)~-ia(cid:1)~/T/jTj , (27) ddinoa tama d n-tedosi smr. eTecnhosegi ontaniszakel oEcfuo ecrmlriedblaieatdniod snipns agca gen,id va e pndr edevriiseasqtiiumoinsilisat erf ifrtoyor md va itrsaau nDa-l n in inspection, is known as multidimensional scaling [8], [36]. ~ e ~ j (cid:1) being defined in (24). The fraction exp -(cid:1) /T Usually, multidimensional scaling is formulated as an op- ia ia / expe-(cid:1)~ /Tj in (27) implements a partition of unity. timization problem for the cooFrdinates {xi} withI costs Tinhtera nscytasbtelem s ioinnfc teh we eN h(cid:4)a vKe etqou caatrioryn so u(2t7 t)h ies acvoemrapguitnagti oonf atlhlye (cid:1)mdsdmxiri = 21N ÂN HGG xi -x(cid:2)k$2 -(cid:2)ikKJJ2. (30) i,k=1 ik partition of unity over an exponential number of assign- ment configurations. The smoothness of a transition from The so-called stress function (cid:2)mds was introduced by one cell of the partition to a neighboring cell is controlled Kruskal in [37]. (cid:2)mds with a constant normalization (cid:2)$ = 1 by the inverse temperature 1/T. measures the absolute stress and (cid:2)$ =(cid:2) penalizes ikrela- Naively interchanging the averaging brackets with the ik ik nonlinear function in (27) yields the equations (25), (26); a tive stress. refined mean–field approach which is known as the TAP 5.1 Mean Field Approximation approach [35] models the feedback effects in strongly dis- of Pairwise Clustering by Central Clustering ordered clustering instances more faithfully than the naive approach. The refined expected assignments are In this section, we establish a connection between the clus- tering and the multidimensional scaling problem. The strat- af egy of combining data clustering and data embedding in a 3. Experimentally, we observed oscillation for the parallel (cid:1)i*nt update Euclidian space is based on a variational approach to as it is known from parallel update of neural networks. HOFMANN AND BUHMANN: PAIRWISE DATA CLUSTERING BY DETERMINISTIC ANNEALING 7 maximum entropy estimation as discussed in Section 4.1. F I K The coordinates of data points in the embedding space are K = H yyT - y y TK, y =  M y . (35) estimated in such a way that the statistics of the resulting i i i i i n=1 in n cluster structure matches the statistics of the original pair- Details of the derivation are summarized in Appendix C. wise clustering solution. The relation of this new principle for The coordinates {x} and {y(cid:1)} are determined by iteratively i structure preserving data embedding to standard multidi- solving (34) according to the following algorithm: mensional scaling is summarized in the following diagram: Algorithm II: Structure Preserving MDS m r e m rj e e m rjj a f a f (cid:1) Æ(cid:2)pc M (cid:1) ÆPGb (cid:2)pc M (cid:1) INITIALIZE x0 , y0 and (cid:1)M (cid:1)(cid:2)(0) (cid:3)(0,1) ik ik ik i n i F I B(cid:2)mds B (cid:3)HPGbe(cid:2)ccjPGbe(cid:2)pcjK rtaenmdpoemrlayt;ure T (cid:6) T ; 0 {x -x 2}Æ(cid:2)cceMmxrj ÆPGbe(cid:2)cceMmxrjj. WHILtE =T 0>; TFINAL i k i i REPEAT Multidimensional scaling offers the left path from dissimi- E-like step: estimate (cid:1)M (cid:1)(cid:2)(t+1) as a m r i larities to coordinates whereas we advocate the right path. function of x ,y ; i n The variational approach to mean–field approximation in- M-like step: volved in the right path requires us to specify a REPEAT a f a f parametrized family of factorized Gibbs distributions. We calculate xt+1 given (cid:1)M (cid:1)(cid:2)(t+1)yt+1 . choose the factorized Gibbs distributions (12) based on the a fi i n cost function for central clustering (cid:1)cc(M|{xi}) and use the update ynt+1 to fulfill the embedding coordinates {x} as the variational parameters. centroid condition; i This approach is motivated by the identity UNTIL convergence; t (cid:6) t + 1; N 1 N N UNTIL convergence; ÂM x -y 2 = ÂÂM M x -x 2 (31) a f af a f af i=1 in i n 2Npn i=1 k=1 in kn i k T (cid:6) (cid:2)T; (cid:1)Mi(cid:1)(cid:2)(0)(cid:6) (cid:1)Mi(cid:1)(cid:2)(t); x i0 (cid:6) x it ;y n0 (cid:6) y nt ; To understand the properties of the algorithm we have with to recollect the key idea for deriving the mean–field ap- ÂN M x proximation. The statistics of the approximating system yn = Âi=N1 Min i (32) with the cost function (cid:1)cc has to be optimally adjusted to i=1 in the statistics of the original system. This fact implies that we are not able to determine the variational parameters in the which yields a correct approximation for pairwise cluster- ing instances with (cid:2) = (cid:3)(cid:3)x – x(cid:3)(cid:3)2. limit of fixed statistics, e.g., in the limit of zero temperature. ik i k As can be easily seen, equations (34) is singular for T = 0 and Suppose we have found a stationary solution of the asymptotic results require us to apply l’Hospital’s Rule. mean–field equations (25), (26). For the clustering problem The derived system of transcendental equations given by it suffices to consider the mean assignments (cid:1)M(cid:1) (cid:2) with the i (15) with quadratic distortions, by (34) and by the centroid parameters (cid:1)* being auxiliary variables. The identity (31) condition explicitly reflects the dependencies between the in allows us to interpret these variables as the squared dis- clustering procedure and the Euclidian representation. Si- tance to the cluster centroid under the assumption of multaneous solution of these equations leads to an efficient Euclidian data. In the multidimensional scaling problem algorithm which interleaves the multidimensional scaling process and the clustering process, and which avoids an arti- the coordinates x are the unknown quantities. If we restrict i ficial separation into two uncorrelated data processing steps. the potentials (cid:1) to be of the form (cid:3)(cid:3)x − y(cid:1)(cid:3)(cid:3)2 with the cen- iu i troid definition (32) we have specified a new family of ap- 6 RESULTS proximating distributions defined in (12) with parameters We demonstrate the properties of the proposed clustering {xi (cid:3) (cid:1)d : 1 (cid:4) i (cid:4) N}. The effective dimensionality of the pa- Algorithms I and II by three classes of experiments: rameter space is min{d, (K − 1)} (cid:5) N instead of (K − 1) (cid:5) N, i) benchmark optimization experiments compare deter- which is a significant reduction, especially in the case of ministic annealing with a greedy gradient descent low–dimensional embeddings (d (cid:2) K). The criterion for method and a linkage algorithm for pairwise cluster- determining the embedding coordinates is ing in Section 6.1; ∂ e j ii) simultaneous pairwise clustering and embedding is (cid:4) (cid:2)cc + (cid:2)pc -(cid:2)cc = 0, (33) ∂x performed on artificial and real-world data in Sec- i tion 6.2; which approximately yields the coordinates iii) pairwise clustering as a segmentation technique for F I 1 K e jG K J textured images is discussed in Section 6.3. Kx ª  M y 2 -(cid:1)* Hy - M y K, (34) i i 2 iu n in n im m n=1 m=1 8 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 1, JANUARY 1997 Fig. 1. Histograms of clustering costs for a protein data set (a) and a random clustering instance (b). The gray and white bins denote the results of optimization with deterministic annealing and gradient de- scent, respectively. The Mean Average Dissimilarity Linkage solution has costs of (cid:1)pc = 759.1. Fig. 2. Embedding of 20-dimensional data into two dimension: (a) pro- jection of the data onto the first two principle components; (b) cluster preserving embedding with Algorithm II. Only 10% of the data are shown. 6.1 Benchmark Experiments for Deterministic Annealing The theoretical derivations of the deterministic annealing Algorithms I and II are motivated by the known robustness properties of maximum entropy inference. To test this claim, a large number of randomly initialized clustering experiments has been performed on (i) dissimilarities taken from protein sequences and on (ii) dissimilarities which were randomly drawn from a uniform distribution on [0, 1.0]. The dissimilarity values between pairs of protein sequences are determined by a sequence alignment pro- gram which takes biochemical and structural information into account. In essence, the alignment program measures the number of amino acids which have to be exchanged to transform the first sequence into the second. The sequences belong to different protein families like hemoglobin, myo- globin and other globins. The protein dissimilarities, sorted according to a clustering solution with N = 226, K = 9 clus- ters, are displayed in Fig. 3. The two cases, dissimilarities from protein sequence comparisons and random dissimi- larities, span the spectrum between ordered and random clustering instances. The benchmark clustering experiments are designed to validate the claim that superior clustering results are achieved by deterministic annealing compared to standard clustering techniques based on gradient de- scent. The histograms of 1,000 clustering runs with different Fig. 3. Similarity matrix of 226 protein sequences of the globin family initializations are summarized in Fig. 1 for (a) the protein (a): dark gray levels correspond to high similarity values. Clustering dissimilarity data (K = 9) and for (b) the random data with embedding in two dimensions (b); clustering of MDS embeddings (N = 100, K = 10). Deterministic annealing clearly outper- found by global (c) or local (d) stress minimization. formed the conventional gradient descent method in the random case with even the worst deterministic annealing 6.2 CLUSTERING AND EMBEDDING RESULTS solution being better than the best gradient descent solu- The properties of the described algorithm for simultaneous tion. In the case of the protein data the average costs of a Euclidian embedding and data clustering are illustrated by deterministic annealing solution is in the best one percent two different experiments: of the gradient descent solutions, e.g., an average determi- nistic annealing solution is better than the best out of 100 1) Clustering and dimension reduction of inhomogene- gradient descent solutions. The standard Mean Average ously distributed data. Dissimilarity Linkage algorithms (MADL), also known as 2) Clustering of real-world proximity data from protein Ward’s method (see [4], Sec. 3.2.7), yields a clustering result sequences. with costs (cid:1)(cid:2)(cid:3) = 759.1 compared to the best The capacity for finding low dimensional representa- (experimentally achieved) result with (cid:1)(cid:2)(cid:3) = 730.9. All ex- tions for high dimensional data is demonstrated with a data periments support our claim that deterministic annealing set drawn from a mixture of 20 Gaussians in 20 dimensions. yields substantially better solutions for comparable com- The centers of the Gaussians are randomly distributed on puting time. the unit sphere. The covariance matrices are diagonal with HOFMANN AND BUHMANN: PAIRWISE DATA CLUSTERING BY DETERMINISTIC ANNEALING 9 values being randomly drawn from the set {0.1, 0.2, 0.4}. The best linear projection according to principal component analysis is shown in Fig. 2a. The positions of the data points are denoted by the letters which name the respective mix- ture component. Other linear projection methods like pro- jection pursuit [38] yield comparable results since no direc- tion is distinguished in the data generation procedure. Si- multaneous clustering and embedding by Algorithm II distributes the data in two dimension with approximately the same group structure as in the high dimensional space. All but cluster (A) are preserved and well separated. A few data points are assigned to the wrong cluster. The algo- rithm selects a representation in a completely unsupervised Fig. 4. Similarity matrix for a data set with 825 word fragments (a). The fashion and preserves the “essential” structure present in calculated clustering solution with embedding in two dimensions the grouping formation and in the topology of the data set. (b). The labels denote the groups by common word beginnings. This procedure for dimension reduction is weakly related to the idea of principal curves [39] or principal surfaces. 6.3 Unsupervised Texture Segmentation Fig. 3 summarizes the clustering result (K = 9) for a real– by Pairwise Clustering world data set of 226 protein sequences. Families of protein Segmenting a digital image into homogenous regions, e.g. sequences are abbreviated by capital letters. The gray level regions of constant or slowly varying intensity, constant visualization of the dissimilarity matrix with dark values color or uniform texture, arises as a fundamental problem for similar protein sequences shows the formation of dis- in image processing. Following Geman et al. [41] we for- tinct “squares” along the main diagonal. These squares cor- mulate texture segmentation as a grouping problem with respond to the discovered partition after clustering, the resulting clustering costs being (cid:1)pc = 735.2. The embedding constraints about valid region shapes. The grouping prob- lem is based on pairwise dissimilarities between texture in two dimensions (Fig. 3b) shows intercluster distances patches which correspond to pixel blocks of the image. which are in good agreement with the similarity values of the data. The best experimentally determined solution ((cid:1)pc Three major modifications compared to [41] have been in- troduced: = 730.9) without the embedding constraint exceeded the quality of the solution in Fig. 3b only by 0.83 percent. The 1) Dissimilarities are calculated based on a Gabor results are consistent with the biological classification. The wavelet scale–space representation. labels HALL and MYTUY in Fig. 3b are individual globin 2) The normalized pairwise clustering cost function (16) sequences which are known to play an “outlier role” in the is used as an objective function for image segmenta- globin family. The corrections by the cavity fields (29) are in tion. the range of 10 to 20 percent of the assignment costs (cid:1)i(cid:1) 3) The presented deterministic annealing algorithm re- (18.0% for a Monte Carlo simulation at 1/T = 2.5). We have places the Monte Carlo method proposed in [41]. compared the clustering solutions of Algorithm II with re- The calculation of dissimilarity matrices from textured sults of a two step procedure, i.e., first to embed the data images can be separated into three stages. In the first stage, using Kruskal’s multidimensional scaling criterion (30) and the image I is transformed in a Gabor wavelet representa- then to cluster the embedded data by the EM procedure of tion. The Gabor transformation possesses a bandpass char- Algorithm I. Depending on the embedding criterion as ab- acteristic and is known to display good texture discrimina- solute (Fig. 3c) or relative (Fig. 3d) stress, the visualizations tion properties [42], [43]. We have used four orientations at of the protein dissimilarities reveal little to almost no clus- three different scales, separated by a full octave, resulting in ter structure. This fact is reflected in high clustering costs (cid:1)pc = 782.7, (833.7) for the embedding guided by absolute L = 12 feature Images I(l), 1 (cid:1) l (cid:1) L and the raw gray scale and relative stress, respectively. It is obvious from Figs. 3b- (0) image I = I. In a second step, the empirical feature distri- 3d that simultaneous clustering and embedding by the bf bution function Fl is calculated separately for every fea- structure preserving MDS algorithm preserves the charac- i teristics of the original cluster structure much better than the ture image I(l) and every image block B, 1 (cid:1) i (cid:1) N. The i classical MDS techniques with subsequent central clustering. blocks B are centered on a regular grid and can overlap An application of pairwise clustering to a linguistic data i with each other. In the third stage, pairs of empirical distri- set is shown in Fig. 4, in which 825 word fragments have bution functions belonging to the same feature image are been compared by a dynamic programming algorithm [40]. compared using the Kolmogorov–Smirnov distance. For a The dissimilarity matrix is visualized on the left side. Dark pair of blocks (B, B) the latter is defined by gray values denote high similarity values. The matrix is i j ordered according to the determined clustering solution Dblf ∫ DblfeFblf,Fblfj:= maxFblfaxf-Fblfaxf Œ 0;1. (36) with eleven clusters (K = 11). Word fragments with similar ik i k x i k beginning or ending have a high likelihood to be grouped Following the three stage procedure, a set of L + 1 inde- together, as can be seen from the labels in Fig. 4b. The cor- pendently calculated dissimilarity matrices has been gener- rections by the cavity fields are again in the ten percent ated, which are combined with a simple maximum rule range (7.7% for 1/T = 3.0). 10 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 1, JANUARY 1997 bf D = max Dl . This is reminiscent of Julesz’ theory of initial, although fundamental, steps of information proc- ik 0£l£L ik essing and data analysis. Concepts in artificial intelligence texture perception [44], conjecturing that a dissimilarity in a as well as in pattern recognition and signal processing are single feature channel is sufficient to discriminate textures. dependent on robust and reliable data clustering principles, The procedure for generating dissimilarity data from im- with robustness being mandatory with respect to unob- ages is schematically summarized in Fig. 5. servable, as well as to noisy events. In this paper, we have developed a maximum entropy framework for pairwise data clustering. A well-known approximation scheme from statistical physics—the mean-field approximation—has been derived in two different ways: 1) a variational method minimizes the Kullback–Leibler divergence between the original Gibbs distribution for data assignments and a parametrized family of factorized distributions; 2) the expectation values of the data assignments are calculated in a direct fashion. Fig. 5. Texture segmentation by pairwise clustering: local properties of This technique allows us to correct the influence of small image patches, e.g., intensity differences and local frequencies, are fluctuations in data assignments. The variational approxi- extracted. The respective empirical feature distribution functions are compared with the Kolmogorov-Smirnov statistics to yield dissimilarity mation of pairwise clustering with central clustering yields values between image blocks. a structure preserving, multidimensional scaling algorithm which simultaneously clusters data and embeds them in a Euclidian space. This algorithm can be used for non-linear dimension reduction and for visualization purposes. Re- sults of the pairwise data clustering algorithms in analyzing protein and linguistic data and in segmenting textured im- ages have been reported. Benchmark clustering experi- ments support our claim that deterministic annealing yields substantially better results than conventional clustering concepts based on gradient descent minimization. The out- lined strategy for analyzing stochastic algorithms for pair- Fig. 6. An image of size 512 (cid:1) 512 and with five different textures (a) is wise clustering should be considered as a general program segmented by pairwise data clustering. The segmentation result by deterministic pairwise clustering (25) is shown in (b). Segmentation for deriving robust optimization algorithms which are errors displayed by black pixels in (c) are located at segment bounda- based on the maximum entropy principle; analogous re- ries. sults for the metric multidimensional scaling problem and for the hierarchical data clustering problem [46] will be re- We have applied the algorithm to over a hundred ran- ported elsewhere. domly composed texture images, one being depicted in Fig. 6a. The resulting segmentation based on a deterministic ACKNOWLEDGMENTS annealing algorithm for pairwise clustering (Fig. 6b) shows, that the five different textures are well discriminated. The It is a pleasure to thank M. Vingron and D. Bavelier for difference image (Fig. 6c) between the resulting segmenta- providing the protein data and the linguistic data, respec- tion and the ground truth demonstrates that incorrect as- tively. We thank J. Puzicha for the segmentation experi- signments are only observed in the border regions, where ments and H.-J. Klock for the MDS experiments. This work statistics belonging to different textures are mixed together. was supported by the Federal Ministry of Education and Corrections by the cavity field terms range around six to Science BMBF. eight percent changes in the assignments. The segmentation was postprocessed with additional penalties for thin re- APPENDIX A gions as suggested in [41] to enforce local texture consis- In this appendix, we derive the mean-field equations for the tency and to prevent a too large fragmentation of the tex- pairwise data clustering problem by minimizing the upper ture regions. Moreover only a small fraction (<1%) of dis- similarities calculated from 64(cid:2)(cid:1)(cid:2)64 blocks was actually bound on the free energy given in (22) with respect to the variational parameters (cid:1)(cid:1). Taking derivatives of the upper processed, including all pairs of adjoined blocks and a i bound on the free energy yields small random neighborhood. More details on the neighbor- hood selection, the adapted mean-field approach to sparse ∂ e d i j ∂ (cid:2)pc K ∂ M clustering and performance statistics for a large number of (cid:1) (cid:2)0 + (cid:2)pc -(cid:2)0 = - in (cid:1) (37) ∂(cid:1) ∂(cid:1) ∂(cid:1) in textured images can be found in [45] and [32]. ia ia n=1 ia with 7 DISCUSSION The problem of grouping data can be regarded as one of the
Description: