Table Of ContentA Bayesian alternative to mutual information for
the hierarchical clustering of dependent random
variables
Guillaume Marrelec1,∗, Arnaud Mess´e2, Pierre Bellec3
5
1 August 8, 2016
0
2
t
c
1 Sorbonne Universit´es, UPMC Univ Paris 06, CNRS, INSERM, Laboratoire d’imagerie
O
biom´edicale (LIB), F-75013, Paris, France
9 2DepartmentofComputationalNeuroscience,UniversityMedicalCenterHamburg-Eppendorf,
1
Hamburg University, Hamburg, Germany
] 3D´epartementd’informatiqueetrechercheop´erationnelle,Centrederecherchedel’institut
L
universitaire de g´eriatrie de Montr´eal, Universit´e de Montr´eal, Montr´eal, Qc, Canada
M
∗ E-mail: marrelec@lib.upmc.fr
.
t
a
Abstract
t
s
[ The use of mutual information as a similarity measure in agglom-
erative hierarchical clustering (AHC) raises an important issue: some
2
v correction needs to be applied for the dimensionality of variables. In
4 thiswork,weformulatethedecisionofmergingdependentmultivariate
9 normal variables in an AHC procedure as a Bayesian model compar-
1
ison. We found that the Bayesian formulation naturally shrinks the
5
empirical covariance matrix towards a matrix set a priori (e.g., the
0
. identity), provides an automated stopping rule, and corrects for di-
1
mensionality using a term that scales up the measure as a function of
0
thedimensionalityofthevariables. Also,theresultinglogBayesfactor
5
1 is asymptotically proportional to the plug-in estimate of mutual infor-
: mation, with an additive correction for dimensionality in agreement
v
with the Bayesian information criterion. We investigated the behavior
i
X of these Bayesian alternatives (in exact and asymptotic forms) to mu-
r tualinformationonsimulatedandrealdata. Anencouragingresultwas
a
firstderivedonsimulations: thehierarchicalclusteringbasedonthelog
Bayesfactoroutperformedoff-the-shelfclusteringtechniquesaswellas
raw and normalized mutual information in terms of classification ac-
curacy. On a toy example, we found that the Bayesian approaches
led to results that were similar to those of mutual information cluster-
ing techniques, with the advantage of an automated thresholding. On
real functional magnetic resonance imaging (fMRI) datasets measur-
ing brain activity, it identified clusters consistent with the established
1
outcome of standard procedures. On this application, normalized mu-
tual information had a highly atypical behavior, in the sense that it
systematically favored very large clusters. These initial experiments
suggestthattheproposedBayesianalternativestomutualinformation
are a useful new tool for hierarchical clustering.
Keywords: agglomerative hierarchical clustering; Bayesian model
comparison; BIC; mutual information; multivariate normal distribu-
tions; normalized mutual information.
1 Introduction
Cluster analysis aims at uncovering natural groups of objects in a multivari-
ate dataset [see Jain (2010) for a review]. In the vast variety of methods
used in cluster analysis, an agglomerative hierarchical clustering (AHC) is
a generic procedure that sequentially merges pairs of clusters that are most
similar according to an arbitrary function called similarity measure, thereby
generating a nested set of partitions, also called hierarchy (Duda et al.,
2000). The choice of the similarity measure indirectly defines the shape of
the clusters, and thus plays a critical role in the clustering process. While
this choice is guided by the features of the problem at hand, it is also of-
ten restricted to a limited number of commonly used measures, such as the
Euclidean distance or Pearson correlation coefficient (D’haeseleer, 2005). In
the present work, we focus on the clustering of random variables based on
their mutual information, which has recently gained in popularity in clus-
ter analysis, notably in the field of genomics (Butte and Kohane, 2000;
Zhou et al., 2004; Dawy et al., 2006; Priness et al., 2007) and in functional
magnetic resonance imaging (fMRI) data analysis (Stausberg et al., 2009;
Benjaminsson et al., 2010; Kolchinsky et al., 2014). Mutual information is a
general measure of statistical dependency derived from information theory
(Shannon, 1948; Kullback, 1968; Cover and Thomas, 1991). A key feature
of mutual information is its ability to capture nonlinear interactions for any
typeofrandomvariables(Steueretal.,2002); alsoofinterest,itindifferently
applies to univariate or multivariate variables and can thus be applied to
clusters of arbitrary size. Yet, mutual information is an extensive measure
that increases with variable dimensionality. In addition, Iˆ, the finite-sample
estimator of mutual information, suffers from a dimensionality-dependent
bias (see Appendix A). Several authors have proposed to correct mutual
information for dimensionality by using a “normalized” version of mutual
information (Li et al., 2001; Kraskov et al., 2005; Kraskov and Grassberger,
2009). In the clustering literature, normalized mutual information is rou-
tinely used. However, the impact of such correction procedure has not been
extensively evaluated so far.
Inthepresentpaper,weconsiderBayesianmodel-basedclustering(Scott
2
and Symons, 1971; Binder, 1981; Heller and Ghahramani, 2005; Jain, 2010)
as an alternative to mutual information for the hierarchical clustering of
dependent multivariate normal variables. Specifically, we derive a similarity
measure by comparing two models: M where X and X are indepen-
I i j
dent (i.e., the covariance between any element of X and any element of
i
X is equal to zero), against M where the covariance between X and
j D i
X can be set to any admissible value. The proposed similarity measure is
j
then the log Bayes factor in favor of M against M (Kass and Raftery,
D I
1995).With appropriate priors on the model parameters, we show that the
similaritymeasures(X ,X )betweenX andX canbeexpressedinclosed
i j i j
form. As will be shown below, the Bayesian formulation naturally (1) al-
lows for clustering even when the sample covariance matrix is ill-defined;
(2) provides for an automated stopping rule when the clustering reached
has s(X ,X ) ≤ 0 for any pair of remaining clusters; (3) corrects for di-
i j
mensionality using a term that scales up the measure as a function of the
dimensionality of the variables; and (4) provides for a local and global mea-
sure of similarity, in that it can be used to decide which pair of variables to
cluster at each step (local level) as well as to compare different levels of the
resulting hierarchy (global level). Asymptotically (i.e., when the number of
samples N → ∞), the similarity measure is a linear function of mutual in-
formation, with a penalization factor that is in agreement with the Bayesian
information criterion (BIC) (Schwarz, 1978). In this sense, the present pa-
per makes an explicit connection between Bayesian model comparison for
the clustering of dependent random variables and mutual information. The
code corresponding to the Bayesian approach is freely available online1.
We evaluated an AHC procedure based on this approach with synthetic
datasets. The experiment aimed to evaluate how it behaved under both its
exact and asymptotic forms compared to other approaches, including raw
and normalized mutual information. We finally tested the new measures on
two real datasets: a toy dataset and functional magnetic resonance imaging
(fMRI) data.
2 Analysis
In the following, we develop a Bayesian solution to the problem of cluster-
ing detailed above. We first introduce the model together with the Bayesian
framework and a general expression for the similarity measure. In sub-
sequent subsections, we derive a closed-form expression for the marginal
model likelihoods under both assumptions of dependence and independence
as well as exact and asymptotic expressions for the similarity measure. We
then provide a description of the hierarchical agglomerative clustering algo-
rithm resulting from the present development. We examine how the same
1https://github.com/SIMEXP/arXiv-1501.05194/releases/tag/1.0
3
framework can be conveniently used to compare nested partitions, that is,
different levels of a hierarchy. We also deal with the issue of setting the
hyperparameters. Finally, we show how the Bayesian solution can naturally
provide for an automatic stopping rule.
2.1 Bayesian model comparison
Let X be a D-dimensional multivariate normal variable with known mean2
µ and unknown covariance matrix Σ. Define X and X as two disjoint
i j
subvectors of X (of size D and D , respectively), and X as their union
i j i∪j
(of size D = D + D ). Assume that we have to decide whether we
i∪j i j
should cluster X and X based on their level of dependence. To this
i j
end, consider two competing models M and M . In M , X and X are
I D I i j
independentandthedistributionofX canbedecomposedastheproduct
i∪j
of the marginal distributions of X and X . Under such condition, Σ ,
i j i∪j
the restriction of Σ to X , is block diagonal with blocks Σ and Σ , the
i∪j i j
restrictions of Σ to X and X , respectively. In M , by contrast, X and
i j D i
X are dependent. Given a dataset {x ,...,x } of N independent and
j 1 N
identically distributed (i.i.d.) realizations of X and S the corresponding
sample sum-of-square matrix
N
(cid:88)
S = (x −µ)(x −µ)t,
n n
n=1
we propose to quantify the similarity between X and X as the log Bayes
i j
factor, that is, the log ratio of the marginal model likelihoods of M versus
D
M
I
p(S |M )
i∪j D
s(X ,X ) = ln . (1)
i j
p(S |M )
i∪j I
Each marginal model likelihood can be expressed as an integral over the
model parameters as described below.
2For the sake of simplicity, we assumed a known mean in the following theoretical
development. Ifthemeanisunknown(aswillbethecaseinthesimulationandrealdata
sections), this development is still valid, with N replaced by N −1 and S by
N
(cid:88)(x −m)(x −m)t,
n n
n=1
where m is the sample mean
N
1 (cid:88)
m= x .
N n
n=1
4
2.2 Marginal model likelihood under the hypothesis of de-
pendence
Let us first calculate P(S |M ), the marginal model likelihood under the
i∪j D
hypothesis of dependence. Expressing this quantity as a function of the
model parameters yields
(cid:90)
p(S |M ) = p(S |M ,Σ )p(Σ |M )dΣ . (2)
i∪j D i∪j D i∪j i∪j D i∪j
Calculationoftheintegralrequirestoknowthelikelihoodp(S |M ,Σ )
i∪j D i∪j
and the prior distribution p(Σ |M ) of the covariance matrix under M .
i∪j D D
With multivariate normal data, S given Σ is Wishart distributed
i∪j i∪j
with N degrees of freedom and scale matrix Σ (Anderson, 2003, Corol-
i∪j
lary 7.2.2), leading to the following likelihood
N−Di∪j−1 (cid:20) (cid:21)
p(Si∪j|MD,Σi∪j) = |SZi∪(jD| ,2N) |Σi∪j|−N2 exp −21tr(Σ−i∪1jSi∪j) , (3)
i∪j
where Z(d,n) is the inverse of the normalization constant
nd d(d−1) (cid:89)d (cid:18)n+1−d(cid:48)(cid:19)
Z(d,n) = 2 2 π 4 Γ .
2
d(cid:48)=1
As to the prior distribution, this quantity is here set as a conjugate prior,
namely an inverse-Wishart distribution with ν degrees of freedom and
i∪j
inverse scale matrix Λ (Gelman et al., 2004, §3.6)
i∪j
νi∪j (cid:20) (cid:21)
p(Σi∪j|MD) = Z|(ΛDi∪j|,ν2 )|Σi∪j|−νi∪j+D2i∪j+1 exp −12tr(Σ−i∪1jΛi∪j) .
i∪j i∪j
(4)
Bringing Equations (3) and (4) together into Equation (2) yields for the
marginal model likelihood
νi∪j N−Di∪j−1
|Λi∪j| 2 |Si∪j| 2
p(S |M ) =
i∪j D
Z(D ,N)Z(D ,ν )
i∪j i∪j i∪j
×(cid:90) |Σi∪j|−N+νi∪j+2Di∪j+1 exp(cid:26)−12tr(cid:104)(Λi∪j +Si∪j)Σ−i∪1j(cid:105)(cid:27) dΣi∪j.
The integrand is proportional to an inverse-Wishart distribution with N +
ν degrees of freedom and scale matrix Λ +S , leading to
i∪j i∪j i∪j
N−Di∪j−1 νi∪j
|Si∪j| 2 Z(Di∪j,N +νi∪j) |Λi∪j| 2
p(S |M ) = .
i∪j D Z(Di∪j,N) Z(Di∪j,νi∪j) |Si∪j +Λi∪j|N+2νi∪j
(5)
5
2.3 Marginal model likelihood under the hypothesis of inde-
pendence
WecannowcalculateP(S |M ), themarginalmodellikelihoodunderthe
i∪j I
hypothesis of independence. If M holds, then Σ is block-diagonal with
I i∪j
two blocks Σ and Σ the submatrix restrictions of Σ to X and X ,
i j i∪j i j
respectively. Introduction of the model parameters therefore yields for the
marginal likelihood
(cid:90)
p(S |M ) = p(S |M ,Σ ,Σ )p(Σ ,Σ |M )dΣ dΣ . (6)
i∪j I i∪j I i j i j I i j
Tocalculatethisintegral,weagainneedtoknowthelikelihoodp(S |M ,Σ ,Σ )
i∪j I i j
and the prior distribution p(Σ ,Σ |M ) of the two blocks of the covariance
i j D
matrix under M . The likelihood is the same as for M and has the form
I D
of Equation (3). Furthermore, since Σ is here block diagonal, we have
i∪j
|Σ | = |Σ ||Σ |andtr(Σ−1S ) = tr(Σ−1S )+tr(Σ−1S ), whereS and
i∪j i j i∪j i∪j i i j j i
S are the restrictions of S to X and X , respectively. Consequently, the
j i j
likelihood can be further expanded as
N−Di∪j−1 (cid:20) (cid:21)
p(Si∪j|MI,Σi,Σj) = |SZi∪(jD| ,2N) (cid:89) |Σk|−N2 exp −12tr(Σ−k1Sk) .
i∪j
k∈{i,j}
(7)
As to the prior distribution, assuming no prior dependence between Σ and
i
Σ yields
j
p(Σ ,Σ |M ) = p(Σ |M )p(Σ |M ). (8)
i j I i I j I
For the sake of consistency, p(Σ |M ) and p(Σ |M ) are set equal to
i I j I
p(Σ |M ) and p(Σ |M ), respectively, which are in turn obtained by
i D j D
marginalization of p(Σ |M ) as given by Equation (4). For k ∈ {i,j},
i∪j D
p(Σ |M ) can be found to have an inverse-Wishart distribution with ν =
k I k
ν −D +D degrees of freedom and inverse scale matrix Λ the restric-
i∪j i∪j k k
tion of Λ to X (Press, 2005, §5.2)
i∪j k
p(Σk|MI) = Z|(ΛDk|,ν2νk )|Σk|−νk+D2k+1 exp(cid:20)−12tr(ΛkΣ−k1)(cid:21). (9)
k k
Incorporating Equations (7), (8), and (9) into Equation (6) yields
|Si∪j|N−D2i∪j−1 (cid:89) |Λk|ν2k
p(S |M ) =
i∪j I
Z(D ,N) Z(D ,ν )
i∪j k k
k∈{i,j}
(cid:90) (cid:26) (cid:27)
× |Σk|−N+νk+2Dk+1 exp −21tr(cid:2)(Sk +Λk)Σ−k1(cid:3) .
6
Each integrand is proportional to an inverse-Wishart distribution with N +
ν degrees of freedom and scale matrix S +Λ , leading to
k k k
|Si∪j|N−D2i∪j−1 (cid:89) Z(Dk,N +νk) |Λk|ν2k
p(S |M ) = . (10)
i∪j I Z(Di∪j,N) k∈{i,j} Z(Dk,νk) |Sk +Λk|N+2νk
2.4 Log Bayes factor of dependence versus independence
Let us now express the Bayesian similarity measure by incorporating Equa-
tions (5) and (10) into Equation (1), yielding
(cid:88) N +νk N +νi∪j
s(Xi,Xj) = ln|Λk +Sk|− ln|Λi∪j+Si∪j|+cst
2 2
k∈{i,j}
(11)
with
νi∪j D(cid:88)i∪j(cid:20) (cid:18)N +νi∪j +1−d(cid:19) (cid:18)νi∪j +1−d(cid:19)(cid:21)
cst = ln|Λ |+ lnΓ −lnΓ (12)
i∪j
2 2 2
d=1
(cid:88) (cid:40)νk (cid:88)Dk (cid:20) (cid:18)N +νk +1−d(cid:19) (cid:18)νk +1−d(cid:19)(cid:21)(cid:41)
− ln|Λ |+ lnΓ +lnΓ .
k
2 2 2
k∈{i,j} d=1
Another form for s(X ,X ) is
i j
s(X ,X ) = ∆φ −∆φ −∆φ , (13)
i j i∪j i j
with
∆φ = φ(N +ν ,Λ +S )−φ(ν ,Λ ), k ∈ {i,j,i∪j}, (14)
k k k k k k
and
dim(A) (cid:18) (cid:19)
n (cid:88) n+1−d
φ(n,A) = − ln|A|+ lnΓ . (15)
2 2
d=1
∆φ quantifies, up to a constant that cancels out in s(X ,X ), the amount
k i j
bywhichthedatasupportamodelofmultivariatenormaldistributionswith
unrestricted covariance matrix for X .
k
2.5 Asymptotic form of the log Bayes factor
We can now provide an asymptotic expression for s(Xi,Xj). Define S(cid:98)k as
the standard sample covariance matrix, i.e., Sk = NS(cid:98)k. When N → ∞, we
7
can use Stirling approximation (Abramowitz and Stegun, 1972, p. 257) to
expand the expression φ of Equation (15), leading to (see Appendix B)
N |S(cid:98)i||S(cid:98)j| 1 Di∪j(Di∪j +1) (cid:88) Dk(Dk +1)
s(Xi,Xj) = ln − − lnN +O(1)
2 |S(cid:98)i∪j| 2 2 k∈{i,j} 2
D D
= N Iˆ(X ,X )− i j lnN +O(1), (16)
i j
2
where
Iˆ(X ,X ) = 1 ln |S(cid:98)i||S(cid:98)j|
i j
2 |S(cid:98)i∪j|
is the plug-in estimator of mutual information for a multivariate normal
distribution. Alternatively, N Iˆ(X ,X ) can be seen as the minimum dis-
i j
criminationinformationfortheindependenceofX andX (Kullback,1968,
i j
Chap. 12, §3.6).
2.6 Hierarchical agglomerative clustering
A hierarchy on a set of D variables is a nested set of partitions (C )D ,
l l=0
where C is a partition of {1,...,D} into D−l clusters (Duda et al., 2000).
l
A hierarchical agglomerative clustering (AHC) is a generic procedure to
generatesuchahierarchy,outlinedinpseudo-codeinAlgorithm1. Themain
stepsofthealgorithmare: (1)Initializethepartitionwithsingletons(line2);
(2) derive a matrix s where each element represents the similarity between
l
two clusters X and X of C , based on an arbitrary function s(X ,X )
i j l i j
(line 4); (3) identify the two clusters that are most similar3 (line 5); (4)
form a new partition identical to the previous one, except that the two most
similar clusters are now merged (lines 6–7); (5) iterate Steps 2–4 until the
partition has only one single element which covers the whole set of variables
(line 3). In the case of the methods proposed here, the similarity measure
is given by either Equation (11) for the exact formulation or Equation (29)
for the asymptotic BIC approximation.
2.7 Comparing distinct levels of the hierarchy
The last development aims at providing a way to compare nested partitions,
i.e., different levels of the hierarchy. Once the hierarchical clustering has
been performed, each step is associated with a partition of X. Assume
3Theremaybemorethanonepairofclusterswhichmaximizethesimilarityfunction.
Most implementations of AHC proceed by selecting arbitrarily one such pair (e.g., the
first one to be detected). In the in-house implementation we used, the pair was selected
randomly amongst all these pairs. This was done to properly capture the instability of
the algorithm. In such a form, AHC may not be deterministic anymore, in that two runs
of the same algorithm on the same dataset may result in different hierarchies.
8
1: return Hierarchy (Cl)Dl=0
2: C0 ← {{X1},...,{XD}}
3: for l = 0,...,D−2 do
4: sl ← [s(Xi,Xj)]Xi,Xj∈Cl
5: (i∗,j∗) ← argmaxi,jsl(i,j)
6: Cl+1 ← Cl \{i∗,j∗}
7: Cl+1 ← Cl ∪{i∗∪j∗}
8: end for
Algorithm 1: General description of the hierarchial agglomerative clus-
tering algorithm.
that, at level l, the partition reads {X ,...,X } and that, at step l+1,
1 K
X and X are clustered. Denote by C the assumption that the partition
i j l
at level l is the correct partition of X. The global improvement brought by
the clustering of X and X between steps l and l+1 can be quantified by
i j
the log ratio between the marginal model likelihoods
p(S|C )
l+1
ln ,
p(S|C )
l
wherebothquantitiescanbecomputedinamannersimilartowhatwasdone
forthesimilaritymeasure. Forinstance,ifC istrue,thenΣisblock-diagonal
l
with K blocks Σ ’s, the submatrix restrictions of Σ to X . Introducing the
k k
model parameters then yields
(cid:90) K
(cid:89)
p(S|C ) = p(S|C ,Σ ,...,Σ )p(Σ ,...,Σ |C ) dΣ . (17)
l l 1 K 1 K l k
k=1
Inawaysimilartowhatwasdonepreviously,thelikelihoodp(S|C ,Σ ,...,Σ )
l 1 K
can be expanded as
N−D−1 K (cid:20) (cid:21)
p(S|Cl,Σ1,...,ΣK) = |ZS(|D,2N) (cid:89)|Σk|−N2 exp −12tr(Σ−k1Sk) . (18)
k=1
Turning to the prior distribution p(Σ ,...,Σ |C ) and assuming no prior
1 K l
dependence between the Σ ’s, we can set
k
K
(cid:89)
p(Σ ,...,Σ |C ) = p(Σ |C ). (19)
1 K l k l
k=1
Each p(Σ |C ) can be obtained by the marginalization of p(Σ|C ), which
k l l
is here taken as an inverse-Wishart distribution with ν degrees of freedom
and inverse scale matrix the D-by-D diagonal matrix Λ. Note that such
a prior on Σ is compatible with the prior used earlier for Σ if one sets
i∪j
9
ν = ν−D+D and if Λ is the restriction of Λ to X (Press, 2005,
i∪j i∪j i∪j i∪j
§5.2). We then have
p(Σk|Cl) = Z|Λ(Dk|−,νν2k)|Σk|−νk+D2k+1 exp(cid:20)−12tr(ΛkΣ−k1)(cid:21). (20)
k k
Incorporating Equations (18), (19), and (20) into Equation (17) and inte-
grating leads to
|S|N−2D−1 (cid:89)K Z(Dk,N +νk) |Λk|ν2k
p(S|C ) = . (21)
l Z(D,N) k=1 Z(Dk,νk) |Λk +Sk|N+2νk
Thesamecalculationcanbeperformedforp(S|C ). Theresultisthesame
l+1
as in Equation (21), except that the product is composed of K −1 terms.
Of these terms, K −2 correspond to clusters that are unchanged from C
l
to C and, as a consequence, are identical to those of Equation (21). The
l+1
(K −1)th term corresponds to the cluster obtained by the merging of X
i
and X . As a consequence, the log Bayes factor reads
j
νi∪j
p(S|Cl+1) Z(Di∪j,N +νi∪j) |Λi∪j| 2
ln = ln
p(S|Cl) Z(Di∪j,νi∪j) |Λi∪j +Si∪j|N+2νi∪j
(cid:88) (cid:34)Z(Dk,N +νk) |Λk|ν2k (cid:35)
− ln .
k∈{i,j} Z(Dk,νk) |Λk +Sk|N+2νk
But this quantity is nothing else than s(X ,X ). In other words, we proved
i j
that
p(S|C )
l+1
ln = s(X ,X ), (22)
i j
p(S|C )
l
i.e., s(X ,X ), the local measure of similarity between X and X , can
i j i j
be used to compute the global measure of relative probability between two
successive levels of the hierarchy.
2.8 Setting the hyperparameters
Hyperparameter selection is often a thorny issue in Bayesian analysis. We
here considered two approaches. The first approach (coined BayesCov) is
to set the degree of freedom to the lowest value that still corresponds to
a well-defined distribution, that is ν = D, and a diagonal scale matrix
that optimizes the marginal model likelihood of Equation (21) before any
clustering (that is, with K = D clusters and D = 1 for all k), yielding (see
k
Appendix C)
ν −D+1
Λ = S ,
dd dd
N
10