Info-Clustering: An Efficient Algorithm by Network Information Flow Chung Chan, Ali Al-Bashabsheh and Qiaoqiao Zhou Abstract—Motivated by the fact that entities in a social exponentialtime in the size of V [4]. The computationof the network or biological system often interact by exchanging in- PSP [3, 5], [1, Algorithm 3] without any approximation also formation, weproposean efficientinfo-clusteringalgorithmthat takes Ω(|V|2) calls to a submodular function minimization cangroupentitiesintocommunitiesusingaparametricmax-flow (SFM) algorithm, which in turn makes Ω(|V|5) oracle calls 7 algorithm.Thisisameaningfulspecialcaseoftheinfo-clustering 1 paradigm where the dependency structure is graphical and can to evaluate the submodular function. Hence, the practicality 0 be learned readily from data. of the general info-clusteringalgorithm is limited by both the 2 sample complexity and computational complexity. I. INTRODUCTION b Fortunately,info-clusteringreducesto faster algorithmsun- e Info-clustering was proposed in [1] as an application of der special statistical models that are also easier (compared F network information theory to the problem of clustering in to a general model) to learn from data. For instance, under 1 machine learning. It regards each object as a piece of in- the Markov tree model, info-clustering reduces to an edge- formation, namely a random variable, and groups random filteringprocedurethatrunsinO(|V|2)time[6].Furthermore, ] variableswithsufficientlylargeamountofmutualinformation T this procedure coincides with an existing functional genomic together.Clusteringisoftenanimportantfirststepinstudying I clustering method by mutual information relevance networks . a large biological system such as the human connectome and s (MIRN) [7]. While rediscovering a simple clustering algo- c genome. It can also identify communities in a social network rithmundertheMarkovtreesimplification,theinfo-clustering [ so that resources can be allocated efficiently based on the paradigm provides a theoretical justification of the MIRN 1 communitiesdiscovered.Sinceentitiesinasocialorbiological algorithmandhelpsdiscoverhowthealgorithmmayfailwhen v system often possess informationand interactwith each other the Markov tree assumption does not hold [1, Example 6]. 9 by the transmission of information, clustering them by their 0 In this work, we propose an efficient info-clustering al- mutual information intuitively gives meaningful results. 1 gorithm under a different graphical model called the pair- Using the multivariate mutual information (MMI) in [2] 0 wise independent network (PIN) [8, 9]. Using the idea of 0 as the similarity measure, info-clustering defines a hierar- the matroidal network link model in [10], the MMI has a . chy of clusters of possibly different sizes at different levels 2 concreteoperationalmeaningasthemaximumnetworkbroad- 0 of mutual information. The clustering solution is intimately cast throughput [11]. The info-clustering solution therefore 7 related to the principal sequence of partitions (PSP) [3] of 1 a submodular function, namely, that of the entropy function identifiesclusterswithlargeintra-clustercommunicationrates, : whichnaturallymapstocommunitiesofcloselyrelatedentities v of the set of random variables to be clustered. From this, it in a social network. Learning the PIN model simplifies to i follows that the clustering solution is unique and solvable X learning the weights of O(|V|2) edges in a graph on V. In using a polynomial number of oracle calls to evaluate the r a social network, the weight of each edge can simply be the a entropy function. However, in general, learning the entropy amount/rate of communication between the edge’s incident functionof a finite set V of randomvariablesfrom data takes nodes. C. Chan, and Q. Zhou are with the Institute of Network Coding at As shown in [1, Proposition 9], the info-clusteringsolution the Chinese University of Hong Kong, the Shenzhen Key Laboratory of Network Coding Key Technology and Application, China, and the Shen- for the PIN model can be obtained from the PSP of the cut zhen Research Institute of the Chinese University of Hong Kong (email: functionofaweightedgraph.Itiswell-knownthatfasterSFM [email protected]). algorithms are possible for the cut function using min-cut or A. Al-Bashabsheh was with the Institute of Network Coding at the Chinese University of Hong Kong. He is now with the Big Data and Brain max-flow algorithms (e.g., see [12–14]). An algorithm was Computing (BDBC) center at Beihang University, Beijing, China (e-mail: given in [15] that computes the PSP efficiently by reducing [email protected]). the problem to a parametric max-flow problem, where the TheworkdescribedinthispaperwassupportedbyagrantfromUniversity Grants Committee of the Hong Kong Special Administrative Region, China capacities of the edges are certain monotonic functions of (ProjectNo.AoE/E-02/08),andsupportedpartiallybyagrantfromShenzhen a parameter. The reduction is carefully done such that the ScienceandTechnologyInnovationCommittee(JSGG20160301170514984), parametric max-flow algorithm in [16] can compute the PSP theChinese University ofHongKong(Shenzhen), China. The work of C. Chan was supported in part by The Vice-Chancellor’s in O(|V|3 |E|) time, where E is the set of edges with non- One-offDiscretionaryFundofTheChineseUniversityofHongKong(Project zeroweight.Wewilladaptthisalgorithmtocomputetheinfo- Nos.VCF2014030andVCF2015007),andagrantfromtheUniversityGrants p clustering solution and modify it to improve the performance CommitteeoftheHongKongSpecialAdministrative Region,China(Project No.14200714). further. source Z 1 1 1 1 5 = c({ m2 m1 c({1,3})Xc Xa1,2})=1 5 m1 1 5γ∈[2,5) 1 1 3 Xb 2 3 2 3 1 2 Z3 c({2,3})=1 Z2 m2 γ∈[−∞,2) (a) Weighted graph G that represents of the (b)Broadcastof2messagebitsm1and (c) The info-clustering solution in (2.11). PIN defined in (2.1a). m2 with capacities defined in (2.1c). Fig. 1: Identifyingthe clusters of the PIN in (2.1a) by brute-forcesearch oversubsets satisfying the threshold constraint(2.9). II. PRELIMINARIES ON INFO-CLUSTERING Definition 2.1 (Definition 2.4 of [17]) ZV is a hypergraphi- A. Formulation cal source w.r.t. a hypergraph (V,E,ξ) with edge functions ξ : E → 2V \ {∅} iff, for some independent (hyper)edge Let V be the finite index set of the objects we want to variables X for e∈E with H(X )>0, cluster. Without loss of generality, we assume e e Z :=(X |e∈E,i∈ξ(e)), for i∈V. (2.2) V =[|V|]:={1,...,|V|} and |V|>1 i e The weight function c : 2V \{∅} → R of a hypergraphical The idea of info-clustering is to treat each object i ∈ V as a random variable Z (denoted in san serif font) taking source is defined as i values from a finite set Zi (denoted in the usual math font), c(B):=H(Xe |e∈E,ξ(e)=B) with support (2.3a) and cluster the objects according to their mutual information. supp(c):= B ∈2V \{∅}|c(B)>0 (2.3b) PZ denotes the distribution of the entire vector of random V variables The PIN mode(cid:8)l [9] is an example, wher(cid:9)e the corresponding Z :=(Z |i∈V). hypergraphis a graph. V i Definition 2.2 ([9]) Z is a pairwise independent network We will illustrate the idea of info-clustering via a simple V (PIN)iffitishypergraphicalw.r.t.agraph(V,E,ξ)withedge example shown in Fig. 1a, where V = [3] = {1,2,3} and the corresponding random variables are defined as function ξ :E →V2\{(i,i)|i∈V} (i.e., no self loops). ✷ Z :=(X , X ) The mutual information among multiple random variables 1 a c is measured by the multivariate mutual information (MMI) Z :=(X ,X ) (2.1a) 2 a b defined in [2] as Z :=( X ,X ), 3 b c I(Z ):= min I (Z ), with (2.4a) with Xa,Xb and Xc being independent uniformly random V P∈Π′(V) P V variables with entropies 1 I (Z ):= H(Z )−H(Z ) (2.4b) H(X )=H(X )=1 P V |P|−1 C∈P C V a b (2.1b) (cid:20)X (cid:21) H(Xc)=5. =D(PZVk C∈PPZC) This is a PIN (Definition 2.2) with correlation represented by andΠ′(V)beingtheseto|fpartitions{QozfV into at}least2 non- the weightedtriangleGshownin Fig.1a characterizedbythe emptydisjointsubsetsofV. We mayalso writeIP(ZV)more weight function c where explicitly as c({1,2})=c({2,3}):=H(Xa)=H(Xb)=1 (2.1c) IP(ZV)=I(ZC1 ∧···∧ZCk) (2.5) c({1,3}):=H(X )=5. c for P ={C ,...,C }. Note that Shannon’s mutual informa- 1 k The vertex set is V :=[3] and the edge set is tion I(Z1∧Z2) is the special case when P is a bipartition. It is sometimes convenient to expand I (Z ) using Shannon’s P V E :=supp(c)={{1,2},{2,3},{1,3}}. (2.1d) mutual information [2, (5.18)] as follows: Note that, forease of comparison,this is thesame graphused I (Z )=I(Z ∧···∧C ) P V C1 k as an example in [15] to illustrate the algorithm. A formal k−1 1 definitionof the (hyper-)graphicalsourcemodelis as follows: = I(Z ∧Z ). (2.6) k−1 Ci kj=i+1Cj i=1 X S For the example with the random vector defined in (2.1a), The info-clustering solution above consists of two clusters shown in Fig. 1c for different intervals of the threshold γ. I(Z )=I(Z ∧Z )=c({v,w}) {v,w} v w For γ ≥2, the subset {1,3} is the only feasible solution that 1, {v,w}∈{{1,2},{2,3}} (2.7a) satisfiesthethresholdconstraintin(2.9).Forγ ≤2,theentire = (5, {v,w}={1,3} set [3] also satisfies the threshold constraint and is maximal. Recall from (2.7b) that there is a unique optimal partition for which reduces to Shanon’s mutual information, and I(Z ), which is therefore the finest optimal partition {1,2,3} I(Z )=min I (Z ), [3] {{2,3},{1}} [3] P∗(Z )= {1,3},{2} . (2.12) I (Z ), [3] (cid:8){{1,3},{2}} [3] I{{1,2},{3}}(Z[3]), (cid:8)C2(Z[3]) (cid:9) I (Z ) As expected from Proposition|2.{1z, th}e non-singleton element {{1},{2},{3}} [3] =min I(Z ,Z ∧Z ), (2.7b) {1,3} is the only possible subset with MMI larger than 2. 2 3 1 (cid:9) It turns out that the computation of the MMI, fundamental I(Z ,Z ∧Z ), (cid:8) 1 3 2 partition, and the entire info-clustering solution can be done I(Z1,Z2∧Z3), in strongly polynomial time from the principal sequence of I(Z1,Z3∧Z2)+I(Z1∧Z3) partition (PSP) of the entropy function 2 =min 6,6,2,2+25 =2, (cid:9) h:B ⊆V 7→H(ZB). (2.13) where we have applied (2(cid:8).6) to calcul(cid:9)ate I{{1},{2},{3}}(Z[3]) The PSP is a more general mathematical structure [3] in for the partition into singletons as the average of the value combinatorialoptimizationdefinedforasubmodularfunction. I(Z1,Z3∧Z2) of the cut that separates node 2 from nodes 1 More precisely, a reveal-valued set function g : 2V → R is and3,andthevalueI(Z1∧Z3)ofthecutthatfurtherseparates said to be submodular iff node 1 from node 3. Note thatthe sequenceoftwo cutseffectivelypartitionsthe g(B1)+g(B2)≥g(B1∩B2)+g(B1∪B2) (2.14) vertexset into singletons. From this expansion,it is clear that for all B ,B ⊆ V. The entropy function, in particular, is a 1 2 the partition into singletons cannot be optimal in this case, submodularfunction[18].ThePSPofthesubmodularfunction since the mutual information between nodes 1 and 3 is very g is the characterization of the solutions to the following for large.Indeed,theoptimalpartitionturnsouttobea clustering all γ ∈R: of the random variable into correlated groups. In general, the gˆ(V):= min g [P], (2.15a) set of optimal partitions to (2.4a), denoted as Π∗(ZV), form P∈Π(V) γ a semi-lattice w.r.t. the partial order that P (cid:22) P′ for the referredtoastheDilworthtruncation[19],whereΠ(V)isthe partitions P and P′ when partition of V into one or more non-empty disjoint subsets, ∀C ∈P, ∃C′ ∈P′ :C ⊆C′. (2.8) g [P]:= g (C) (2.15b) γ γ P ≺ P′ denotes the strict inequality. There is a unique C∈P X finest/minimum partition P∗(ZV), referred to as the funda- gγ(C):=g(C)−γ (2.15c) mental partition for ZV [2, Theorem 5.2]. For a threshold (n.b.,Π′(V) in (2.4a)is Π(V) but withoutthe trivial partition γ ∈R, the set of clusters is defined as [1, Definition 1] {V}, i.e., Π′(V) = Π(V)\{{V}}.) For every γ, submodu- C (Z ):=maximal{B ⊆V ||B|>1,I(Z )>γ} (2.9) larity of g implies that there exists a unique finest/minimum γ V B (w.r.t. the partial order (2.8)) optimal partition to (2.15a), where maximalF is used to denote the inclusion-wise maxi- denoted as P∗(γ). It can be characterized as mal elements of F, i.e., P∗(γ)=P ∀γ ∈[γ ,γ ),ℓ∈{0,...,N} (2.16a) ℓ ℓ ℓ+1 maximalF :={B ∈F |6∃B′ )B,B′ ∈F}. (2.10) for some integer N >0, a sequence of critical values of γ Proposition 2.1 ([1, Theorem 5]) C (Z ) = P∗(Z )\{i | γ V V i∈V} with γ =I(ZV), i.e., the non-singleton subsets in the −∞<γ1 <···<γN <∞ (2.16b) fundamentalpartition are the maximal subsets (clusters) with with γ := −∞ and γ := ∞ for convenience, and a 0 N+1 MMI larger than that of the entire set. ✷ sequence of successively finer partitions Fortheexample,applyingthedefinition (2.9)ofclusterswith P =V ≻P ≻···≻P ={{i}|i∈V}. (2.16c) the MMI calculated in (2.7), the clustering solution is 0 1 N The sequence of partitions (together with the corresponding {[3]} γ ∈(−∞,2) critical values) is referred to as the PSP of g. The PSP of Cγ(Z[3])={1,3} γ ∈[2,5) (2.11) the entropy function (2.13) characterizes the info-clustering ∅ γ ∈[5,∞). solution as follows: Proposition 2.2 ([1, Corollary 2]) For a finite set V with The incut function of the weighted digraph is defined as size |V|>1 and a random vector Z , V g(B):=c(V \B,B). (3.3) C (Z )= min{P ∈Π(V)|h [P]=ˆh (V)} γ V γ γ (2.17) The incut function for the digraph in Fig. 2a is shown in h {{i}|i∈V}, i Fig. 2b and calculated below: namely, the non-singl(cid:15)eton elements of the finest optimal par- g({2})=c({1,3},{2})=c(1,2)+c(3,2)=1 tition to the Dilworth truncation (2.15a). ✷ g({3})=c({1,2},{3})=c(1,3)+c(2,3)=6 g({2,3})=c({1},{2,3})=c(1,2)+c(1,3)=6 (3.4) III. INFORMATION FLOWINTERPRETATION g({1,3})=c({2},{1,3})=c(2,1)+c(2,3)=1 For PIN, the MMI can be interpreted as the maximum broadcast throughput of a network [10, 11], and hence info- g({1})=g({1,2})=g({1,2,3})=0. clustering reduces to clustering by network information flow. TocomputethePSPofg,wefirstevaluate(2.15b)fordifferent When applied to clustering social network, it can identify partitions as follows: communities naturally based on the amount of information g [{{1,2,3}}]=g ({1,2,3}) flow. γ γ More precisely, treating each edge as an undirected com- =g([3])−γ munication link with capacity c, at most a total of 1 bit can =−γ be communicated between node 1 and 2, and between node g [{1,3},{2}]=g ({1,3})+g ({2}) γ γ γ 2 and 3; and at most a total of 5 bits can be communicated (3.5a) =g({1,3})+g({2})−2γ between node 1 and 3. It can be seen that, for every pair of distinct nodes v,w ∈ V, the broadcast throughput between v =2−2γ and w is given by the MMI in (2.7a). Fig. 1b illustrates how g [{1},{2},{3}]=g ({1})+g ({2})+g ({3}) γ γ γ γ 2 bits of information can be broadcast in the entire network, =7−3γ achieving the MMI of the entire set of random variables in (2.7b).WiththeinterpretationoftheMMIasinformationflow, Similarly, aclusteratthresholdγ isthereforeamaximalsubnetworkwith g [{1,2},{3}]=6−2γ >g [{1,3},{2}] γ γ broadcast throughput larger than γ. For instance, the cluster (3.5b) g [{1},{2,3}]=6−2γ >g [{1,3},{2}]. {1,3}∈C (Z )istheonlysubsetofnodesonwhichthe γ γ 2 {1,2,3} induced subnetwork has a throughput exceeding 2. Fig. 2c plots the Dilworth truncation gˆ (V) in (2.15) against γ Specializingtothe(hyper-)graphicalmodel,itwasshownin γ as the minimum of g [P] over all partitions P ∈ Π(V). It γ [1,Proposition8]thattheclusteringsolutionscanbeobtained can be seen that for a given P, g [P] is linear with integer γ directlyasthenon-singletonsubsetsfromthePSPoftheincut slope −|P| ∈ {−|V|,...,1} and so gˆ (V) is piecewise γ function. More precisely, from the weighted graph G with linear consisting, in this example, of |V| = 3 line segments vertex set V and capacity function c, define for every pair (highlightedinblue)and|V|−1=2breakpoints(highlighted (v,w) of vertices v,w∈V in red). (In general, the number of line segments is at most |V|.) The finest optimal partition for each value of γ is c({v,w}), v <w c(v,w):= (3.1) {{1,2,3}}, γ ∈(−∞,2] (0, otherwise. TwthhietihssedvteerfiotnefxesasrtcehsteV(cva,apwnad)citewydigtfhuenpsceotitsoiEntivoewfhcaiacpwhaecciigatynhtecbd(evd,diwegfi)ranpe>hd Da0s. P∗(γ)={|{1,P{3zP0}1,{2}}}, γ ∈[ γ21 ,5) (3.6) {{1},{2},{3}}, γ ∈[ 5 ,∞) For example, Fig. 2a is the weighted digraph obtained by | {z } |{z} oi.rei.e,nbtyindgirtehcetinwgeaignhetdedgegfrraopmhthineiFnicgi.de1natnacocdoerwdiinthgatosm(3a.l1le)r, with the PSP and|the coP{rrz2espond}ing critic|a{γlz2}values annotated label to the other incident node with a larger one. above and in the Fig. 2c. For convenience, we also write for arbitrary subsets B ,B ⊆V IV. CLUSTERING USING PARAMETRIC MAX-FLOW 1 2 By [1, Proposition 9], the PSP of the incut function c(B ,B ):= c(v,B ) where (3.2a) 1 2 2 B 7→ c(V \ B,B) of the digraph D coincides with the vX∈B1 PSP of the cut function (divided by 2) of the corresponding c(v,B2):= c(v,w) for v ∈V, and (3.2b) undirected graph G, which was shown in [15] to be com- wX∈B2 putable by running a parametric max-flow algorithm O(|V|) c(B ,w):= c(v,w) for w ∈V. (3.2c) times. The parametric max-flow algorithm was introduced by 1 [16], whichrunsin O(|V|2 |E|) times usingthe well-known wX∈B1 p gˆγ(V) γ1=2γ2=5 1 1 γ 1 })= gγ[{{1,2,3}}]=1−γ gγ[{{1,2},{3}}] 5 g({1,35 1 | P{z0 } ==gγ6[{−{12}γ,{2,3}}] c(1,3)= c(1,2)=1 g({12,3})=6 gγ[{|{1,3P{}z1,{2}}}]=2−2γ 3 2 gγ[{{1},{2},{3}}]=5−3γ 3 2 g({3})=6 g({3})=6 P2 c(2,3)=1 | {z } (a)WeighteddigraphDbyorientingG (b) The non-zero values of the incut function (c) The PSP (3.6) of g and the Dilworth truncation as (3.1). g in (3.4). from (3.5). Fig. 2: Computing the clusters of the PIN model (2.1a) as the non-singleton elements of the PSP (2.16). push-relable/preflowalgorithm [20, 21] implementedwith the Algorithm 1: Computing the PSP as a parametric SFM. highest-level selection rule [22]. Hence, the info-clustering Input: Submodular function g :2V 7→R defined on a algorithm solution for the PIN model can be obtained in finite ground set V. O(|V|3 |E|) time. Output: The function P∗(γ) of γ ∈R defined in (2.16a). In thpis section, we will adapt and improve the algorithm 1 P∗ ←{{1}}, B∗(γ)←{1}, xγ,1 ←gγ({1}), µi ←−∞ in [15] to compute the desired PSP for the info-clustering for all i∈V; solution. The algorithm will be illustrated using the same 2 for j =2 to |V| do exampleasin the lastsection,whichis chosentobe the same 3 set B∗(γ) as the (inclusion-wise) minimum example as in [15] for ease of comparison. minimizer to We first give a procedure in Algorithm 1 for computing the minimum minimizer P∗(γ) to (2.15) for all γ ∈ R and min gγ(B)−xγ(B\{j}), (4.1) B⊆[j]:j∈B anysubmodularfunctiong,assumingaparametricsubmodular function minimizer. This procedurecan be specialized further where xγ(C):= i∈Cxγ,i for convenience; to the PIN model where g is chosen to be the incut func- 4 remove every C that intersects B∗(γ) from P∗(γ); P tion (3.3), so that the parametric max-flow algorithm can be 5 add B∗(γ) to P∗(γ); applied instead. 6 if j <|V| then Considertheexamplewithg definedin(3.4)andillustrated 7 for i=1 to j do inFig.2b,andwithg definedin(2.15c).Whenj =2,Line1 8 µi ←max{µi,min{γ |i6∈B∗(γ)}} ; γ initializes xγ,1 as 9 xγ,i ←gmax{γ,µi}({i}) ; 10 end 11 end x =g({1})−γ =−γ. (4.2) γ,1 12 end Then, (4.1) becomes min{g ({1,2})−x ,g ({2})}=min{0,1−γ} γ γ,1 γ 1−γ, γ <1 P∗(γ) to = (0 γ ≥1, which is a piecewise linear function plotted in Fig. 3a. The minimum minimizer is therefore given by {1,2} , γ < 1 P∗(γ)=(cid:8) B0 (cid:9) γ1′ (4.4) {1,2}, γ <1 {|1{}z, {}2} , γ ≥|{1z}. B∗(γ)= B0 (4.3) (cid:8) B1 (cid:9) γ2′ |{2{}z,} γ ≥ 1 . |{z} |{z} B1 γ1′ With P∗(γ) initialized inL|{inze} 1 to {{1|}{}z,}Line 4–5 update Next, with µ and µ initialized to −∞ in Line 1, the 1 2 min gγ(B)−xγ(B\{2}) 5 update P∗(γ) to the desired solution (3.6) characterized by B⊆{1,2}:2∈B the PSP. 1−γ γ′ =1 1 The procedure is said to be parametric since the solution P∗ is computed for all possible values of γ ∈ R rather than 0 γ a particular value. To realize such a procedure, the charac- B0={1,2} B1={1} terization (2.16a) of P∗(γ) through the PSP in (2.16c) and the corresponding critical values of γ in (2.16b) is computed B∗(γ) 1 1 instead. 3 2 3 2 The minimization in (4.1) is a submodular function mini- P∗(γ) 1 1 mization (SFM) (over a lattice family). In contrast with [15], we perform (4.1) on a growing set [j] = {1,...,j} instead 3 2 3 2 of the entire set V in every loop. The idea follows from the (a) For the loop with j =2. algorithm for computing Dilworth truncation such as the one B⊆{1m,2,i3n}:3∈Bgγ(B)−xγ(B\{3}) givenin[19].Incontrastwith[19],however,wefollow[15]to consider the update rule in Lines 8–9, which guarantees x γ′ =2 γ′ =5 γ,i 1 2 to be piecewise linear with at most one break point, possibly 6−γ at γ = µ if µ is finite. When i = j, the step in Line 8 B={2,3} i i 5 givesµ =−∞asB∗(γ)alwayscontainj (see(4.1)),andso j x = g ({j}), which is linear without any breakpoint. As γ,j γ pointed out in [15], the update rule is particularly useful for max{2−γ,1} thetheparametricproceduresincethecomplexityoftengrows with the number of break points. max{0,γ−1} B1={1,3} B2={3γ} Specializing to PIN models where g is the incut function B0={1,2,3} definedin(3.3),theparametricSFM in (4.1)canbe solvedas a parametric min-cut problem as shown in Algorithm 2. B∗(γ) 1 1 1 3 2 3 2 3 2 Algorithm 2: Computing the parametric SFM in (4.1) as P∗(γ) 1 1 1 a parametric min-cut. 3 2 3 2 3 2 Input: A weighted digraph D on vertex set V with P1 P2 P3 capacity function c:V2 →R satisfying (3.1). γ1 γ2 Output: The minimum minimizer to (4.1) with g defined (b) For the loop with j =3. as the incut function in (3.3). 1 define a weighted digraph Dj(γ) with vertex set Fig. 3: Illustration of the computationof PSP in Algorithm 1. U ←{s,1,...,j} (where s is a new node outside [j]) and capacity function c :U2 →R initialized to 0; γ 2 for v =1 to j−1 do subsequent steps following Line 5 give 3 cγ(s,v)←max{0,−xγ,v} ; 4 cγ(v,j)←max{0,xγ,v}+c(v,j) ; µ1 =max{−∞,γ1′}=1 (4.5a) 5 cγ(v,w)←c(v,w) for all w ∈[j−1]\[v] ; −1, γ <1 6 end xγ,1 =gmax{γ,1}({1})=(−γ, γ ≥1. (4.5b) 7 compute the minimum minimizer B∗(γ) to µ2 =max{−∞,−∞}=−∞ (4.5c) min cγ(U\T,T) (4.7) T⊆U\{s}:j∈T x =g ({2})=1−γ. (4.5d) γ,2 max{γ,−∞} which is the minimum s–j cut value of D (γ). Similarly, when j = 3, the function of the minimum in (4.1) j is plotted in Fig. 3b. It follows that the minimum minimizer is For the current example, the digraph D (γ) with j = 2 is j {1,2,3}, γ ∈(−∞,2] shown in Fig. 4. The vertexset is U ={s,1,2}. The for-loop sets v =1 and initializes the capacity c(s,1)=|−x |+ and B∗(γ)={|1B,B{13z0},} γ ∈[ γ21′ ,5) (4.6) c(1,2)=|xγ,1||+x|++ :1=, wmhaexre{0,x} for x∈R. γ,1 (4.8) {3}, γ ∈[ 5 ,∞). | {z } |{z} Evaluatingthe capacitieswith theinitialvalueof x in (4.2) With P∗(γ)givenby(4|.B4{z)2}andillustrate|d{γiz2′n}Fig.3a, Lines4– gfuivnecstiothnesnc(osn,-1d)ecarnedascin(1g,a2n)drensopne-citnivcreelyassinhgowpniecinewγt,hi1seefilginueraer. |−xγ,1|+ |xγ,1|+0, γ<0|−xγ,1|+−γ, γ<0 0, γ<0 = = s =(γ, γ≥0 1 s (γ, γ≥01 (0, γ≥0t s 1 1 |xγ,1|++1 c(s,{1,2})=1 1−γ, γ<0 5 1 1 = (1, γ≥0 1 2 3 2 c({s,1},2)=1 2 (a) Weighted digraph D2(γ). (b) Construction of D2(γ) from D. (c) Minimum s–2 cut for D2(1). Fig. 4: Illustration of the parametric min-cut problem in Algorithm 2 for the for-loop with j =2. |−xγ,1|+ |−xγ,1|+ = 1, γ<1 1, γ<1 (γ, γ≥1 = (γ, γ≥1 1 s 1 s |−xγ,2|+ T∗={1,2,3} 1 |−xγ,2|+ 5 1 0, γ<1 1 (f∗(s,1)=1) s 5 1 = 0, γ<1 1 =(γ−1, γ≥1 =1)(f∗ (γ−1, γ≥1 t 3 2 5 3) (1, 3|xγ,2|++1 2 |xγ,2|+ f∗(1, 2)=1 = 2−γ, γ<1 = 1−γ, γ<1 ((f∗(2,3)=00)) (1, γ≥1 (0, γ≥1 3 1 2 (a) Weighted digraph D3(γ). (b) Construction of D3(γ) from D. (c) s–3 max-flow and min-cut for D3(1). Fig. 5: Illustration of the parametric min-cut problem in Algorithm 2 for the for-loop with j =3. Similarly, the digraph D (γ) with j = 3 is shown in Fig. 5, c (v,w)=c(v,w) forw ∈[v−1]astheyarezerobydefault. j γ with the vertex set now being U = {s,1,2,3} instead, and Incontrastwith[15],theadditionalnodecontractionandedge x and x updated to the functions in (4.5). removalin Steps2 and3abovereducethe numberofvertices γ,1 γ,2 The weighted digraph D (γ) can be viewed as a result of from |V| to |U| = j and therefore the complexity in solving j processing the weighted digraph D as follow: (4.7). 1. AugmentthedigraphDwithtwonewnodessandt(both Foreachj ∈V,(4.7)canbesolvedbytheparametricmax- outside V) such that flow algorithm in [16] in O(j3 |E|) time by invoking O(j) a. the capacity from j to t is set to infinity; timesthepreflowalgorithm[20,21]implementedwithhighest p b. for v from 1 to j −1, if x < 0, add an arc with level selection rule [22], which in turn runs in O(j2 |E|) γ,v capacity −x from s to v, else add an arc with times. The procedure is described in Algorithm 3, which γ,v p capacity x from v to t. returns the characterization of the solution B∗(γ) to (4.7) as γ,v 2. Removealloutgoingarcsfromnodej and contractnode t to node j. B∗(γ)=Bℓ for γ ∈[γℓ′,γℓ′+1),ℓ∈{0,...,N′} (4.9) 3. Remove all incoming arcs of nodes j + 1,...,|V| and contract the nodes to node s. (4.7). for some integer N′ > 0, where γ′ := −∞,γ′ := +∞ 0 N′+1 This procedure is illustrated for D2(γ) and D3(γ) in Fig. 4b and B0 := [j]. Note also that B∗(γ) = {j} for sufficiently and Fig. 5b respectively, where the red dotted arcs are large γ and so BN′+1 ={j}. removed, and the nodes circled together by blue lines are To solve (4.7) for any fixed j, we assume the following contracted. subroutine Step 2 implies the formula in Line 4 directly. Step 3 gives [f∗,T∗]=MaxFlow(c¯,U¯,s¯,t¯,f¯) (4.11) c (s,v)=max{0,−x }+ c(u,v) γ γ,v u∈XV\[j] which takes as arguments the capacity function c¯ (fixed and but this reduces to the formula in Line 3 because the last not parametric), the vertex set U¯ on which c¯ is defined, the summation is zero by the assumption (3.1) that all the arcs source node s¯∈ U¯, the sink node t¯∈ U¯ and a valid preflow pointfroma nodewith a smaller labelto a nodewith a larger f¯associated with the weighted digraph D (γ) defined by the j label. For the same reason, in Line 5, we do not need to set previous arguments. It returns the maximum s¯–t¯flow f∗ and Algorithm 3: Solving the parametric min-cut problem in 3 give (4.7) using the parametric max-flow algorithm [16]. γ+ =c(1,2)=1 Input: The weighted digraph Dj(γ) on vertex set U with γ− =max{µ1,c(1,2)}=1, capacity function c created in Algorithm 2. γ Output: A list L containing (γ′,B ) for ℓ∈[N′] that where the last equality is because µ1 is initialized to be −∞ ℓ ℓ by Line 1 of Algorithm 1. Since γ− = γ+ in this case, the characterizes (4.7) as in (4.9). algorithm returns at Line 5 the list 1 create empty lists L and PL; 2 γ+ ←maxv∈[j−1]{c([v−1],v)+c(v,[j]\[v])}; L=[( 1 , {1})]. 3 γ− ←min{max{µv,c(v,j)}|v ∈[j−1]}; 4 if γ− =γ+ then γ1′ B1 5 L←(γ−,{j}) and return L; This gives the desired B∗|(γ{z)}in|{z(4}.3). Fig. 4c shows the 6 end digraph D (γ) at γ = 1. It can be seen that both {1,2} and 7 set f as the zero flow 0 from s to j; {2} are so2lutions to the minimization in (4.7). 8 [f∗,T∗]←MaxFlow(cγ−,U,s,j,f); If γ− 6= γ+ (or more specifically γ− < γ+), then the 9 add (γ−,γ+,f∗,{s},{j}) to PL; interval (γ−,γ+) must contain other critical values of γ 10 while PL is not empty do where B∗(γ) changes. The critical values are then computed 11 withdraw any element (γ−,γ+,f,S,T) from L; iterativelybythepreflowalgorithmMaxFlow(4.11)(Lines8 12 compute γ¯ ∈[γ−,γ+] as the solution to and23)appliedonthedigraphD¯ withcapacitiesderivedfrom j those of D (γ) (Lines 15–17), and with γ evaluated at some c (S,U \S)=c (U \T,T); (4.10) j γ γ value γ¯ ∈ (γ−,γ+) satisfying (4.10). This either resolves 13 define a weighted digraph D¯j with vertex set B∗(γ) for the entire interval (in which case the solution is U¯ ←([j]\(S∪T))∪{s,j} and capacity function updated in Line 25) or reduces the problem to two smaller c¯:U¯2 →R initialized to 0; subproblemsforlaterprocessing(i.e.,withtheoriginalinterval 14 for v in U¯ \{s,j} do (γ−,γ+) replaced by the two smaller intervals (γ−,γ¯) and 15 c¯(s,v)← u∈Scγ¯(u,v) ; (γ¯,γ+) in Line 27). 1176 cc¯¯((vv,,wj))←←Pc(wv,∈wT)cγ¯fo(vr,awll)w; ∈U¯ \{s,j,v} ; D3T(oγ)illsuhsotrwanteitnheFpigr.oc5eaduarsetahbeoivnep,uctotnosidAelrgojr=ith3m, i3.e..,Twheitnh, 18 end P Lines 2–3 give 19 for v in U¯ \{j} do γ+ =max{c(1,2)+c(1,3),c(1,2)+c(2,3)} 20 f¯(v,j)← w∈T min{f(v,w),c¯(v,j)} and f¯(j,v)←−f¯(v,j) to ensure anti-symmetry; =max{1+5,1+1}=6 21 f¯(v,w)←Pf(v,w) for all w∈U¯ \{j,v} ; γ− =min{max{µ1,c(1,3)},max{µ2,c(2,3)}} 22 end =min{max{1,5},max{−∞,1}}=1 23 [f∗,T∗]←MaxFlow(c¯,U¯,s,j,f); 24 if T∗ ={j} then where we used the values µ1 = 1 and µ2 = −∞ by (4.5). 25 add (γ−,T) to L; Sinceγ− <γ+ inthiscase,Line5isskipped.Line8invokes 26 end the preflow algorithm for the graph D3(γ−) = D3(1) shown 27 add (γ−,γ¯,f,S,T ∪T∗) and in Fig. 5c. The min-cut is T∗ = [3] = U \{s} (where U = (γ¯,γ+,f∗,S∪(U \T∗),T) to PL; {s,1,2,3}is the vertexset of D3) by the constructionof γ−. 28 end The max-flow is f∗(s,1)=f∗(1,3)=1 (4.13) f∗(1,s)=f∗(3,1)=−1 the inclusion-wise minimum set T∗ that solves and0otherwise.Notethatthesecondlineofequationsensures min c(U¯ \T,T), (4.12) the anti-symmetry property of a flow function, i.e., T⊆U¯\{s¯}:t¯∈T f∗(w,v)=−f∗(v,w) (4.14) and is referred to as the minimum s¯–t¯cut. Roughly speaking, γ+ in Line 2 is the value of γ at which for all pairs of distinct nodes v and w. The flow along each (4.7) (i.e., B∗(γ)) is constant for γ ≥ γ+. Similarly, γ− in arc is indicated in Fig. 5c by the parentheses next to the Line3isthevalueofγ atwhich(4.7)isconstantforγ ≤γ−. corresponding capacity of the arc. Whenγ− =γ+,thereisonlyonecriticalvalueγ′ ofγ where The tuple (γ−,γ+,f∗,{s},{j}) is then added to PL in 1 B∗(γ) changes from B =[j] to B ={j}. Line 9 and then retrieved(and deletedfromPL) subsequently 0 1 To illustrate the above, consider j = 2, i.e., with D (γ) inside the while-loop (Line 11). With γ− = 1, γ+ = 6, 2 shown in Fig. 4a as the input to Algorithm 3. Then, Lines 2– S = {s} and T = {j} in Line 11, the l.h.s. of (4.10) is 1 3.5 s S T 1 2 s S 1 5 S s (1) (1) (3.5) (1) 5 (0) 1 (0) 2.5 5 (1) (0) 1 (0) 1 5 (3.5) 1 (1) 4 (0) (0) (0) (1) T 3 1 2 3 1 2 T 3 1 2 T∗={1,3} 1 3.5 s s 1 5 s (3.5) (5) 5 (3.5)(0)1 (1) 2.5 (22) (1) 1 5 (5) (11) T∗={3} (1) (1) 3 1 2 3 1 2 3 T∗={3} (a) D¯3 from D3(3.5). (b) D¯3 from D3(2). (c) D¯3 from D3(5). Fig. 6: Illustration of the parametric max-flow algorithm in Algorithm 3. given as (see Fig. 5a) Line 25 will be skipped. Instead, Line 27 adds the following two tuples to the list PL, which becomes c (S,U \S)=c (s,{1,2,3}) γ γ 1, γ <1 0, γ <1 PL=[(1,3.5,f∗,{s},{1,3}),(3.5,6,f,{s,2},{3})]. = + (γ, γ ≥1 (γ−1, γ ≥1 (4.15) Repeatingthe while-loopwiththe firstelementretrievedfrom 1, γ <1 PL, it can be shown that (4.10) is solved by the value γ¯ =2. = (2γ−1, γ ≥1 SimilartoFig. 6a,Fig.6bshowsthedigraphD3(2)atthetop andD¯ at thebottom.Itcanbe verifiedthatS and T satisfies and r.h.s. of (4.10) is given as 3 (4.10) for the top graph and T∗ is the min-cut in the bottom cγ(U \T,T)=cγ({s,1,2},3) graph.Since T∗ ={3} in this case, a new element(2,{1,3}) 2−γ, γ <1 is added to L in Line 25. =5+ Finally,repeatingthewhile-loopagainwiththelastelement (1, γ ≥1 retrived from PL (4.15), it can be shown that γ¯ = 5. Fig. 6c 7−γ, γ <1 again gives D (5) and the min-cut T∗ = {3}, in which case = 3 (6, γ ≥1. a new element (5,{3}) is added to L again in Line 25. Since PL is not empty, the algorithm terminates with γ¯ is computed as the solution to (4.10), namely γ¯ = 3.5. L=[( 2 ,{1,3}),( 5 , {3})]. In general, such a value must exist and is unique because U \ S and T are optimal solutions to (4.7) at γ− and γ+ γ1′ B1 γ2′ B2 respectively.ThecomputationisinO(j)timesincebothsides This gives the desir|e{dz}B|∗({γz)}in|({4z.}6)|{thz}at yields the de- of the equationsare piecewise linear with at most O(j) break sired PSP in (3.6), and therefore the info-clustering solution points. in (2.11) for the PIN model (2.1a). The new weighted digraph D¯ with the capacity function j c¯assigned in the first for-loop (Lines 15–17) can be obtained V. CONCLUSION from D (γ) by j We have adapted the parametric max-flow algorithm of 1) setting γ =γ¯, contracting S to the source node s, computingthePSPtoaninfo-clusteringalgorithmthatclusters 2) contracting T to the sink node j, and then a graphical network based on the information flow over its 3) removing the incoming arcs to s and outgoing arcs edges. The overall running time is O(|V|3 |E|), where |V| from j. is the size of the network and |E| is the number of edges The second for-loop turns f to a valid preflow f¯of D¯ . or communication link. The algorithm simpplifies the general j Recall that for the current example, γ¯ = 3.5 in the last info-clustering algorithm by a few orders of magnitude, and executionofthealgorithm.Fig.6ashowstwodigraphs,where is applicable to systems, such as the social networks, where the top one is the digraph D (γ) at γ = γ¯ = 3.5 and the similarity can be measured by mutual information. 3 bottomoneisthenewweighteddigraphD¯ .ThesetsS,T and Toimplementthe algorithmin alarge-scalesocialnetwork, 3 theflowf indicatedonthetopdigraphD (3.5)satisfy(4.10), the preflow algorithm may be made distributive and adaptive: 3 while D¯ is annotated with the max-flow f∗ and min-cut T∗ Servers may be deployed in different parts of the network to 3 computed by Line 23. Note that, since T∗ = {1,3} 6= {j}, measure and store the informationexchange rates of different pair of nodes. The push and relabel operations in the preflow [19] A. Schrijver, Combinatorial Optimization: Polyhedra and Efficiency. algorithm can be done locally by the servers first and then Springer, 2002. [20] A.V.Goldberg, “Efficient graphalgorithms forsequential and parallel communicated to other servers when necessary. The preflow computers,” Ph.D. dissertation, Massachusetts Institute of Technology, ofthenetworkmaybestoredinconjunctionwiththeclustering Dept.ofElectrical Engineering andComputerScience, 1987. solution, so that the clusters can be updated incrementally [21] A. V. Goldberg and R. E. Tarjan, “A new approach to the maximum- flowproblem,”JournaloftheACM(JACM),vol.35,no.4,pp.921–940, overtimebasedonthechangesofinformationexchangerates. 1988. The allocation of the servers and other resources may also be [22] B.V.CherkasskyandA.V.Goldberg,“Onimplementingthepushrelabel adaptedtotheclusteringsolution.Forinstance,asintra-cluster method for the maximum flow problem,” Algorithmica, vol. 19, no. 4, pp.390–410, 1997. communicationismorefrequentthaninter-clustercommunica- tion,thenodesinaclusterwithlargermutualinformationmay be assigned to the same server so thatchangesin the network canbeupdatedmorefrequentlywithoutmuchcommunication overhead among the servers. REFERENCES [1] C. Chan, A. Al-Bashabsheh, Q. Zhou, T. Kaced, and T. Liu, “Info- clustering: A mathematical theory for data clustering,” IEEE Transac- tionsonMolecular,BiologicalandMulti-ScaleCommunications,vol.2, no.1,pp.64–91,June2016. [2] C. Chan, A. Al-Bashabsheh, J. Ebrahimi, T. Kaced, and T. Liu, “Multivariate mutual information inspired by secret-key agreement,” Proceedings oftheIEEE,vol.103,no.10,pp.1883–1913, Oct2015. [3] H. Narayanan, “The principal lattice of partitions of a submodular function,”LinearAlgebraanditsApplications, vol.144,no.0,pp.179 –216,1990. [4] Y. Wu and P. Yang, “Minimax rates of entropy estimation on large alphabetsviabestpolynomialapproximation,”IEEETrans.Inf.Theory, vol.62,no.6,pp.3702–3720, 2016. [5] K. Nagano, Y. Kawahara, and S. Iwata, “Minimum average cost clus- tering.” in NIPS, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S.Zemel, and A.Culotta, Eds. Curran Associates, Inc.,2010, pp. 1759–1767. [6] C. Chan and T. Liu, “Clustering of random variables by multivariate mutual information on Chow-Liu tree approximations,” in Fifty-Third AnnualAllertonConference onCommunication, Control, andComput- ing,Allerton Retreat Center, Monticello, Illinois, Sep.2015. [7] A.J.ButteandI.S.Kohane,“Mutual informationrelevance networks: functionalgenomicclusteringusingpairwiseentropymeasurements,”in PacSympBiocomput, vol.5,2000,pp.418–429. [8] S.Nitinawarat, C.Ye,A.Barg,P.Narayan,andA.Reznik,“Secretkey generationforapairwiseindependentnetworkmodel,”IEEETrans.Inf. Theory,vol.56,no.12,pp.6482–6489,Dec2010. [9] S. Nitinawarat and P. Narayan, “Perfect omniscience, perfect secrecy, andsteiner treepacking,” IEEETrans.Inf.Theory,vol.56,no.12,pp. 6490–6500, Dec.2010. [10] C. Chan, “Matroidal undirected network,” in Information Theory Pro- ceedings(ISIT),2012IEEEInternationalSymposiumon,July2012,pp. 1498–1502. [11] ——, “The hidden flow of information,” in Proc. IEEE Int. Symp. on Inf.Theory,St.Petersburg,Russia,Jul.2011. [12] S. Fujishige and S. Iwata, “Minimizing a submodular function arising fromaconcavefunction,”DiscreteAppliedMathematics,vol.92,no.2, pp.211–215, 1999. [13] M.Queyranne, “Minimizing symmetricsubmodularfunctions,” Mathe- matical Programming,vol.82,no.1-2,pp.3–12,1998. [14] S.Jegelka,H.Lin,andJ.A.Bilmes,“Onfastapproximatesubmodular minimization,” inAdvances inNeuralInformation ProcessingSystems, 2011,pp.460–468. [15] V. Kolmogorov, “A faster algorithm for computing the principal se- quenceofpartitions ofagraph,”Algorithmica,vol.56,no.4,pp.394– 412,2010. [16] G.Gallo,M.D.Grigoriadis,andR.E.Tarjan,“Afastparametricmax- imum flow algorithm and applications,” SIAM Journal on Computing, vol.18,no.1,pp.30–55,1989. [17] C.ChanandL.Zheng,“Mutualdependence forsecretkeyagreement,” inProceedingsof44thAnnualConferenceonInformationSciencesand Systems,2010. [18] S. Fujishige, “Polymatroidal dependence structure of a set of random variables,” Information andControl, vol.39,no.1,pp.55–72,1978.