Correlation Clustering with Partial Information Erik D. Demaine and Nicole Immorlica(cid:2) Laboratory for Computer Science, MIT, Cambridge, MA 02139, USA. edemaine, [email protected]. Abstract. Weconsiderthefollowinggeneralcorrelation-clusteringprob- lem [1]: given a graph with real edge weights (both positive and nega- tive), partition the vertices into clusters to minimize the total absolute weight of cut positive edges and uncutnegative edges. Thus, large posi- tiveweights(representingstrongcorrelationsbetweenendpoints)encour- agethoseendpointstobelongtoacommoncluster;largenegativeweights encouragetheendpointstobelongtodifferentclusters;andweightswith small absolute value represent little information. In contrast to most clustering problems, correlation clustering specifies neither the desired numberof clusters nora distance threshold for clustering; both of these parametersareeffectivelychosentobethebestpossible bytheproblem definition. Correlation clusteringwasintroducedbyBansal, Blum,andChawla[1], motivated by both document clustering and agnostic learning. They provedNP-hardnessandgaveconstant-factorapproximationalgorithms forthespecialcaseinwhichthegraphiscomplete(fullinformation)and every edge has weight +1 or −1. We give an O(logn)-approximation algorithm for the general case based on a linear-programming rounding and the “region-growing” technique. Wealso prove that this linear pro- gram has a gap of Ω(logn), and therefore our approximation is tight underthisapproach.WealsogiveanO(r3)-approximationalgorithmfor K -minor-freegraphs.Ontheotherhand,weshowthattheproblemis r,r APX-hard,andanyo(logn)-approximationwouldrequireimprovingthe best approximation algorithms known for minimum multicut. 1 Introduction Clustering objectsintogroupsisacommontaskthatarisesinmanyapplications suchas datamining,webanalysis,computationalbiology,facilitylocation,data compression, marketing, machine learning, pattern recognition, and computer vision. Clustering algorithms for these and other objectives have been heavily investigated in the literature. For partial surveys, see e.g. [6, 11, 14, 15, 16, 18]. In a theoretical setting, the objects are usually viewed as points in either a metric space (typically finite) or a general distance matrix, or as vertices in a graph. Typical objectives include minimizing the maximum diameter of a clus- ter(k-clustering)[8],minimizingthe averagedistancebetweenpairsofclustered (cid:2) Research was supported in part byan NSFGRF. S.Aroraetal.(Eds.):APPROX2003+RANDOM2003,LNCS2764,pp.1–13,2003. (cid:2)c Springer-VerlagBerlinHeidelberg2003
Description: