ebook img

Analysis of a Gibbs sampler method for model based clustering of gene expression data PDF

0.34 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Analysis of a Gibbs sampler method for model based clustering of gene expression data

BIOINFORMATICS Vol.00no.002007 Pages1–8 Analysis of a Gibbs sampler method for model based clustering of gene expression data Anagha Joshia,b, Yves Van de Peera,b, Tom Michoela,b ∗ aDepartmentofPlantSystemsBiology,VIB,Technologiepark927,9052Gent,Belgium, bDepartmentofMolecularGenetics,UGent,Technologiepark927,9052Gent,Belgium 8 ABSTRACT A variety of heuristic clustering methods have been used, such 0 Motivation:Overthelastdecade,alargevarietyofclusteringalgo- as hierarchical clustering (Eisen et al., 1998), k-means (Tavazoie 0 rithms have been developed to detect coregulatory relationships et al., 1999), or self-organizing maps (Tamayo et al., 1999). Alt- 2 among genes from microarray gene expression data. Model based houghthesemethodshavehadanenormousimpact,theirstatistical n clustering approaches have emerged as statistically well grounded propertiesaregenerallynotwellunderstoodandimportantparame- a methods, but the properties of these algorithms when applied to terssuchasthenumberofclustersarenotdeterminedautomatically. J large-scale data sets are not always well understood. An in-depth Therefore, therehasbeenashiftinattentiontowardsmodel based 4 analysiscanrevealimportantinsightsabouttheperformanceofthe clustering approaches in recent years (Yeung et al., 2001; Fraley 1 algorithm,theexpectedqualityoftheoutputclusters,andthepossi- andRaftery,2002;MedvedovicandSivaganesan,2002;Medvedo- bilitiesforextractingmorerelevantinformationoutofaparticulardata vicetal.,2004;Qin,2006; Dahl,2006). Amodelbasedapproach ] M set. assumesthatthedataisgeneratedbyamixtureofprobabilitydis- Results: We have extended an existing algorithm for model based tributions, one for each cluster, and takes explicitly into account Q clustering of genes to simultaneously cluster genes and conditi- the noisyness of gene expression data. It allows for a statistical . ons, and used three large compendia of gene expression data for assessmentoftheresultingclustersandgivesaformalestimatefor o S.cerevisiae to analyze its properties. The algorithmuses a Baye- theexpectednumberofclusters.Toinfermodelparametersandclu- i b sianapproachandaGibbssamplingproceduretoiterativelyupdate sterassignments,standardstatisticaltechniquessuchasExpectation - the clusterassignmentofeach geneand condition. Forlarge-scale MaximizationorGibbssamplingareused(Liu,2002). q [ data sets, the posterior distribution is strongly peaked on a limited In this paper we use a novel model based clustering method numberofequiprobableclusterings.AGOannotationanalysisshows whichbuildsupon the method recentlyintroduced byQin(2006). 1 that these local maxima are all biologically equally significant, and We address two key questions that have remained largely unans- v thatsimultaneouslyclusteringgenesandconditionsperformsbetter wered for model based clustering methods in general, namely 3 thanonlyclusteringgenesandassumingindependentconditions.A convergence of the Gibbs sampler for very large data sets, and 3 collectionofdistinctequivalentclusteringscanbesummarizedasa non-heuristic reconstruction of gene clusters from the posterior 0 weightedgraphonthesetofgenes,fromwhichweextractfuzzy,over- probabilitydistributionofthestatisticalmodel. 2 . lappingclustersusingagraphspectral method. Thecoresofthese Inthemodel used byQin(2006), itisassumedthat theexpres- 1 fuzzyclusterscontaintightsetsofstronglycoexpressedgenes,while sionlevelsofgenesinoneclusterarerandomsamplesdrawnfrom 0 the overlaps exhibit relations between genes showing only partial a Gaussian distribution and expression levels of different experi- 8 coexpression. mentalconditionsareindependent.Wehaveextendedthismodelto 0 Availability: GaneSh, a Java packagefor coclustering, isavailable allowdependenciesbetweendifferentconditionsinthesamecluster. : v underthetermsoftheGNUGeneralPublicLicensefromourwebsite Medvedovicetal.(2004)usedamultivariatenormaldistributionto Xi athttp://bioinformatics.psb.ugent.be/software. take into account correlation among experimental conditions. Our Contact:[email protected] approachconsistsofclusteringtheconditionswithineachgeneclu- r a Supplementaryinformation:availableonourwebsiteat ster, assuming that the expression levels of the genes in one gene http://bioinformatics.psb.ugent.be/supplementary data/anjos/gibbs cluster for the conditions in one condition cluster are drawn from oneGaussian distribution. Hence our model isamodel for coclu- stering or two-way clustering of genes and conditions. The same 1 INTRODUCTION statisticalmodelwasalsousedinourrecentapproachtoreconstruct SincetheseminalpaperbyEisenetal.(1998),nowalmostadecade transcriptionregulatorynetworks(Michoeletal.,2007).Thecoclu- ago,clusteringformsthebasisforextractingcomprehensibleinfor- steringiscarriedoutbyaGibbssamplerwhichiterativelyupdates mation out of large-scale gene expression data sets. Clusters of theassignmentofeachgene,andwithineachgeneclustertheassi- coexpressedgenestendtobeenrichedforspecificfunctionalcate- gnment of each experimental condition, using the full conditional gories (Eisenet al., 1998), share cis-regulatory sequences intheir distributionsofthemodel. promoters (Tavazoieet al., 1999), or form thebuilding blocks for ItisknownthataGibbssamplermayhavepoor mixingproper- reconstructingtranscriptionregulatorynetworks(Segaletal.,2003). tiesifthedistributionbeingapproximatedismulti-modalanditwill thenhaveaslowconvergencerate(Liu,2002). Previousstudiesof ∗Correspondingauthor,E-mail:[email protected] ©OxfordUniversityPress2007. 1 A.Joshietal Gibbssamplersformodelbasedclusteringhavenotreportedconver- belong to one fuzzy cluster with very high probability we obtain gencedifficulties(MedvedovicandSivaganesan,2002;Medvedovic tightclusterswhichshowhigherfunctionalcoherencecomparedto etal.,2004;Dahl,2006).Inthosestudies,onlydatasetswitharela- standardclusters.Keepingalsogeneswhichbelongwithlowerbut tively small number of genes (upto a few 100) (Medvedovic and still significant probability to multiple fuzzy clusters, we can ten- Sivaganesan, 2002; Medvedovic et al., 2004), or a small number tatively identify multifunctional genes or relations between genes ofexperimentalconditions(lessthan10)(Dahl,2006)wereconsi- showingonlypartialcoexpression. Weshowthatourresultsarein dered, and special sampling techniques such as reverse annealing goodagreementwithpreviousfuzzyclusteringapproachestogene (Medvedovic et al., 2004) or merge-split proposals (Dahl, 2006) expressiondata(GaschandEisen,2002).Webelievethatourfuzzy weresufficienttogenerateawellmixingGibbssampler.Weobserve clusteringmethodtosummarizetheposteriordistributionwillbeof that for data sets of increasing size the correlation between two generalinterestforallmodelbasedclusteringapproachesandsol- Gibbssamplerrunsaswellasthenumberofclustersolutionsvisited vestheproblemsassociatedtoheuristicclusteringsofthepairwise inonerunafterburn-insteadilydecreases.Thismeansthatforlarge- probabilitymatrix. scaledatasets,theposteriordistributionisverystronglypeakedon Allouranalysesareperformedonthreelarge-scalepubliccom- multiplelocalmodes.Sincethepeaksaresostrong,weapproximate pendia of gene expression data for S. cerevisiae (Spellman et al., theposteriordistributionbyaveragingovermultiplerunsperformed 1998;Gaschetal.,2000;Hughesetal.,2000). inparallel, each converging quickly toasinglemode. Bycompu- tingthecorrelationbetweendifferentaveragesofthesamenumber 2 METHODS of runs we are able to show that the number of distinct modes is Mathematicalmodel relativelysmallandaccurateapproximationstotheposteriordistri- bution can be obtained withas few as 10 modes for around 6000 ForanexpressionmatrixwithNgenesandMconditions,wedefineacoclu- genes. steringasapartition ofthegenesintoK geneclusters Gk, together with foreachgenecluster,apartitionofthesetofconditionsintoL condition Toidentifythefinaloptimalclustering,thetraditionalapproachis k toselectoutofalltheclusteringsvisitedbytheGibbssamplerthe clusters Ek,l. Weassumethat all data points in acocluster {(i,m): i ∈ onewhichmaximizestheposteriordistribution(maximumaposte- Gk,m ∈ Ek,l} are random samples from the same normal distribution. ThismodelgeneralizesthemodelusedbyQin(2006),wherethepartitionof riori(MAP)clustering).However, weshowthatforlargedatasets conditionsisalwaysfixedatthetrivialpartitionintosingletonsets. thedifferencesinlikelihoodbetweenthedifferentlocalmaximaare Givenasetofmeansandprecisions(µkl,τkl),acoclusteringCdefinesa extremely small and statistically insignificant, such that the MAP probabilitydensityondatamatricesD=(xim)by clusteringisasgoodastakinganylocalmaximumatrandom.AGO K Lk (Ashburneretal.,2000)analysisofthedifferentmodesshowsthat p D|C,(µkl,τkl) = p(xim|µkl,τkl). alsofrom the biological point of view any difference between the ` ´ kY=1lY=1i∈YGkm∈YEk,l local modesisinsignificant. Taking intoaccount thefullposterior Weuseauniformprioronthesetofcoclusteringswithnormal-gammacon- distributionismoredifficultsincedifferentclusteringsmayhavea jugatepriorsfortheparametersµ andτ .UsingBayes’rulewefindthe kl kl differentnumberofclustersandthelabelingofclustersisnotuni- probability of a coclustering C with parameters (µkl,τkl) given the data que(thelabelswitchingproblem(RednerandWalker,1984)).The D. Thenwetake the marginal probability overthe parameters (µkl,τkl) commonsolutiontothisproblemistoconsiderpairwiseprobabili- toobtainthefinalprobabilityofacoclusteringCgiventhedataD,uptoa tiesfortwogenesbeingclusteredtogetherornot(Medvedovicand normalizationconstant: Sivaganesan,2002;Medvedovicetal.,2004;Dahl,2006).Amajor K Lk questionthathasnotyetrecievedafinalanswerishowtoreconstruct p(C|D)∝ p(µ,τ) p(xim|µ,τ)dµdτ, (1) gene clusters from these pairwise probabilities. Medvedovic and kY=1lY=1ZZ i∈YGkm∈YEk,l Sivaganesan (2002) and Medvedovic et al. (2004) use a heuristic wherep(µ,τ)=p(µ|τ)p(τ)with hierarchical clustering on the pairwise probability matrix to form a final clustering estimate. Dahl (2006) introduces a least-squares p(µ|τ)= λ0τ 1/2e−λ02τ(µ−µ0)2, p(τ)= β0α0 τα0−1e−β0τ, method, which selects out of all clusterings visited by the Gibbs 2π Γ(α0) ` ´ sampler the one which minimizes a distance function to the pair- α0,β0,λ0 >0and−∞<µ0 <∞beingtheparametersofthenormal- wiseprobabilitymatrix.Inbothapproaches, theprobabilitymatrix gamma prior distribution. We use the values α0 = β0 = λ0 = 0.1 isreducedtoasinglehardclustering.Thisnecessarilyremovesnon- and µ0 = 0.0, resulting in a non-informative prior. We have compared thenormal-gammapriorwithothernon-informative, conjugate priors, but transitiverelationsbetween genes (such asa low probability for a foundnodifferenceinresults(seeSupplementaryInformation).Thedouble pair of genes to be clustered together even though they both have integral in eq. (1) can be solved exactly in terms of the sufficient stati- relativelyhighprobabilitytobeclusteredwiththesamethirdgene) stics T(n) = xn (n = 0,1,2)foreachcocluster. The whichmayneverthelessbeinformativeandbiologicallymeaningful. kl i∈Gk,m∈Ekl im log-likelihoodorBayesianscoredecomposesasasumofcoclusterscores: Weproposethatthepairwiseprobabilitymatrixreflectsasoftor P fuzzyclusteringofthedata, i.e., genescanbelongtomultipleclu- K Lk sterswithacertainprobability.Toextractthesefuzzyclustersfrom S(C)=logp(C|D)= Skl, (2) thepairwiseprobabilitiesweuseamethodfrompatternrecognition kX=1Xl=1 with theory (Inoue and Urahama, 1999). This method iteratively com- papnurodtebusapbtdhilaeitteylsamrtghaeetsrptirxeo,ibgcaeobnnivsliattrlyuucemtsaatnardixfucbzoyzryrreecsmpluoosvntidenrignwgfritoehimgtehintevtheeeicgtwoenreviogefhctttohoref, Skl=−21Tk(0l)log(2π)+ 12log`λ0+λ0Tk(0l)´−logΓ(α0) thegenesassignedtothelastcluster.Byonlykeepinggeneswhich +logΓ(α0+ 21Tk(0l))+α0logβ0−(α0+ 21Tk(0l))logβ1 2 and anentropymeasure β1=β0+ 21hTk(2l)− (TTk(k(1l0l)))2i+ λ02`(λT0k(1l+)−Tk(µ0l0))TTk(k0(l0l))´2. Hfuzzy= N21ln2Xij h(Fij), (6) whereNisthedimensionofthesquarematrixF and Gibbssampleralgorithm h(q)=−qln(q)−(1−q)ln(1−q)for0≤q≤1. WeuseaGibbssamplertosamplecoclusteringsfromtheposteriordistribu- tion(1).Thealgorithmiterativelyupdatestheassignmentofgenestogene Forahardclustering (Fij = 0or1forall i,j), Hfuzzy = 0, andfora clusters,andforeachgenecluster,theassignmentofconditionstocondition maximallyfuzzyclustering(Fij =0.5foralli,j),Hfuzzy = 1.Inreality, clustersasfollows: thematrixFisverysparse(mostgenepairswillneverbeclusteredtogether), 1. Initialization: randomly assign N genes to arandom K0 number of soWHefuzazsysuremmeatihnastsamfaulzlzeyvegnenfoer-greenalefmuzaztyrixclFustiesripnrgosd.ucedbyafuzzyclu- geneclusters,andforeachcluster,randomlyassignM conditionstoa steringofthegenes,i.e.,weassumethateachgeneihasaprobabilityp randomL numberofconditionclusters. ik k,0 tobelongtoeachclusterk,suchthat kpik =1.Toextracttheseproba- 2. ForNcycles,removearandomgeneifromitscurrentcluster.Foreach bilitiesfromF weuseagraphspectralmethod(InoueandUrahama,1999), genecluster k, calculate theBayesian score S(Ci→k), where Ci→k originally developed forpattern recogPnition andimage analysis, modified denotesthecoclusteringobtainedfromCbyassigninggeneitocluster heretoenforcethenormalizationconditionsonp .Afuzzyclusterisrepre- ik k,keepingallotherassignmentsofgenesandconditionsequal,aswell sentedbyacolumnvectorw = (w1,...,wN)T, withwi theweightof astheprobabilityS(Ci→0)forthegenetobealoneinitsowncluster. geneiinthiscluster,normalizedsuchthatkwk2 =wTw= w2 =1. i i AssigngeneitooneofthepossibleK+1geneclusters, whereK Thecohesiveness oftheclusterwithrespecttothegene-genematrixF is iQskth∝eceuSrr(eCnit→nku)m,nboerrmoafligzeendesuclcuhsttehrast, ackcoQrdkin=g t1o.theprobabilities definedaswTFw=PijwwiTFFijwwj.BytheRayleigh-RitzthePorem, 3. Fmorfreoamchitgsecnuerrcelnutstcelruskte,r.foFroMreaccyhclceosPn,drietimonovceluastrearnld,ocmalccuolnadteititohne mw6=ax0 wTw =v1TFv1=λ1, BayesianscoreS(Ck,m→l).Assignconditionmtooneofthepossible whereλ1isthelargesteigenvalueofF andv1thecorresponding(normali- Lk+1clusters,whereLkisthecurrentnumberofconditionclusters zed)eigenvector.HencethemaximallycohesiveclusterinF isgivenbythe forgeneclusterk,accordingtotheprobabilities Ql ∝ eS(Ck,m→l), eigenvectorofthelargesteigenvalue.BythePerron-Frobeniustheorem,this 4. nItoerrmatealsizteepd2suacnhdt3hautnPtillcoQnlve=rge1n.ce.Oneiterationisdefinedasexecu- ethigeemnveemctboerrsishiupnpiqroubeaabnidlitaiellsittosecnlutrsiteesra1rebynopnin1eg=atimvea.xvWj1(,evi1c,ajn).thHeenndceefitnhee tingstep2and3consecutivelyonce,andhenceconsistsofN+K×M genewiththehighestweightinv1isconsideredtheprototypicalgeneforthis samplingsteps(withKthenumberofgeneclustersafterStep1ofthat cluster,anditwillnotbelongtoanyothercluster.Theprobabilitypi1mea- iteration). surestowhatextentothergenesarecoexpressedwiththisprototypicalgene. Tofindthenextmostcohesivecluster,weremovefromF theinformation This coclustering algorithm simulates a Markov chain which satisfies alreadycontainedinthefirstclusterbysetting detailedbalancewithrespecttotheposteriordistribution(1),i.e.,afterasuf- ficientnumberofiterations,theprobabilitytovisitaparticularcoclustering Fi(j2)= 1−pi1Fij 1−pj1, Cisgivenexactlybyp(C|D).Theexpectationvalueofanyrealfunctionf andcomputethelargesteigenpvalueandcorpresponding(normalized)eigen- withrespecttotheposteriordistributioncanbeapproximatedbyaveraging overtheiterationsofasufficientlylongGibbssamplerrun: vectorv2forthismatrix.Theprototypicalgeneforthisclustermayalready have some probability assigned to the previous cluster, so we define the 1 T0+T membershipprobabilitiestothesecondclusterby E(f)=XC f(C)p(C|D)≈ T t=XT0+1f(Ct) (3) pi2=min maxvj2(,vi2,j)(1−pimax1),1−pi1 . “ ” whereCtisthecoclusteringvisitedatiterationtandT0isapossibleburn- Hereimax=argmaxj(v2,j)istheprototypicalgeneforthesecondcluster, inperiod.WesaythattheGibbssamplerhasconvergediftworunsstarting andwetakethe‘min’toensurethat kpikwillneverexceed1. fromdifferentrandominitializationsreturnthesameaverages(3)forasui- ThisprocedureofreducingF andcomputingthelargesteigenvalueand tablesetoftestfunctionsf.Moreprecisely,if{fn}isasetoftestfunctions, correspondingeigenvectortodefinetPhenextclustermembershipprobabili- definean =E1(fn)theaverageoffninthefirstGibbssamplerrun,and tiesisiterateduntiloneofthefollowingstoppingcriteriaismet: bn=E2(fn)theaverageoffninthesecondGibbssamplerrun.Wedefine 1. All entries in the reduced matrix F(k) reach 0, i.e., for all genes, acorrelationmeasureρ(0≤ρ≤1)betweentworunsas k′<kpik′ = 1, and we have completely determined all fuzzy ρ= | nanbn| . (4) Pclustersandtheirmembershipprobabilities. ( Pna2n)( nb2n) 2. Thelargesteigenvalue ofthereducedmatrixF(k) hasrank> 1. In thiscasetheeigenvectorisnolongeruniqueandneednolongerhave FullconvergenceisreachedipfρP=1. P nonnegativeentries,sowecannotmakenewclustermembershippro- Fuzzyclustering babilities outofit. Thismayhappen ifthe(weighted) graphdefined byconnecting genepairs withnon-zero entries inF(k) isnolonger Tokeeptrackofthegeneclusters,independentofthe(varying)numberof stronglyconnected(Perron-Frobeniustheorem). clustersortheirlabeling,weconsiderfunctions Tocomputeoneormoreofthelargesteigenvalues andeigenvectors for 1 ifgeneiandjbelongtothesamegeneclusterinC large sparse matrices such as F and its reductions F(k) we use efficient fij(C)= (5) sparse matrix routines, such as for instance implemented in the Matlab® (0 otherwise functioneigs. Ingeneral,theposteriordistribution(1)isnotconcentratedonasinglecoclu- Datasets steringandthematrixF = (E(fij))ofexpectation values (seeeq. (3)) consistsofprobabilitiesbetween0and1.Toquantifythisfuzzyness,weuse Weusethreelargecompendiaofgeneexpressiondataforbuddingyeast: 3 A.Joshietal 1. Gaschetal.(2000)dataset:expressionin173stressrelatedconditions. 1 2. Hughes et al. (2000) data set: compendium of expression profiles correspondingto300diversemutationsandchemicaltreatments. 0.8 3. Spellmanetal.(1998)dataset: 77conditions foralphafactorarrest, elutriation,andarrestofacdc15temperature-sensitivemutant. ure s Weselectthegenespresentinallthreedatasets(6052genes)and,tobeas ea 0.6 m unbiasedaspossible, nofurtherpostprocessingisdone.WeuseSynTReN n n(VuamnbdeernofBcuolcnkdeitieotnasl.f,o2r0a06sy)nttohegteicnetrraatnescsrimiptuiloanterdegdualtaatosreytsnwetiwthovrkarywiintgh elatio 0.4 1000genes(seealsoSupplementaryInformation). Corr Functionalcoherence 0.2 Toestimatetheoverallbiologicalrelevanceoftheclustersweuseamethod whichcalculatesthemutualinformationbetweenclustersandGOattributes 0 (GibbonsandRoth,2002).ForeachGOslimattribute, wecreateacluster- 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 attributecontingencytablewhererowsareclustersandcolumnsareattribute Number of Iterations status (‘Yes’ if the gene possesses the attribute, ‘No’ if it is not known whetherthegenepossessestheattribute). Thetotal mutualinformation is Fig.1. TraceplotofthecorrelationmeasureρbetweentwodifferentGibbs definedasthesumofmutualinformationsbetweenclustersandindividual samplerrunsasafunctionofthenumberofiterations, forasmalldataset GOattributes: (100genes,10conditions,topcurve)andalargedataset(1000genes,173 MI= H(C)+H(A)−H(C,A) (7) conditions,bottomcurve).BothdatasetsaresubsetsoftheGaschetal.data set. XA whereCisaclusteringofthegenes,AisaGOattributeandHisShannon’s entropy,H =− ipilog(pi),andthepiareprobabilitiesobtainedfrom detailatasingleGibbssamplerrun.Itturnsoutthatthecorrelation thecontingencytables. P measure between two successive iterations reaches 1 very rapidly andremainsunchanged afterwards(SeeSupplementary Figure2). 3 RESULTSANDDISCUSSION Sinceeachiterationinvolvesalargenumberofsamplingsteps(i.e., ConvergenceoftheGibbssampleralgorithm alargenumberofpossibleconfigurationchanges),thisimpliesthat Westudyconvergenceusingthetestfunctionsf whichindicateif theGibbssamplerveryrapidlyfindsalocalmaximumoftheposte- ij geneiandjareclusteredtogetherornot(seeeq.(5)intheMethods) riordistributionfromwhichitcannolonger escape. Weconclude and compute thecorrelation measure ρ between different runs for thattheposteriordistributionissupportedonmultiplelocalmaxima thissetoffunctions(seeeq.(4)intheMethods).Inadditiontothe whichoverlaponlypartially,andwithvalleysinbetweenthatcan- correlation measure, wealso compute theentropy measure H notbecrossedbytheGibbssampler. Theselocalmaximaallhave fuzzy (seeeq.(6)intheMethods).Thisparametersummarizesthe‘shape’ approximately the same log-likelihood (see for instance the small oftheposteriordistribution:avalueof0correspondstohardcluste- varianceinFigure4below)andarethereforeallequallymeaningful. ringwhichimpliesthatthedistributioniscompletelysupportedona Theprobabilityratiobetween peaksandvalleysissolarge(expo- singlesolution,themorepositiveH is,themorethedistribution nential in the size of the data set) that an accurate approximation fuzzy issupportedonmultiplesolutions. to the posterior distribution is given by averaging over the local IntheanalysisbelowweusesubsetsfromtheGaschetal.dataset maximaonly.Thosecanbeuncoveredbyperformingmultipleinde- withavaryingnumberofgenesandconditionsandperformmultiple pendentruns,eachconvergingveryquicklyononeofthemaxima, Gibbssamplerrunswithalargenumberofiterations.Oneiteration and there is no need for special techniques to also sample in bet- involvesareassignmentofallgenesandallconditionsinallclusters, weenlocalmaxima. Thenumber of local maxima(Gibbssampler andhenceinvolvesN+M KsamplingstepsintheGibbssampler, runs)necessaryforagoodapproximationcanbeestimatedasfol- × whereN isthenumberofgenes,M thenumberofconditions,and lows.Weperform150independentGibbssamplerrunsandcompute Kthenumberofclustersatthatiteration(typicallyK √N). foreachthepairwisegene-geneclusteringprobabilitymatrixF (see ∼ First we consider a very small data set (100 genes, 10 conditi- Methods). For each k = 1,...,50, we take twonon-overlapping ons). WestarttwoGibbssamplerrunsinparallelandcomputethe setsof k solutionsand computetheaverageof theirpairwisepro- correlationmeasure ρat each iteration, see Figure1. In thiscase, bability matricesF. Then, we compute the correlation measure ρ ρapproaches itsmaximum value ρ = 1 inlessthan 5000 iterati- betweenthosetwoaverages.Thisisrepeatedseveraltimes,depen- onsandtheGibbssamplergeneratesawellmixingchainwhichcan dingonthenumberofnon-overlappingsetsthatcanbechosenfrom easilyexplorethewholespace.Non-zerovaluesoftheentropymea- thepoolof150solutions.Ifforagivenkthecorrelationisalways sureH (0.105 0.003)indicatethattheposteriordistribution 1,thenthereareatmostklocalmaxima.Figure2showsthatask fuzzy ± issupportedonmultipleclusteringsofthegenes. increases,thecorrelationquicklyreachesclosetothisperfectvalue NextweruntheGibbssampleralgorithmonadatasetwith1000 1.Thisimpliesthatthenumberoflocalmaximaisnottoolargeand genes and all 173 conditions. Unlike in the previous situation we agoodapproximationtotheposteriordistributioncanbeobtained observethatthecorrelationbetweentwoGibbssamplerrunssatura- inthiscasealreadywith10to20solutions.SupplementaryFigure1 teswellbelow1(seeFigure1).HencetheGibbssamplerdoesnot showsanexampleofhardclustersformedasaresultofasinglerun convergetotheposterior distributioninonerun. Wecangainfur- andfuzzyclustersformedbymergingtheresultof10independent therunderstandingforthelackofconvergencebylookinginmore runs. 4 1 The rapid convergence of the log-likelihood shows that the Gibbs samplerreachesthelocalmaximaveryquicklyandthelowvariance shows that the different local maxima are all equally likely. The 0.9 averageover 10runsoftheGOmutualinformationscore(seeeq. e (7) in the Methods) shows the same rapid convergence and small sur 0.8 variance(seeSupplementaryFigure6), implyingthatthedifferent a e m maximaarebiologicallyequallymeaningfulaccordingtothisscore. on 0.7 Thecorrelationbetweendifferentaveragesof10Gibbssamplerruns elati reaches0.85,avalueweconsiderhighenoughforagoodapproxi- Corr 0.6 mationof theposterior distribution. The other twodata setsshow preciselythesamebehavior(seeSupplementaryFigures4and5). 0.5 0.4 -40000 0 5 10 15 20 25 30 35 40 45 50 Number of cluster solutions merged -50000 -60000 Fig.2. Correlationmeasureρbetweendifferentaveragesofthesamenum- beroflocalmaximaforadatasetof1000genesand173conditions(subset ore -70000 c oftheGaschetal.dataset). d s -80000 o o h eli -90000 k In Figure 3, we keep the same 1000 genes and select an incre- ogli -100000 L asingnumber ofconditions. Asthedatasetincreases, theentropy -110000 measureH decreases, meaning theclustersbecomeincreasin- fuzzy glyhard.Simultaneously,thecorrelationmeasureρdecreasesfrom -120000 about0.85to0.55(seeSupplementaryFigure3).Weconcludethat -130000 thedepthofthevalleysbetweendifferentlocalmaximaoftheposte- 0 10 20 30 40 50 60 70 80 90 100 riordistributionincreaseswiththesizeofthedatasetanditbecomes Number of Iterations increasingly more difficult for the Gibbs sampler to escape from thesemaximaandvisitthewholespaceinonerun. Fig.4. Traceplotoftheaveragelog-likelihoodscoreandstandarddeviation for10GibbssamplerrunsfortheSpellmanetal.dataset. 0.045 0.04 Two-wayclusteringversusone-wayclustering 0.035 Our coclustering algorithm extends the CRC algorithm of Qin (2006) by also clustering the conditions for each cluster of genes e 0.03 ur (‘two-way clustering’), instead of assuming they are always inde- s ea 0.025 pendent(‘one-wayclustering’).Wecomparetheclusteringofgenes m py 0.02 forthethreeyeastdatasetsusingbothmethods, bycomputingthe ntro averagenumberofclustersinferred(K),theaveragelog-likelihood E 0.015 score and the average GO mutual information score for 10 inde- 0.01 pendent runsof eachalgorithm. TheresultsaretabulatedinTable 1 and 2. For all three data sets, both the log-likelihood score and 0.005 theGOmutualinformationscorearehigher(better)forourmethod. 0 TheincreaseinGOmutualinformationscoreisespeciallysignifi- 0 20 40 60 80 100 120 140 160 cantincaseoftheHughesetal.dataset.Thisdatasethasveryfew Number of Conditions overexpressedorrepressedvaluesandifeachconditionisconside- red independent, there are very few distinct profiles which results Fig.3. EntropymeasureHfuzzy fordatasetswith1000genesandvarying intheformationofveryfewclusters( 15for6052genes). Also numberofconditions(subsetsoftheGaschetal.dataset). ∼ clusteringtheconditionsgivesmoremeaningfulresultssincediffe- rentiallyexpressedconditionsformseparateclustersfromonelarge backgroundclusterofnon-differentiallyexpressedconditions. Analysisofwholegenomedatasets Forsimulateddatasets,clustersaredefinedassetsofgenessha- IfweruntheGibbssampleralgorithmonthethreewholegenome ring the same regulators in the synthetic regulatory network, and yeast data sets, we are in the situation where the algorithm very thetruenumberofclustersisknown.Hereweconsideragenenet- rapidlygetsstuckinalocalmaximum.InFigure4weplottheave- workwhosetopologyissubsampledfromanE.colitranscriptional rageBayesianlog-likelihoodscore(seeeq.(2)intheMethods)for network(VandenBulcke etal.,2006) with1000 genes, ofwhich 10 different Gibbs sampler runs for the Spellman et al. data set. 105transcriptionfactors,and286clusters.Fortwo-wayclustering, 5 A.Joshietal Table1. One-wayclustering,averagesfor10differentGibbssamplerruns. 350 300 Dataset Avg.K Avg.log-likelihoodscore Avg.MI 250 s Gaschetal. 52.9(2.6) −6.101(0.014)×105 1.771(0.031) er st Hughesetal. 14.9(0.5) 2.530(0.002)×106 0.588(0.044) Clu 200 Spellmanetal. 49.7(2.2) −7.183(0.037)×104 1.491(0.032) of er 150 b m u N 100 Table2. Two-wayclustering,averagesfor10differentGibbssamplerruns. 50 Dataset Avg.K Avg.log-likelihoodscore Avg.MI 0 0 50 100 150 200 250 300 350 400 450 500 Gaschetal. 84.5(2.5) −5.586(0.012)×105 1.912(0.033) Number of Conditions Hughesetal. 85.5(2.7) 2.798(0.004)×106 1.511(0.045) Spellmanetal. 65.4(4.2) −5.112(0.011)×104 1.612(0.032) Fig.5. Numberofgeneclustersforasimulateddatasetwith1000genes andavaryingnumberofconditions,fortwo-wayclustering(topdatapoints (×))andone-wayclustering(bottomdatapoints(+)) asweincreasethenumber of conditionsinthesimulateddataset, moreclustersareformedandthenumberofclusterssaturatesclose byaveraging over10different Gibbssampler runs. FortheGasch tothetruenumber(seeFigure5).Forone-wayclustering,addition etal. andHughesetal. datasets, fullfuzzyclusteringisachieved ofconditionsdoesnotaffecttheinferrednumberofclusterswhich with500fuzzyclusters(all6052geneshavetotalassignmentpro- isanorderofmagnitudesmallerthanthetruenumber (seeFigure babilityPkpik >0.98).FortheSpellmanetal.datasetthesecond 5).Fortwo-wayclustering,duetotheclusteringofconditions, the stoppingcriteriumismetafterproducing321fuzzyclusters. numberofmodelparametersisreduced,andgreaterstatisticalaccu- Ingeneral,weobservethatthealgorithmfirstproducesonevery racycanbeachieved, evenwhenthenumber ofgenesinacluster largefuzzy cluster corresponding toan average expression profile becomessmall. thatalmostallgenescanrelateto.Thisclusterisofnointerestfor Thecorrelationmeasureρbetweentrueclustersandinferredclu- further analysis. Then it produces a number of fuzzy clusters of stersalsoshowsahighervaluefortwo-wayclusteringoverone-way varying size which show interesting coexpression profiles and are (SupplementaryFigure8). usefulforfurtheranalysis. Forthethreedatasetsconsideredhere, Unlike for simulated data sets, the inferred number of clusters thisnumber isaround 100, consistent withtheaveragenumber of does not depend much upon the number of conditions for real clustersindifferent Gibbs sampler runs (see Table2). Theremai- biological data sets (Supplementary Figure 7), i.e., even if more ning fuzzy clusters are typically very small and consist mostly of conditionsareadded,thealgorithmdoesnotgeneratemoreclusters. noise. Liketheveryfirstcluster, theyareofnointerestforfurther Thisisbecauseinsimulateddata,everyadditionofaconditionadds analysis. newinformation,butforrealdatasetsthatmightnotbethecase.In Since every gene belongs toevery cluster, we use a probability ordertogetthetrueclustersfromtheexpressiondata,wedonotonly cutoff to remove from each cluster the genes which belong to it needmoreconditionsbutalsothateachnew conditioncontributes withaverysmallprobability.Thesmallerthecutoff,themoregenes information different from the information already available from belongtoacluster,whichresultsintomorefuzzyclustersandvice thepreviousconditions. Thismightbeareasonwhythealgorithm versa.Table3showsthetotalnumberofgenesassignedtoatleast clusters6052genesinonly 80clusters(seeTable2). one fuzzy cluster with different cutoff values and in brackets the ∼ numberofgenesassignedtoatleasttwofuzzyclusters. Fuzzyclusters The goal of merging different Gibbs sampler solutions and for- Ouralgorithmreturnsasummaryoftheposteriordistributioninthe mingfuzzyclustersistoextractadditionalinformationoutofadata formofagene-genematrixwhoseentriesaretheprobabilitiesthata setthatisnotcapturedbyasinglehardclusteringsolution.Thiscan pairofgenesisclusteredtogether.Toconvertthesepairwiseproba- beachievedintwoways.First,byobtainingtightclustersoffewbut bilitiesbacktoclustersweuseagraphspectralmethodasexplained highly coexpressed genes with a high probability cutoff. Second, in the Methods. The method produces fuzzy overlapping clusters by characterizing genes which belong to multiple clusters with a whereeachgeneibelongstoeachfuzzyclusterk withaprobabi- significantprobability. lity p , such that P p = 1. The size of a fuzzy cluster k is Forallthreedatasets,ataprobabilitycutoffof0.5,wegetasub- ik k ik defined as P p . The algorithm iteratively produces new fuzzy setofgeneswhichbelongtoonlyoneclusterwithhighprobability. i ik clusters until all the information in the pairwise matrix is conver- Table3 showsthat each dataset retainsat least20% of itsgenes. tedintoclusters(1st stoppingcriterium, seeMethods), oruntilthe Thesearesetsofstronglycoexpressedgeneswhichclustertogether mathematicalconditionsunderlyingthealgorithmceasetohold(2nd inalmosteveryhardclustersolution. Ribosomal genesshow such stoppingcriterium,seeMethods).Weappliedthealgorithmtopair- astrongcoexpression patterninallthethreedatasetswheremost wiseprobability matrices for each of the three data sets, obtained genesbelongtothisclusterwithaprobabilitycloseto1(seeFigure 6 Table 3. Number of genes clustered and number of genes ERP2,RET2,RET3,SEC13,SEC21,SEC24andothers.Cluster34 belonging to multiple clusters with different membership containsgenesrepressedundernitrogenstressandstationarystate. probabilitycutoffvalues. 20percentofthegenesincluster27alsobelongtocluster34with a significant membership. These include genes encoding for ER vesicle coat proteins like RET2, RET3, SEC13 and others which Dataset 0.1 0.3 0.5 areinduced under DTTstress aswell as repressed under nitrogen stress and stationary state. Also RIO1, an essential serine kinase, Gaschetal. 6045(4356) 4062(344) 1781(0) belongstotwoclusterswithasignificantprobability.Itclusterswith Hughesetal. 6052(4554) 3959(34) 2254(0) Spellmanetal. 6052(5187) 3158(139) 1255(0) genesinvolvedinribosomalbiogenesisandassembly(Gaschetal. data cluster 3) as well as with genes functioning as generators of precursormetabolitesandenergy(Gaschetal. datacluster7). We findsimilarobservationsfortheHughesetal. andSpellmanetal. datasets.GenesCLN1,CLN2andotherDNAsynthesisgeneslike 6). At least 75% of all the genes in cluster 2 (Gasch et al. data), CLB6 which are known to be regulated by SBF during S1 phase cluster3(Hughesetal.data)andcluster2(Spellmanetal.data)are (Kochetal.,1996)belongtocluster19(Spellmanetal.data).They locatedinribosome. alsobelongwithsignificantprobabilitytocluster4(Spellmanetal. data).Morethanonethirdofthegenesincluster4arepredictedto becellcycleregulatedgenes. CONCLUSION Wehavedevelopedanalgorithmtosimultaneouslyclustergenesand conditionsandsamplesuchcoclusteringsfromaBayesianprobabi- listicmodel.Forlargedatasets,themodelissupportedonmultiple equivalent local maxima. The average of these local maxima can berepresentedbyamatrixofpairwisegene-geneclusteringproba- bilitiesandwehaveintroducedanewmethodforextractingfuzzy, Fig.6. RibosomalgenesformatightclusterintheHughesetal.dataset. overlappingclustersfromthismatrix.Thismethodisabletoextract (Duetospaceconstraintsonlythefirstfewgenesareshown;forthecomplete information out of the data set that is not available from a single, figure,seetheSupplementaryInformation.) hardclustering. Localbutverystrongcoexpressionpatternscanalsobedetected FUNDING byourmethod.Cluster15oftheGaschetal.datasetconsistsofonly EarlyStage Marie Curie Fellowship to A.J.; Postdoctoral Fellow- 4genesclusteredtogetherwithprobability1(seeFigure7).These shipoftheResearchFoundationFlanders(Belgium)toT.M. fourgenes,GAL1,GAL2,GAL7,andGAL10,areenzymesinthe galactosecatabolicpathwayandrespondtodifferentcarbonsources duringsteadystate. Theyarestronglyupregulatedwhengalactose ACKNOWLEDGEMENT isusedasacarbonsource(2nd experimentclusterinFigure7)and Wethank Steven Maere and Vanessa Vermeirssen for helpful dis- stronglydownregulatedwithanyothersugarasacarbonsource(1st cussions. experimentclusterinFigure7).Ineveryhardclustersolution,these 4genesareclusteredtogetheralongwithothergenes. Bymerging REFERENCES thesehardclustersolutionstoformfuzzyclusters,wegetatightbut moremeaningfulclusterwithonly4genes. Ashburner, M., Ball, C.A., Blake, J. A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis,S.,Matese,J.C.,Richardson,J.E.,Ringwald,M.,Rubin, G.M.,andSherlock,G.(2000). Geneontology:toolfortheuni- ficationofbiology.TheGeneOntologyConsortium. NatGenet, 25,25–29. Dahl,D.B.(2006). Model-basedclusteringforexpressiondatavia aDirichletprocessmixturemodel. InK.-A.Do, P.Mu¨ller, and Fig.7. FourgenesGAL1, GAL2, GAL7andGAL10formatightcluster M.Vannucci,editors,Bayesianinferenceforgeneexpressionand showingconditionalcoexpressionintheGaschetal.dataset. proteomics,pages201–218.CambridgeUniversityPress. Eisen,M.B.,Spellman,P.T.,Brown,P.O.,andBotstein,D.(1998). Table 3 shows that many genes belong to two or more clusters Clusteranalysisanddisplayofgenome-wideexpressionpatterns. withasignificantprobability.FortheGaschetal.dataset,wefind ProcNatlAcadSciUSA,95(25),14863–14868. similarobservationsasin(GaschandEisen,2002).Cluster27con- Fraley,C.andRaftery,A.E.(2002).Model-basedclustering,discri- tains genes localized in endoplasmic reticulum (ER) and induced minantanalysis,anddensityestimation.JAmerStatisticalAssoc, underdithiothreitol(DTT)stresslikeFKB2,JEM1, ERD2,ERP1, 97,611–631. 7 A.Joshietal Gasch, A. P. and Eisen, M. B. (2002). Exploring the conditio- learningalgorithmsusingsimulateddata. BMCBioinformatics, nalcoregulationofyeastgeneexpressionthroughfuzzyk-means 8Suppl2,S5. clustering. GenomeBiol,3(11),RESEARCH0059. Qin,Z.S.(2006).Clusteringmicroarraygeneexpressiondatausing Gasch, A. P., Spellman, P. T., Kao, C. M., Carmel-Harel, O., weighted Chinese restaurant process. Bioinformatics, 22(16), Eisen, M. B., Storz, G., Botstein, D., andBrown, P.O.(2000). 1988–1997. Genomic expression programs in the response of yeast cells to Redner, R.A.andWalker, H.F.(1984). Mixturedensities, maxi- environmentalchanges. MolBiolCell,11(12),4241–4257. mum likelihood, and the EM algorithm. SIAM Review, 26(2), Gibbons, F. D. and Roth, F. P. (2002). Judging the quality of 195–239. geneexpression-basedclusteringmethodsusinggeneannotation. Segal,E.,Shapira,M.,Regev,A.,Pe’er,D.,Botstein,D.,Koller,D., GenomeRes,12(10),1574–1581. andFriedman,N.(2003). Modulenetworks: identifyingregula- Hughes, T. R., Marton, M. J., Jones, A. R., Roberts, C. J., tory modules and their condition-specific regulators from gene Stoughton, R., Armour, C. D., Bennett, H.A., Coffey, E., Dai, expressiondata. NatGenet,34,166–167. H.,He,Y.D.,Kidd,M.J.,King,A.M.,Meyer,M.R.,Slade,D., Spellman,P.T.,Sherlock,G.,Zhang,M.Q.,Iyer,V.R.,Anders,K., Lum, P.Y.,Stepaniants, S.B.,Shoemaker, D.D.,Gachotte, D., Eisen,M.B.,Brown,P.O.,Botstein,D.,andFutcher,B.(1998). Chakraburtty,K.,Simon,J.,Bard,M.,andFriend,S.H.(2000). Comprehensive identification of cell cycle-regulated genes of Functional discovery via a compendium of expression profiles. theyeastSaccharomycescerevisiaebymicroarrayhybridization. Cell,102(1),109–126. MolBiolCell,9(12),3273–3297. Inoue,K.andUrahama,K.(1999). Sequentialfuzzyclusterextrac- Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., tion by a graph spectral method. Pattern Recogn. Lett., 20(7), Dmitrovsky, E., Lander, E. S., and Golub, T. R. (1999). Inter- 699–705. preting patterns of gene expression with self-organizing maps: Koch, C., Schleiffer, A., Ammerer, G., and Nasmyth, K. methods and application to hematopoietic differentiation. Proc (1996). Switchingtranscriptiononandoffduringtheyeastcell NatlAcadSciUSA,96(6),2907–2912. cycle:Cln/Cdc28kinasesactivateboundtranscriptionfactorSBF Tavazoie,S.,Hughes,J.D.,Campbell,M.J.,Cho,R.J.,andChurch, (Swi4/Swi6)atstart,whereasClb/Cdc28kinasesdisplaceitfrom G. M. (1999). Systematic determination of genetic network thepromoterinG2. GenesDev,10(2),129–141. architecture. NatGenet,22(3),281–285. Liu, J. S. (2002). Monte Carlo strategies in scientific computing. Van den Bulcke, T., Van Leemput, K., Naudts, B., van Remor- Springer. tel, P., Ma, H., Verschoren, A., De Moor, B., and Marchal, K. Medvedovic, M. and Sivaganesan, S. (2002). Bayesian infi- (2006). SynTReN:ageneratorofsyntheticgeneexpressiondata nitemixturemodelbasedclusteringofgeneexpressionprofiles. for design and analysis of structure learning algorithms. BMC Bioinformatics,18(9),1194–1206. Bioinformatics,7,43. Medvedovic,M.,Yeung,K.Y.,andBumgarner,R.E.(2004).Baye- Yeung, K. Y., Fraley, C., Murua, A., Raftery, A. E., and Ruzzo, sian mixture model based clustering of replicated microarray W.L.(2001). Model-based clusteringanddatatransformations data. Bioinformatics,20(8),1222–1232. forgeneexpressiondata. Bioinformatics,17(10),977–987. Michoel, T., Maere, S., Bonnet, E., Joshi, A., Saeys, Y., Vanden Bulcke,T.,VanLeemput,K.,vanRemortel,P.,Kuiper,M.,Mar- chal,K.,andVandePeer,Y.(2007). Validatingmodulenetwork 8

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.