ebook img

Motif Discovery through Predictive Modeling of Gene Regulation PDF

0.45 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Motif Discovery through Predictive Modeling of Gene Regulation

Motif Discovery through Predictive Modeling of 7 0 Gene Regulation 0 2 n ManuelMiddendorf1, AnshulKundaje2 , MihirShah2, Yoav Freund2,4,5, a Chris H. Wiggins3,4, and ChristinaLeslie2,4,5 J 5 1 1 DepartmentofPhysics, ] 2 DepartmentofComputerScience, N 3 Department ofAppliedMathematics, G 4 CenterforComputationalBiologyand Bioinformatics, . o 5 CenterforComputationalLearningSystems i b ColumbiaUniversity,NewYork, NY 10027, - [email protected] q [ http://www.cs.columbia.edu/compbio/medusa 1 February 6,2008 v 1 2 0 Abstract 1 0 WepresentMEDUSA,anintegrativemethodforlearningmotifmodelsoftran- 7 scriptionfactorbindingsitesbyincorporatingpromotersequenceandgeneexpres- 0 sion data. We use a modern large-margin machine learning approach, based on / o boosting, to enable feature selection from the high-dimensional search space of i candidate binding sequences whileavoiding overfitting. Ateach iterationof the b algorithm,MEDUSAbuildsamotifmodelwhosepresenceinthepromoterregion - q of agene, coupled withactivityofaregulator inanexperiment, ispredictiveof : differential expression. Inthisway, welearnmotifsthatarefunctional andpre- v dictiveofregulatoryresponseratherthanmotifsthataresimplyoverrepresentedin i X promotersequences.Moreover,MEDUSAproducesamodelofthetranscriptional controllogicthatcanpredicttheexpressionofanygeneintheorganism,giventhe r a sequence of thepromoter regionof thetarget geneand theexpression stateof a setofknownorputativetranscriptionfactorsandsignalingmolecules.Eachmotif modeliseitherak-lengthsequence,adimer,oraPSSMthatisbuiltbyagglomer- ativeprobabilisticclusteringofsequenceswithsimilarboostingloss.Byapplying MEDUSAtoasetofenvironmental stressresponseexpressiondatainyeast,we learn motifs whose ability to predict differential expression of target genes out- performs motifs from the TRANSFAC dataset and from a previously published candidatesetofPSSMs.WealsoshowthatMEDUSAretrievesmanyexperimen- tallyconfirmedbindingsitesassociatedwithenvironmental stressresponsefrom theliterature. 1 1 Introduction One of the central challenges in computational biology is the elucidation of mecha- nisms for gene transcriptional regulation using functional genomic data. The prob- lemofidentifyingbindingsitesfortranscriptionfactorsintheregulatorysequencesof genesisakeycomponentinthesecomputationalefforts.Whilethereisavastliterature onthissubject,onlya fewdifferentconceptualapproacheshavebeentried, andeach ofthesestandardapproacheshasitslimitations. Themostwidely-usedmethodologyforcomputationaldiscoveryofputativebind- ingsitesisbasedonclusteringgenes—usuallybysimilarityofgeneexpressionprofiles, sometimescombinedwith annotationdata—andsearchingfor motif patternsthat are overrepresentedin the promotersequencesof these genesin the belief thatthey may be coregulated. Popular motif discovery programs in this paradigm include MEME [1], Consensus[2], GibbsSampler [3], AlignACE[4] and manyothers. The cluster- first methodologyhas several drawbacks. First, it is not always true that genes with correlatedgeneexpressionprofilesareinfactcoregulatedgeneswhoseregulatoryre- gionscontaincommonbindingsites. Moreover,byfocusingoncoregulatedgenes,one fails to consider more complicated combinatorial regulatory programs and the over- lappingregulatorypathwaysthatcanaffectdifferentsetsofgenesunderdifferentcon- ditions. Recently,moresophisticatedgraphicalmodelsforgeneexpressiondatahave been introducedto try to partitiongenes into “transcriptionalmodules” [5]—clusters of genes that obey a common transcriptionalprogram depending on a small number ofregulators—ortolearnoverlappingclustersofthiskind[6]. Thesegraphicalmodel approachesusetheabstractionofmodulestogiveaninterpretablerepresentationofpu- tativerelationshipsbetweengenesandtosuggestbiologicalhypotheses. Oneexpects thatusingthesemorecomplexclusteringalgorithmsasapreprocessingstepformotif discovery would lead to improved identification of true binding sites; however, it is difficulttoassesshowmuchofanadvantageonemightobtain. Another well-established motif discovery approach is the innovative REDUCE method [7] and related algorithms [8, 9]. REDUCE avoids the cluster-first method- ologybyconsideringthegenome-wideexpressionlevelsgivenbyasinglemicroarray experiment,anditdiscoverssequenceswhosepresencein promotersequencescorre- lateswithdifferentialexpression. SinceREDUCEuseslinearregressiontoiteratively identifyputativebindingsites, itmustenforcestricttests ofstatistical significanceto avoidoverfittinginalargeparameterspacecorrespondingtothesetofallpossiblese- quencecandidates. Therefore,REDUCEcanfindthestrongestsignalsinadatasetbut willnotattempttofindmoresubtlesitesthataffectfewergenes. Sincethealgorithm fitsparametersindependentlyforeachmicroarrayexperiment,theissueofcondition- specificregulationenterstheanalysisonlyaspost-processingstepratherthanthrough simultaneoustrainingfrommultipleconditions. In this paper, we introduce a new motif discovery algorithm called MEDUSA (Motif Element Discrimination Using Sequence Agglomeration) that learns putative bindingsites associated with condition-specificregulation in a large gene expression dataset. MEDUSAworksbyextractingbindingsitemotifsthatcontributetoapredic- tivemodelofgeneregulation.Morespecifically,MEDUSAbuildsmotifmodelswhose presenceinthepromoterregionofagene,togetherwiththeactivityofregulatorsinan 2 experiment,ispredictiveofdifferentialexpression. LikeREDUCE,MEDUSAavoids the cluster-first methodology and builds a single regulatory model to explain the re- sponseofalltargetgenes.However,unlikeREDUCE,MEDUSAlearnsfrommultiple anddiversegeneexpressionexperiments,usingtheexpressionstatesofasetofknown regulatorytorepresentcondition-specificregulatoryconditions. Moreover,MEDUSA isbasedonaclassificationapproach(usinglarge-marginmachinelearning)ratherthan linearregression,toavoidoverfittinginthehigh-dimensionalsearchspaceofcandidate bindingsequences.Inadditiontodiscoveringbindingsitemotifs,MEDUSAproduces amodelofthecondition-specifictranscriptionalcontrollogicthatcanpredicttheex- pressionofanygene,giventhegene’spromotersequenceandtheexpressionstateofa setofknowntranscriptionfactorsandsignalingmolecules. ThecoreofMEDUSAisaboostingalgorithmthataddsabindingsitemotif(cou- pledwitharegulatorwhoseactivityhelpspredictup/downregulationofgeneswhose promoterscontainthemotif)toanoverallgeneregulationmodelateachboostingitera- tion.Eachmotifmodeliseitherak-lengthsequence(or“k-mer”),adimer,oraPSSM. The PSSMs are generatedby consideringthe mostpredictivek-merfeatures(Fig. 2) selectedatagivenroundofboostingthatareassociatedwithacommonregulator;we thenperformagglomerativeprobabilisticclusteringofthese k-mersintoPSSMs, and weselectfromallthecandidatePSSMsseenduringclusteringtheonethatminimizes boosting loss (Fig. 2). In experiments on a set of environmentalstress response ex- pression data in yeast, we learn motifs together with regulation models that achieve accuratepredictionofup/downregulationoftargetgenesinheld-outexperiments. In fact,weshowthattheperformanceofthelearnedmotifsforpredictionofdifferential expression in test data is stronger than the performanceof motifs from the TRANS- FACdatasetor froma previouslypublishedcandidateset of PSSMs. For these envi- ronmentalstress response experiments, we also show that MEDUSA retrieves many experimentallyconfirmedbindingsitesfromtheliterature. We first introduced the idea of predictive modeling of gene regulation with the GeneClass algorithm [10]. However, GeneClass uses a fixed set of candidate motifs as an input to the algorithm and cannot perform motif discovery. We note also that therehavebeenpreviouseffortstoincorporatemotifdiscoveryinanintegrativemodel for sequence and expression data using the probabilistic graphicalmodel framework [11]. Thisgraphicalmodelapproachagainusestheabstractionof“modules”tolearn setsofmotifsassociatedwithclustersofgenes,givingahigh-levelmodularrepresen- tationofgeneregulation.Asexplainedabove,MEDUSAdoesnotproduceanabstract module representation. However, it has two advantages over graphical model meth- ods. First,MEDUSAusesalarge-marginlearningapproachthathelpstoimprovethe generalizationofthelearnedmotifsandregulationmodel,andwecanevaluatepredic- tionaccuracyonheld-outexperimentstoassessourconfidenceinthemodel. Second, training graphical models requires special expertise to avoid poor local minima in a complex optimization problem, while MEDUSA can be run “out-of-the-box”. Code for MEDUSA is publicly available and can be downloaded from the supplementary websiteforthepaper,http://www.cs.columbia.edu/compbio/medusa. 3 2 Methods 2.1 Learning Algorithm MEDUSAlearnsbindingsitemotifstogetherwithapredictivegeneregulationmodel using a specific implementationof Adaboost, a generaldiscriminative learningalgo- rithm proposed by Freund and Schapire [12]. Adaboost’s basic idea is to iteratively applyvariantsofa simple, weaklydiscriminativelearningalgorithm,calledtheweak learner,to differentweightingsof thesame trainingset. Theonlyrequirementof the weaklearneristhatitpredictstheclasslabelofinterestwithgreaterthan50%accuracy. Ateachiteration,weightsarerecalculatedsothatexampleswhichweremisclassified atthepreviousiterationaremorehighlyweighted. Finally,alloftheweakprediction rules are combinedinto a single strong rule using a weighted majority vote. As dis- cussedin[13],boostingisalarge-marginclassificationalgorithm,abletolearnfroma potentiallylargenumberofcandidatefeatureswhilemaintaininggoodgeneralization error(thatis,withoutover-fittingthetrainingdata). The discretization of expression data (see Sect. 3.2) into up- and down-regulated expressionlevelsallowsustoformulatetheproblemofpredictingregulatoryresponse of target genes as the binary classification task of learning to predict up and down examples. Rather than viewing each microarray experiment as a training example, MEDUSAconsidersallgenesandexperimentssimultaneouslyandtreatseverygene- experimentpairasaseparateinstance,dramaticallyincreasingthenumberoftraining examplesavailable. For every gene-experimentexample, the gene’s expression state intheexperiment(up-ordown-regulation)givestheoutputlabely = . Asweex- ge ± plainbelow(seeSect.3.2),positiveandnegativeexamplescorrespondtostatistically significantup-anddown-regulatedexpressionlevels; exampleswith baselineexpres- sionlevelsareomittedfromtraining. The inputs to the learner are (i) the promoter sequences of the target genes and (ii)thediscretizedexpressionlevelsofasetofputativeregulatorgenes. Thesequence data is represented only via occurrence or non-occurrenceof a sequence element or motif. A full discussion of how MEDUSA determines a set of sequence and motif candidates to be considered at each round of boosting is given in Sect. 2.2. Let the binarymatrixM indicatethepresence(M =1)orabsence(M = 0)ofamotif µg µg µg µ in the promoter sequence of gene g, and let the binary matrices Pσ indicate the πe up-regulation(σ = +)ordown-regulation(σ = )of a regulatorπ in experimente − (Pσ = 1, ifregulatorπ isinstate σ in experimente, andPσ = 0, otherwise). Our πe πe weakrulessplitthegene-experimentexamplesinthetrainingdatabyaskingquestions oftheform‘M Pσ =1?’;i.e.,‘Ismotifµpresent,andisregulatorπinstateσ?’. In µg πe thisway,eachruleintroducedcorrespondstoaputativeinteractionbetweenaregulator andsomesequenceelementinthepromoterofthetargetgenethatitregulates. Theweakrulesarecombinedbyweightedmajorityvoteusingthestructureofan alternatingdecisiontree[14,10]. An exampleisgivenin Fig.1. Theweakrulesare showninrectangles. Theirassociatedweights,indicatingthestrengthoftheircontri- butiontothemajorityvote,areshowninovals.Ifthe motifpresence,regulatorstate { } conditionforaparticularruleholdsintheexampleconsidered,theweightoftherule is added to the final prediction score. The weight can be either positive or negative, 4 Figure 1: Example ofanalternating decision tree: Therectangles represent weakrules, learned by MEDUSA,thatsplitgene-experimentexamplesinthetrainingdata.Examplesforwhichtheconditionholds followthepathfurtherdownthetree(‘y’)andhavetheirscoresincrementedbythepredictionscoregivenin theovals.Thefinalpredictionisthesumofallscoresthattheexamplereaches. contributingtoup-ordown-regulationrespectively.Rulesthatappearlowerinthetree are conditionally dependent on the rules in ancestor nodes. For example, in Fig. 1, onlyifUSV1isup-regulatedandbothmotifsGTACGGAandAGGGATarepresentis thescore0.285addedtothepredictionscore. Thetreestructureisthusabletoreveal combinatorialinteractionsbetweenregulatorsand/ormotifs. Thesignofthefinalpre- dictionscoregivestheprediction,andtheabsolutevalueofthescoreindicatesthelevel ofconfidence. Inthiswork,weconsiderbothsequencesandposition-specificscoring matrices(PSSM) (an exampleis shown in the lower rightnodeof Fig. 1) as putative motifs(seeSect.2.2). Eachiterationoftheboostingalgorithmresultsintheadditionofanewnode(cor- responding to a new weak rule) to the tree. The weak rule and its position in the treeatwhichitisaddedarechosenbyminimizingtheboostinglossoverallpossible combinationsofmotifs,regulators,andregulator-states,andoverallpossiblepositions (“preconditions”)inthecurrenttree. Apseudo-codedescriptionisgiveninFig.2. The implementation uses efficient sparse matrix multiplication in MATLAB, ex- ploiting the fact that our motif-regulator features are outer products of motif occur- rencevectorsandregulatorexpressionvectors,andallowsustoscaleuptosignificantly largerdatasetsthanin[10]. 5 Definitions: cˆ = preconditionassociatedwitha specificpositioninthetree c = weakruleassociatedwithmo- µπσ tifµandregulatorπinstateσ w = weightofexample(g,e) ge W[c(g,e)] = Pc(g,e)=1wge, for a given conditionc c = notc ¬ Z(cˆ,µ,π,σ) = boostingloss =W[ cˆ]+2pW[cˆ cµπσ]W[cˆ cµπσ] ¬ ∧ ∧¬ y = labelofexample(g,e) ge T = totalnumberofboostingitera- tions F (g,e) = predictionfunctionatiteration t t α = weight of weak rule t con- t tributingtothefinalprediction score Initialization: F (g,e)=0,forall(g,e) 0 Mainloop: fort=1...T wge =e−ygeFt−1(g,e) callHierarchicalMotifClustering(Sec.2.2). getasetofproposedPSSMs. minimizeboostingloss: c∗ =argmin Z(cˆ,µ,π,σ) cˆ,µ,π,σ calculateweightofthenewweakrulec∗: α = 1lnW[c∗∧(yge=+)] adtdne2wnoWde[c∗c∧∗(wygiet=h−w)]eightα tothetree t Ft(g,e)=Ft−1(g,e)+αtc∗(g,e) endfor sign(F (g,e))=predictionforexample(g,e) T F (g,e) =predictionconfidencefor(g,e) T | | Figure2: Pseudo-codedescriptionofthelearningalgorithm 2.2 Hierarchical MotifClustering Ateachboostingiteration,MEDUSAconsidersalloccurrencesofk-mers(k=2,3,...7) anddimerswithagapofupto15bp(seeSect.3.4)inthepromotersequenceofeach geneascandidatemotifs. Sinceslightlydifferentsequencesmightinfactbeinstances ofbindingsitesforthesameregulator,MEDUSAperformsahierarchicalmotifcluster- 6 ingalgorithmtogeneratemoregeneralcandidatePSSMsasbindingsitemodels. The motifclusteringusesk-mersanddimersassociatedwithlowboostinglossasastarting pointtobuildPSSMs: thesesequencesareviewedseedPSSMs,andthenthealgorithm proceeds by iteratively merging similar PSSMs, as described below. The generated PSSMsarethenconsideredasadditionalputativemotifsforthelearningalgorithm. A position-specific scoring matrix (PSSM) is represented by a probability dis- tribution p(x ,x ,...,x ) over sequences x x ...x , where x A,C,G,T . 1 2 n 1 2 n i ∈ { } Theemission probabilitiesare assumedto beindependentateverypositionsuchthat p(x1,...,xn)=Qni=1pi(xi).ForagiveninputsequencethePSSMreturnsalog-odds scoreS = Pni=1ln(pi(xi)/pbg(xi))withrespecttobackgroundprobabilitiespbg. A scorethresholdcanthenbechosentodefinewhethertheinputsequenceisahitornot. When comparingtwoPSSMs, weallow possibleoffsetsbetweenthetwo starting positions. Inordertogivethemthesamelengths,wepadeithertheleftorrightends with the background distribution. We then define a distance measure d(p,q) as the minimumoverallpossiblepositionoffsetsoftheJSentropy[15]betweentwoPSSMs pandq. d(p,q) min (cid:2)w1DKL(p w1p+w2q)+w2DKL(q w1p+w2q)(cid:3), ≡offsets || || whereDKListheKullback-Leiblerdivergence[15].Byusingp(x1...xn)=Qni=1pi(xi) aDnKdLP(px||iqp)i(=xiP) ni==11D(KanLd(ptih||eqia)n.aTlohgeorueslaetiqvueawtieoingshtfsoorfqt)heontweocaPnSSeMasisl,yws1haonwdtwha2t, areheredefinedasw = N /(N +N ),whereN ,N arethenumbersoftarget 1,2 1,2 1 2 1 2 genesforthegivenPSSM.Notethatthisdistortionmeasureisnotaffectedbyadding more“padded”backgroundelementseitherbeforeorafterthePSSM.Ourmergecri- terionissimilartotheoneusedintheagglomerativeinformationbottleneckalgorithm [16],thoughwealsoconsideroffsetsinourmerges. At every boosting iteration, we first find the weak rule c among all possible tmp combinationsofregulators,regulator-statesandsequencemotifs(k-mersanddimers), thatminimizesboostingloss. The100motifswithlowestlossappearingwiththesame regulator,regulator-state,andpreconditionasinc aretheninputtothehierarchical tmp clustering algorithm. Sequence motifs can be regardedas PSSMs with 0/1 emission probabilities,smoothedbybackgroundprobabilities.ByiterativelyjoiningthePSSMs with smallest d(p,q), the clustering proposesa set of 99 PSSMs from variousstages of the hierarchy. At every mergeof two PSSMs, the score threshold associated with thenewPSSMisfoundbyoptimizingtheboostingloss. NotealsothatthenewPSSM canbelongerthaneitherofthetwoPSSMsusedinthemerge,duetotheprocedureof mergingwithoffsets;inthisway,wecanobtaincandidatePSSMslongerthemaximum seedk-merlengthof7. Thenumberoftargetgenes, whichdeterminestheweightof the PSSM for further clustering, is calculated by counting the number of promoter sequences which score above the threshold. The new node that is then added to the alternatingdecisiontreeistheweakrulethatminimizesboostinglossconsideringall sequencemotifsandPSSMs. 7 3 Statistical Validation 3.1 Dataset We use the environmentalstress response (ESR) dataset of Gasch et al. [17], which consists of 173 cDNA microarray experiments measuring the expression of 6152 S. cerevisiaegenesinresponsetodiverseenvironmentalperturbations.Allmeasurements aregivenaslog expressionvalues(fold-changewithrespecttoanunstimulatedrefer- 2 encecondition).Notethatouranalysisdoesnotrequireanormalizationtoazero-mean, unit-variancedistribution,asisoftenemployed;insteadwewishtoretainthemeaning ofthetruezero(thatis,thereferencestate). 3.2 Discretization Figure3: Expressiondiscretization. Anoisedistributionisempiricallyestimatedusingdatafromthree unstimulatedreferenceexperiments.Thenoisemodeltakesintoaccountintensity-specificeffects.Bychoos- ing ap-value cutoff of0.05 wediscretize differential expression into up-regulated, down-regulated, and baselinelevels. We discretize expression data by using a noise model that accounts for intensity specificeffectsintherawdatafromboththeCy3(R)andCy5(G)channels. Inorder to estimate the nullmodel, we use the three replicate unstimulatedexperimentspub- lishedwiththesamedataset[17]. PlotsofM = log (R/G)versusA = log (√RG) 2 2 (Fig. 3) show the intensity specific distribution of the noise in the expressionvalues. WecomputethecumulativeempiricalnulldistributionofM conditionedonAbybin- ning the A variable into small bin sizes, maintaininga good resolutionwhile having sufficientdatapointsperbin. Foranyexpressionvalue(M,A)ofageneinanexper- iment,weestimateap-valuebasedonthenulldistributionconditionedonA, andwe use a p-value cutoff of 0.05 to discretize the expression values into +1, -1 or 0 (up- 8 regulation,down-regulation,orbaseline).Thediscretizationallowsustoformulatethe predictionproblemasaclassificationtask. 3.3 Candidate Regulators Theregulatorsetconsistsof475genes(transcriptionfactors,signalingmolecules,ki- nasesandphosphatases),including466whichareusedinSegaletal.[5]and9generic (global)regulatorsobtainedfromLeeetal.[18]. 3.4 MotifSet Wescanthe500bp5’-UTRpromotersequencesofallS.cerevisiaegenesfromtheSac- charomycesgenomeDatabase(SGD)foralloccurringk-mermotifs(k =2,3,...,7). We also include3-3and4-4dimermotifsallowinga middlegapofupto 15bp. We restrictthesetofalldimerstothosewhosetwocomponentshavespecificrelationships, consistentwith most knowndimer motifs: equal, reversed, complements, or reverse- complements. AsdescribedinSect.2.2,weuseaninformation-theoretic,hierarchical clusteringschemetoinferasetofPSSMsateachboostingiteration.Thecompletecan- didatemotifsetisthentheunionofallk-mers,dimers,andPSSMs,withacardinality of10962+1184+99=12245. 3.5 Cross-validation We divide the 173 microarray experiments into five folds, keeping replicate experi- mentsinthesamefold.Wethenperformfive-foldcross-validation,trainingtheclassi- fieronfourfoldsandtestingitontheheld-outfold. Thelearningalgorithmisrunfor 700 boosting iterations. The average test-loss for prediction on all genes in held-out experimentsis13.4 3.9%. ± Forcomparison,werunthesamelearningalgorithmwithexperimentally-confirmed or computationally-predictedmotifs in the literature. In these runs, the hierarchical motifclusteringisleftout,andthesetofputativemotifscontainsonlythosethatwere proposedintheliterature. The TRANSFAC database [19] containsa library of known and putative binding siteswhichcanbeusedtoscanthepromotersequenceofeverygene. Afterremoving redundantsites,wecompilealistof354motifs. Theboostingalgorithmwiththesame numberofiterationsandthesamefoldsforcross-validationgivesahighertest-lossof 20.8 2.8%ThecompiledTRANSFACmotifsthushavea muchweakerstrengthin ± predictinggeneexpressionthanthemotifsfoundbyMEDUSA. The same comparison was performed with a list of 356 motifs found in [20] by usingastate-of-the-artGibbssamplingalgorithmongroupsofgenesclusteredbyex- pression data and annotation information. These motifs also gave weaker predictive strengththanthosediscoveredbyMEDUSAwithanaveragetest-lossof16.1 3.5%. ± We are thusable to identifymotifswhichhave a significantlystrongerprediction accuracy (on independent held-out experiments) than motifs previously identified in theliterature. 9 4 Biological Validation To confirm that MEDUSA can retrieve biologicallymeaningfulmotifs, we run addi- tionalexperiments,randomlyholdingout10%ofthe(gene,experiment)examplesand trainingMEDUSAontheremainingexamples.Welearnungappedk-mersanddimers simultaneously. After1000iterations,we obtaina testlossof11%andasetof1000 PSSMs. We thencomparetoseveralknownandputativebindingsites, consensusse- quencesandPSSMsfromfivedatabases: TRANSFAC[19],TFD,SCPD,YPDanda set of PSSMs foundby AlignACE [20]. After convertingthe sequencesand consen- suspatternstoPSSMs,smoothedbybackgroundprobabilities,wecompareallPSSMs withtheonesfoundbyMEDUSAusingd(p,q)(seeSect.2.2)asadistancemeasure. We definethebestmatchforeachofMEDUSA’sPSSMsasthePSSMthatisclosest toitintermsofd(p,q). Each node in the alternating decision tree defines a particular subset of genes, namelythose havingatleastoneexamplethatpasses throughthe particularnode. In this way, we can associate motifs with Gene Ontology (GO) annotationsby looking for enriched GO annotations in the gene subsets, and we can estimate the putative functionsof the targetsof a transcriptionfactorthatmightbindto the PSSM in each node. WeseematchestovariantsoftheSTREelement,thebindingsitefortheMSN2 and MSN4 general stress response transcription factors. The genes passing through nodes containing these PSSMs are significantly enriched for the GO terms carbohy- dratemetabolism,responsetostressandenergypathways,consistentwiththeknown functionsofMSN2/4. GCR1 andRAP1 areknowntotranscriptionallyregulateribo- somalgenes,consistentwithenrichedGOannotationsassociatedwiththenodesofthe specificPSSMs. TheheatshockfactorHSF1—whichbindstotheheatshockelement (HSE)—playsaprimaryroleinstressresponsetoheataswellasseveralotherstresses. TheheatshockelementexistsasapalindromicsequenceoftheformNGAANNTTCN. WefindalmostanexactHSEinthetree. InS.cerevisiae,severalimportantresponses to oxidativeand redoxstresses are regulatedby Yap1p, which bindsto the YRE ele- ment. We findseveralstronglymatchingvariantsoftheYRE.Itisinterestingtonote that comparison of PSSMs from AlignACE with our PSSMs revealed the PAC and RRPEmotifstobeamongthetopthreematches. ThesePSSMsalsoappearinthetop 10iterationsinthetree,indicatingtheyarealsostronglypredictiveofthetargetgene expression.Boththeseputativeregulatorymotifshavebeenstudiedingreatdepthwith respecttotheirrolesinrRNAprocessingandtranscriptionaswelltheircombinatorial interactions. TheenrichedGO annotationsofthesenodesare thesame astheirputa- tivefunctions. Thetreecontains122dimermotifswith variablegaps. These include theHSEmotif(GAANNNTTC),HAP1motif(CCGN*CCG),GIS1motif(AGGGGC- CCCT) aswellasvariantsoftheCCG evertedrepeat. Severalimportantbiologically verified PSSMs learned by MEDUSA are given in Fig. 4. A complete comparison study of MEDUSA’s PSSMs with each of the above mentioned databases as well as GeneOntologyanalysisisavailableontheonlinesupplementarywebsite. AnaddedadvantageofMEDUSAisthatwecanstudytheregulatorswhosemRNA expressionispredictiveoftheexpressionoftargets. Theseregulatorsarepairedwith thelearnedPSSMs. Ofthe475regulators(transcriptionfactors,kinases,phosphatases and signaling molecules) used in the study, 234 are presentin the tree. We can rank 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.