MANUSCRIPTCATEGORY: ORIGINALPAPER BIOINFORMATICS ORIGINAL PAPER Vol.28no.42012,pages538–545 doi:10.1093/bioinformatics/btr713 Gene expression AdvanceAccesspublicationJanuary4,2012 Quantifying the white blood cell transcriptome as an accessible window to the multiorgan transcriptome Isaac S. Kohane and Vladimir I. Valtchinov∗,† National Center for Biomedical Computing Informatics for Integrating Biology and the Bedside (i2b2), Boston, MA 02115, USA AssociateEditor:TreyIdeker ABSTRACT For those studies seeking to find the mechanisms/genes Motivation: We investigate and quantify the generalizability of the dysregulatedinthetissueofinterest,thelargemajorityhavebeen white blood cell (WBC) transcriptome to the general, multiorgan structured very similarly: a direct comparison is made between transcriptome. We use data from the NCBI’s Gene Expression expression profiling of peripheral blood and the specific tissue Omnibus (GEO) public repository to define two datasets for wherethediseaseisknowntodevelop.Analternativemethodisto comparison,WBCandOO(OtherOrgan)sets. establishthedegreeofisomorphismbetweentheperipheralblood Results: Comprehensivepair-wisecorrelationandexpressionlevel transcriptome and the overlap in expression profiles from a fixed profilesarecalculatedforbothdatasets(withsizesof81and1463, number of representative human tissues (e.g. brain, colon, heart, respectively).WehaveusedmappingandrankingacrosstheGene kidney, etc., altogether nine tissue types) (see Liew et al., 2006). Ontology(GO)categoriestoquantifysimilaritybetweenthetwosets. Tothesurpriseofmany,theseinvestigationshaveproventoshow GO mappings of the most correlated and highly expressed genes considerable shared transcriptome across these tissues. We adopt fromthetwodatasetstightlymatch,withthenotableexceptionsof hereasystematicsurveyapproachusingtheever-growingmountain componentsoftheribosome,celladhesionandimmuneresponse. of public microarray expression data of bothWBC and dozens of Thatis,10877or48.8%ofallmeasuredgenesdonotchange>10% othertissues,underavarietyofconditionsandpathophysiological ofrankrangebetweenWBCandOO;only878(3.9%)changerank states. Our goal is to provide a quantitative estimate of the >50%. Two trans-tissue gene lists are defined, the most changing generalizability of gene expression findings in WBC to those in andtheleastchanginggenesinexpressionrank.Wealsoprovidea other tissues.We seek to identify the most robust similarities and general, quantitative measure of the probability of expression rank correspondingdifferencesinagenome-wideexpressionprofiles,and andcorrelationprofileintheOOsystemgiventheexpressionrank toquantifythemappropriately.Wehypothesizetherewillberobust andcorrelationprofileintheWBCdataset. correlationsthatsurvivethewell-knownvariabilityofthemultiply Contact: [email protected] sourced gene expression database such as the Gene Expression Supplementary information: Supplementary data are available at Omnibus (GEO) (Barrett et al., 2005), that allow large fractions Bioinformaticsonline. oftheperipheralbloodtranscriptometobereflectedinotherorgans andtissues. ReceivedonMarch8,2011;revisedonDecember6,2011;accepted This hypothesis (heretofore referred to as the WBC relevance onDecember25,2011 hypothesis)entailsthefollowingthreequestions:(i)towhatextent do those genes with the highest levels of expression in the WBC transcriptome also have high levels of expression in other organ 1 INTRODUCTION systems? Specifically, can we quantify how the rank of the gene One of the reasons that the classification of malignancy was one in WBC informs us of the rank in non-WBC tissues? (ii) How of the earliest human applications of microarray transcriptional do correlations between pairs of genes in WBC inform us of profiling(Golubetal.,1999)wasthatthetissuetobecharacterized, their correlation in non-WBC tissues? (iii) How does the overall thetumor,wasextractedasamatterofaroutinesurgicaloncological correlationstructure[e.g.RelevanceNetwork(Butteetal.,1999)] care.Otherclinicaldomainshavelaggedbecausethetissueinvolved ofWBC compare to those of other tissues?To the extent that the in a pathophysiology may not be reasonably obtained from a WBC relevance hypothesis is supported we explore whether the living individual (e.g. in behavioral or psychiatric disorders) or informativeness of WBC gene expression is broad or varies by the organ specificity of a disease is unclear (e.g. type II diabetes specificfunctionalcategories.Welookatboththebroadcategories mellitusorhypertension).Increasingly,investigatorshaveexplored of gene annotation (Ashburner et al., 2000; Dennis et al., 2003), thepossibilityofclassifying,prognosticatingandcharacterizingthe suchasapoptosis,aswellasexaminingindividualgenes.Weused mechanismofdiseasesusingthegeneexpressionpatternsmeasured the NCBI’s GEO to find experiments measuringWBC expression inwhitebloodcells(WBCs)(Coppolaetal.,2008;Padmosetal., and a large number of experiments on non-WBC tissues that we 2008;Scherzeretal.,2007;Washizukaetal.,2009). lump into the Other Organ (OO) category. To minimize the noise frominter-platformcomparisons(Nimgaonkaretal.,2003),weused onlythosedataobtainedontheGEO’sGPL96platformforboththe ∗Towhomcorrespondenceshouldbeaddressed. WBCandOOcategories.Thisresultedinatotalof1463samples †Alsoat:NewAtlanticTechnologyGroup,LLC,Newton,MA02468,USA. ©TheAuthor(s)2012.PublishedbyOxfordUniversityPress. ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommonsAttributionNon-CommercialLicense(http://creativecommons.org/licenses/ by-nc/3.0),whichpermitsunrestrictednon-commercialuse,distribution,andreproductioninanymedium,providedtheoriginalworkisproperlycited. [10:3714/3/2012Bioinformatics-btr713.tex] Page:538 538–545 MANUSCRIPTCATEGORY: ORIGINALPAPER WBCtranscriptome Expression matrix Outlier Removal: re- n Bulk Download format: move lowest 5 % of nitio GEO Platform: GPL96 probeset-level genes by entropy Defi s et ata S WBC Set: 81 samples OO Set: 1,463 samples D Compute pair-wise correlations and gene expression ranks in WBC and OO sets Sets n GO mapping of top 1000 most correlated and expressed genes: global FDR thresholds wee Null hypothesis (H0) is similarity by chance bet y arit Predictive Models: across correlation and expression levels spaces mil Formula: OO ~ WBC y si ntif ua “Most Changing” genes “Least Changing” genes Q Analysis of most correlated genes: Human Housekeeping Genes es RelNet and “most connected” genes vs. ys Most/Least-changing Genes al An hoc Expression of human homolog TFs in DAVID GO enrichment analysis: d- WBC/OO: Mahoney Atlas tissue of expression A Fig.1. Adiagramrepresentingtheworkflowoftheanalysis.(i)‘DataSetsDefinition’includesbulkdownloadoftheGEOarchivefilesfortheGPL96 platform,parsingoutallGSMsamplefileswithsimilarprobeset-averagedexpressionmatrix,detectingandremovingoutliergenesandconstructingtwosets forcomparison,WBCandOOsets.(ii)‘QuantifySimilaritybetweensets’include:useGOmappingofthemostcorrelated(highlyexpressed)genes(as definedbytheirFDRq-valuethresholds)toquantifychangeacrosssets;thenullhypothesisH0isoverlapbetweencorrespondingsetsfromWBCandOO bychance);usealinearmodelandgeneralleastsquaresandPCAtoquantifyrelationshipsbetweenOOandWBCacrossexpressionandcorrelationprofiles, respectively;definetwolistof‘most-changing’and‘least-changing’fromWBCtoOOgenes,acrossexpressionprofiles.(iii)In‘Ad-hocAnalyses’,wefirst constructtheRelNetsofthe200mostcorrelatedWBCgenesandtheircorrespondingOOpairsrankschanges.WenextlookedattheTFshumanhomologs fromtheMahoneyAtlasasexpressedinWBC.Anotherstepislookingathowthemost-changingandleast-changinggenesfromstep(ii)arerepresentedin thelistofhumanhousekeepinggenes.Finally,werunGOenrichmentanalysisofthe‘least-changing’geneswithrespectto‘tissue-of-expression’,inDAVID. fortheOOand81samplesfortheWBCset.TheOOsetconsists datasetsthatweretakenbytheGPL96platform,correspondingtotheHG- ofabout95%solidtissuesamplesandabout8%arecancer-related U133AchipbyAffymetrix,andonlyincludedprobeset-averagedexpression tissues(seeSection2.2). levels.AcopyoftheGEOdatabasewasdownloadedonApril5,2006.An outlierdetectionandremovalanalysiswasperformed(Butte,1999).There are1463and81samplesintheOOandWBCdataset,respectively. 2 MATERIALS AND METHODS TheoverallflowoftheanalysisissummarizedinFigure1.Themajorsteps 2.2 QuantifytissuetypesintheOOset arelistedinthefigurecaptionandarealsoexplainedindetail. ToquantifythetypesoftissuesacrosstheOOset,weutilizetheRpackage GEOquerytodownloadandparseouttheannotationforeachofthe1463 samplesassignedtotheset.FourofthefieldsintheGSMannotationwere 2.1 GEOdatasets used:‘title’,‘organism’,‘source’and‘description’;thelistispresentedin We have selected the NCBI’s public archive GEO as a source of the theSupplementaryTableS1. microarray expression data (Barrett et al., 2005). To minimize the noise Overall,27GSMsamplesarenolongeravailablefromGEOandthus across platforms (Nimgaonkar et al., 2003), we have elected to use only cannotbeincludedinthestatistics.Thereare1387samplesthatarefrom 539 [10:3714/3/2012Bioinformatics-btr713.tex] Page:539 538–545 MANUSCRIPTCATEGORY: ORIGINALPAPER I.S.KohaneandV.I.Valtchinov 2.5 Definitionofchangescores Giventworankedlistsofgenes,wehavedefinedachangescoreofeach geneasthesignedintegerequaltothenumberofpositionsitsrankchanges betweentheWBCandOOdatasets. 2.6 TestforindependenceofintersectionofGO categories Wehaveutilizedtheχ2 testofindependencetoanalyzetheresultslisted inTable2fortheWBCandOOoverlap.EachofthelinesinTable2was fedintochisq.test()inR(ver.2.10.1)withtheparameter‘simulate.p.value =TRUE’setsothataMonteCarlosamplingsimulationoftheP-valueis performed.TheaverageP=0.0004998isbasedon2000MCsimulations. 2.7 Generalleastsquaresandprincipalcomponent analysisfittingprocedures For the expression profile modeling procedure, we first applied the low entropyfilerandthenhaveremovedtheribosome-relatedgenesbyremoving all genes that are annotated to belong to these two GO categories: (i)5840,annotation=“ribosome”and(ii)5842,annotation=“cytosoliclarge ribosomalsubunit(sensuEukaryota)”.Wehaveretainedthemorerestrictive low-entropycondition(fortheWBCcase)tokeepthelistofgenesequal Fig.2. DistributionofsolidtissuesinthesamplesoftheOOset.Atotal intheOOandWBCexpressionsets.Bothlow-entropyandtheremovalof of 1436 GSM samples annotations were parsed out and reviewed with ribosomegenesproceduresamountedtotake300probesoffthegeneslist. 27samplesfromtheoriginal2006downloadthatwasnolongeravailable Forthelistofremaininggenes,wehaveaveragedtheexpressionlevelsfor inGEO. eachofthemoverthenumberofthesamplesintheircorrespondingdataset. TheresultingarrayswerefedintotheRlm()functionandalinearmodel wasusedforthefunctionalrelationshipbetweenOOandWBCsets. For the correlation profiles case, a similar procedure was used for solid-tissue organs, or ∼95% of all samples in the OO set. One hundred removingoutliers(low-entropyfilter)andtheribosomal-relatedsetofgenes. andsixteen(or∼8%)aresamplesthatcomefrommalignanttissues.The Foreachcorrelationmetric,wehavetransformedtostandardz-scoresand computedtheFDRq-valuesforthosepairs,andonlyretainedq-valuebelowa resultingtissuecategoriesdistributionisshowninFigure2. thresholdofα<0.05.TheunificationoftheOOandWBCsetswasperformed andthisenlargedsetofcorrelationcoefficientsbetweenthegenepairswas used as the basis for performing both general least squares and principal 2.3 Pair-wisecorrelationcomputation componentanalysis(PCA)inthesamplesversusOOandWBCgene-pair Foreachofthedatasetswehaveabout∼250millionpair-wisecorrelations correlation space. The corresponding sizes of the data matrices analyzed were 30619 × 2 and 61190 × 2 for the MIC and Pearson’s correlation betweenthe22283REF_IDsontheGPL96.Computationswerecarriedout coefficients,respectively.Weusedastandardsingularvaluedecomposition onanHPCClusteratPartnersHealthCareSystem,Inc. methodfromthe‘pcaMethods’Rpackageandlm()functionforthelinear leastsquares.ResultsaresummarizedinTable3. 2.4 ‘TopN’mosthighlyexpressedandcorrelatedgene 2.8 TissueofexpressionenrichmentanalysisinDAVID listsasfalsediscoveryratecutoffparameters Asetof48REF_IDsfortheGOcategoryofsugar-bindinggenes(GO:5529) TheselectionofthetopNusedfortheGOcategoriesmappingstatisticswas was submitted to DAVID. These probesets were mapped to 36 internal basedonafalsediscoveryrate(FDR)(Benjamini,1995;Schweder,1982; DAVIDIDs.Onlythe‘TissueofExpression’submenuoptionswereselected. withthespecificq-valuethresholdsdescribedbelow. Afunctionalannotationclusteringwasperformedwiththedefaultoptions. For the most highly expressed gene list, we have transformed the ResultsarelistedinSupplementaryTableS6. expressionlevelsaveragedoverallsamplestoa‘standardscore’(z-score), followedbyacalltothe‘fdrtool’Rpackagetodeterminethecorresponding q-values(Strimmer,2008).BothOOandWBCcaseswerecomputedand 3 RESULTS theminimumofthetwocutoffparameterswastakentobetheglobalFDR cutoffparameter.ThespecificchoiceofN=1000usedintheanalysisbelow 3.1 Foldandrankanalysis correspondstoanFDRcutoffq-value=0.02. To first determine whether there was sufficient similarity between Similarly for the case of top N most correlated gene lists, we WBCandothertissues,wefocusedouranalysesonthosegenesthat have transformed to z-scores for the Mutual Information Content (MIC) hadbehaviorsthatweremostlylikelytobeconsistent:geneswith (Strimmer,2008)andalsoused|r|forthecaseofthePearson’scorrelation high levels of expression, with low entropies (i.e. low variances), coefficient.Q-valueswerecomputedforbothsimilaritymeasurecasesand andwithhighcorrelationswithothergenesacrosstheentireGEO the minimum of the two was taken. The most correlated pairs list was datasetsontheGPL96platform(seeSection2). transformedtogenelistsbyaddingthetwomembersofallpairslessthe overlapsetofgenesbetweenthemembers.Here,anFDRq-valueof0.005 Solely reviewing the top N = 1000 most correlated genes correspondsapproximatelytoN=1000(mostcorrelatedgenes)parameter, (i.e.thosewithhighcorrelationswithothergenesintheirrespective forwhichcomparisonresultsarelistedbelow. GEOdatasets)andthe1000mostexpressedgenesintheWBCand 540 [10:3714/3/2012Bioinformatics-btr713.tex] Page:540 538–545 MANUSCRIPTCATEGORY: ORIGINALPAPER WBCtranscriptome OOcategories(seeSection2.4foradiscussiononthechoiceofN), Table2. RepresentationofGOcategoriesin1000mostexpressed(‘E’,in wenotedthattheGOcategoriestowhichthese1000genesbelong lastcolumn)orcorrelated(‘C’)genes (amere4.5%ofthetotalnumberofgenesmeasuredontheGPL96 platform)hadaveryhighoverlap.ThedistributionofdistinctGO Intersection WBC OO E/C categorieswithineachofthethreemainGOhierarchiesforOOand WBC datasets is shown in Figure 3. The hierarchies are ordered Cell 116 53 65 C MF 258 179 238 C by the number of genes annotated for OO and the corresponding BP 280 220 233 C numberofannotationsinWBCshowninthesecondcolumnofthe Cell 144 18 87 E figure.Withthenotableexceptionsofcomponentsoftheribosome, MF 303 246 233 E cell adhesion and the immune response, the annotations of these BP 322 267 257 E top correlated genes in OO samples are well represented by the mostcorrelatedgenesinWBCsamples.Forgraphicalcompactness, Figure 3 only shows the 30 most annotated GO categories in OO distribution of change score is centered on zero for all three GO samples. If we extend the analysis to all categories, then the GO hierarchies.Thetop1000mostexpressedgeneshaveamuchtighter categoriesthatchangethemost(asdefinedbytheirChangeScore distribution(morelowchangescores)thanthe1000mostcorrelated describedinSection2)includetheribosomallocationandrelated genes. processes as before, but also includes nucleic acid binding, and Abroaderexaminationofwhichgeneschangethemostintheir spliceosome-related processes as shown in the second column of rank(byentropyorbyexpression)withinthe22283genesmeasured Table1. ontheGPL96platformrevealsasteepdeclineinthechangeinrank But this focuses on the differences rather than the larger of expression as illustrated in Supplementary Figure S2. That is, similaritiesbetweentheWBCandOOtranscriptomessuggestedby 10877 or 48.8% of all measured genes do not change >10% of thebroadsilhouettesinFigure3.Indeed,90%oftheGOcategories rank range betweenWBC and OO. Only 878 (3.9%) change rank change <1.37% for cellular components, 1.83% for molecular >50%.Thereissomewhatmorechangeinrankofgenesbyentropy function and 2.89% for the biological processes hierarchies, or in asrevealedinSupplementaryFigureS3.Thatis,5084(22.8%)of averageofslightlyover2%.ThesimilaritiesbetweenWBCandOO allgenesdonotchange>10%ofrankrangeinentropy,and4356 arefurtherquantifiedbyexaminingwhichGOcategoriesareshared (19.5%)by>50%. by the 1000 most highly expressed or the 1000 most correlated TheindividualgenesthatchangetheleastinrankacrossWBCand genes in the two transcriptomes, respectively, as summarized in OO,areitemizedinSupplementaryTableS3asarethosegenesthat Table 2. In all instances, the intersection is larger than any of changethemost.Intheformerlist,areincludedHBB(hemoglobin the distinct sets (ranging from 27% larger to 700% larger). The βchain),H3F3AH3histone,family3A,MED6(mediatorofRNA details of the particular GO categories summarized in this table polymeraseIItranscription,subunit6homolog),FTL(ferritin,light are given in Supplementary Table S2. Of course, if all the genes polypeptide),ACTG1(actin,gamma1),B2M(β-2-microglobulin) on the chip were used in this analysis, rather than the top 1000, several HLA-related genes, several hemoglobin-related genes, thenallGOannotationswouldintersectcompletelyacrossthetwo S100A9 (S100 calcium binding protein A9—calgranulin B). In transcriptomes,bydefinition.However,withonly4.5%ofthegenes the latter list, are included the FCGR3B (Fc fragment of IgG, measured, the measured degree of overlap is highly improbable low affinity IIIb, receptor—CD16b), PSAP (prosaposin—variant (P=0.000499 by using Monte Carlo sampling simulation in χ2 Gaucherdiseaseandvariantmetachromaticleukodystrophy),CA1 test,seeSection2).SupplementaryFigureS1illustratestheoverall (carbonicanhydraseI—mosthighlyexpressedinerythrocytes)and stabilityofnumbersofannotationbyGOcategorybyplottingthe UBB(ubiquitinB).Incontrasttoexpressionrank,UBBchangesthe frequency of change scores for each GO category.As shown, the leastinentropyrankbetweenWBCandOO.Conversely,GAPDH ishighlyinvariantasviewedbychangeinrankinexpression,but is among the most changed in rank in entropy across WBC and Table1. ThemostchangingGOcategoriesfromWBCtoOO OO. Many more genes have similar rank profiles in expression andentropy:Forexample,H3F3A(H3histone)isjustasinvariant GOcategoriesof1000most GOcategoriesof1000most in rank as in expression of HLA genes, ribosomal genes and correlatedgenes expressedgenes several of the hemoglobins. The expression of HBB in avascular and non-hematopoietic tissues has been previously documented Cell Ribosome(21→121) Mitochondrion(72→129) (e.g.Manserghetal.,2008). Cytosoliclargeribosomal Extracellularmatrix(sensu subunit(sensuEukaryota) Metazoa)(7→37) (1→34) 3.2 QuantitativemapbetweenWBCandOO MF Nucleicacidbinding Transmembranereceptor transcriptomes (130→56) activity(9→1) Structuralconstituentof Sugarbinding(16→3) Having established the broad correlation between OO and WBC ribosome(34→178) expression profiles, we next take the logical step of trying to find BP NuclearmRNAsplicing,via Nucleosomeassembly a quantitative relationship between the two transcriptomes. The spliceosome(47→9) (3→14) exact question we ask is: can one come up with the expression Proteinbiosynthesis sensoryperceptionofsound rank in the OO system given the expression rank in WBC? Or, (62→193) (12→2) evenmorebroadly,giventhebroaderexpressionprofile(valuesand cross-correlations)ofagroupofWBCgeneshowcanonepredict 541 [10:3714/3/2012Bioinformatics-btr713.tex] Page:541 538–545 MANUSCRIPTCATEGORY: ORIGINALPAPER I.S.KohaneandV.I.Valtchinov Fig.3. DistributionofGOcategoriesin1000mostcorrelatedgenesinWBCandinOO. theircorrespondingrankvaluesandcorrelationsintheOOset,and showsimilarstructurewherealargepartofthevarianceiscarried withwhatprobability?Indoingthis,wetreatWBCexpression(or bythefirstPC. correlation)profileasanindependentrandomvariableanddetermine the conditional probability distribution of the OO expression or 3.3 Relevancenetworksand‘most-connected’genes correlationcoefficientsgiventheWBCones. WemodeledtheOOexpressionlevelsandcorrelationcoefficients Relevance networks generate graph of connected nodes in which as a function of their WBC counterparts using linear predictive eachnoderepresentsageneandeachedgerepresentacorrelation models. The ribosome-related genes were removed as previously metric (e.g. Pearson’s) that exceeds a statistically significant described. The parameters for the resulting linear least squares threshold(Butteetal.,1999,2000). models as well as the P-values measuring the goodness-of-fit for Shown in Supplementary Figure S4 is the relevance network the expression and for the correlation profiles are summarized in for the WBC transcriptome depleted for ribosomal genes. The Table 3 together with the PCA model results for the correlation most highly connected genes include RRAGB (Ras-related GTP profiles space.The two PCs for the MIC and r correlation spaces bindingB),MAG(myelin-associatedglycoprotein,memberofthe 542 [10:3714/3/2012Bioinformatics-btr713.tex] Page:542 538–545 MANUSCRIPTCATEGORY: ORIGINALPAPER WBCtranscriptome Table3. Linearleastsquaresfitcoefficientsascomputedfor(i)expression It is instructive to note in passing that the results reported here profile,andlinearsquaresandPCAwithtwoPCsfor(ii)cross-correlation do not depend on the actual permutation thresholds in the two usingMIC,overtheFDRq<0.05truncatedcompletephasespaceof∼250 comparison sets. Since the two sets are very different in size (81 million ‘points’ in the phase space of the pairs (WBC, OO) correlation for WBC versus 1463 for OO), we did not use a permutation numbers,and(iii)sameasin(ii)forPearson’scorrelationcoefficient schemetoselectsubsetsforcomparison(Kerr,2009).Instead,we definethenumberofgenesusedforthecorrelationanalysesbased Linearleastsquaresfitting:OO∼A+B×WBC on global FDR thresholds. We built the relevance network of the EstimateSE;t-value;Pr(>|t|) WBCcorrelationprofilebasedonafixednumber(thetop200,as Expressionprofiles (A)8.846590;0.297784;29.71<2e-16∗∗∗ reportedinSupplementaryFigureS1,mostcorrelatedpairs)asitis (B)0.2064040.001263;163.48<2e-16∗∗∗ ofillustrativepurposesonly. ResidualSE 40.47on21981DF MultipleR2 0.5487 3.4 Transcriptionfactorsexpressionpatterns: F-statistics 2.672e+04on1and21981DF, theMahoneyatlas P <2.2e-16 Correlation,MIC (A)5.73524;0.01937;296.2<2e-16∗∗∗ The Mahoney is an atlas of transcription factors expressed in (B)−1.42333;0.00641;−222.1<2e-16∗∗∗ the murine brain (Gray et al., 2004). We determined how highly ResidualSE 0.3749on30617DF expressed these TF’s were in the WBC. First, the homologues MultipleR2 0.6169 of these 2342 human TF’s were mapped to their 2168 human AdjustedR2 0.6169 homologuesbyreciprocalbestfirstmatch(Moreno-Hagelsiebetal., F-statistics 4.931e+04on1and30617DF 2008).Ofthose,1212aremeasuredontheHG-U133Amicroarray. P <2.2e-16 PCA:R2 PC1:0.926;PC2:0.07403 Theexpressionranksofthese1212wereobtainedfromtheWBC Correlation,Pearson (A)4.116740.01349248.5<2e-16∗∗∗ transcriptome and are illustrated in Supplementary Figure S6.As (B)−0.42123;0.00984−309.3<2e-16∗∗∗ shown,thedistributionofranksisuniformovertherangeofranks ResidualSE 5.2748on60188DF measuredbyGPL96. MultipleR2 0.4751 AdjustedR2 0.4751 F-statistics 1.721e+05on1and60188DF 3.5 Geneontologyenrichmentanalysis P <2.2e-16 We used DAVID online tools (Dennis et al., 2003) to perform PCA:R2 PC1:0.8314;PC2:0.1686 enrichment analysis of the set of 500 least-changing genes with respecttoGeneOntology(GO)categories.TheDAVIDfunctional See Section 2 for exact algorithmic steps. Significance level is encoded as annotation resulted in the 110 clusters listed in Supplementary ‘***’<0.0001. Table S4. The top cluster (enrichment score of 20.35) is clearly reflecting the existence of a robust and well-correlated cluster of ribosome-related genes (see discussion about Relevance graphs immunoglobulinsuper-familythoughttobeinvolvedintheprocess above). ofmyelinationandknowntomediatecertainmyelin–neuroncell– cellinteractions),BMP7(bonemorphogeneticprotein7),DOPEY1 (dopey family member 1), GTF2H5 (general transcription factor 4 DISCUSSION IIH, polypeptide 5, involved with DNA repair mechanisms, and nucleotideexcisionrepairinparticular),PURA(purine-richelement If the tools of functional genomics can be applied to peripheral binding protein A, deletion of this gene has been associated blood cell samples to develop biomarkers for other organs, two withmyelodysplasticsyndromeandacutemyelogenousleukemia), questions are apparent: To what extent does expression in WBCs EZH1 (enhancer of zeste homolog, a transcriptional regulator reflectexpressioninotherorgansystems?WhyshouldWBCsreflect and as a component of protein complexes that stably maintain expressioninotherorgansystems?Thisinvestigationfocusesonthe heterochromatin.) and PGDS (prostaglandin D2 synthase, plays a first question and we review some possibilities with regard to the role in the production of prostanoids in the immune system and secondattheendofthisdiscussion. mast cells). The rank of these 200 top correlated gene pairs in Before a further discussion of our findings, here we comment thisWBCrelevancenetworkarecomparedwiththeranksofthose onthegeneralvalidityofourapproach.Twosetsweredefinedand samegenepairsintheOOset,inSupplementaryFigureS5.Itdoes compared, theWBC and OO.The OO set is composed of a large revealasignificantconcordanceintheranksofcorrelationingene numberofdifferentsolidtissues(∼95%ofallsamplescomefrom pairs from WBC and OO, even though the overall shape of the suchtissue,seeFig.2).Also,therewasnodiscriminationbetween presenteddistributionissomewhatpeaked—thelargestbin(N=50) diseasedornormalsamplesinOOortheWBCsets(i.e.∼8%ofthe correspondstoarankchangeacrossWBCtoOOofmaximumof samplesinOOarefrommalignanttissues).Inasense,wecompare 11.4%. Ninety percent of all WBC pairs do not change rank by WBC expression to expression of an ‘averaged or generalized’ >31positionsor15.5%.ThetopsharedgenepairsareSF3B1and humantissue.Aquestionmightariseaboutthegeneralvalidityof DDX5(thefirstisinthespliceosomeandtheotherisDEADdomain such a procedure. To address it, we have redone the bulk of our involved in spliceosome), YME1L1 and FNTA (the first is ATP- analyses for the case of an OO set defined as a superposition of dependentmetalloproteasespecifictomitochondriaandthesecond five samples taken at random from the solid tissue distribution in isafarnesyltransferase,alsoinvolvedinproteinmetabolismbutin Figure2.Thespecificdifferencesseendonotaffectthequantitative thecytoplasm). conclusionsreportedbelow(resultsnotshown). 543 [10:3714/3/2012Bioinformatics-btr713.tex] Page:543 538–545 MANUSCRIPTCATEGORY: ORIGINALPAPER I.S.KohaneandV.I.Valtchinov Abird’seyeviewofthecorrespondencebetweentheWBCand system,thenameofthemajorhistocompatibilitycomplex(MHC) OO transcriptome is provided by the GO (Ashburner, 2000). We in humans and are expected to be widely expressed across most have performed a systematic mapping of the three main branches tissues (Shiina, 2009). Another story is the ‘oxygen transport’ ofGOcategoriesinthecasesofthemostcorrelatedandexpressed category as exemplified by a very stable levels of expression of genes, across the WBC and OO datasets. From the results listed HBB (hemoglobin β chain) and several other hemoglobin-related inFigure1andTable1,oneisabletocharacterizethesimilarities genes found in the list between WBC and OO sets. One remote andthedifferencebetweentheWBCandOOtranscriptomes.First, but distinct possibility is the presence of contaminating red blood thetopmostcommonannotationsacrossthedatasetsshowatight cells(RBCs)inthesamplesofbothgroups.Forexample,albumin, match, with the notable exception of the ribosome, cell adhesion transferrinandsomeothermajorplasmaproteinsarequiteabundant and the immune response GO codes; the nucleic acid binding in muscle tissues, but the presence of hemoglobin-related genes and spliceosome-related processes are added to this when one in the WBC data is much more counterintuitive. To rule out the considers all GO categories that change the most between WBC possibilityofRBCcontamination,wehavereviewedtheannotation and OO. Overall, 90% of the GO categories change only ∼2% in fortheGEOsamplesassignedtotheWBCdataset(Supplementary averageacrossdatasets.Next,focusingontheGOcategoriesshared TableS6).Althoughitisdifficulttojudgetheexactsamplecollection betweenWBCandOOintheexpressionandcorrelationspaces,the protocol from just the description field in the GEO GSM format, intersectionsbetweenthedistinctgenesetsacrossGObranchesare it appears that most studies had followed the proper protocols. consistentlylargerthanthedistinctsetsthemselves,whichisahighly Therefore, the finding that HBB (hemoglogin β chain) and other improbable for only 4.5% of measured genes (P=0.0004998 by hemoglobin-relatedgenesarewidelyexpressedinWBCisatrue,but MonteCarlosamplingsimulationintheχ2test;Table2).Thisresult neverthelessasurprisingone.TheexpressionofHBBinavascular rulesouttheenrichment-by-chancehypothesisandsuggeststhehigh andnon-hematopoietictissueshasbeenpreviouslydocumentedthus levelofoverlapinGOannotationcategoriesacrosstheWBCand providing an example of another human tissue (and embryonic OO.SupplementaryFigureS2alsodemonstratestheoverallstability development stage) where hemoglobin units might possibly play ofGOcategorieschangeacrossWBCandOO,inbothexpression noveldevelopmentroles(Manserghetal.,2008). andcorrelationspaces,asexemplifiedbytheirchangescores. Next, we have reviewed the list of 567 human housekeeping Two of the most changing (from WBC to OO) GO categories genes (Eisenberg et al., 2003) against the two lists presented in and their corresponding genes are the ‘sensory perception of Supplementary Table S3. It appears that the housekeeping genes sound’(biologicalprocess,GOcodeBP:7605)and‘sugarbinding’ are well represented on both least changing and most changing (molecular function, MF:5529) listed in the second column from WBC to OO lists. Some of the most stable are e.g. HLA- in Table 1. Some of the individual genes in the BP:7605 A/B/C (major histocompatibility complex, class I, A/B/C), TPT1 category are WRD1 (WD repeat domain 1), MYH14 (myosin, (Tumorprotein,translationallycontrolled1),VIM(Vimentin);some heavy polypeptide 14) and DIAPH1 [diaphanous homolog 1 ofthemostchangingaree.g.TUBB(tubulin,β),PSAP(prosaposin), (Drosophila)]. Sugar-binding most frequent genes are LGALS8 JUNB(Jun-Boncogene),CREBBP(CREBbindingprotein). [lectin, galactoside-binding, soluble, 8 (galectin 8)], PKD1 Finally,wewouldliketobrieflytouchuponsomeofthebiological [polycystic kidney disease 1 (autosomal dominant)] and BCAN and physiological underpinnings of why the WBC transcriptome (brevican).Additionally, the ‘nucleosome assembly’BP:6334 GO might reflect the state of (including disease) another tissue or set category is quite overrepresented in going from OO to the WBC oftissues.First,asbloodcellscontactandinteractwithallhuman datasets (3 → 14). Some of the genes included in this GO tissuesandtransportandconveybioactivemolecules(i.e.oxygen, categoryareH3F3A(H3histone,family3A),NAP1L4(nucleosome metabolites, nutrient, cytokines and hormones, etc.), there is a assembly protein 1-like 4) and HIST1H1T (histone 1, H1t). The distinctpossibilitythattheWBCwillthemselvesreflectthestate(s) completelistsofthegenesincludedinthesetwoGOcategoriesare ofthosetissues.Conceivably,thesesubtlechangesduetointeraction listedinSupplementaryTableS5. of WBCs with diseased tissues might trigger specific changes in To cross-validate the enrichment of sugar-binding and sensory thegeneexpressionoftheWBCsparallelingtheinitialstimulus.A perception of sound categories in WBC, we have performed recentstudycomparingexpressioninninehumantissueswiththe enrichment analysis in DAVID using the ‘Tissue of Expression’ WBCtranscriptomehasfoundthat∼80%oftheWBCexpression annotationoption(seeSection2).Forthesugar-bindingindividual profile as being shared with any given tissue (Liew, 2006).After genes, Clusters #3 (enrichment score of 0.59) and Cluster #9 all,a‘cellisacell’andthemostfundamentalcharacteristicsofany (enrichmentscoreof0.38)clearlyindicatetherelativeenrichmentof cellsareexpectedtobesharedacrossallhumantissues.Secondly, thisgroupofgeneswithrespecttoWBCsasatissueofexpression. the WBC are the ‘etiological organ or tissue’ for some diseases, Asimilaranalysisofthehearinggeneshasyieldedamuchweaker a good example being asthma which is not a disease of the lungs WBC tissue association results bordering the method’s sensitivity (Weiss et al., 2009). Thirdly, some diseases affect all tissues, e.g. (resultsnotshown). Pompe’s disease affects the function of the lysosomes, which are We next took a more detailed look at some of the individual foundinalltypesoftissue(Vellodi,2005). genesintheleast-changinggroupaspresentedintheSupplementary Table S2.AGO terms enrichment analysis of those genes reveals 5 CONCLUSION that two of the most overrepresented types of categories are ‘oxygen transport’ (GO: 15671, BP), and ‘antigen presentation’- WestudythesimilaritybetweentheWBCandOOtranscriptomes related categories (GO:19882, GO:19883 and GO: 19884, BP; usingpublicrepositoryexpressiondata(GEO).TheOOsetshasa SupplementaryMaterial).Theseantigenpresentationcategoriesare wide representation of solid tissues (∼95%) and normal (∼92%) of course the genes part of the human leukocyte antigen (HLA) tissuesamples. 544 [10:3714/3/2012Bioinformatics-btr713.tex] Page:544 538–545 MANUSCRIPTCATEGORY: ORIGINALPAPER WBCtranscriptome We utilize GO mapping and ranking across correlation and Benjamini,Y.andHochberg,Y.(1995)Controllingthefalsediscoveryrate:apractical expression level profiles selected by FDR thresholds to quantify andpowerfulapproachtomultipletesting.J.R.Stat.Soc.B,57,289–300. similarity between comparison sets. We first identify that Butte,A.J.andKohane,I.S.(1999)Mutualinformationrelevancenetworks:functional genomicclusteringusingpairwiseentropymeasurements.PacificSymposiumon componentsoftheribosome,celladhesionandimmuneresponseare Biocomputing,5,415–426. amongthemostpronounceddifferencesbetweenthetranscriptomes. Butte,A.J.etal.(2000)DiscoveringfunctionalrelationshipsbetweenRNAexpression We next build predictive models of the rank in OO given the andchemotherapeuticsusceptibilityusingrelevancenetworks.Proc.NatlAcad. rank in WBC, across correlation (using linear least squares) and Sci.USA,97,121826. Coppola,G. et al. (2008) Gene expression study on peripheral blood identifies expressionlevelspaces(usinggeneralleastsquaresandPCA),after progranulinmutations.Ann.Neurol.,64,92–96. removing the ribosome-related genes. Two trans-tissue gene lists Dennis,G. Jr et al. (2003) DAVID: Database for Annotation, Visualization, and werealsodefined,themost-andleast-changinginexpressiongenes. IntegratedDiscovery.GenomeBiol.,4,P3. Wealsoconsidertheindividualgenesonbothendsofthespectrum Eisenberg,E.andLevanon,E.Y.(2003)Humanhousekeepinggenesarecompact.Trends ofWBC-OOchangeandgainfurtherinsightusingadhocanalyses, Genet.,19,362–365. Golub,T.R.etal.(1999)Molecularclassificationofcancer:classdiscoveryandclass includinglookingatTFexpressioninWBCfromMahoneyatlasas predictionbygeneexpressionmonitoring,Science,286,531–537. wellasGOenrichmentintissueofexpression. Gray,P.A.etal.(2004)Mousebrainorganizationrevealedthroughdirectgenome-scale Wereportonanoverallverytight,quantitativematchbetweenthe TFexpressionanalysis.Science,306,2255. WBChumantranscriptomeandthetranscriptomeofageneralized Kathleen,K. (2009) Comments on the analysis of unbalanced microarray data. Bioinformatics,25,2035–2041. set of solid tissues, across both correlation and expression level Kuo,W.P. et al. (2002) Analysis of matched mRNA measurements from different spaces.Thesefindingsareessentiallyindependentonexactsubset microarraytechnologies,Bioinformatics,18,405–412. oftissuesinthegeneralizedOOsetindicatingasharedfundamental Liew,C.C.etal.(2006)Theperipheralbloodtranscriptomedynamicallyreflectssystem- biologyconnectionbetweentheWBCsandOOsinthehumanbody. widebiology:apotentialdiagnostictool.J.Lab.Clin.Med.,147,126–132. Our results underscore the utility of the peripheral blood cells in Mansergh,F.C. et al. (2008) Developmentally regulated expression of hemoglobin subunitsinavasculartissues.Int.J.Dev.Biol.,52,873-886. applyingthefunctionalgenomicstoolsfordiscoveringbiomarkers Moreno-Hagelsieb,G. and Latimer,K. (2008) Choosing BLAST options for better forotherorgans. detectionoforthologsasreciprocalbesthits.Bioinformatics,24,319–324. Nimgaonkar,A.etal.(2003)Reproducibilityofgeneexpressionacrossgenerationof Affymetrixmicroarrays.BMCBioinformatics,4,26. ACKNOWLEDGEMENTS Padmos,R.C. et al. (2008)Adiscriminating messenger RNAsignature for bipolar disorderformedbyanaberrantexpressionofinflammatorygenesinmonocytes. We are thankful to the PHS HPC Cluster staff for computational Arch.Gen.Psychiatry,65,395–407. support. Scherzer,C.R.etal.(2007)MolecularmarkersofearlyParkinson’sdiseasebasedon geneexpressioninblood.Proc.NatlAcad.Sci.USA,104,955–960. Funding: Library of Medicine grant number (R01 LM 010125 to Schweder,T. and Spjøtvoll,E. (1982) Plots of p-values to evaluate many tests I.S.K.). simultaneously.Biometrika,69,493–502. Strimmer,K. (2008) A unified approach to false discovery rate estimation. BMC ConflictofInterest:nonedeclared. Bioinformatics,9,303. Shiina,T.etal.(2009)TheHLAgenomiclocimap:expression,interaction,diversity anddisease.J.Hum.Genet.,54,15–39. REFERENCES Vellodi,A.(2005)Lysosomalstoragedisorders.Br.J.Haematol.,128,413–431. Washizuka,S. et al. (2009) Expression of mitochondrial complex I subunit gene Ashburner,M.etal.(2000)Geneontology:toolfortheunificationofbiology.TheGene NDUFV2inthelymphoblastoidcellsderivedfrompatientswithbipolardisorder OntologyConsortium.Nat.Genet.,25,25–29. andschizophrenia.Neurosci.Res.,63,199–204. Barrett,T.etal.(2005)NCBIGEO:miningmillionsofexpressionprofiles–database Weiss,S.T.etal.(2009)Asthmageneticsandgenomics2009.Curr.Opin.Genet.Dev., andtools.NucleicAcidsRes.,33,D562–D566. 19,279–282. 545 [10:3714/3/2012Bioinformatics-btr713.tex] Page:545 538–545