Repsilberetal.BMCBioinformatics2010,11:27 http://www.biomedcentral.com/1471-2105/11/27 RESEARCH ARTICLE Open Access Biomarker discovery in heterogeneous tissue samples -taking the in-silico deconfounding approach Dirk Repsilber1*, Sabine Kern2, Anna Telaar1, Gerhard Walzl3, Gillian F Black3, Joachim Selbig2, Shreemanta K Parida4, Stefan HE Kaufmann4, Marc Jacobsen4,5 Abstract Background: For heterogeneous tissues, such as blood, measurements of gene expression are confounded by relative proportions of cell types involved. Conclusions have to rely on estimation of gene expression signals for homogeneous cell populations, e.g. by applying micro-dissection, fluorescence activated cell sorting, or in-silico deconfounding. We studied feasibility and validity of a non-negative matrix decomposition algorithm using experimental gene expression data for blood and sorted cells from the same donor samples. Our objective was to optimize the algorithm regarding detection of differentially expressed genes and to enable its use for classification in the difficult scenario of reversely regulated genes. This would be of importance for the identification of candidate biomarkers in heterogeneous tissues. Results: Experimental data and simulation studies involving noise parameters estimated from these data revealed that for valid detection of differential gene expression, quantile normalization and use of non-log data are optimal. We demonstrate the feasibility of predicting proportions of constituting cell types from gene expression data of single samples, as a prerequisite for a deconfounding-based classification approach. Classification cross-validation errors with and without using deconfounding results are reported as well as sample- size dependencies. Implementation of the algorithm, simulation and analysis scripts are available. Conclusions: The deconfounding algorithm without decorrelation using quantile normalization on non-log data is proposed for biomarkers that are difficult to detect, and for cases where confounding by varying proportions of cell types is the suspected reason. In this case, a deconfounding ranking approach can be used as a powerful alternative to, or complement of, other statistical learning approaches to define candidate biomarkers for molecular diagnosis and prediction in biomedicine, in realistically noisy conditions and with moderate sample sizes. Background most widely used material is blood, which is frequently For studies involving heterogeneous tissue samples, sampled for diagnostic or prognostic purposes. Blood is detection of differential gene expression from molecular frequently used as surrogate tissue in many clinical stu- profiles, as well as identification of biomarkers is a pro- dies for reasons of accessibility, ease of storage and pro- blem of validity: molecular profile variation and changes cessing. Valid biomarkers from blood are thus often in cell type proportions between tissue samples are con- targeted [2]. Regarding tissue heterogeneity, however, founded [1-4]. However, heterogeneous tissues are fre- blood is an extreme example since inter-individual dif- quently used (e.g. blood, tumor) and further confounded ferences and disease-specific changes, amongst other in pathological situations where diseased tissue is fre- reasons, lead to high variability in composition ([5,6], cf. quently infiltrated by immune cell populations. The our data, figure 1). Cell sorting of blood cells, or - in the case of solid tis- sues - micro-dissection [7], depend on sophisticated *Correspondence:[email protected] equipment. Hence, biomarker studies under field condi- 1DepartmentofGeneticsandBiometry,ResearchInstitutefortheBiologyof tions, especially in resource-poor countries, have to rely FarmAnimals,Wilhelm-StahlAllee2,D18196Dummerstorf,Germany ©2010Repsilberetal;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsoftheCreative CommonsAttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedtheoriginalworkisproperlycited. Repsilberetal.BMCBioinformatics2010,11:27 Page2of15 http://www.biomedcentral.com/1471-2105/11/27 TB 0 0 1 %] 80 p.[ o 0 r 6 p e yp 40 ell t 0 c 2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 TST+ 0 0 1 %] 80 p.[ o 0 r 6 p e yp 40 ell t 0 CD3 c 2 CD14 Others 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 samples Figure1Experimentallydefinedproportionsofdifferentbloodcelltypes.IndividualproportionsofPBMCsaredepicted(CD3+cells,CD14+ cells,andOthers)forgroupsofTBpatientsandTST+individuals.Celltypeproportionsarehighlyvariable,evenbetweenindividualswithina group. on molecular profiling from whole blood samples. Ide- proportions of this cell type in the study samples have ally, biomarkers with prominent and clear signals can be been determined. used which remain detectable in spite of varying cell Such cell type-specific gene is differentially expressed type populations. However, biomarker signals for more if the interaction term in the linear model subtle differences are most likely not detectable due to confounding tissue compositions. Figure 2 gives an over- yi 0 1pi 2gpi i (1) view over possible scenarios: is significant. Here, y depicts the log-ratio of gene Figure 2A, shows the non-problematic case for homo- i expression signals for a specific gene in a common geneous tissue (e.g. culture of homogeneous cell popula- reference design (sample i), but it could also be a vector tions under synchronizing conditions) without any of log-intensities for one-color chips after normalization. confounding or interpretation problems (left), or tissue b is the overall mean for this gene, representing the with fixed cell type proportions. For these cases, there is 0 background signal (without any cells of the cell type no confounding problem [3]. exclusively expressing this gene). The binary factor g Figure 2B, refers to two cases for which in-silico represents patient (g = 1) or control status (if g = 0) for approaches exist for deconfounding: The simple case the respective sample, and p denotes the proportion of (figure 2B, left) refers to a situation in which a gene of i the immune cell population in question as confounding interest is exclusively expressed in a certain cell type factor. The variable g × p indicates the interaction effect (one amongst others, in varying proportions), and the of study group and immune cell proportion. Finally, ε i denotes the residual for sample i. An important Repsilberetal.BMCBioinformatics2010,11:27 Page3of15 http://www.biomedcentral.com/1471-2105/11/27 Figure2Casesofgene expressionintissuecontext.(A)Non-problematiccases;(B) Confoundingofcelltypeproportionsandcell-type specificgeneexpression:simpleandproblematiccase,deconfoundingispossible;(C)worstcase:geneexpressiondependsoncelltype proportion,deconfoundingnotpossible. assumption for this modeling approach is that single-cell problematic case for which it is harder to disentangle gene expression is independent of cell type proportions. influences of single-cell gene expression and variation in For an example of this type of analysis see the contribu- cell type proportions. A few similar approaches exist tion by Jacobsen et al. (2006) [1]. Similar problems and dealing with such a case, all employing an iterative opti- their solutions were presented by Kriete and Boyce mization of the decomposition as given by equation 2: (2005) [8] combining tissue composition data and gene expression data, as well as by Gosh (2004) [9], for the X SC. (2) latter without estimates of the cell type proportions. If Here, X denotes the classical gene expression matrix gene expression is no longer restricted to a specific cell (genes by samples). S, the signature matrix, gives the type (as in figure 2B, right), we are dealing with the cell type specific gene expression profiles (genes by cell Repsilberetal.BMCBioinformatics2010,11:27 Page4of15 http://www.biomedcentral.com/1471-2105/11/27 types), and C, the concentration matrix, gives cell type gene expression, peripheral blood mononuclear cells proportions over samples (cell types by samples). (PBMC), were isolated and cell type proportions deter- An alternative formulation is given in equation 3 for mined. CD3+-cells (T-lymphocytes) were enriched in the mixture of two cell types (with cell type specific these samples and collected for RNA preparation (for expression signatures s and s for gene i): more details on the experimental dataset see Methods). 1,i 2,i Resulting data contain proportion and cell type specific xi c s (1c )s (3) gene expression profile for the most prominent RNA k k 1,i k 2,i containing cell type in blood, as well as the whole blood where xk denotes the expression value of the ith gene gene expression signal of the same samples. This design i in the kth heterogeneous sample and 0 ≤ c ≤ 1 denotes constitutes a valuable validation dataset for testing and k the fraction of the first cell type in the kth mixture; further developing an algorithm for deconfounding, as equivalent expressions are used in [4,10,11]. Venet et al. estimated cell type specific gene expression profiles can (2001) [10] were first to study this approach. In their be compared to those of FACS-sorted cells. contribution they made use of a de-correlation In our contribution, we study applicability and optimi- approach, which tends to improve the reconstruction of zation of the deconfounding approach for detection of simulated cell type specific gene expression profiles. differential regulation of features in a univariate Experimental data were also used but without the possi- approach, as well as an approach using deconfounding bility to validate their deconfounding results in a for the classification task, towards identification of bio- straight forward way. Lu et al. (2003) [11] described a marker panels in heterogeneous tissues. similar approach for analyzing yeast cell cycle expression patterns. Likewise, Stuart et al. [12] investigated prostate Methods tumor tissue. Lahdesmaki et al. (2005) [4] for the first Experimental data time introduced an approach, which also estimated the Gene expression data are part of the Grand Challenges appropriate numbers of cell types for deconfounding in Global Health Project: Grant Number 37772, “Bio- analysis. markers of protective immunity against Tuberculosis in However, none of these prior approaches systemati- the context of HIV/AIDS in Africa” (funded by the Bill cally studied: & Melinda Gates Foundation through the Grand Chal- lenges in Global Health Initiative; http://www.biomar- - reconstruction of cell type specific gene expression kers-for-tb.net/. PBMC from 40 TB cases and from 40 profiles validated with experimental data; healthy household contact controls were extracted and - sample size effects; analyzed by flow cytometry for proportions of CD3+ T- - realistic simulation parameter settings derived from lymphocytes and CD14+mononuclear phagocytes as appropriate experimental data, with noise conditions described before [1]. All donors gave informed consent. as in a typical clinical study; This study was approved by local ethics committees in - the power of detection of differential gene expres- Stellenbosch (South Africa) (N05/11/187) and Berlin sion in comparison with a classical approach; (EA 10 1/176/07, Germany). - how to use a deconfounding approach in a classifi- Signals of gene expression in whole blood as well as in cation task. CD3+ cells, for the Human Whole Genome Oligo 44K Agilent arrays (GE2_44k_1005) were measured accord- These are the core objectives which our study aims to ing to manufacturer’s protocols. The microarray design contribute. was an independent swop design as recommended by The experimental basis includes an experimental gene Landgrebe et al. [13]: 50% of each group ("TB”, “TST+”, expression data set of 40 Agilent two-color arrays for “TST-”) were labelled with Cy3, the other half using two groups of a field study: tuberculosis patients Cy5. Pairs for hybridization on an array were chosen to (denoted TB cases) and healthy household contacts with match regarding age and gender. For validation of the a positive tuberculin skin test (denoted TST+, healthy deconfounding algorithm we used CD3+ proportions. controls).This dataset is part of the Grand Challenges in CD3+ cells sorted by fluorescence-activated cell sorting Global Health Project: Grant Number 37772, “Biomar- (FACS) were subjected to RNA extraction and microar- kers of protective immunity against Tuberculosis in the ray measurements of gene expression following the context of HIV/AIDS in Africa” (funded by the Bill & same procedure as for the whole blood samples. More Melinda Gates Foundation through the Grand Chal- details about the observational field study as well as the lenges in Global Health Initiative). From each of the gene expression dataset will be published separately (see enrolled individuals, RNA was prepared from a whole- http://www.biomarkers-for-tb.net/publications). blood sample. From the same samples, cells with active Repsilberetal.BMCBioinformatics2010,11:27 Page5of15 http://www.biomedcentral.com/1471-2105/11/27 Gene expression data were normalized using R-pack- 3. ∑ c = 1 for all samples j (i.e. cell type propor- i ij age limma [14]: background correction using the tions sum to 100%) method normexp [15], lowess normalization was applied for each array (within array normalisation), quantile nor- Our implementation is available as an R-package and malisation on the set of all arrays (between array nor- has additional options for using quantile normalization malisation) as recommended [16]. As proposed by [17], instead of global normalization proposed previously gene expression intensities for both groups were [10]. Moreover, it is possible to run the deconfounding obtained as in Equations 4 and 5 from re-parameterizing on log-values of the normalized intensities or on non- the normalized log-ratios (M) and mean log-intensities log data. Finally, our implementation does not apply the (A) resulting from the limma analysis. de-correlation proposed by Venet et al. [10]. To assign the right cell type for each of the estimated MlogI logI A 1(logI logI ) (4) profiles, our implementation relies on a majority count TB control 2 TB control decision involving the estimated gene expression profiles from n = 9 markers. Five of these markers are con- marker 1 1 sidered to be expressed exclusively for a specific cell logITB A 2M logIcontrol A 2M (5) type (positive marker genes) and the remaining four exclusively not in this cell type (negative marker genes). Summarizing, for each of the 40 TB cases and the 40 Marker genes were chosen according to a priori molecu- healthy household contact controls we were able to ana- lar immunological knowledge. For our experimental lyze gene expression data of whole blood as well as for dataset we used CD3D, CD3E, CD3G, CD2 and CD7 as the sorted CD3+ cells of the same samples together with positive markers, and CD19, FCGR1A, CD14 and their FACS-measured cell type proportions for the CD3 MARCO as negative markers for the CD3+ T cells. + cell population. Simulated data Deconfounding algorithm: implementation and Cell type specific gene expression profiles (columns of enhancements the signature matrix S) were simulated according to a The basis of our deconfounding algorithm was imple- gamma distribution such that expectation value and var- mented as proposed by Venet et al. (2001) [10] and Lah- iance were those of the experimental data (shape a = desmaki et al. (2005) [4] using R [18]: 12.5 and scale b = 0.65): input X and n normalize columns of X (either centre, I ~(a,b) (6) gene or by quantile normalization) generate start values for S and C As by Venet et al. [10], biological variance was mod- apply constraints to S and C (see below) eled as multiplicative error term (cid:1), technical variance as (*) fix S, calculate C using lsqnonneg- additive error term (cid:1). For our experimental data, varia- algorithm tion was found to increase with mean signal intensities. apply constraints for S Therefore, we decided to model a constant coefficient of fix C, calculate S using lsqnonneg- variation instead of standard deviation: algorithm apply constraints for C Xgene Igene (cid:2) (7) if | X - SC | < a or number iterations > b then EXIT and report S and C where h = 0.17 and (cid:1) ~ N(0, c · Igene), using c = 0.1 as else continue at (*) estimated from our experimental data. Gene expression values for negative marker genes had where X is the gene expression matrix measured from expression X = 6.0, positive marker genes had heterogeneous tissue (rows: genes, columns: samples), S marker, neg X = 12.0 in the expressing cell type - as and C as in equation 2, iteration exit criteria were set a marker, pos observed for the marker genes in our experimental = 0.1 and b = 100. The Least squares non-negative study. Cell type proportions, C , were drawn from the matrix factorization algorithm is implemented as in the sim uniform distribution between cell type specific maxi- MATLAB function lsqnonneg [19]. The constraints are: mum and minimum values as in our experimental flow cytometry data. The simulated gene expression matrix, 1. S non-negative and normalized (either centered, X , was calculated from simulated cell type-specific or by quantile normalization [16]) sim 2. 0 ≤ c ≤ 1 for all elements of C (cell type i, gene expression profiles, Ssim, and simulated cell type ij proportions, C , corresponding to equation 2: sample j) sim Repsilberetal.BMCBioinformatics2010,11:27 Page6of15 http://www.biomedcentral.com/1471-2105/11/27 after deconfounding. Simulated whole blood gene X S C (8) sim sim sim expression data were analyzed using the t-test, ranking To investigate the algorithm’s capabilities regarding candidates for differential expression using p-values and detection of differential expression of single features and - to enable a direct comparison - considering the 100 for classification, two groups of gene expression profiles top candidates as positive candidates for differential were simulated, e.g. corresponding to TB patients and expression. The cell type-specific gene expression pro- TST+ controls in our experimental data. We simulated files (columns of the signature matrix Sˆ) estimated n = 100 individuals in each group. For each gene from deconfounding were ranked using absolute log- sample expression profile n ∈ {1000, 10000} genes were fold-change values. Also here the 100 top candidates genes considered, with n = 10 and n ∈ {20, 600} dif- were chosen. markers diff ferentially expressed biomarkers. Classification in the case of reverselyregulated Differential expression was simulated by adding Δ ∈ differentially expressed biomarkers diff 1, 2, 5} to the expression values of the biomarker genes The worst-case scenario for biomarker detection in het- in the first cell type. Figure 3 illustrates the generation erogeneous tissues arises when cell types involved of simulated profiles. express differentially regulated biomarkers in opposite Power study: valid biomarkers with and without directions. In this case, in the tissue RNA isolate, signals deconfounding for differential expression likely cancel each other and We simulated a gene expression experiment with sam- hamper detection of respective biomarkers markedly. To ples mixed out of two cell types (CD3 and other) for identify a possible exit strategy, we conducted a simula- 10,000 genes, where 600 genes were differentially tion study for this worst-case scenario, again considering expressed. For the differentially expressed genes we noise values estimated from the experimental data in simulated all eight possible combinations of NEUTRAL, this study. UP and DOWN. Sample sizes of the two groups under To exemplify the worst-case classification task, we comparison (alike TB and TST+ healthy control) varied simulated differential gene expression as above, but also from n ∈ {4, 10, 20, 40, 80, 120}. Simulated gene subtracted the same value from the expression values of samples expression data were analyzed as the experimental data. the second cell type. This way, for all cells in the mix- As for the latter we were able to analyze a simulated ture averaged over all samples, no differential expression whole blood sample (mixture of the two cell types) as is expected, while for the single cell types it is more or well as the two cell type-specific gene expression profiles less evident. Gene expression profiles for new samples, simulated S profiles simulated I profiles + ST+ simulated celCl tDyp3e "CD3" (S matrix) TST nonnoe n nooisisee TB vs. T 1.0 TB vs. 3 pr. 0.0 pr. 0 ene ex −1.0 ne ex −3 g e 0 2000 4000 6000 8000 10000 diff. 0 2000 4g0e0n0e number6000 8000 10000 diff. g gene number + ST+ simulateOd tcheellr tsype "Others" TST ememppiriirciacla nl oniosiese pr. TB vs. T 0.01.0 pr. TB vs. 03 ene ex −1.0 ne ex −3 g e 0 2000 4000 6000 8000 10000 diff. 0 2000 4g0e0n0e number6000 8000 10000 diff. g gene number gene position in S matrix Figure3Simulatedcelltype-specificgeneexpressionprofiles.Left:Smatricesforcell-typesCD3+andOthers.Right:Imatrixbeforeandafter addingempiricalnoise. Repsilberetal.BMCBioinformatics2010,11:27 Page7of15 http://www.biomedcentral.com/1471-2105/11/27 for validation of the trained classifiers in the classifica- Results tion scenario, were generated using the identical signa- As the experimental data offered gene expression pro- ture matrices, S , as for the training step, but with files for whole blood, i.e. a heterogeneous tissue which sim new values for the concentration matrices as well as for is a mixture of several cell types, and in addition the the noise term realizations. gene expression profiles from CD3+ cells of the same Canonical classification approach samples, and the respective CD3+ proportions (deter- For feature selection, t-tests were used to identify bio- mined by FACS), we were able to use this information marker candidates from the simulated heterogeneous as a basis for a validation study for the proposed decon- tissue gene expression data: The top n (cid:1) {10, 20} founding algorithm. cand were chosen to train a linear discriminant function as In addition, to methodologically optimize the decon- classificator. Classification errors in a validation (500 founding algorithm as well as to investigate its usability new cases simulated) for this classical classification to detect differentially expressed genes and biomarkers approach were then compared to a deconfounding rank- usable for classification of new patients (with only whole ing approach, which is described in the following. blood expression profiles measured) - we had to rely on Deconfounding ranking approach simulation studies. For the training dataset, a deconfounding analysis was Summarizing, our study was designed to answer four run and n candidates top ranked for differential questions. For which data scale and algorithm settings cand expression were picked from gene-wise mean absolute do we achieve: differences between the corresponding columns of the estimated signature matrices, Sˆ, for the two groups. In - The best estimate of cell type-specific expression addition, using the simulated whole blood expression profiles (columns of signature matrix)? Data basis: data, X , from the training dataset, a random forest experimental data. sim predictor was trained to estimate the cell type propor- - The best marker-based identification of recon- tions Cˆ , resulting from the deconfounding algorithm structed cell type-specific gene expression profiles run from the same training-data [20,21]. (columns of Sˆ)? Data basis: simulation study (para- Input to this statistical learning step were the gene meters estimated from experimental data). expression data in X for the n = 20 marker - The largest power to detect differential expression? sim markers genes. For each new individual during the validation Data basis: simulation study (parameters estimated part of the study, cell type proportions were estimated from experimental data). from the simulated whole blood gene expression profile - The smallest prediction errors for the classification using the trained random forest machine. Deconfound- task? Data basis: simulation study (parameters esti- ing results Sˆ of the training dataset for the two groups mated from experimental data). A and B were then multiplied with the estimated cell type proportions for the new individual, to result in Reconstruction of cell type-specific gene expression group-specific gene expression profiles xˆA and profiles and cell type proportions in experimental data .,sample with xˆB with the cell type proportions of the sam- The deconfounding algorithm was applied to the whole .,sample ple in question. The actual gene expression signals of blood gene expression matrices for both groups of indi- the sample at the chosen n biomarker loci were then viduals (TB and TST+) both using the quantile normali- cand compared to these group-specific gene expression zation as well as global mean normalization approach matrices and the following summary score computed: for log- and non-log-intensities. Numbers of cell types was set to n = 2. Deconfounding results - estimated CT ncand abs(x xgrouup ) (9) cell type-specific gene expression profiles Sˆ as well as group gene, new individual gene, new individual cell type proportions Cˆ - could be compared to gene1 the actual experimental data (see figure 4, figure 5 and Classification was based on choosing the group for table 1): which g was minimal. Figure 4 displays mean values of the measured CD3+ group Implementation and availability expression profile in TB patients against both estimated R-package deconf implementing the deconfounding columns of the signature matrix Sˆ. algorithm and options, R-scripts for data simulation, For the displayed example in figure 4, non-log data data analysis and an anonymized part of the experimen- were quantile normalized: Experimental data show con- tal dataset is available as additional file 1 (Windows R- siderable variation when compared to the estimates after package) and additional file 2 (tar-gz archive). deconfounding. As expected, cell type 1 is evidently bet- ter correlated with the experimental CD3+ profile than Repsilberetal.BMCBioinformatics2010,11:27 Page8of15 http://www.biomedcentral.com/1471-2105/11/27 Table 1Profile reconstruction versusdifferential gene expression:alternatives fordeconfounding algorithm settings Optimaldeconfoundingalgorithmsettings log/quant log/notquant notlog/quant notlog/notquant correconstr 0.86 0.85 0.73 0.68 DGEpower 0.47 0.46 70 0.62 Correlationsofmeasuredandestimatedcelltype-specificgeneexpressionprofiles("correconstr.”)aswellaspowerfordetectionofdifferentialexpression("DGE power”,seetext)–forallfourcombinationsofusinglogsornot,applyingquantileorglobalmeannormalization,respectively,forthedeconfoundingalgorithm. cell type 2. The correlation is best for large expression - and also a typical value for such type of clinical study values. involving high-throughput analyses. The effect of sample Also, referring to figure 5, even though there is lower size is clearly distinguishable for simulation results using correlation between experimental and estimated cell n = 4 (figure 6, left) and n = 120 (figure 6, sample sample type proportions, the indicated regression lines in the right) respectively. scatter plots for experimental and estimated proportions Cell type assignment using markers (simulation study) show the correct tendencies for the respective cell types. The deconfounding algorithm itself does not assign a Table 1 (first row) depicts correlations of mean mea- cell type to the estimated cell type specific expression sured profiles with the estimates from deconfounding profiles (columns of Sˆ). Therefore, to find out in which results for the comparison between the four methodolo- of the two possible orders the two estimated cell type gical algorithmic alternatives. profiles (CD3+ and others) reside, one has to rely on Deconfounding quality as function of sample size expression signals of cell type specific markers. Regard- (simulation study) ing the analysis of the experimental data, such markers To investigate the influence of sample size on the qual- were chosen based on a priori immunological knowl- ity of deconfounding results, we had to rely on simula- edge. In our simulation studies, we simulated 5-10 posi- tion studies which were aimed at mirroring tive CD3+ marker genes, which were expressed at high experimental data distribution and noise as realistically levels (simulated level for X = 12), whereas marker, CD3 as possible. Figure 6 (middle panel) shows the simula- these marker genes showed a low mean expression in tion results for n = 20, which approximates the the alternative cell type (simulated level for X sample marker, other sample size for the GC6 experimental data (cf. figure 4) = 6). These expression levels were used as observed for cell type 1 cell type 2 4 Counts 4 721 e 3 3 fil 217 o pr 181 D3 2 145 2 C 109 d re 1 73 1 u as 55 e m 37 0 0 19 8 −1 1 −1 −1 0 1 2 3 −4 −2 0 2 4 estimated CD3 profile estimated CD3 profile Figure4Validationofestimatedgeneexpressionprofiles.ValidationofgeneexpressionprofileestimateswithexperimentaldatafromFACS sortedCD3+cells:Leftpanel:measuredgeneexpressionintensitiesforCD3+cellsversusintensitiesestimatedforcelltype1.Rightpanel: measuredgeneexpressionintensitiesforCD3+cellsversusintensitiesestimatedforcelltype2. Repsilberetal.BMCBioinformatics2010,11:27 Page9of15 http://www.biomedcentral.com/1471-2105/11/27 A B ct1(2)/CD3 (TB) ct2(2)/CD3 (TB) %] %] 3 [ 70 3 [ 70 D D C 0 C 0 n 5 n 5 gi gi ri 0 ri 0 o 3 o 3 45 55 65 35 45 55 estimated ct1 [%] estimated ct2 [%] C D ct1(2)/Other (TB) ct2(2)/Other (TB) %] %] er [ 60 er [ 60 h h Ot 0 Ot 0 n 4 n 4 gi gi ori 20 ori 20 45 55 65 35 45 55 estimated ct1 [%] estimated ct2 [%] Figure5Validationofestimatedcelltypeproportions.Validationofcelltypeproportionestimateswithexperimentaldata(FACScountsfor CD3+cells):A:measuredproportionofCD3+cellsversusestimatedproportionsforcelltype1.B:measuredproportionofCD3+cellsversus estimatedproportionsforcelltype2.C:measuredproportionofnon-CD3+cellsversusestimatedproportionsforcelltype1.D:measured proportionofnon-CD3+cellsversusestimatedproportionsforcelltype2.Linearregressionlinesaredisplayedinred. 4 samples 20 samples 120 samples ) 2 g o (l e fil o r p 3 D C al n gi ri o estimated CD3 profile (log2, quant) Figure6Profileestimatesforsimulateddata.Geneexpressionprofileestimatesforsimulateddata,realisticnoise,quantilenormalisation,and samplesizesof4(leftpanel),20(middlepanel),or120samples(rightpanel). Repsilberetal.BMCBioinformatics2010,11:27 Page10of15 http://www.biomedcentral.com/1471-2105/11/27 the experimental data. Another group of marker genes cases A and B, for which differential gene expression is was simulated in the reverse manner. Figure 7 shows either in the same direction in both cell-types or differ- the distributions of estimated marker gene expression ential in one cell type only. However, for small sample levels from simulated data after deconfounding employ- sizes in all cases, and especially also for large sample ing global mean (A) or quantile normalization (B). Here, sizes in figure 8C, the deconfounding ranking approach use of the robust quantile normalization was rewarding detects more of the true differentially expressed genes for this critical step: Lack of a possibility to assign the than the t-test. As it is for this worst-case scenario (fig- right cell types thwarts the analysis as a whole. It is also ure 8C), where differentially expressed signals of the cell evident, that marker gene expression levels were esti- types involved cancel each other, we aimed at assessing mated mostly correctly regarding relative values in both application of the deconfounding ranking approach for cell types, whereas absolute gene expression levels were the classification objective for this case. scaled down in the estimates. However, to be able to As power for detection of differential expression use these cell type specific estimated marker gene (“DGE power”) we define the proportion of truly differ- expression levels to assign the right cell types it is only entially expressed genes in the 100 top-ranked 100 can- necessary that positive markers have top expression didates. Table 1 (second row) depicts this power for levels in the cell type exclusively expressing them. detection of differential expression for four algorithmic Valid detection of cell type-specific differential gene alternatives. Choosing quantile normalization for inten- expression (simulation study) sity values and using non-log values gives optimal Because we want to study the use of deconfounding for results. biomarker discovery, in our power-study we compared Applying the deconfounding approach for classification the t-test and our deconfounding approach regarding (simulation study) their power to detect differential gene expression (candi- As an important objective is to find biomarkers from date biomarkers). Figure 8 shows the central results: t- the estimated cell type specific gene expression signa- test and deconfounding approach show comparable tures resulting from the deconfounding, we have to results for higher sample sizes (40 ≤ n ≤ 120) and show how such biomarkers could be applied to a new sample patient’s whole blood expression dataset. The decon- founding algorithm results in estimates for the signature A matrix and the concentration matrix for a given group of samples. In our case, the procedure uses simulated gene expression profiles of 40 individuals (per study y sit8 group) to estimate two cell type-specific gene expression n e profiles (CD3+ and NotCD3+). It is, however, not possi- r int4 ble to use a single individual’s profile for deconfounding, e k as for a single case there is no information available mar0 about how a change in cell type proportions influences measured gene expression signals. To enable the use of 1 2 3 4 5 6 7 8 9 10 the deconfounding results for classification of a new individual, we have to either measure or estimate a sin- B gle individual’s cell type proportions. To estimate cell type proportions from a single whole blood expression y sit8 profile we employed a random forest machine to learn n e to predict cell type proportions from simulated whole r int4 blood gene expression data using the training dataset ke and the deconfounding estimates of Cˆ . For a new indi- ar0 vidual, this trained random forest was then used to esti- m mate cell type proportions. 1 2 3 4 5 6 7 8 9 10 These were multiplied to the group-specific signature mmaarrkkeerr ppoossitiitoinon matrices estimated by deconfounding from the two groups in the training data. The resulting group-specific Figure7Celltype-specificassignmentusingmarkers.Meanand rangeformarkerintensitiesafterdeconfoundingwithout(A)and gene expression matrices - based on cell type propor- withquantilenormalization(B).Cell-typesCD3+(red)andOther tions as in the new individual - were used in a majority (blue).Thefirstfivemarkerpositionsarepositivemarkers(exclusively votes comparison approach and the individual classified expressedinCD3+),theremainingfivearenegativemarkers(not accordingly. expressedinCD3+).