Ecketal.BMCBioinformatics (2017) 18:441 DOI10.1186/s12859-017-1843-1 RESEARCH ARTICLE Open Access Interpretation of microbiota-based diagnostics by explaining individual classifier decisions A. Eck1*, L. M. Zintgraf2, E. F. J. de Groot3, T. G. J. de Meij4, T. S. Cohen2, P. H. M. Savelkoul1,6, M. Welling2,5 and A. E. Budding1 Abstract Background: Thehumanmicrobiotaisassociatedwithvariousdiseasestatesandholdsagreatpromisefornon-invasive diagnostics.However,microbiotadataischallengingfortraditionaldiagnosticapproaches:Itishigh-dimensional,sparse andcomprisesofhighinter-personalvariation.Stateoftheartmachinelearningtoolsarethereforeneededtoachieve thisgoal.Whilethesetoolshavetheabilitytolearnfromcomplexdataandinterpretpatternsthereinthatcannotbe identifiedbyhumans,theyoftenoperateasblackboxes,offeringnoinsightintotheirdecision-makingprocess.Inmost cases,itisdifficulttorepresentthelearningofaclassifierinacomprehensibleway,whichmakesthempronetobe mistrusted,orevenmisused,inaclinicalenvironment.Inthisstudy,weaimtoelucidatemicrobiota-basedclassifier decisionsinabiologicallymeaningfulcontexttoallowtheirinterpretation. Results:Weappliedamethodforexplanationofclassifierdecisionsontwomicrobiotadatasetsofincreasingcomplexity: gutversusskinmicrobiotasamples,andinflammatoryboweldiseaseversushealthygutmicrobiotasamples.The algorithm simulates bacterial species as being unknown to a pre-trained classifier, and measures its effect on the outcome. Consequently, each patient is assigned a unique quantitative estimation of which species in their microbiota defined the classification of their sample. The algorithm was able to explain the classifier decisions well, demonstrated by our validation method, and the explanations were biologically consistent with recent microbiota findings. Conclusions: Application of a method for explaining individual classifier decisions for complex microbiota analysis proved feasible and opens perspectives on personalized therapy. Providing an explanation to support a microbiota-based diagnosis could guide decisions of clinical microbiologists, and has the potential to increase their confidence in the outcome of such decision support systems. This may facilitate the development of new diagnostic applications. Keywords: Microbiota, Inflammatory bowel disease (IBD), Supervised classification, IS-pro, Machine learning Background of microbiota analysis has been extensively studied, for The human microbiota refers to the trillions of microor- example, for disorders like inflammatory bowel disease ganismslivingonandwithinourbody,whichareinvolved (IBD) and diverticulitis [3–6] and has led to increasing inmanybiologicalprocessesnecessarytomaintainhuman effortstodevelopclinicalapplications. health.Thesemicrobialcommunitiesareassociatedwitha However, using microbiota for diagnostic purposes widerangeofdiseasestates[1,2],andholdagreatprom- imposes several challenges. Studying the human micro- ise for non-invasive diagnostics. The diagnostic potential biota is an active field of research. As knowledge is still accumulating, a full biological understanding of the underlying mechanisms is missing, limiting the diagnos- *Correspondence:[email protected] tic efforts to data-driven approaches, such as data 1DepartmentofMedicalMicrobiologyandInfectionControl,VUUniversity mining. The data itself is high-dimensional and has high medicalcenter,Amsterdam,TheNetherlands Fulllistofauthorinformationisavailableattheendofthearticle ©TheAuthor(s).2017OpenAccessThisarticleisdistributedunderthetermsoftheCreativeCommonsAttribution4.0 InternationalLicense(http://creativecommons.org/licenses/by/4.0/),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedyougiveappropriatecredittotheoriginalauthor(s)andthesource,providealinkto theCreativeCommonslicense,andindicateifchangesweremade.TheCreativeCommonsPublicDomainDedicationwaiver (http://creativecommons.org/publicdomain/zero/1.0/)appliestothedatamadeavailableinthisarticle,unlessotherwisestated. Ecketal.BMCBioinformatics (2017) 18:441 Page2of13 inter-personal variability. Consequently, identification of the diagnostic Porto-criteria for pediatric IBD. Fresh meaningful patterns isthwarted. fecal samples were collected in sterile containers and Therefore, machine learning (ML) tools are particu- immediately stored in the freezer at −20 °C as was larly useful in this field. Compared to humans, classifiers previouslydescribed[9]. can more easily identify patterns in high-dimensional data, being able to learn to distinguish between categor- MicrobiotaprofilingbyIS-pro ies (e.g., sick vs. healthy) based on examples. A trained All samples were analyzed by intergenic spacer (IS) classifier can then make predictions for new patients, profiling (IS-pro) [8, 9]. IS-pro is a PCR-based technique accompanied by a probability reflecting its certainty. that differentiates bacterial species by the length of the However, contrary to humans, classifiers usually do not 16S–23S rRNA in combination with phylum-specific provide an explanation to their propositions. In most fluorescently labeled PCR primers. Three phylum- cases they are considered as black boxes, taking features specific primers were used: (1) Bacteroidetes, (2) Firmi- (e.g., microbial species) as input and returning a predic- cutes, Actinobacteria,Fusobacteria,andVerrucomicrobia tion. In a diagnostic setting, ML tools are able to assist (FAFV), and (3) Protebacteria. For more details see the us,but still cannotbeblindly trusted. Additionalfile1:SupplementaryMethods. Therefore, in order to effectively incorporate ML tools in microbiota-based diagnostics it is important to pro- Preprocessing vide clinical microbiologists with a sensible, sample- Preprocessing was carried out with the IS-Pro propri- specific explanation for every classifier prediction. This etary software suite (IS-Diagnostics, Amsterdam, the will allow them to weigh the outcome in the clinical and Netherlands) and resulted in microbial profiles, pre- biological context, interpret it and even adjust specific sented as peak profiles. Each peak represents an IS frag- therapy based on microbial species that appear to play a ment and is characterized by a color that corresponds to role inthediagnosis. the phylum (or phyla). The length of the IS fragment, In this paper, we present how an idea originally pro- measured in nucleotides (nt), discriminates bacterial posed by Robnik-Šikonja and Kononenko [7] can be species, and its intensity, measured in relative fluores- implemented for microbiota-based diagnostics. The cence units (RFU), reflects the abundance. Intensity method simulates input features as being unknown to values were log2 transformed. Each peak was taken as a measure their effect on the classification outcome. Con- feature for classification. After excluding all-zero fea- sequently, each input feature is assigned a relevance tures, 914 and 1199 features were left for classification value that reflects its importance for a specific decision intheSVGandIBDdatasets,respectively. of the classifier. With this approach, we aim to provide each patient a unique quantitative estimation of which Supervisedclassification species in their microbiota were used to determine their The following classifiers were used: Linear Support health status by the classifier. Given a trained classifier Vector Machine (SVM), Random Forest (RF), Nearest and a microbiota sample of a new patient, the species Shrunken Centroids (NSC) and Logistic Regression with relevance ranking may help interpret why this patient L2regularization (LR). was given a specific diagnosis, highlight species of inter- These classifiers are especially suitable for high- est,andguidetreatment. dimensionaldataastheyapplyregularizationordimension- This method was applied on two microbiota datasets ality reduction. Examples of their applicability to human generated by IS-pro, a technique designed for micro- microbiotawerepreviouslydescribed[10].Implementation biota profiling in clinical routine [8]. We further also was done using packages of the scikit-learn 0.17.0 Python describe how the calculated explanations can be vali- library: svm.SVC (with linear kernel and probabilistic dated and visualized. outputs); neighbors.nearest_centroid.NearestCentroid (with a 0.1 shrink threshold and probabilities as computed in Methods [11]);ensemble.RandomForestClassifier(with50trees),and Subjectsandsamples linear_model.LogisticRegression(withL2penalty). Two balanced datasets were included: skin versus gut Overall relevance measures of NSC were obtained per microbiota (SVG, n = 94) and gut microbiota of IBD class by taking the difference between the global patients versus healthy individuals (IBD, n = 112). Skin centroid and the class centroid, to which we compared microbiota samples were collected by swabbing the our explanation for the predicted class. For the RF inner lower arm of healthy adults. Swabs were stored in classifier, we used the importance ranking given by RTF buffer immediately after collection and later frozen RandomForestClassifier.feature_importances. Note that at −20 °C. Gut microbiota was sampled from healthy using the weights of a linear SVM as an indication of children and children diagnosed with IBD according to feature importance is wrong and misleading [12]. Ecketal.BMCBioinformatics (2017) 18:441 Page3of13 Similarly, the coefficients of the LR classifier cannot be relevance of the respective input feature. A positive interpreteddirectly. relevance value means that the feature value is a supportive evidencefor the decision, andanegative rele- Algorithmforexplainingclassifierdecisions vance value expresses that the feature value constitutes Robnik-Šikonja and Kononenko [7] proposed that in evidence against the decision (and therefore evidence for order to measurehow important afeature (e.g., abacter- theotherclassinatwo-classproblemlikeours). ial species) is, we can evaluate the classifier output when that feature is considered “unknown”. In our case, we Univariatevs.multivariateapproach seek to evaluate the decision the classifier would make if The method described above is a univariate approach: the information about a single bacterial species is not to estimate a feature’s relevance, we marginalized out available. By simulating this for every feature individu- only this one feature. Intuitively, we would expect the ally, we obtain an overall ranking of the importance of classifier to stay robust to changes in only one feature, eachspeciesfor a sample-specificclassification. for example when features are redundant. In this case, it The authors proposed three strategies to evaluate the might be necessary to remove several features at once to classifier while leaving one feature out: simply declare have a noticeable effect on the prediction of the classi- the feature as unknown (which only few classifiers fier. We propose a computationally feasible multivariate allow), re-train the classifier without that feature (which implementation for microbiota data in the Additional is computationally unfeasible for clinical applications if file1:SupplementaryMethods. there are many features), or approximately marginalize the feature out. We chose the latter for our analysis. Analyticvalidationofresultingexplanations Based on this strategy, we used the empirical distribu- When thedata is poorly understood byhumans,it isnot tion of the feature in the dataset, and replaced its value straightforward to assess how good an explanation is. in the given sample with all other possible values it We propose a novel method for the analytic validation obtained in the dataset. The average prediction probabil- of the explanations, since we have not found any such ity over those values was then used to determine the method in literature. The relevance estimation method importance of the feature (see the Additional file 1: described above returns a relevance ranking which indi- Supplementary Methods for the mathematical formula- cates a contribution measure of each feature. Features tion). The algorithm returns a vector of the same size as were sorted according to their relevance in ascending the number of features, where each entry reflects the order (from features with negative influence on the class A B C D E Fig.1Agutmicrobiotasampleanditspeaks’relevancevalues:(a)Agutmicrobiotasample:Peakscorrespondtobacterialspecies;Peakheights correspondtoabundance.(b)Relevancevaluescalculatedforeachpresentpeakintheprofilein(a).(c)Relevancevaluescalculatedforeach zero-valuedpeakintheprofilein(a).(d)Averagegutmicrobiotaprofileacrossallsamples.(e)Averageskinmicrobiotaprofileacrossallsamples; FAFV:Firmicutes,Actinobacteria,FusobacteriaandVerrucomicrobia;RelevancewascalculatedbasedonSVMpredictions Ecketal.BMCBioinformatics (2017) 18:441 Page4of13 score, to irrelevant features, to the most important Skinvs.gutmicrobiota ones). Then, we successively marginalized out a growing This section presents the results for a simple case study: number of features, starting with the least important theSVGdataset. ones, until no features were left. We used the predicted class for the analysis, and observed how the class prob- Explainingasingleprediction ability changed based on the features' subset used for Figure1displaystheoutputoftheexplanationalgorithm the classification. The class probability should rise when for a single prediction of the classifier. For this example, features with negative evidence for the predicted class we selected a gut microbiota sample correctly classified are ignored, remain stable if features of zero importance by a linear SVM. Bacterial species abundances (feature are removed, and decline when highly discriminative values) are shown in Fig. 1a, and the relevance values featuresareignored. calculated for each feature are illustrated below (Fig. 1b We also assessed our results based on pairwise corre- and c). To facilitate interpretation, we split the relevance lations between relevance vectors of a sample calculated calculated for positive-valued features (present peaks, by different classifiers, as well as between sample- Fig. 1b) from that of zero-valued features (absent peaks, specific relevancemeasures andglobalones. Fig. 1c), since zero-abundance peaks also hold informa- tion but are not visible in the profile and therefore can be visually confusing. Results In general, for a single gut microbiota sample, we In this study, we calculated explanations for individual expect that common gut colonizers would obtain classifier decisions using two microbiota datasets. Each positive relevance values when they are found in the sample is represented as a peak profile, composed of sample, and negative relevance values in case they are peaks representing different bacterial species and their not detected. On the other hand, common skin colo- abundances,which aretaken asfeaturesforclassification nizers should obtain negative relevance values when (see Methods). they are found in the sample and positive values The first dataset, denoted “SVG” (Skin Versus Gut), otherwise. This was indeed observed for this example, consisted of 47 skin microbiota samples that were classi- with common members of the gut microbiota, such fied against 47 gut microbiota samples (both of healthy subjects). Since themicrobiota composition thatinhabits Table1Mostrelevantbacterialspeciesforclassificationofthe gutmicrobiotasampleshowninFig.1 skin or gut is very different, this classification task is relatively easy and can be directly interpreted by experts. Speciesname IS-pro Presence Relevance Common peak(s) habitat This dataset was used to demonstrate how the method Alistipesputredinis 235 + 0.43 Gut works and how the results can be presented, while allowing aqualitativeevaluation ofthe method. Alistipesfinegoldii 396 + 0.4 Gut The second dataset, denoted “IBD”, consisted of 56 231 + 0.33 Gut IBD gut microbiota samples (20 Ulcerative Colitis Lachnospiraceaesp. 605 + 0.25 Gut patients and 36 Crohn’s disease patients) classified Bacteroidesfragilis 537 + 0.22 Gut against 56 healthy gut microbiota samples. This data is Staphylococcusepidermidis 313 – 0.2 Skin more complex and not so well understood, and classi- Odoribactersplanchnicus 307 + 0.18 Gut fiers outperform human expertise. It is those cases that are interesting for clinical applications, and for which ex- Bacteroidessp. 549 + 0.16 Gut planationsforclassifierdecisionsareespeciallydesirable. Prevotellasp. 438 + 0.15 Gut The performance obtained by the classifiers in terms Lachnospiraceaesp. 491 + 0.15 Gut of prediction accuracy in a 10-fold cross-validation test Alistipesfinegoldii 230 – −0.14 Gut was: SVM - 98% and 81%, RF - 99% and 81%, NSC - Bacteroidesvulgatus 478 + 0.14 Gut 99% and 79%, and LR- 100% and 78%, for SVG and IBD Streptococcusmitis 296 – 0.13 Skin datasets,respectively. Each classifier’sdecision isaccompanied byaprobabil- Sutterellawadsworthensis 661 + 0.12 Gut ity value reflecting the certainty of the classifier (all of unclassifiedBacteroidete 455 + 0.09 Gut the above classifiers are either probabilistic or their out- unclassifiedProteobacterium 932 + 0.09 – put can be transformed to probabilities). To explain the Escherichiacoli 735/828 + 0.08 Gut decision, we assigned each input feature a relevance value. A positive value means that the feature’s value was unclassifiedFirmicute 558 + 0.07 Gut ’+’and’-’indicatewhetherapeakispresentorabsentinasample,andneeds supportiveevidenceforthepredictedclass,andanegative tobecoupledwiththerelevancevalueforinterpretation.Rankingisbasedon valuemeansitwasevidenceagainstthepredictedclass. absoluterelevancevalues;ClassificationwasdonebyalinearSVM Ecketal.BMCBioinformatics (2017) 18:441 Page5of13 as Bacteroides and Prevotella species, being assigned visual interpretation, we again separated the present positive relevance when they were found in the sam- peaks (Fig. 2a) from the absent peaks (Fig. 2b). In both ple. This is indicated by the matching peaks between heat maps, the relevance values clustered the samples Fig. 1b and d. Prevalent skin species that were not according to their (mostly correctly) predicted class. detected in this sample, such as Staphylococcus epi- Prevalent colonizers of each habitat stood out with es- dermidis, also obtained positive relevance. This is in- pecially high or low relevance values, depending on the dicated by the matching peaks between Fig. 1c and e. class and the presence status. Top ranked species, by Table 1 lists the top ranked species according to ab- means of average absolute relevance across all samples, solute relevance, together with their presence status are listed in Table 2. in this sample and the body habitat they usually colonize as a validation. Healthyvs.IBDgutmicrobiota While body habitats are relatively easy to classify, and Meta-analysisofallsamples the calculated explanations of the classification are suit- While the proposed method is mainly aimed to explain able for a qualitative interpretation, real-life clinical single predictions, to validate it and gain insight into its dilemmas are usually much more complicated and much behavior across the entire dataset, we calculated expla- less straightforward to interpret. The following use-case, nations for every sample in the dataset. These were in which we classified gut microbiota samples of IBD summarized in heat maps (Fig. 2) to visualize the over- patients against gut microbiota of healthy individuals, is all relevance of features in the dataset. To facilitate a agoodexample. A B Fig.2Skinmicrobiotavs.gutmicrobiota:Clusteredheatmapsofpeaks’relevancevaluesacrossallsamples.Columnscorrespondtosamples, rowstopeaks(orbacterialspecies).Toprelevantspeciesforclassificationareindicated.(a)Heatmapincludingonlytherelevancecalculatedfor positive-abundancepeaksineachsample;(b)Heatmapincludingonlytherelevancecalculatedforzero-abundancepeaksineachsample.Only peakswithmeanabsoluterelevanceabove0.01areshownineachheatmap.Colorkeyfrompinktoblueindicatesrelevancefromlowtohigh, respectively.Foreachsamplethepredictedclass(pred)andthetrueclass(obs)areindicated.Clusteringisbasedoncosinecorrelationmatrix followedbyunweightedpairgroupmethodwitharithmeticmean(UPGMA).RelevancevalueswerecalculatedbasedonlinearSVMpredictions Ecketal.BMCBioinformatics (2017) 18:441 Page6of13 Table2Mostrelevantbacterialspeciesforclassificationofskinmicrobiotaversusgutmicrobiotasamples Speciesname IS-propeak(s) Commonhabitat Rankedtop15by: Alistipesfinegoldii 230/231/395/396/400/407 Gut all Alistipesputredinis 235/236 Gut all Bacteroidesvulgatus 478/479 Gut all Bacteroidesfragilis 537 Gut all Clostridiumperfringens 235 Gut all Streptococcusmitis 296/297 Skin all Staphylococcusepidermidis 312/313 Skin svm,nsc,lr Prevotellasp. 437/438 Gut svm,rf,lr Bacteroidessp. 474 Gut svm,rf,nsc Lactobacillusparacasei 294 Gut svm,nsc Staphylococcusepidermidis 303 Skin nsc,lr Streptococcussp. 321 Gut nsc,lr Bifidobacteriumadolescentis 495 Gut nsc,lr Clostridiumsp. 231 Gut rf Odoribactersplanchnicus 308 Gut rf Streptococcussp. 318 Gut rf Lactobacillussp. 478 Gut rf unclassifiedProteobacterium 941 – rf Lactococcuslactis 344 Gut nsc Propionibacteriumacnes 269 Skin nsc Rankingisbasedonabsolutemeanrelevancevaluesacrossallsamplesperclassifier.Speciesareorderedaccordingtothenumberofclassifiersbywhichthey weretop-ranked.All-allclassifiers;svm-linearSVM;rf-RandomForest;nsc-NearestShrunkenCentroids;lr-L2-regularizedLogisticRegression Explainingasingleprediction Meta-analysisofallsamples Figure 3 illustrates the relevance calculated for three Figure5provides anoverviewof the relevance valuescal- selected IBD individual samples, all correctly classified culatedperpeakforeverysampleinthedataset.Theclus- by a linear SVM. Similar to Fig. 1, peak-specific tering of the samples matched the predicted classes, relevance values are presented in reference to their althoughitfittedlessaccuratelythetrue classofthesam- abundance profile (Fig. 3a and b), which allows a ples, since more samples were misclassified compared to straightforward interpretation of which peaks in the the SVG dataset. The color pattern for prevalent species profile drove the classification towards the IBD class or wascomplementarybetweenthetwoclassesandthepres- against it. In a separate panel (Fig. 3c) the same is ence (Fig. 5a) or absence (Fig. 5b) status, i.e., a prevalent displayed for zero-abundance peaks. The most relevant peak that stronglysupported one classalsosupported the species to explain the classifier’s decision for the top otherclassbyitsabsence,andviceversa.Topexplanatory example arelistedinTable3. speciesbymeansofaverageabsoluterelevanceforclassifi- In a clinical context, such relevance ranking can cationacrossallsamplesareshowninTable4. improve the understanding of a classifier’s output by clinicians, and potentially support their decisions. Discoveringunderlyingsubgroupswithinaclass Species that were assigned high relevance values may IBD is a complex and multifactorial disease character- then guide an intervention according to their abun- ized by high between-patient variability in symptoms dance. For example, in the top case presented in and response to treatment. This variability is attributed Fig. 3, determining a relevance cutoff of 0.25 would to various factors, including the intestinal microbiota result in a short list of species that can be presented composition. We show how the calculated explanations to clinical microbiologists, which is more informative may be used to reveal underlying subgroups within a than just presenting the decision (i.e. class probabil- group of patients (Fig. 6). Most of Crohn’s disease ity) itself (Fig. 4). Based on their expertise, it may be patients in our cohort (32/36) were treated with exclu- advised that eliminating Escherichia coli and/or sup- sive enteral nutrition (EEN), and were evaluated again plementing Alistipes spp. would be beneficial for this after a six-week period. These patients could be grouped particular patient. into two clusters based on the calculated relevance Ecketal.BMCBioinformatics (2017) 18:441 Page7of13 A B C D E Fig.3ThreeIBDgutmicrobiotasamplesandtheircorrespondingpeaks’relevancevaluesillustrateuniquesample-specificrelevanceranking.All sampleswerecorrectlyclassified,withtheclassifier’sprobabilityvaluesdisplayedpersample.(a)Gutmicrobiotasamples:peakscorrespondto bacterialspecies;Peakheightscorrespondtoabundance.Classificationprobabilityisshowntotherightofeachprofile.(b)Relevancevalues calculatedforeachpresentpeakintheprofilesin(a).(c)Relevancevaluescalculatedforeachabsentpeak(i.e.zero-valuedpeak)intheprofiles in(a).(d)AverageIBDgutmicrobiotaprofileacrossallsamples.(e)AveragehealthymicrobiotaprofileacrossallsamplesFAFV:Firmicutes, Actinobacteria,FusobacteriaandVerrucomicrobia;RelevancewascalculatedbasedonSVMpredictions values, with a significant enrichment of patients with thepredictedclass,Fig.7a).Mostfeatureshadanear-zero induced remission in response to EEN in one cluster relevance score, and were therefore not important for the comparedtotheothercluster(p=0.04,chi-squaretest). decision. We observed a rise in certainty when features withhighnegativerelevance wereremoved,followed by a Analyticvalidationofexplanations strong decline in response to removal of features with Since the IBD dataset is poorly understood by humans, top-ranked positive evidence (Fig. 7b). In the middle we sought for a way to analytically validate the results. range, where features with near-zero relevance were ex- Here, we propose a novel validation method that cluded, the validation curve stagnates, indicating that re- consists of subsequently removing a growing number of movingthemhadnoinfluenceontheclassification. features, sorted by their relevance in ascending order. In addition, we validated our results by comparing For a good explanation, the confidence of the classifier explanations calculated by different classifiers by sam- is expected to first go up (evidence against the predic- ple, and comparing sample-specific explanations of a tion is removed), and then start to go down when classifier with the respective model’s global ranking (if important features are taken out. When all features are available by the classifier). The correlation was higher removed, the classifier has no information left and there- when calculated within-classifier, i.e. between the foreitscertaintyshoulddropto0.5inabalanceddataset. sample-specific relevance vectors and the global im- The average validation results for the IBD dataset are portance ranking of the same classifier, than between presented in Fig. 7. Only a small number of features was sample-specific relevance vectors of different classi- very relevant (either as positive or negative evidence for fiers (Additional file 2: Figure S1). Ecketal.BMCBioinformatics (2017) 18:441 Page8of13 Table3MostrelevantbacterialspeciesforthetopIBD context. Our algorithm is based on a method originally microbiotasamplepresentedinFig.3 proposedbyRobnik-ŠikonjaandKononenko[7].It gener- Speciesname IS-propeak(s) Presence Relevance ates a sample-specific feature relevance ranking that, Escherichiacoli 735 + 0.62 combined with an abundance profile, allows us to Alistipesfinegoldii 396 – 0.57 reason about the importance of different species in driving a prediction. This gives an insight into the Alistipesfinegoldii 230 – 0.52 specific pathogenic perturbations of the microbiota of unclassifiedProteobacterium 747 + 0.35 individual patients, which in turn may allow adjusting Lachnospiraceaesp. 606 + 0.3 personalized treatment, supplementing or targeting Alistipesputredinis 235 – 0.29 specific species. unclassifiedFirmicute 326 + −0.26 We first tested our approach on an easily classifiable dataset consisting of gut and skin microbiota samples. Finegoldiamagna 306 + 0.24 Species ranked highly relevant were identified as Parabacteroidesdistasonis 461 + 0.23 common inhabitants of each of these body sites [13]. Morganellamorganii 855 + −0.23 Common gut colonizers, such as Bacteroidetes spp. and Alistipesfinegoldii 400 – 0.21 E. coli, were supportive evidence for the ‘gut’class. Skin Bacteroidesfragilis 605 + −0.21 samples presented an opposite image, with dominant Escherichiacoli 827 + 0.19 skin bacteria, such as S. epidermidis and Streptococcus Alistipesputredinis 236 – 0.18 mitis, driving the classification by their presence. It is important to note that evidence for or against a class Streptococcussp. 321 + 0.17 can be found in the presence of certain species, but also unclassifiedFirmicute 525 + 0.17 intheirabsence. Bacteroidesvulgatus 479 – 0.16 Assessing the explanations for the ‘IBD versus Bacteroidessp. 594 + 0.16 healthy’ case was more challenging. Although findings Alistipesfinegoldii 407 – 0.15 of aberrations in the microbiota composition of IBD Escherichiacoli 828 – −0.14 patients are well established [14–16], they are not consistent regarding specific species. Increased Bacteroidesfragilis 537 – 0.05 amounts of mucosa-associated E. coli, which was ’+’and’-’indicatewhetherapeakispresentorabsentinasample,andneeds assigned high relevance values by all classifiers tested, tobecoupledwiththerelevanceforinterpretation.Rankingisbasedon absoluterelevancevalues;ClassificationwasdonebyalinearSVM were indeed previously detected in IBD patients [17]. We also compared our results to a recent study1 [9] Discussion in which a core microbiota, prevalent and stable over In this study, we present how a method aimed to explain time, was identified in healthy children. Interestingly, individual classifier decisions can be applied to human strong evidence for the IBD class was given by the microbiota data to facilitate the interpretation of absence of peaks originating from species described microbiota-based predictions in a biologically meaningful as members of the pediatric healthy core microbiota, A B Fig.4Outputofaclassifierwithout(a)andwith(b)explanation Ecketal.BMCBioinformatics (2017) 18:441 Page9of13 A B Fig.5IBDgutmicrobiotavs.healthygutmicrobiota:Clusteredheatmapsofpeaks’relevancevaluesacrossallsamples.Columnscorrespondto samples,rowstopeaks(orbacterialspecies).Mostrelevantspeciesforclassificationareindicated.(a)Heatmapincludingonlytherelevance calculatedforpositive-abundancepeaksineachsample;(b)Heatmapincludingonlytherelevancecalculatedforzero-abundancepeaksineach sample.Onlypeakswithmeanabsoluterelevanceabove0.01areshownineachheatmap.Colorkeyfrompinktoblueindicatesrelevancefrom lowtohigh,respectively.Foreachsamplethepredictedclass(pred)andthetrueclass(obs)areindicated.Clusteringisbasedoncosinecorrelation matrixfollowedbyunweightedpairgroupmethodwitharithmeticmean(UPGMA).RelevancevalueswerecalculatedbasedonlinearSVMpredictions such as certain Alistipes, Bacteroides and Prevotella ignored. Indeed, this was observed for all the classifiers species. These species were mostly absent from IBD we tested in the IBD dataset. We could rank features samples, and were assigned consistently high rele- correctly according to their relevance even when a vance values, which resulted in high dendrogram classifier waslesscertainabout itsdecisions andresulted distances between the two clusters that appear in the in lower probabilities, like in the case of RF (Fig. 7). In heat map displaying the absent peaks (Fig. 5b). In our case, the set of important features for one individual general, in all heat maps, clustering based on the cal- sample was quite small, meaning that a short list of the culated relevance values separated the samples mostly mostimportantspeciescouldbeeffectivelypresentedtoa according to their predicted class, which means the physician. Note that although the set of important relevance values explain the classifier decisions well. features was rather small for an individual classification, To analytically validate this, we proposed a method this doesnot meanthat the classifierrelies only on those, that estimates how the classifier’s probability for a class and that they hold all the information.The set ofimport- changeswhenfeaturesof varyingrelevanceareexcluded. antfeaturesvariedwidelybetweensamples(Figs.2and4), This approach brings value since it is not always which corroborates the need for a sample-specific explan- straightforward to assess how good an explanation is, ationmethodformicrobiotadata. especially when we lack knowledge about the domain. A Still, sample-specific explanations were highly corre- good explanation means that the certainty of a classifier lated with the global feature rankings at the model level would decline only when high relevance features are (Additional file 2: Figure S1), indicating that our method Ecketal.BMCBioinformatics (2017) 18:441 Page10of13 Table4MostrelevantbacterialspeciesforclassificationofIBD gutmicrobiotaversushealthygutmicrobiotasamples Speciesname IS-propeak(s) Rankedtop15by Alistipesfinegoldii 230/395/396/400/407 all Alistipesputredinis 235/236 all Odoribactersplanchnicus 307/308 all unclassifiedFirmicute 396 all Escherichiacoli 735/821/827/828 all unclassifiedBacteroidete 251 svm,rf,nsc Lachnospiraceaesp. 491/605 svm,rf,lr Bacteroidesvulgatus 479 svm,nsc Lactobacillussp. 478 svm,lr Streptococcussp. 318 svm,lr Bacteroidessp. 474 svm,lr Streptococcussp. 321 nsc,lr Prevotellasp. 438 rf,nsc Bacteroidesfragilis 537 rf,nsc Clostridiumperfringens 233/235 rf,nsc Lactobacillussp. 244 rf,nsc unclassifiedFirmicute 156 rf Rankingisbasedonabsolutemeanrelevancevaluesacrossallsamplesper classifier.Speciesareorderedaccordingtothenumberofclassifiersbywhich theyweretop-ranked.All-allclassifiers;svm-linearSVM;rf-RandomForest; nsc-NearestShrunkenCentroids;lr-L2-regularizedLogisticRegression conforms to the general rankings. Some discrepancies occur when comparing explanations between the differ- ent classifiers, which represent different algorithm concepts; SVM and LR are linear classifiers, RF is an en- Fig.6Responsetoexclusiveenteralnutrition(EEN)inCrohn’sdisease semble of decision trees, and NSC relies on the distance patients:aclusteredheatmapdisplayingpeaks’relevancevalues between data points to class centroids. While there may acrossCrohn’sdiseasepatientswhoreceivedEENtreatment.Columns correspondtosamples,rowstopeaks(orbacterialspecies).Foreach be several solutions to a problem and different classifiers sample,itisindicatedwhetherthepredictionwascorrect(i.e.the can make different mistakes, we can assume that the samplewasclassifiedas‘IBD’,darkgrey)ornot(thesamplewas better the prediction accuracy ofa classifier is,the closer classifiedas‘healthy,lightgrey),andwhetherremissionwasinducedin the explanation of its decision is to the true difference responsetoEEN(green)ornot(red).Theclusterontheleftisenriched between the classes. Nonetheless, in practice it is also withpatientswithinducedremissioncomparedtotheothercluster ontheright(p=0.04,chi-squaretest).Onlypeakswithmeanabsolute important to pay attention to overfitting, which is a relevanceabove0.01areshown.Colorkeyfrompinktoblueindicates common concern in small microbiota datasets. While in relevancefromlowtohigh,respectively.Clusteringisbasedoncosine our experiments we did not observe overfitting, it is im- correlationmatrixfollowedbyunweightedpairgroupmethodwith portant to take into account that in such cases the clas- arithmeticmean(UPGMA).Relevancevalueswerecalculatedbasedon sifier may fit to random noise in the data, and the linearSVMpredictions explanations mightnotbesensible. The calculated relevance can also be used to reveal underlying subgroups within a patient group. As an location, a biomarker to predict the response or to example, we focused on response to EEN, an effective adjust an individualized application of EEN in chil- therapeutic measure to induce remission in active dren is not yet available. Our results suggest a role of Crohn’s disease pediatric patients. EEN consists of a the microbiota in the response to EEN, and illustrate complete liquid diet [18] and leads to the induction the potential the explanations hold to unveil patterns of remission in approximately 85% of patients. In our in the data. cohort, remission was induced in 18 out of 32 pa- Several other techniques also aim to interpret the tients. While several factors may influence the re- learning of a classifier. These are mostly focused on sponse to EEN, including disease duration and feature ranking and feature selection, which are global
Description: