Faculty of Sciences Department of Biochemistry, Physiology and Microbiology Laboratory of Microbiology 2004-2005 Knowledge Accumulation of Microbial Data Aiming at a Dynamic Taxonomic Framework Peter Dawyndt Promotors : Prof. Dr. Hans De Meyer Prof. Dr. ir. Jean Swings Dissertation submitted in fulfillment of the requirements for the degree of Doctor (Ph.D.) in Sciences, Computer Science Faculty of Sciences Department of Biochemistry, Physiology and Microbiology Laboratory of Microbiology 2004-2005 Knowledge Accumulation of Microbial Data Aiming at a Dynamic Taxonomic Framework Peter Dawyndt Promotors : Prof. Dr. Hans De Meyer Prof. Dr. ir. Jean Swings Dissertation submitted in fulfillment of the requirements for the degree of Doctor (Ph.D.) in Sciences, Computer Science 4 A Journey Through Life AndIthinkoveragain Mysmalladventures Whenwithashorewind Idrifted outinmykayak AndthoughtIwasindanger. Myfears, Thosesmallones ThatIthought sobig forallthevitalthings Ihadtogetandtoreach. Andyet,thereisonly Onegreatthing, Theonlything: Tolivetoseeinhutsandonjourneys Thegreatdaythatdawns Andthelightthatfillstheworld. SongoftheKitlinguharmiut(CopperEskimo),from thereportoftheFifthThuleExpedition(1921-1924) 5 6 AJOURNEYTHROUGHLIFE EXAMINATIONCOMMITTEE 7 Members of the reading committee Prof.Dr.BrianAustin SchoolofLifeSciences,Heriott-Watt University,Edinburgh,Scotland Prof.Dr.BernardDeBaets DepartmentofAppliedMathematics,BiometricsandProcessControl, GhentUniversity,Belgium Prof.Dr.HansDeMeyer(promotor) DepartmentofAppliedMathematicsandComputer Science,GhentUniversity,Belgium Prof.Dr.MatsGyllenberg DepartmentofMathematicsandStatistics,University ofHelsinki,Finland Prof.Dr.TimoKoski DepartmentofMathematics,Linko¨pingInstitute ofTechnology, Linko¨pingUniversity,Sweden Prof.Dr.ir.JeanSwings(promotor) BCCMTM/LMGBacteriaCollection&LaboratoryofMicrobiology, DepartmentofBiochemistry,PhysiologyandMicrobiology, GhentUniversity,Belgium Other members of the examination committee Prof.WalterBossaert DepartmentofAppliedMathematicsandComputer Science,GhentUniversity,Belgium Prof.Dr.ArmandDeClercq DepartmentofAppliedMathematicsandComputer Science,GhentUniversity,Belgium Prof.Dr.AlbertHoogewijs (chairman) DepartmentofPureMathematicsandComputer Algebra,GhentUniversity,Belgium Prof.Dr.MicahKrichevsky UnitedStatedFederationofCultureCollections(USFCC) BionomicsInternational, Wheaton,MD,USA Dr.MarcVancanneyt BCCMTM/LMGBacteriaCollection, DepartmentofBiochemistry,PhysiologyandMicrobiology,GhentUniversity,Belgium Dr.LucVauterin AppliedMathsBVBA,Sint-Martens-Latem,Belgium 8 EXAMINATIONCOMMITTEE Contents TableofContents 11 ListofFigures 18 ListofTables 22 ListofAbbreviations 23 Acknowledgments 25 1 LandscapingBacterialTaxonomy 29 2 IntegratedStrainDatabase 37 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.2 Constructionofanintegratedstraindatabase . . . . . . . . . . . . . . . . . 40 2.2.1 Equationaltheoryforthemicrobiallabellingsystem . . . . . . . . 41 2.2.2 Algorithm forincrementallearningoflabelequivalences . . . . . . 45 2.3 Errordetection/correction strategies . . . . . . . . . . . . . . . . . . . . . 51 2.3.1 Basicerrordetectionandcorrection . . . . . . . . . . . . . . . . . 52 2.3.2 Integrated strainhistory . . . . . . . . . . . . . . . . . . . . . . . 61 2.4 Dataqualityassessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 2.5 Linkingautonomousmicrobial datasources . . . . . . . . . . . . . . . . . 72 2.5.1 Managingcross-referencesbetweenBRCsandEMBL . . . . . . . 73 2.5.2 Advanceddynamicqueries . . . . . . . . . . . . . . . . . . . . . . 78 2.6 Conclusionsandfutureperspectives . . . . . . . . . . . . . . . . . . . . . 81 3 Min-transitiveApproximations 91 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.2 Equivalencerelations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.3 Transitiveclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 3.4 Transitiveopenings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 3.4.1 T-transitive openingsofasimilarity relation . . . . . . . . . . . . 103 3.4.2 Thebinarytreerepresentation ofmin-transitive openings . . . . . . 106 3.4.3 Thecompletelinkageclusteringalgorithm . . . . . . . . . . . . . . 106 3.4.4 Anewmin-transitive openingalgorithm . . . . . . . . . . . . . . . 109 3.4.5 Numericalexample . . . . . . . . . . . . . . . . . . . . . . . . . . 110 3.4.6 Measurementofaveragedeviations . . . . . . . . . . . . . . . . . 112 9 10 CONTENTS 3.5 Alternativetransitiveapproximations . . . . . . . . . . . . . . . . . . . . . 117 3.5.1 T-transitive approximations ofasimilarity relation . . . . . . . . . 117 3.5.2 Afirstnewmin-transitive approximation algorithm . . . . . . . . . 120 3.5.3 Numericalexample . . . . . . . . . . . . . . . . . . . . . . . . . . 123 3.5.4 Asecondnewmin-transitive approximation algorithm . . . . . . . 124 3.5.5 Numericalexample . . . . . . . . . . . . . . . . . . . . . . . . . . 126 3.5.6 Measurementofaveragedeviations . . . . . . . . . . . . . . . . . 128 3.5.7 Min-transitive approximations usingmedianlinkage . . . . . . . . 131 3.6 Conclusionsandfutureperspectives . . . . . . . . . . . . . . . . . . . . . 134 4 SlidingWindowDiscretization 143 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 4.2 Genotypicfingerprinting techniques . . . . . . . . . . . . . . . . . . . . . 147 4.2.1 AFLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 4.3 Comparisonoffingerprintpatterns . . . . . . . . . . . . . . . . . . . . . . 151 4.4 Pairwisecurvematching . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 4.4.1 Cosinemeasure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 4.4.2 Pearson’sproductmoment correlation . . . . . . . . . . . . . . . . 157 4.5 Bandmatching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 4.6 Pairwisebandmatching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 4.6.1 Simplebandmatching . . . . . . . . . . . . . . . . . . . . . . . . 161 4.6.2 Closestbandmatching . . . . . . . . . . . . . . . . . . . . . . . . 162 4.6.3 Firstbandmatching . . . . . . . . . . . . . . . . . . . . . . . . . 164 4.7 Multiplebandmatching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 4.7.1 Equal–width bandmatching . . . . . . . . . . . . . . . . . . . . . 169 4.7.2 Histogram–based bandmatching . . . . . . . . . . . . . . . . . . . 170 4.8 Slidingwindowdiscretization . . . . . . . . . . . . . . . . . . . . . . . . 171 4.9 Bandpatternsimilarity quantification . . . . . . . . . . . . . . . . . . . . 173 4.10 Minimization ofstochasticcomplexity . . . . . . . . . . . . . . . . . . . . 174 4.10.1 Stochasticcomplexity principles . . . . . . . . . . . . . . . . . . . 174 4.10.2 BinClassimplementation . . . . . . . . . . . . . . . . . . . . . . . 175 4.10.3 Asimpleexample . . . . . . . . . . . . . . . . . . . . . . . . . . 177 4.10.4 Findingtheoptimalα-cut forhierarchicalclassifications . . . . . . 178 4.11 Applicationtothetaxonomy ofVibrionaceae . . . . . . . . . . . . . . . . 179 4.11.1 EcologicalandtaxonomicaltraitsofthefamilyVibrionaceae . . . . 179 4.11.2 fAFLPfingerprinting onselectionofbacterialstrains . . . . . . . . 180 4.11.3 DiscretizationoffAFLPfingerprint patterns . . . . . . . . . . . . . 182 4.11.4 Classificationofbinaryvectors . . . . . . . . . . . . . . . . . . . . 184 4.11.5 Comparison ofthealternativeclassifications . . . . . . . . . . . . . 193 4.11.6 Evaluationofclassificationbydomainexpert . . . . . . . . . . . . 195 4.12 Conclusionsandfutureperspectives . . . . . . . . . . . . . . . . . . . . . 198 5 ImprovedDiscriminatoryPowerofFAMEAnalysis 211 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 5.2 FAMEdatabaseconstruction . . . . . . . . . . . . . . . . . . . . . . . . . 213 5.2.1 Cellularfattyacids . . . . . . . . . . . . . . . . . . . . . . . . . . 213 5.2.2 Chromatographic fattyaciddecomposition . . . . . . . . . . . . . 216
Description: