Table Of ContentFaculty of Sciences
Department of Biochemistry, Physiology and Microbiology
Laboratory of Microbiology
2004-2005
Knowledge Accumulation of
Microbial Data Aiming at a
Dynamic Taxonomic Framework
Peter Dawyndt
Promotors : Prof. Dr. Hans De Meyer
Prof. Dr. ir. Jean Swings
Dissertation submitted in fulfillment of the requirements for the degree of
Doctor (Ph.D.) in Sciences, Computer Science
Faculty of Sciences
Department of Biochemistry, Physiology and Microbiology
Laboratory of Microbiology
2004-2005
Knowledge Accumulation of
Microbial Data Aiming at a
Dynamic Taxonomic Framework
Peter Dawyndt
Promotors : Prof. Dr. Hans De Meyer
Prof. Dr. ir. Jean Swings
Dissertation submitted in fulfillment of the requirements for the degree of
Doctor (Ph.D.) in Sciences, Computer Science
4
A Journey Through Life
AndIthinkoveragain
Mysmalladventures
Whenwithashorewind
Idrifted outinmykayak
AndthoughtIwasindanger.
Myfears,
Thosesmallones
ThatIthought sobig
forallthevitalthings
Ihadtogetandtoreach.
Andyet,thereisonly
Onegreatthing,
Theonlything:
Tolivetoseeinhutsandonjourneys
Thegreatdaythatdawns
Andthelightthatfillstheworld.
SongoftheKitlinguharmiut(CopperEskimo),from
thereportoftheFifthThuleExpedition(1921-1924)
5
6 AJOURNEYTHROUGHLIFE
EXAMINATIONCOMMITTEE 7
Members of the reading committee
Prof.Dr.BrianAustin
SchoolofLifeSciences,Heriott-Watt University,Edinburgh,Scotland
Prof.Dr.BernardDeBaets
DepartmentofAppliedMathematics,BiometricsandProcessControl,
GhentUniversity,Belgium
Prof.Dr.HansDeMeyer(promotor)
DepartmentofAppliedMathematicsandComputer Science,GhentUniversity,Belgium
Prof.Dr.MatsGyllenberg
DepartmentofMathematicsandStatistics,University ofHelsinki,Finland
Prof.Dr.TimoKoski
DepartmentofMathematics,Linko¨pingInstitute ofTechnology,
Linko¨pingUniversity,Sweden
Prof.Dr.ir.JeanSwings(promotor)
BCCMTM/LMGBacteriaCollection&LaboratoryofMicrobiology,
DepartmentofBiochemistry,PhysiologyandMicrobiology,
GhentUniversity,Belgium
Other members of the examination committee
Prof.WalterBossaert
DepartmentofAppliedMathematicsandComputer Science,GhentUniversity,Belgium
Prof.Dr.ArmandDeClercq
DepartmentofAppliedMathematicsandComputer Science,GhentUniversity,Belgium
Prof.Dr.AlbertHoogewijs (chairman)
DepartmentofPureMathematicsandComputer Algebra,GhentUniversity,Belgium
Prof.Dr.MicahKrichevsky
UnitedStatedFederationofCultureCollections(USFCC)
BionomicsInternational, Wheaton,MD,USA
Dr.MarcVancanneyt
BCCMTM/LMGBacteriaCollection,
DepartmentofBiochemistry,PhysiologyandMicrobiology,GhentUniversity,Belgium
Dr.LucVauterin
AppliedMathsBVBA,Sint-Martens-Latem,Belgium
8 EXAMINATIONCOMMITTEE
Contents
TableofContents 11
ListofFigures 18
ListofTables 22
ListofAbbreviations 23
Acknowledgments 25
1 LandscapingBacterialTaxonomy 29
2 IntegratedStrainDatabase 37
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.2 Constructionofanintegratedstraindatabase . . . . . . . . . . . . . . . . . 40
2.2.1 Equationaltheoryforthemicrobiallabellingsystem . . . . . . . . 41
2.2.2 Algorithm forincrementallearningoflabelequivalences . . . . . . 45
2.3 Errordetection/correction strategies . . . . . . . . . . . . . . . . . . . . . 51
2.3.1 Basicerrordetectionandcorrection . . . . . . . . . . . . . . . . . 52
2.3.2 Integrated strainhistory . . . . . . . . . . . . . . . . . . . . . . . 61
2.4 Dataqualityassessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.5 Linkingautonomousmicrobial datasources . . . . . . . . . . . . . . . . . 72
2.5.1 Managingcross-referencesbetweenBRCsandEMBL . . . . . . . 73
2.5.2 Advanceddynamicqueries . . . . . . . . . . . . . . . . . . . . . . 78
2.6 Conclusionsandfutureperspectives . . . . . . . . . . . . . . . . . . . . . 81
3 Min-transitiveApproximations 91
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.2 Equivalencerelations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.3 Transitiveclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.4 Transitiveopenings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.4.1 T-transitive openingsofasimilarity relation . . . . . . . . . . . . 103
3.4.2 Thebinarytreerepresentation ofmin-transitive openings . . . . . . 106
3.4.3 Thecompletelinkageclusteringalgorithm . . . . . . . . . . . . . . 106
3.4.4 Anewmin-transitive openingalgorithm . . . . . . . . . . . . . . . 109
3.4.5 Numericalexample . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.4.6 Measurementofaveragedeviations . . . . . . . . . . . . . . . . . 112
9
10 CONTENTS
3.5 Alternativetransitiveapproximations . . . . . . . . . . . . . . . . . . . . . 117
3.5.1 T-transitive approximations ofasimilarity relation . . . . . . . . . 117
3.5.2 Afirstnewmin-transitive approximation algorithm . . . . . . . . . 120
3.5.3 Numericalexample . . . . . . . . . . . . . . . . . . . . . . . . . . 123
3.5.4 Asecondnewmin-transitive approximation algorithm . . . . . . . 124
3.5.5 Numericalexample . . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.5.6 Measurementofaveragedeviations . . . . . . . . . . . . . . . . . 128
3.5.7 Min-transitive approximations usingmedianlinkage . . . . . . . . 131
3.6 Conclusionsandfutureperspectives . . . . . . . . . . . . . . . . . . . . . 134
4 SlidingWindowDiscretization 143
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.2 Genotypicfingerprinting techniques . . . . . . . . . . . . . . . . . . . . . 147
4.2.1 AFLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.3 Comparisonoffingerprintpatterns . . . . . . . . . . . . . . . . . . . . . . 151
4.4 Pairwisecurvematching . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
4.4.1 Cosinemeasure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
4.4.2 Pearson’sproductmoment correlation . . . . . . . . . . . . . . . . 157
4.5 Bandmatching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
4.6 Pairwisebandmatching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
4.6.1 Simplebandmatching . . . . . . . . . . . . . . . . . . . . . . . . 161
4.6.2 Closestbandmatching . . . . . . . . . . . . . . . . . . . . . . . . 162
4.6.3 Firstbandmatching . . . . . . . . . . . . . . . . . . . . . . . . . 164
4.7 Multiplebandmatching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
4.7.1 Equal–width bandmatching . . . . . . . . . . . . . . . . . . . . . 169
4.7.2 Histogram–based bandmatching . . . . . . . . . . . . . . . . . . . 170
4.8 Slidingwindowdiscretization . . . . . . . . . . . . . . . . . . . . . . . . 171
4.9 Bandpatternsimilarity quantification . . . . . . . . . . . . . . . . . . . . 173
4.10 Minimization ofstochasticcomplexity . . . . . . . . . . . . . . . . . . . . 174
4.10.1 Stochasticcomplexity principles . . . . . . . . . . . . . . . . . . . 174
4.10.2 BinClassimplementation . . . . . . . . . . . . . . . . . . . . . . . 175
4.10.3 Asimpleexample . . . . . . . . . . . . . . . . . . . . . . . . . . 177
4.10.4 Findingtheoptimalα-cut forhierarchicalclassifications . . . . . . 178
4.11 Applicationtothetaxonomy ofVibrionaceae . . . . . . . . . . . . . . . . 179
4.11.1 EcologicalandtaxonomicaltraitsofthefamilyVibrionaceae . . . . 179
4.11.2 fAFLPfingerprinting onselectionofbacterialstrains . . . . . . . . 180
4.11.3 DiscretizationoffAFLPfingerprint patterns . . . . . . . . . . . . . 182
4.11.4 Classificationofbinaryvectors . . . . . . . . . . . . . . . . . . . . 184
4.11.5 Comparison ofthealternativeclassifications . . . . . . . . . . . . . 193
4.11.6 Evaluationofclassificationbydomainexpert . . . . . . . . . . . . 195
4.12 Conclusionsandfutureperspectives . . . . . . . . . . . . . . . . . . . . . 198
5 ImprovedDiscriminatoryPowerofFAMEAnalysis 211
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
5.2 FAMEdatabaseconstruction . . . . . . . . . . . . . . . . . . . . . . . . . 213
5.2.1 Cellularfattyacids . . . . . . . . . . . . . . . . . . . . . . . . . . 213
5.2.2 Chromatographic fattyaciddecomposition . . . . . . . . . . . . . 216
Description:Laboratory of Microbiology. 2004-2005. Knowledge Accumulation of. Microbial Data Aiming at a. Dynamic Taxonomic Framework. Peter Dawyndt. Promotors