Table Of ContentKNOWLEDGE DISCOVERY
IN BIOINFORMATICS
KNOWLEDGE DISCOVERY
IN BIOINFORMATICS
Techniques, Methods, and Applications
Editedby
XIAOHUA HU
Drexel University, Philadelphia, Pennsylvania
YI PAN
Georgia State University, Atlanta, Georgia
Copyright(cid:2)2007byJohnWiley&Sons,Inc.Allrightsreserved.
PublishedbyJohnWiley&Sons,Inc.,Hoboken,NewJersey.
PublishedsimultaneouslyinCanada.
Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmittedinanyform
orbyanymeans,electronic,mechanical,photocopying,recording,scanning,orotherwise,exceptas
permittedunderSection107or108ofthe1976UnitedStatesCopyrightAct,withouteithertheprior
writtenpermissionofthePublisher,orauthorizationthroughpaymentoftheappropriateper-copyfeetothe
CopyrightClearanceCenter,Inc.,222RosewoodDrive,Danvers,MA01923,(978)750-8400,fax(978)
750-4470,oronthewebatwww.copyright.com.RequeststothePublisherforpermissionshouldbe
addressedtothePermissionsDepartment,JohnWiley&Sons,Inc.,111RiverStreet,Hoboken,NJ07030,
(201)748-6011,fax(201)748-6008,oronlineathttp://www.wiley.com/go/permission.
LimitofLiability/DisclaimerofWarranty:Whilethepublisherandauthorhaveusedtheirbestefforts
inpreparingthisbook,theymakenorepresentationsorwarrantieswithrespecttotheaccuracyor
completenessofthecontentsofthisbookandspecificallydisclaimanyimpliedwarrantiesof
merchantabilityorfitnessforaparticularpurpose.Nowarrantymaybecreatedorextendedbysales
representativesorwrittensalesmaterials.Theadviceandstrategiescontainedhereinmaynotbesuitable
foryoursituation.Youshouldconsultwithaprofessionalwhereappropriate.Neitherthepublishernor
authorshallbeliableforanylossofprofitoranyothercommercialdamages,includingbutnotlimitedto
special,incidental,consequential,orotherdamages.
Forgeneralinformationonourotherproductsandservicesorfortechnicalsupport,pleasecontactour
CustomerCareDepartmentwithintheUnitedStatesat(800)762-2974,outsidetheUnitedStatesat
(317)572-3993orfax(317)572-4002.
Wileyalsopublishesitsbooksinavarietyofelectronicformats.Somecontentthatappearsinprintmay
notbeavailableinelectronicformats.FormoreinformationaboutWileyproducts,visitourwebsiteat
www.wiley.com.
WileyBicentennialLogo:RichardJ.Pacifico
LibraryofCongressCataloging-in-PublicationData:
Knowledgediscoveryinbioinformatics:techniques,methods,andapplications
/editedbyXiaohuaHu,YiPan.
p.cm.
ISBN978-0-471-77796-0
1. Bioinformatics. 2. Computationalbiology. I. Hu,Xiaohua(XiaohuaTony)
II. Pan,Yi,1960–
[DNLM: 1. ComputationalBiology–methods. 2. MedicalInformatics–methods.
QU26.5K732007]
QH506.K55642007
5700.285–dc22
2006032495
PrintedintheUnitedStatesofAmerica
10987654321
CONTENTS
Contributors xiii
Preface xvii
1 CurrentMethodsforProteinSecondary-StructurePrediction
BasedonSupportVectorMachines 1
Hae-JinHu,Robert W.Harrison, PhangC. Tai, andYiPan
1.1 TraditionalMethods 2
1.1.1 StatisticalApproaches 2
1.1.2 MachineLearningApproaches 2
1.2 SupportVectorMachineMethod 8
1.2.1 IntroductiontoSVM 8
1.2.2 EncodingProfile 10
1.2.3 KernelFunctions 11
1.2.4 TertiaryClassifierDesign 15
1.2.5 AccuracyMeasureofSVM 20
1.3 PerformanceComparisonofSVMMethods 22
1.4 DiscussionandConclusions 23
References 23
2 ComparisonofSevenMethodsforMiningHiddenLinks 27
Xiaohua Hu,XiaodanZhang,andXiaohua Zhou
2.1 AnalysisoftheLiteratureonRaynaud’sDisease 27
2.2 RelatedWork 29
v
vi CONTENTS
2.3 Methods 30
2.3.1 InformationMeasures 31
2.3.2 RankingMethods 31
2.3.3 SevenMethods 32
2.4 ExperimentResultsandAnalysis 37
2.4.1 DataSet 37
2.4.2 Chi-Square,Chi-SquareAssociationRule,andMutual
InformationLinkABCMethodsCompared 38
2.4.3 Chi-SquareABCMethod:SemanticCheckforMining
ImplicitConnections 38
2.4.4 Chi-SquareandMutualInformationLink
ABCMethods 40
2.5 DiscussionandConclusions 43
Acknowledgments 43
References 44
3 VotingScheme–BasedEvolutionaryKernelMachines
forDrugActivityComparisons 45
BoJinandYan-QingZhang
3.1 GranularKernelandKernelTreeDesign 46
3.1.1 Definitions 46
3.1.2 GranularKernelProperties 47
3.2 GKTSESs 48
3.3 EvolutionaryVotingKernelMachines 51
3.4 Simulations 53
3.4.1 DataSetandExperimentalSetup 53
3.4.2 ExperimentalResultsandComparisons 53
3.5 ConclusionsandFutureWork 54
Acknowledgments 55
References 55
4 BioinformaticsAnalysesofArabidopsisthaliana
TilingArrayExpressionData 57
TruptiJoshi, JinrongWan,Curtis J.Palm,KaraJuneau, RonDavis,
AudreySouthwick,KatrinaM.Ramonell, Gary Stacey,andDong Xu
4.1 TilingArrayDesignandDataDescription 58
4.1.1 Data 58
4.1.2 TilingArrayExpressionPatterns 59
4.1.3 TilingArrayDataAnalysis 59
4.2 OntologyAnalyses 61
4.3 AntisenseRegulationIdentification 63
4.3.1 AntisenseSilencing 63
4.3.2 AntisenseRegulationIdentification 63
4.4 CorrelatedExpressionBetweenTwoDNAStrands 67
CONTENTS vii
4.5 IdentificationofNonproteinCodingmRNA 68
4.6 Summary 69
Acknowledgments 69
References 70
5 IdentificationofMarkerGenesfromHigh-Dimensional
MicroarrayDataforCancerClassification 71
JiexunLi,Hua Su,andHsinchunChen
5.1 FeatureSelection 73
5.1.1 TaxonomyofFeatureSelection 73
5.1.2 EvaluationCriterion 73
5.1.3 GenerationProcedure 76
5.2 GeneSelection 78
5.2.1 IndividualGeneRanking 78
5.2.2 GeneSubsetSelection 79
5.2.3 SummaryofGeneSelection 82
5.3 ComparativeStudyofGeneSelectionMethods 83
5.3.1 MicroarrayDataDescriptions 83
5.3.2 GeneSelectionApproaches 83
5.3.3 ExperimentalResults 84
5.4 ConclusionsandDiscussion 85
Acknowledgments 85
References 85
6 PatientSurvivalPredictionfromGeneExpressionData 89
Huiqing Liu,LimsoonWong,andYingXu
6.1 GeneralMethods 91
6.1.1 Kaplan–MeierSurvivalAnalysis 91
6.1.2 CoxProportional-HazardsRegression 93
6.2 Applications 95
6.2.1 DiffuseLarge-B-CellLymphoma 95
6.2.2 LungAdenocarcinoma 97
6.2.3 Remarks 98
6.3 IncorporatingDataMiningTechniquestoSurvivalPrediction 98
6.3.1 GeneSelectionbyStatisticalProperties 99
6.3.2 CancerSubtypeIdentificationviaSurvival
Information 100
6.4 SelectionofExtremePatientSamples 103
6.4.1 Short-andLong-TermSurvivors 103
6.4.2 SVM-BasedRiskScoringFunction 103
6.4.3 Results 104
6.5 SummaryandConcludingRemarks 108
Acknowledgments 109
References 109
viii CONTENTS
7 RNAInterferenceandmicroRNA 113
ShibinQiu andTerranLane
7.1 MechanismsandApplicationsofRNAInterference 114
7.1.1 MechanismofRNAInterference 114
7.1.2 ApplicationsofRNAi 117
7.1.3 RNAiComputationalandModelingIssues 120
7.2 SpecificityofRNAInterference 121
7.2.1 ComputationalRepresentationofRNAi 121
7.2.2 DefinitionofOff-TargetErrorRates 122
7.2.3 FeatureMapsofMismatch,Bulge,andWobble 124
7.2.4 PositionalEffect 125
7.2.5 ResultsforRNAiSpecificity 125
7.2.6 SilencingMultipleGenes 128
7.3 ComputationalMethodsformicroRNAs 129
7.3.1 PredictionofmicroRNAGenes 130
7.3.2 PredictionofmiRNATargets 131
7.4 siRNASilencingEfficacy 132
7.4.1 siRNADesignRules 132
7.4.2 EfficacyPredictionwithSupportVectorRegression 134
7.5 SummaryandOpenQuestions 136
7.5.1 siRNAEfficacyandTargetmRNASecondaryStructures 137
7.5.2 DynamicsofTargetmRNAandsiRNA 137
7.5.3 IntegrationofRNAiintoNetworkModels 137
Appendix:Glossary 138
References 140
8 ProteinStructurePredictionUsingStringKernels 145
Huzefa Rangwala, KevinDeRonne,andGeorge Karypis
8.1 ProteinStructure:Granularities 146
8.1.1 Secondary-StructurePrediction 146
8.1.2 ProteinTertiaryStructure 148
8.2 LearningfromData 149
8.2.1 KernelMethods 150
8.3 StructurePrediction:CapturingtheRightSignals 150
8.4 Secondary-StructurePrediction 151
8.4.1 YASSPPOverview 152
8.4.2 InputSequenceCoding 153
8.4.3 Profile-BasedKernelFunctions 154
8.4.4 PerformanceEvaluation 154
8.5 RemoteHomologyandFoldPrediction 157
8.5.1 Profile-BasedKernelFunctions 158
8.5.2 PerformanceEvaluation 161
8.6 ConcludingRemarks 165
References 165
CONTENTS ix
9 PublicGenomicDatabases:DataRepresentation,
Storage,andAccess 169
AndrewRobinson,WennyRahayu, andDavid Taniar
9.1 DataRepresentation 170
9.1.1 FASTAFormat 170
9.1.2 GenbankFormat 171
9.1.3 Swiss-ProtFormat 172
9.1.4 XMLFormat 176
9.2 DataStorage 180
9.2.1 MultidatabaseRepositories 180
9.3 DataAccess 183
9.3.1 Single-DatabaseAccessPoint 183
9.3.2 Cross-ReferenceDatabases 186
9.3.3 Multiple-DatabaseAccessPoints 186
9.3.4 Tool-BasedInterfaces 192
9.4 Discussion 194
9.5 Conclusions 194
References 194
10 AutomaticQueryExpansionwithKeyphrasesandPOS
PhraseCategorizationforEffectiveBiomedical
TextMining 197
MinSongandIl-YeolSong
10.1 KeyphraseExtraction-BasedPseudo-RelevanceFeedback 198
10.1.1 KeyphraseExtractionProcedures 199
10.1.2 KeyphraseRanking 200
10.1.3 QueryTranslationintoDNF 202
10.2 QueryExpansionwithWordNet 203
10.3 ExperimentsonMedlineDataSets 203
10.4 Conclusions 205
References 206
11 EvolutionaryDynamicsofProtein–ProteinInteractions 209
L. S.Swapna,B. Offmann, andN.Srinivasan
11.1 ClassIGlutamineAmidotransferase–LikeSuperfamily 211
11.1.1 DJ-1/PfpIFamily 213
11.1.2 ComparisonofQuaternaryStructuresofDJ-1
FamilyMembers 214
11.2 DriftsinInterfacesofCloseHomologs 214
11.2.1 ComparisonofQuaternaryStructuresofIntracellular
ProteaseandHypotheticalProteinYhbO 216
11.2.2 ComparisonofQuaternaryStructuresofIntracellular
ProteaseandDJ-1 218