ANALYZING THE LARGE NUMBER OF VARIABLES IN BIOMEDICAL AND SATELLITE IMAGERY ANALYZING THE LARGE NUMBER OF VARIABLES IN BIOMEDICAL AND SATELLITE IMAGERY PHILLIP I. GOOD AJOHNWILEY&SONS,INC.,PUBLICATION Copyright©2011byJohnWiley&Sons,Inc.Allrightsreserved PublishedbyJohnWiley&Sons,Inc.,Hoboken,NewJersey PublishedsimultaneouslyinCanada Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,or transmittedinanyformorbyanymeans,electronic,mechanical,photocopying, recording,scanning,orotherwise,exceptaspermittedunderSection107or108ofthe 1976UnitedStatesCopyrightAct,withouteitherthepriorwrittenpermissionofthe Publisher,orauthorizationthroughpaymentoftheappropriateper-copyfeetothe CopyrightClearanceCenter,Inc.,222RosewoodDrive,Danvers,MA01923,(978) 750-8400,fax(978)750-4470,oronthewebatwww.copyright.com.Requeststothe PublisherforpermissionshouldbeaddressedtothePermissionsDepartment,John Wiley&Sons,Inc.,111RiverStreet,Hoboken,NJ07030,(201)748-6011,fax(201) 748-6008,oronlineathttp://www.wiley.com/go/permission. LimitofLiability/DisclaimerofWarranty:Whilethepublisherandauthorhaveused theirbesteffortsinpreparingthisbook,theymakenorepresentationsorwarranties withrespecttotheaccuracyorcompletenessofthecontentsofthisbookandspecifically disclaimanyimpliedwarrantiesofmerchantabilityorfitnessforaparticularpurpose. Nowarrantymaybecreatedorextendedbysalesrepresentativesorwrittensales materials.Theadviceandstrategiescontainedhereinmaynotbesuitableforyour situation.Youshouldconsultwithaprofessionalwhereappropriate.Neitherthe publishernorauthorshallbeliableforanylossofprofitoranyothercommercial damages,includingbutnotlimitedtospecial,incidental,consequential,orother damages. Forgeneralinformationonourotherproductsandservicesorfortechnicalsupport, pleasecontactourCustomerCareDepartmentwithintheUnitedStatesat(800) 762-2974,outsidetheUnitedStatesat(317)572-3993orfax(317)572-4002. Wileyalsopublishesitsbooksinavarietyofelectronicformats.Somecontentthat appearsinprintmaynotbeavailableinelectronicformats.Formoreinformationabout Wileyproducts,visitourwebsiteatwww.wiley.com. LibraryofCongressCataloging-in-PublicationData: Good,PhillipI. Analyzingthelargenumberofvariablesinbiomedicalandsatelliteimagery/PhillipI. Good. p.cm. Includesbibliographicalreferencesandindex. ISBN978-0-470-92714-4(pbk.) 1.Datamining.2.Mathematicalstatistics.3.Biomedicalengineering–Dataprocessing. 4.Remotesensing–Dataprocessing.I.Title. QA76.9.D343G7532011 066.3(cid:2)12–dc22 2010030988 PrintedinSingapore oBookISBN:978-0-470-93727-3 ePDFISBN:978-0-470-93725-9 ePubISBN:978-1-118-00214-8 10987654321 CONTENTS Preface xi 1 VERY LARGE ARRAYS 1 1.1 Applications 1 1.2 Problems 2 1.3 Solutions 2 2 PERMUTATIONTESTS 5 2.1 Two-SampleComparison 5 2.1.1 Blocks 7 2.2 k-SampleComparison 8 2.3 ComputingThep-Value 9 2.3.1 MonteCarloMethod 10 2.3.2 AnRProgram 11 2.4 Multiple-VariableComparisons 11 2.4.1 EuclideanDistanceMatrixAnalysis 12 2.4.2 Hotelling’sT2 13 2.4.3 Mantel’sU 14 2.4.4 CombiningUnivariateTests 15 2.4.5 GeneSetEnrichmentAnalysis 16 v vi CONTENTS 2.5 CategoricalData 17 2.6 Software 19 2.7 Summary 20 3 APPLYING THE PERMUTATIONTEST 23 3.1 WhichVariablesShouldBeIncluded? 24 3.2 Single-ValueTestStatistics 26 3.2.1 CategoricalData 26 3.2.2 AMultivariateComparisonBasedona SummaryStatistic 26 3.2.3 AMultivariateComparisonBasedon VariantsofHotelling’sT2 28 3.2.4 AdjustingforCovariates 29 3.2.5 Pre–PostComparisons 31 3.2.6 ChoosingaStatistic:Time-Course Microarrays 32 3.3 RecommendedApproaches 35 3.4 ToLearnMore 35 4 BIOLOGICAL BACKGROUND 37 4.1 MedicalImaging 37 4.1.1 Ultrasound 38 4.1.2 EEG/MEG 39 4.1.3 MagneticResonanceImaging 41 4.1.3.1 MRI 41 4.1.3.2 fMRI 42 4.1.4 PositronEmissionTomography 44 4.2 Microarrays 44 4.3 ToLearnMore 47 5 MULTIPLE TESTS 49 5.1 ReducingtheNumberofHypothesestoBe Tested 50 CONTENTS vii 5.1.1 Normalization 50 5.1.2 SelectionMethods 52 5.1.2.1 UnivariateStatistics 52 5.1.2.2 WhichStatistic? 54 5.1.2.3 HeuristicMethods 55 5.1.2.4 WhichMethod? 59 5.2 ControllingtheOverAllErrorRate 59 5.2.1 AnExample:AnalyzingDatafrom Microarrays 60 5.3 ControllingtheFalseDiscoveryRate 61 5.3.1 AnExample:AnalyzingTime-Course DatafromMicroarrays 62 5.4 GeneSetEnrichmentAnalysis 63 5.5 SoftwareforPerformingMultiple SimultaneousTests 67 5.5.1 AFNI 67 5.5.2 Cyber-T 68 5.5.3 dChip 68 5.5.4 ExactFDR 69 5.5.5 GESS 69 5.5.6 HaploView 69 5.5.7 MatLab 69 5.5.8 R 70 5.5.9 SAM 70 5.5.10 ParaSam 71 5.6 Summary 72 5.7 ToLearnMore 72 6 THE BOOTSTRAP 73 6.1 SamplesandPopulations 73 6.2 PrecisionofanEstimate 74 6.2.1 RCode 77 6.2.2 ApplyingtheBootstrap 78 viii CONTENTS 6.2.3 BootstrapReproducibilityIndex 79 6.2.4 EstimationinRegressionModels 80 6.3 ConfidenceIntervals 82 6.3.1 TestingforEquivalence 83 6.3.2 ParametricBootstrap 84 6.3.3 BlockedBootstrap 85 6.3.4 BalancedBootstrap 85 6.3.5 AdjustedBootstrap 86 6.3.6 WhichTest? 87 6.4 DeterminingSampleSize 88 6.4.1 EstablishaThreshold 89 6.5 Validation 90 6.5.1 ClusterAnalysis 92 6.5.2 CorrespondenceAnalysis 94 6.6 BuildingaModel 96 6.7 HowLargeShouldTheSamplesBe? 98 6.8 Summary 99 6.9 ToLearnMore 99 7 CLASSIFICATIONMETHODS 101 7.1 NearestNeighborMethods 101 7.2 DiscriminantAnalysis 102 7.3 LogisticRegression 103 7.4 PrincipalComponents 103 7.5 NaiveBayesClassifier 104 7.6 HeuristicMethods 104 7.7 DecisionTrees 105 7.7.1 AWorked-ThroughExample 106 7.8 WhichAlgorithmIsBestforYourApplication? 108 7.8.1 SomeFurtherComparisons 111 7.8.2 ValidationVersusCross-validation 112 7.9 ImprovingDiagnosticEffectiveness 113 7.9.1 Boosting 113 CONTENTS ix 7.9.2 EnsembleMethods 113 7.9.3 RandomForests 114 7.10 SoftwareforDecisionTrees 116 7.11 Summary 117 8 APPLYINGDECISIONTREES 119 8.1 Photographs 119 8.2 Ultrasound 121 8.3 MRIImages 122 8.4 EEGsandEMGs 124 8.5 MisclassificationCosts 125 8.6 ReceiverOperatingCharacteristic 126 8.7 WhentheCategoriesAreAsYetUndefined 127 8.7.1 UnsupervisedPrincipalComponents AppliedtofMRI 127 8.7.2 SupervisedPrincipalComponents AppliedtoMicroarrays 129 8.8 EnsembleMethods 131 8.9 MaximallyDiversifiedMultipleTrees 131 8.10 PuttingItAllTogether 133 8.11 Summary 135 8.12 ToLearnMore 135 Glossaryof BiomedicalTerminology 137 Glossaryof StatisticalTerminology 141 Appendix:An R Primer 153 R1 GettingStarted 153 R1.1 RFunctions 155 R1.2 VectorArithmetic 156 R2 StoreandRetrieveData 156 x CONTENTS R2.1 StoringandRetrievingFilesfrom WithinR 156 R2.2 TheTabularFormat 157 R2.3 CommaSeparatedFormat 158 R3 Resampling 159 R3.1 TheWhileCommand 159 R4 ExpandingR’sCapabilities 161 R4.1 DownloadingLibrariesofRFunctions 161 R4.2 ProgrammingYourOwnFunctions 161 Bibliography 165 AuthorIndex 175 SubjectIndex 181
Description: