ebook img

Fault-Tolerant Systems PDF

399 Pages·2007·12.87 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Fault-Tolerant Systems

FAULT TOLERANT SYSTEMS In Praise of Fault Tolerant Systems “Faultattackshaverecentlybecomeaseriousconcerninthesmartcardindustry. “Fault Tolerant Systems” provides the reader with a clear exposition of these at- tacksand the protection strategiesthatcanbeusedto thwartthem.Amustread forpractitionersandresearchersworkinginthefield.” DavidNaccache,Ecolenormalesupérieure “Understandingthefundamentalsofanarea,whetheritisgolforfaulttolerance, isaprerequisitetodevelopingexpertiseinthearea.KrishnaandKoren’sbookcan provide a reader with this underlying foundation for fault tolerance. This book isparticularlytimelybecausethedesignoffault-tolerantcomputingcomponents, such as processors and disks, is becoming increasingly important to the main- streamcomputingindustry.” ShubuMukherjee,Director,FACT-AMIGroup,IntelCorporation “Professors Koren and Krishna, have written a modern, dual purpose text that firstpresentsthebasicsfaulttolerancetoolsdescribingvariousredundancytypes both at the hardware and software levels followed by current research topics. It reviews fundamental reliability modeling approaches, combinatorial blocks and Markovchaintechniques.Notably,thereisacompletechapteron statisticalsim- ulation methods that offers guidance to practical evaluations as well as one on fault-tolerant networks. All chapters, which are clearly written including illumi- natingexamples,haveextensivereferencelistswherebystudentscandelvedeeper into almost any topic. Several practical and commercial computing systems that incorporate fault tolerance are detailed. Furthermore, there are two chapters in- troducingcurrentfaulttoleranceresearchchallenges,cryptographicsystemsand defectsinVLSIdesigns.” RobertRedinbo,UCDavis “ThefieldofFault-TolerantComputinghasadvancedconsiderablyinthepastten yearsandyetnoefforthasbeenmadetoputtogethertheseadvancesintheform of a book or a comprehensive paper for the students starting in this area. This is thefirstbookIknowofinthepast10yearsthatdealswithhardwareandsoftware aspectsoffaulttolerantcomputing,isverycomprehensive,andiswrittenasatext forthecourse.” KewalSaluja,UniversityofWisconsin,Madison FAULT TOLERANT SYSTEMS Israel Koren C. Mani Krishna AMSTERDAM•BOSTON•HEIDELBERG•LONDON NEWYORK•OXFORD•PARIS•SANDIEGO SANFRANCISCO•SINGAPORE•SYDNEY•TOKYO MorganKaufmannPublishersisanimprintofElsevier Publisher DenisePenrose PublishingServicesManager GeorgeMorrison ProductionEditor DawnmarieSimpson AssistantEditor KimberleeHonso CoverDesign AlisaAndreola CoverIllustration YaronKoren TextDesign GeneHarris Composition VTEX Copyeditor GraphicWorldPublishingServices Proofreader GraphicWorldPublishingServices Indexer GraphicWorldPublishingServices Interiorprinter TheMaple–VailBookManufacturingGroup Coverprinter PhoenixColor,Inc. MorganKaufmannPublishersisanimprintofElsevier. 500SansomeStreet,Suite400,SanFrancisco,CA94111 Thisbookisprintedonacid-freepaper. (cid:1)c2007,Elsevier,Inc.Allrightsreserved. Designationsusedbycompaniestodistinguishtheirproductsareoftenclaimedastrademarksorregistered trademarks.InallinstancesinwhichMorganKaufmannPublishersisawareofaclaim,theproductnames appearininitialcapitalorallcapitalletters.Readers,however,shouldcontacttheappropriatecompanies formorecompleteinformationregardingtrademarksandregistration. Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorby anymeans—electronic,mechanical,photocopying,scanning,orotherwise—withoutpriorwrittenpermis- sionofthepublisher. PermissionsmaybesoughtdirectlyfromElsevier’sScience&TechnologyRightsDepartmentinOxford,UK: phone:(+44)1865843830,fax:(+44)1865853333,E-mail:[email protected] yourrequestonlineviatheElsevierhomepage(http://elsevier.com),byselecting“Support&Contact”then “CopyrightandPermission”andthen“ObtainingPermissions.” LibraryofCongressCataloging-in-PublicationData Koren,Israel,1945- Faulttolerantsystems/IsraelKoren,C.ManiKrishna. p.cm. Includesbibliographicalreferencesandindex. ISBN0-12-088525-5(alk.paper) 1. Fault-tolerantcomputing.2.Computersystems–Reliability. I.Krishna,C.M.II.Title. QA76.9.F38K672007 004.2–dc22 2006031810 ISBN13:978-0-12-088568-8 ISBN10:0-12-088568-9 ForinformationonallMorganKaufmannpublications,visitour Websiteatwww.mkp.comorwww.books.elsevier.com PrintedintheUnitedStates 06 07 08 09 10 5 4 3 2 1 Contents Foreword xi Preface xiii Acknowledgements xvii AbouttheAuthors xix 1 Preliminaries 1 1.1 FaultClassification 2 1.2 TypesofRedundancy 3 1.3 BasicMeasuresofFaultTolerance 4 1.3.1 TraditionalMeasures 5 1.3.2 NetworkMeasures 6 1.4 OutlineofThisBook 7 1.5 FurtherReading 9 References 10 2 Hardware Fault Tolerance 11 2.1 TheRateofHardwareFailures 11 2.2 FailureRate,Reliability,andMeanTimetoFailure 13 2.3 CanonicalandResilientStructures 15 2.3.1 SeriesandParallelSystems 16 2.3.2 Non-Series/ParallelSystems 17 2.3.3 M-of-NSystems 20 2.3.4 Voters 23 2.3.5 VariationsonN-ModularRedundancy 23 2.3.6 DuplexSystems 27 2.4 OtherReliabilityEvaluationTechniques 30 2.4.1 PoissonProcesses 30 2.4.2 MarkovModels 33 v vi Contents 2.5 Fault-ToleranceProcessor-LevelTechniques 36 2.5.1 WatchdogProcessor 37 2.5.2 SimultaneousMultithreadingforFaultTolerance 39 2.6 ByzantineFailures 41 2.6.1 ByzantineAgreementwithMessageAuthentication 46 2.7 FurtherReading 48 2.8 Exercises 48 References 53 3 Information Redundancy 55 3.1 Coding 56 3.1.1 ParityCodes 57 3.1.2 Checksum 64 3.1.3 M-of-NCodes 65 3.1.4 BergerCode 66 3.1.5 CyclicCodes 67 3.1.6 ArithmeticCodes 74 3.2 ResilientDiskSystems 79 3.2.1 RAIDLevel1 79 3.2.2 RAIDLevel2 81 3.2.3 RAIDLevel3 82 3.2.4 RAIDLevel4 83 3.2.5 RAIDLevel5 84 3.2.6 ModelingCorrelatedFailures 84 3.3 DataReplication 88 3.3.1 Voting:Non-HierarchicalOrganization 89 3.3.2 Voting:HierarchicalOrganization 95 3.3.3 Primary-BackupApproach 96 3.4 Algorithm-BasedFaultTolerance 99 3.5 FurtherReading 101 3.6 Exercises 102 References 106 4 Fault-Tolerant Networks 109 4.1 MeasuresofResilience 110 4.1.1 Graph-TheoreticalMeasures 110 4.1.2 ComputerNetworksMeasures 111 4.2 CommonNetworkTopologiesandTheirResilience 112 4.2.1 MultistageandExtra-StageNetworks 112 4.2.2 CrossbarNetworks 119 4.2.3 RectangularMeshandInterstitialMesh 121 4.2.4 HypercubeNetwork 124 Contents vii 4.2.5 Cube-ConnectedCyclesNetworks 128 4.2.6 LoopNetworks 130 4.2.7 AdhocPoint-to-PointNetworks 132 4.3 Fault-TolerantRouting 135 4.3.1 HypercubeFault-TolerantRouting 136 4.3.2 Origin-BasedRoutingintheMesh 138 4.4 FurtherReading 141 4.5 Exercises 142 References 145 5 Software Fault Tolerance 147 5.1 AcceptanceTests 148 5.2 Single-VersionFaultTolerance 149 5.2.1 Wrappers 149 5.2.2 SoftwareRejuvenation 152 5.2.3 DataDiversity 155 5.2.4 SoftwareImplementedHardwareFaultTolerance(SIHFT) 157 5.3 N-VersionProgramming 160 5.3.1 ConsistentComparisonProblem 161 5.3.2 VersionIndependence 162 5.4 RecoveryBlockApproach 169 5.4.1 BasicPrinciples 169 5.4.2 SuccessProbabilityCalculation 169 5.4.3 DistributedRecoveryBlocks 171 5.5 Preconditions,Postconditions,andAssertions 173 5.6 Exception-Handling 173 5.6.1 RequirementsfromException-Handlers 174 5.6.2 BasicsofExceptionsandException-Handling 175 5.6.3 LanguageSupport 177 5.7 SoftwareReliabilityModels 178 5.7.1 Jelinski–MorandaModel 178 5.7.2 Littlewood–VerrallModel 179 5.7.3 Musa–OkumotoModel 180 5.7.4 ModelSelectionandParameterEstimation 182 5.8 Fault-TolerantRemoteProcedureCalls 182 5.8.1 Primary-BackupApproach 182 5.8.2 TheCircusApproach 183 5.9 FurtherReading 184 5.10 Exercises 186 References 188 viii Contents 6 Checkpointing 193 6.1 WhatisCheckpointing? 195 6.1.1 WhyisCheckpointingNontrivial? 197 6.2 CheckpointLevel 197 6.3 OptimalCheckpointing—AnAnalyticalModel 198 6.3.1 TimeBetweenCheckpoints—AFirst-OrderApproximation 200 6.3.2 OptimalCheckpointPlacement 201 6.3.3 TimeBetweenCheckpoints—AMoreAccurateModel 202 6.3.4 ReducingOverhead 204 6.3.5 ReducingLatency 205 6.4 Cache-AidedRollbackErrorRecovery(CARER) 206 6.5 CheckpointinginDistributedSystems 207 6.5.1 TheDominoEffectandLivelock 209 6.5.2 ACoordinatedCheckpointingAlgorithm 210 6.5.3 Time-BasedSynchronization 211 6.5.4 DisklessCheckpointing 212 6.5.5 MessageLogging 213 6.6 CheckpointinginShared-MemorySystems 217 6.6.1 Bus-BasedCoherenceProtocol 218 6.6.2 Directory-BasedProtocol 219 6.7 CheckpointinginReal-TimeSystems 220 6.8 OtherUsesofCheckpointing 223 6.9 FurtherReading 223 6.10 Exercises 224 References 226 7 Case Studies 229 7.1 NonStopSystems 229 7.1.1 Architecture 229 7.1.2 MaintenanceandRepairAids 233 7.1.3 Software 233 7.1.4 ModificationstotheNonStopArchitecture 235 7.2 StratusSystems 236 7.3 CassiniCommandandDataSubsystem 238 7.4 IBMG5 241 7.5 IBMSysplex 242 7.6 Itanium 244 7.7 FurtherReading 246 References 247 8 Defect Tolerance in VLSI Circuits 249 8.1 ManufacturingDefectsandCircuitFaults 249 Contents ix 8.2 ProbabilityofFailureandCriticalArea 251 8.3 BasicYieldModels 253 8.3.1 ThePoissonandCompoundPoissonYieldModels 254 8.3.2 VariationsontheSimpleYieldModels 256 8.4 YieldEnhancementThroughRedundancy 258 8.4.1 YieldProjectionforChipswithRedundancy 259 8.4.2 MemoryArrayswithRedundancy 263 8.4.3 LogicIntegratedCircuitswithRedundancy 270 8.4.4 ModifyingtheFloorplan 272 8.5 FurtherReading 276 8.6 Exercises 277 References 281 9 Fault Detection in Cryptographic Systems 285 9.1 OverviewofCiphers 286 9.1.1 SymmetricKeyCiphers 286 9.1.2 PublicKeyCiphers 295 9.2 SecurityAttacksThroughFaultInjection 296 9.2.1 FaultAttacksonSymmetricKeyCiphers 297 9.2.2 FaultAttacksonPublic(Asymmetric)KeyCiphers 298 9.3 Countermeasures 299 9.3.1 SpatialandTemporalDuplication 300 9.3.2 Error-DetectingCodes 300 9.3.3 AreTheseCountermeasuresSufficient? 304 9.3.4 FinalComment 307 9.4 FurtherReading 307 9.5 Exercises 307 References 308 10 Simulation Techniques 311 10.1 WritingaSimulationProgram 311 10.2 ParameterEstimation 315 10.2.1 PointVersusIntervalEstimation 315 10.2.2 MethodofMoments 316 10.2.3 MethodofMaximumLikelihood 318 10.2.4 TheBayesianApproachtoParameterEstimation 322 10.2.5 ConfidenceIntervals 324 10.3 VarianceReductionMethods 328 10.3.1 AntitheticVariables 328 10.3.2 UsingControlVariables 330 10.3.3 StratifiedSampling 331 10.3.4 ImportanceSampling 333

Description:
The mixture of hardware and software discussions in the book is appealing. The authors describe the theory behind designing and modelling systems against failures in their components. There is a brief coverage of coding theory. This is a subject on which ample numbers of books have been dedicated. T
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.