Delivering Affordable Fault-tolerance to Commodity Computer Systems by ShuguangFeng A dissertationsubmittedin partialfulfillment oftherequirementsforthedegreeof DoctorofPhilosophy (ComputerScience and Engineering) in TheUniversityofMichigan 2011 DoctoralCommittee: AssociateProfessorScott Mahlke,Chair ProfessorDavidBlaauw ProfessorTrevorN. Mudge ProfessorDennisM. Sylvester AssistantProfessorThomasWenisch Pradip Bose, IBM T.J.Watson © ShuguangFeng 2011 AllRightsReserved Tomy parentsJinan andPinfang. ii ACKNOWLEDGEMENTS OverthelastfewyearstherehavebeenmanymomentswhenIfeltthatthisdissertation wouldneverbewritten. ThefactthatIcannowplacethefinishingtouchesonthisdocument means that there is no shortage of people to whom I owe a great debt of gratitude. Since I cannotpossiblylistallthosethathaveblessedmytimeinAnnArbor,Iencourageeveryone that I have omitted to track me down, so that I might apologize over a meal the next time wemeet. First I would liketo thank my advisor, Scott Mahlke. His expert guidance and passion for his work were crucial to my success. The existence of this dissertation owes itself, in largepart,toScott’sdriveandunshakableoptimism,whichonmanyoccasionsresurrected research directions I had already resigned to abandon. Although we often disagreed, his eagerness to debateideas and willingnessto entertain dissentleft an indeliblemark on my approach toproblemsolving. I also want to thank Jason Blomeand Eric Karl who played an instrumentalrole in the early months of my graduate career. Their advice and guidance as I took my initial steps intoacademicresearch was invaluable. Next, I would like to thank Shantanu Gupta and Amin Ansari. Having endured count- lesshoursofmeetings,andendlessroundsofpaperrevisions,thesetwohavebecomemore iii than simply co-authors. Collaborating on nearly all of our work, we have witnessed each others’failuresas well asshared in each others’successes. I wouldalsoliketothank themembers(and honorary members)oftheCCCP research group. Ihaveoftenfeltthatgraduateschoolwasasmuchaboutmaintainingone’ssanityas itwasaboutconductinginnovativeresearch. Whetheritwasourinfamouscoffeebreaks,or seemingly random office discussions ranging from Middle East politics to Russell Peters, mylabmatesprovidedplentyofmuch-neededdistractions. I am also grateful to the amazing friends I made through Harvest and Knox. It is hard for me to imagine being able to survive the last few years without these friendships to remind me that there does actually exist a world outside of the office. I would especially liketo thank Greg (“MagnumOpus”) Davidson,Erik Kim, ManojCheriyan, and Matthew RemyforsharingtheirPh.D.experienceswithme,andshowingmethat,despitewhattrials maycome, itis alwayspossibletorememberwhoholdstomorrow. IalsowanttothankGraceHwangforputtingupwithmeforthelastfouryears. Despite not seeing each other for weeks, sometimes months, at a time, she has been an encour- agement throughout this Ph.D. journey. Her patience with me during particularly stressful weekswasawonderfulcomfort,allowingmetoselfishlyventoverthephonewhilekeeping thingsin perspective. Althoughneverideal, Grace madejugglinga long-distancerelation- shipas painlessas Ihad anyrighttohopefor. Finally, I would like to thank my parents. They have sacrificed so much for me and my sister. I may never truly appreciate, much less acknowledge, the extent of their love and commitment to us. The chance to write this thesis is simply one in a long series of opportunitiesafforded tomebecauseoftheirsacrifice. iv TABLE OF CONTENTS DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii LIST OFFIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OFTABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii CHAPTER I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 DependableComputingfortheMasses . . . . . . . . . . . . . . . 1 1.2 ReliabilityTaxonomy . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Threatsto ReliableComputing . . . . . . . . . . . . . . 4 1.2.2 AnatomyofFault-tolerantComputing . . . . . . . . . . 5 1.3 ConventionalSolutions . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 II. Self-calibrating OnlineWearoutDetection . . . . . . . . . . . . . . . . 10 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Device-levelWearout Analysis . . . . . . . . . . . . . . . . . . . 14 2.2.1 GateOxideBreakdown . . . . . . . . . . . . . . . . . . 15 2.2.2 HSPICE Analysis . . . . . . . . . . . . . . . . . . . . . 16 2.3 Microarchitecture-levelWearout Analysis . . . . . . . . . . . . . 20 2.3.1 MicroprocessorImplementation . . . . . . . . . . . . . 21 2.3.2 Power, Temperature,and MTTFCalculations . . . . . . 22 2.3.3 Wearout Simulation . . . . . . . . . . . . . . . . . . . 24 2.4 Wearout Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.1 OnlineDelay Profiling . . . . . . . . . . . . . . . . . . 27 2.4.2 FailurePrediction Algorithm . . . . . . . . . . . . . . . 29 v 2.4.3 ImplementationDetails . . . . . . . . . . . . . . . . . . 31 2.5 ExperimentalAnalysis . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5.1 Overheadand Accuracy . . . . . . . . . . . . . . . . . 35 2.5.2 DynamicVariations . . . . . . . . . . . . . . . . . . . . 37 2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 III. Maestro: Orchestrating LifetimeReliabilityinChipMultiprocessors . 42 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 Schedulingfor DamagedCores andDynamicWorkloads . . . . . 45 3.2.1 FailureMechanismReview . . . . . . . . . . . . . . . . 46 3.2.2 ExistingSchedulingSchemes . . . . . . . . . . . . . . 47 3.2.3 WorkloadVariation . . . . . . . . . . . . . . . . . . . . 50 3.2.4 ImplicationsforMean Timeto Failure . . . . . . . . . . 53 3.3 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3.1 HealthMonitoring . . . . . . . . . . . . . . . . . . . . 54 3.3.2 MaestroVirtualizationLayer . . . . . . . . . . . . . . . 55 3.4 Evaluationand Analysis . . . . . . . . . . . . . . . . . . . . . . . 64 3.4.1 AdaptiveLifetimeSimulation . . . . . . . . . . . . . . 65 3.4.2 LifetimeThroughputEnhancement . . . . . . . . . . . 67 3.4.3 FailureDistributions . . . . . . . . . . . . . . . . . . . 69 3.4.4 Sensitivityto SystemUtilization . . . . . . . . . . . . . 71 3.4.5 Sensitivityto SensorNoise . . . . . . . . . . . . . . . . 72 3.4.6 SensorSelection . . . . . . . . . . . . . . . . . . . . . 73 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 IV. Shoestring: ProbabilisticSoftError Reliabilityonthe Cheap . . . . . . 75 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2 Background and Motivation . . . . . . . . . . . . . . . . . . . . . 78 4.2.1 Soft ErrorRate . . . . . . . . . . . . . . . . . . . . . . 78 4.2.2 SolutionLandscapeand Shoestring . . . . . . . . . . . 80 4.3 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3.1 CompilerOverview . . . . . . . . . . . . . . . . . . . . 86 4.3.2 PreliminaryClassification . . . . . . . . . . . . . . . . 87 4.3.3 VulnerabilityAnalysis . . . . . . . . . . . . . . . . . . 89 4.3.4 CodeDuplication . . . . . . . . . . . . . . . . . . . . . 94 4.4 ExperimentalMethodology . . . . . . . . . . . . . . . . . . . . . 96 4.4.1 FaultModeland InjectionFramework . . . . . . . . . . 97 4.4.2 OutcomeClassification . . . . . . . . . . . . . . . . . . 99 4.4.3 SystemSupport . . . . . . . . . . . . . . . . . . . . . . 100 4.5 Evaluationand Analysis . . . . . . . . . . . . . . . . . . . . . . . 102 4.5.1 PreliminaryFaultInjection . . . . . . . . . . . . . . . . 102 4.5.2 Program Analysis . . . . . . . . . . . . . . . . . . . . . 104 vi 4.5.3 Overheadsand FaultCoverage . . . . . . . . . . . . . . 106 4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 V. Encore: Low-cost, Fine-grained Transient FaultRecovery . . . . . . . 114 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.2 Recoveringfrom TransientFaults . . . . . . . . . . . . . . . . . . 118 5.2.1 RecoverywithFine-grained Re-execution . . . . . . . . 119 5.2.2 TheRoleofIdempotence . . . . . . . . . . . . . . . . . 122 5.3 Encore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.3.1 IdentifyingInherent Idempotence . . . . . . . . . . . . 125 5.3.2 Instrumentation . . . . . . . . . . . . . . . . . . . . . . 131 5.3.3 RegionFormation . . . . . . . . . . . . . . . . . . . . 132 5.3.4 EncoreHeuristics . . . . . . . . . . . . . . . . . . . . . 134 5.4 ExperimentalMethodology . . . . . . . . . . . . . . . . . . . . . 136 5.4.1 CompilationFramework . . . . . . . . . . . . . . . . . 137 5.4.2 RecoverabilityCoverageModel . . . . . . . . . . . . . 137 5.4.3 Performance Modeling . . . . . . . . . . . . . . . . . . 139 5.5 Evaluationand Analysis . . . . . . . . . . . . . . . . . . . . . . . 139 5.5.1 RegionIdempotence . . . . . . . . . . . . . . . . . . . 140 5.5.2 DynamicExecutionBreakdown . . . . . . . . . . . . . 141 5.5.3 Overheads . . . . . . . . . . . . . . . . . . . . . . . . 142 5.5.4 Full-systemReliability . . . . . . . . . . . . . . . . . . 144 5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.6.1 FaultDetection . . . . . . . . . . . . . . . . . . . . . . 146 5.6.2 SystemRecovery . . . . . . . . . . . . . . . . . . . . . 146 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 VI. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 vii LIST OF FIGURES Figure 1.1 A “reliability pipeline” depicting the different pieces of a comprehen- sive reliability strategy. The relative location of each component to the transient-permanent boundary represents the extent to which recent re- search intotransient(wearout-induced permanent)faultshas studiedthat particularaspect ofreliability.. . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Impact of OBD-induced oxideleakage current on standard cell propaga- tiondelays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 HSPICE simulationtraces forinverterwithdegradedPMOS(slowdown). 18 2.3 HSPICE simulationtraces forinverterwithdegradedPMOS(speedup). . 19 2.4 OpenRisc1200embedded microprocessor. . . . . . . . . . . . . . . . . . 22 2.5 Derived workload-dependentsteady statetemperature and MTTFforthe OR1200CPUcore. Anambienttemperatureof333K wasusedforHotspot. 24 2.6 TheobservedslowdownofsignalsfromtheALUresultbusasaresultof OBDeffectsoverthelifetimeofoneinstanceofanOR1200processorcore. 26 2.7 Onlinedelay profilingunit. . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.8 SensitivityanalysisofTRIX parametrization. . . . . . . . . . . . . . . . 31 2.9 Designand organizationofthewearout detectionunit. . . . . . . . . . . 32 2.10 Scaling of the WDU and DPU area and power as the number of signals monitoredscales. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.11 AnalysisofTRIX analysisefficacy in predictingfailure. . . . . . . . . . . 36 2.12 Impact oftemperatureon logicgatedelay. . . . . . . . . . . . . . . . . . 38 3.1 VariationofmoduletemperaturesacrossSPEC2000 workloads. Alltem- peratures are normalized to T , the peak temperature seen across all max benchmarks andmodules(83◦C). . . . . . . . . . . . . . . . . . . . . . . 51 3.2 Head-to-headcomparisonsofapplu(SPECFP),vpr(SPECINT),andwup- wise (SPECFP). No one benchmark in (a), (b), or (c) strictly dominates theother(withrespect totemperature)across all modules.. . . . . . . . . 52 3.3 Projectedcorelifetimebasedonexecutionofappluandvpr asafunction of the module identified as the weakest structure. Values are normalized to thebest achievableMTTF. . . . . . . . . . . . . . . . . . . . . . . . . 54 viii 3.4 A high-level block diagram of the Maestro introspectivereliability man- agement system. Dynamic monitoring of sensor feedback and detailed characterization of workload behavior enables Maestro to improve life- timesystemreliabilitywithwearout-centricscheduling. . . . . . . . . . . 56 3.5 Chromosomestructure . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.6 Stepsinvolvedinreproduction. S ,S ,S ,S aretheparentalcandidates. 0 1 2 3 S istheresultingchildchromosomeafterinitialcrossover. S′ andS′′ are c c c thestatesofthechildchromosomeafterconflictsresolutionandmutation respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.7 CMP details. All simulation results, unless otherwise stated, are pre- sented foraCMPconfigured with16 cores. . . . . . . . . . . . . . . . . 65 3.8 Theadaptivesimulationusedtoacceleratelifetimereliabilitysimulations whileincurringminimalexperimentalerror. . . . . . . . . . . . . . . . . 67 3.9 Performanceofwearout-centricschedulingpoliciesversesCMPsizeand failurethreshold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.10 Failure distributions for individual cores and the 16-core CMP with a failure threshold of 8 cores and 100% utilization. Trendlines are added (between markers)to improvereadability. . . . . . . . . . . . . . . . . . 70 3.11 Impact ofCMPutilizationon reliabilityenhancement. . . . . . . . . . . . 71 3.12 Sensitivity to sensor noise. Although random sensor noise can be re- movedwiththeappropriatefiltering,systematicerrorduetomanufactur- ing tolerancesis moreproblematic. . . . . . . . . . . . . . . . . . . . . . 73 3.13 Performance of wearout-centric scheduling with different sensors. Re- sults are shown for a failure threshold of 1 core to favor the temperature sensorand access counterbasedapproaches. . . . . . . . . . . . . . . . . 73 4.1 The soft error rate trend for processor logic across a range of silicon technology nodes. The Nominal curve illustrates past and present trends whiletheVscale L,Vscale M,andVscale Hcurvesassumelow,medium and high amounts(respectively)of voltagescaling in futuredeep submi- crontechnologies. Theuser-visiblefailurerateshighlightedat45nmand 16nmarecalculated assuminga92%system-widemaskingrate. . . . . . 79 4.2 Fault coverage versus dynamic instruction penalty trade-off for two ex- isting fault detection schemes: symptom-based detection and instruction duplication-based detection. Also indicated is the region of the solution space targeted by Shoestring. The mapping of fault coverage to user- visible failure rate (dashed horizontal lines) is with respect to a single chip in a 16nm technology node with aggressive voltage scaling (Vs- cale H). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3 A representativeexampleofperformanceoptimizedcode(loop unrolled). 84 4.4 A standard compiler flow augmented with Shoestring’s reliability-aware codegeneration passes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 ix
Description: