ebook img

Built-in Fault-Tolerant Computing Paradigm for Resilient Large-Scale Chip Design: A Self-Test, Self-Diagnosis, and Self-Repair-Based Approach PDF

318 Pages·2023·11.84 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Built-in Fault-Tolerant Computing Paradigm for Resilient Large-Scale Chip Design: A Self-Test, Self-Diagnosis, and Self-Repair-Based Approach

Xiaowei Li Guihai Yan Cheng Liu Built-in Fault-Tolerant Computing Paradigm for Resilient Large-Scale Chip Design A Self-Test, Self-Diagnosis, and Self-Repair-Based Approach Built-in Fault-Tolerant Computing Paradigm for Resilient Large-Scale Chip Design Xiaowei Li • Guihai Yan (cid:129) Cheng Liu Built-in Fault-Tolerant Computing Paradigm for Resilient Large-Scale Chip Design A Self-Test, Self-Diagnosis, and Self-Repair-Based Approach XiaoweiLi GuihaiYan StateKeyLabofProcessors StateKeyLabofProcessors InstituteofComputingTechnology,Chinese InstituteofComputingTechnology,Chinese AcademyofSciences AcademyofSciences Beijing,China Beijing,China ChengLiu StateKeyLabofProcessors InstituteofComputingTechnology,Chinese AcademyofSciences Beijing,China ISBN978-981-19-8550-8 ISBN978-981-19-8551-5 (eBook) https://doi.org/10.1007/978-981-19-8551-5 ©TheEditor(s)(ifapplicable)andTheAuthor(s),underexclusivelicensetoSpringerNatureSingapore PteLtd.2023 Thisworkissubjecttocopyright.AllrightsaresolelyandexclusivelylicensedbythePublisher,whether thewholeorpartofthematerialisconcerned,specificallytherightsoftranslation,reprinting,reuse ofillustrations,recitation,broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,and transmissionorinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilar ordissimilarmethodologynowknownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthors,andtheeditorsaresafetoassumethattheadviceandinformationinthisbook arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor theeditorsgiveawarranty,expressedorimplied,withrespecttothematerialcontainedhereinorforany errorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictional claimsinpublishedmapsandinstitutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSingaporePteLtd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore Toallthestudentsandcolleaguesin IntegratedCircuitDesignGroupintheState KeyLabofComputerArchitecture. Preface If your computer crashes, you can revive it by a reboot, an empirical solution that usuallyturnsouttobeeffective.Therationalebehindthissolutionisthattransient faults, either in hardware or software, can be fixed by refreshing the machine state. Such a “silver bullet,” however, could be futile in the future because the faults, especially those existing in the hardware such as Integrated Circuit (IC) chips, cannot be eliminated by refreshing. What we need is a more sophisticated mechanism to steer the system back on the right track. The “magic cure” is the on-chip fault-tolerant mechanism, which relies on a suite of built-in design- for-reliability logic, including fault detection, fault diagnosis, and fault recovery, workinginaunifiedmanner. With the shrinking semiconductor feature sizes and continuous scaling of the IC designs, silicon defects caused by manufacture defects, radiation particles, or progressively aging are almost inevitable and pose critical influence on both the yieldandqualityofICproducts.Underthiscontext,wehavesuccessfullyapplied on-chip fault-tolerant computing mechanism onto a set of different chip designs includinggenericcircuits,general-purposeprocessors,network-on-chips,anddeep learningprocessorsinthepastdecade,andgraduallyformulateasystematicbuilt-in fault-tolerantcomputingparadigm,whichcanbeutilizedtoguideICdesignsagainst thesetypicalsilicondefects.Inadditiontothebasicfaultdetection,faultdiagnosis, and fault recovery, the proposed built-in fault-tolerant computing paradigm also providesadditionalbenefits,suchasfacilitatinggracefulperformancedegradation, mitigatingtheimpactofverificationblindspots,andimprovingthechipyield. Inthisbook,wemainlyillustratethebuilt-infault-tolerantcomputingparadigm with practical demonstrations on genetic circuits, general-purpose processors, network-on-chips, and deep learning processors. The entire book consists of six chapters. Chapter 1 presents the background of fault-tolerant chip designs and overviewofthebuilt-infault-tolerantcomputingparadigm.Chapter2presentson- line fault detection, on-chip path delay, and lifetime fault-tolerant pipeline design for genetic circuits. Chapter 3 investigates the vulnerability of general-purpose processorsundersilicondefectsandpresentsacoresalvagingapproach,particularly formulti-coreprocessorarchitecture.Chapter4focusesonfault-tolerantnetwork- vii viii Preface on-chip designs from distinct angles including topology reconfiguration, routing design, and architecture design. Chapter 5 focuses on built-in fault-tolerant deep learning processors fabricated with both conventional CMOS-based technology and emerging ReRAM-based technology. Chapter 6 concludes this book with a brief summary of the proposed built-in fault-tolerant computing paradigm and a discussion of future fault-tolerant computing directions on large-scale VLSI designs. Themajorityofthecontentinvolvedinthisbookiscollectedfrompeer-reviewed papersofGuihaiYan,ChengLiu,LeiZhang,WenLi,SongweiPei,SongjunPan, BingzhangFu,YingWang,andHangLusupervisedbybothProf.XiaoweiLiand Prof.HuaweiLiwholeadtheIntegratedCircuitDesignGroupinStateKeyLabof Computer Architecture, and has already been published in the journals of TVLSI, TCAD, TC, JCST, and Journal of China Science. Prof. Xiaowei Li organized this book in general, Prof. Guihai Yan mainly worked on Chaps. 2 and 3. Prof. Cheng LiuworkedonChaps.1,4,5,and6.Dr.JingyaWualsohelpedalottoeditthisbook. Prof.HuaweiLiandProf.GuojieLuoreviewedthisbook.Prof.TimChengwrote foreword for this book. All the efforts are indispenable for this book and greatly appreciated. Thetechniquespresentedinthisbookarepartlyselectedfromresearchfounded by the National Key Research and Development Program of China under grant 2020YFB1600201,andtheNationalNaturalScienceFoundationofChina(NSFC) undergrantNo.(62174162,62090024,61902375,U20A20202,61876173). Beijing,China XiaoweiLi May2022 Foreword Hardware systems must have sufficient robustness to cope with failures resulting fromvariousvariabilityandreliabilityconcerns.Thisrequirementnotonlyapplies to safety-critical advanced systems in avionics and automotive applications but also becomes a necessity for consumer electronics where cost has been a serious constraint. For integrated circuits, device geometry shrinkage, very low power supply levels, and ultra-high operating speeds have significantly reduced noise margins and increased variations inprocess,device, and design parameters.These continuing trends in technology scaling have resulted in lower reliability and higher design uncertainty for highly integrated chips. Not just technology, the environment,energy,thermalresources,andevenapplicationshavealsocontributed to greater variations and more diverse sources of errors. Thus, high variability and low reliability have become the predominant challenges for chip design and manufacturing. Whileverification,test,andfaulttolerancetechnologieshavebeenfoundational disciplines for multiple decades for which the readers can find good textbooks for theirrespectivebasicknowledge,principles,techniques,andsolutions,thesefields allcontinuetoevolveandadvance,someofwhichhaveevenreinventedthemselves, inordertackletheenormousvariabilityandreliabilitychallenges.Asaresult,new and more effective and efficient solutions continue to emerge, replacing classical approachesfordesigningandmanufacturingrobustandreliablehardware. Forfaulttolerance,asuiteoftechniques,rangingfrombuilt-inredundancyand online reconfiguration capability to tolerate errors, to built-in self-test/-diagnosis/- repairtorecoverfromerrors,topost-fabricationtuning/adaptationcapability(either off-lineoronline)tobypasserrors,toautomaticcompensationtoalleviatetheneg- ative effect caused by variations, or to dynamic adaptation to mask environmental noise and transient errors, have been developed; some of which have even been advancedfromtheproof-of-conceptandprototypingstagestoactualdeployment. Researchers at the Institute of Computing Technology of Chinese Academy of Sciences have been among the most productive and impactful research groups in addressingthetechnicalchallengesandcontributingnewsolutionsinthisarea.Over the past decade, they have developed and employed a number of built-in and/or ix x Foreword onlinefault-tolerantsolutions.Theirsolutionsareeithergeneric,broadlyapplicable to digital designs and general-purpose processors, or specific to special-purpose designsincludingnetwork-on-chipsanddeeplearningprocessors.Thisbookgives in-depth and coherent explanations of these very interesting results. Particularly, the solutions are introduced in a unified “3S” framework supporting a built-in fault-tolerantcomputingparadigm,where“3S”standsforself-test,self-diagnosis, and self-repair (or self-recovery). The description of each technique also includes clarification of the key differences from the conventional counterparts which I am sure the readers will find informative and insightful. It is commendable that the authorshavedoneanoutstandingjobinproducingthisself-containedbookcovering multiple aspects of built-in fault-tolerant design for resilientchips. Publishing this bookalsoservesverywellformotivatingresearchgraduatestudentsandresearchers togainthelatestresultsandinsightintothissubjectofsignificantimportance. HongKongUniversity Kwang-Ting(Tim)Cheng(郑光廷) ofScienceandTechnology December12,2022 Contents 1 Introduction .................................................................. 1 1.1 TypicalOn-ChipFaults................................................. 1 1.1.1 ProcessVariation................................................ 2 1.1.2 ManufacturingDefects ......................................... 3 1.1.3 ChipAging...................................................... 4 1.1.4 SoftErrors....................................................... 5 1.1.5 IntermittentFaults .............................................. 6 1.1.6 EmergingTechnologiesInducedDefects...................... 8 1.2 ConventionalFault-TolerantChipDesignWisdom ................... 9 1.2.1 DesignforTest.................................................. 10 1.2.2 DesignforDiagnosis ........................................... 12 1.2.3 DesignforReliability........................................... 13 1.3 Built-InFault-TolerantComputingParadigm ......................... 14 1.3.1 Self-test.......................................................... 15 1.3.2 Self-diagnosis................................................... 16 1.3.3 Self-repair....................................................... 19 1.3.4 GeneralBenefits ................................................ 23 1.4 Summary ................................................................ 26 References..................................................................... 27 2 Fault-TolerantCircuits...................................................... 33 2.1 On-LineFaultDetection................................................ 33 2.1.1 ChallengesforOn-LineFaultDetection....................... 34 2.1.2 StabilityViolationBasedFaultDetection..................... 35 2.1.3 TimingConstrainsExploration ................................ 38 2.1.4 On-LineFaultDetectionArchitecture......................... 42 2.1.5 ExperimentResultAnalysis.................................... 48 2.1.6 Discussion....................................................... 56 2.2 On-ChipPathDelayMeasurement..................................... 58 2.2.1 PathDelayMeasurementandFaultTolerance ................ 58 2.2.2 PathDelayMeasurementCircuits ............................. 61 xi

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.