ebook img

High-level estimation and exploration of reliability for multi-processor system-on-chip PDF

210 Pages·2018·14.812 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview High-level estimation and exploration of reliability for multi-processor system-on-chip

Computer Architecture and Design Methodologies Zheng Wang Anupam Chattopadhyay High-level Estimation and Exploration of Reliability for Multi-Processor System-on-Chip Computer Architecture and Design Methodologies Series editors Anupam Chattopadhyay, Noida, India Soumitra Kumar Nandy, Bangalore, India Jürgen Teich, Erlangen, Germany Debdeep Mukhopadhyay, Kharagpur, India Twilight zone of Moore’s law is affecting computer architecture design like never before. The strongest impact on computer architecture is perhaps the move from unicore to multicore architectures, represented by commodity architectures like general purpose graphics processing units (gpgpus). Besides that, deep impact of application-specificconstraintsfromemergingembeddedapplicationsispresenting designers with new, energy-efficient architectures like heterogeneous multi-core, accelerator-rich System-on-Chip (SoC). These effects together with the security, reliability, thermal and manufacturability challenges of nanoscale technologies are forcing computing platforms to move towards innovative solutions. Finally, the emergenceoftechnologiesbeyondconventionalcharge-basedcomputinghasledto a series of radical new architectures and design methodologies. The aim of this book series is to capture these diverse, emerging architectural innovations as well as the corresponding design methodologies. The scope will cover the following. Heterogeneous multi-core SoC and their design methodology Domain-specific Architectures and their design methodology Novel Technology constraints, such as security, fault-tolerance and their impact on architecture design Novel technologies, such as resistive memory, and their impact on architecture design Extremely parallel architectures More information about this series at http://www.springer.com/series/15213 Zheng Wang Anupam Chattopadhyay (cid:129) High-level Estimation and Exploration of Reliability for Multi-Processor System-on-Chip 123 Zheng Wang Anupam Chattopadhyay Shenzhen Institutes of Advanced Schoolof Computer Science Technology andEngineering ChineseAcademy of Sciences NanyangTechnological University Shenzhen Singapore China Singapore ISSN 2367-3478 ISSN 2367-3486 (electronic) Computer Architecture andDesign Methodologies ISBN978-981-10-1072-9 ISBN978-981-10-1073-6 (eBook) DOI 10.1007/978-981-10-1073-6 LibraryofCongressControlNumber:2017943095 ©SpringerScience+BusinessMediaSingapore2018 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpart of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission orinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilar methodologynowknownorhereafterdeveloped. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfrom therelevantprotectivelawsandregulationsandthereforefreeforgeneraluse. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authorsortheeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinor for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictionalclaimsinpublishedmapsandinstitutionalaffiliations. Printedonacid-freepaper ThisSpringerimprintispublishedbySpringerNature TheregisteredcompanyisSpringerNatureSingaporePteLtd. Theregisteredcompanyaddressis:152BeachRoad,#21-01/04GatewayEast,Singapore189721,Singapore Acknowledgements This book is the result of my work as research associate at the Institute for CommunicationTechnologiesandEmbeddedSystems(ICE)attheRWTHAachen University. During this time I have been accompanied and supported by many people. It is my great pleasure to take this opportunity to thank them. My most sincere thanks go to my advisors, Prof. Dr. -Ing. Anupam Chattopadhyay and Prof. Dr. -Ing. Tobias Noll. Prof. Chattopadhyay has been extremely helpful and tremendously inspiring throughout my Ph.D. study. Prof. Noll has been impressively knowledgeable while patient with my ideas and mistakes. Their thoughtful advices have greatly contributed to this work and influenced me for my future career. Special thanks go to my defense committee members, Prof. Andrei Vescan and Prof.RenatoNegraforspendingtheirtime,offeringoralexam,givingmefeedback, and attending my defense session. Several colleagues at ICE and EECS have assisted and encouraged me during thepastfiveyearsformyworkandpersonallife.AmongthemIwouldliketoshow my deep appreciation to Ayesha Khalid, Zoltán Rákossy and Michael Meixner. Furthermore,IwouldliketothankmystudentsXiaoWang,ChaoChen,LaiWang, Renlin Li, Hui Xie, Liu Yang, Saumitra Chafekar, Alessandro Littarru, Shazia Kanwal, Kolawole Soretire, Emmanuel Ugwu, Dan Yue, Kapil Singh, Piyush Sharma and Sai Rama Usha Ayyagari for their consistent contribution. I would also like to thank my family: my parents and parents-in-law for sup- porting me spiritually throughout writing this book and my life in general. And finally, infinite gratitude to my beloved wife as well as my son. December 2016 Zheng Wang v Contents 1 Introduction.... .... .... ..... .... .... .... .... .... ..... .... 1 1.1 Contribution.... .... ..... .... .... .... .... .... ..... .... 2 1.2 Outline.... .... .... ..... .... .... .... .... .... ..... .... 4 2 Background.... .... .... ..... .... .... .... .... .... ..... .... 5 2.1 Reliability Definition . ..... .... .... .... .... .... ..... .... 5 2.2 Fault, Error and Failure .... .... .... .... .... .... ..... .... 5 2.3 Hardware Faults. .... ..... .... .... .... .... .... ..... .... 6 2.3.1 Origins .. .... ..... .... .... .... .... .... ..... .... 6 2.3.2 Fault Models.. ..... .... .... .... .... .... ..... .... 8 2.4 Soft Error.. .... .... ..... .... .... .... .... .... ..... .... 8 2.4.1 Evaluation Metrics .. .... .... .... .... .... ..... .... 9 2.4.2 Scaling Trends ..... .... .... .... .... .... ..... .... 9 3 State-of-the-Art . .... .... ..... .... .... .... .... .... ..... .... 11 3.1 Fault Injection and Simulation ... .... .... .... .... ..... .... 11 3.1.1 Physical Fault Injection... .... .... .... .... ..... .... 12 3.1.2 Simulated Fault Injection . .... .... .... .... ..... .... 13 3.1.3 Emulated Fault Injection.. .... .... .... .... ..... .... 15 3.2 Analytical Reliability Estimation . .... .... .... .... ..... .... 16 3.2.1 Architecture Vulnerability Factor Analysis .... ..... .... 16 3.2.2 Probablistic Transfer Matrix ... .... .... .... ..... .... 17 3.2.3 Design Diversity Estimation ... .... .... .... ..... .... 18 3.3 Architectural Fault-Tolerant Techniques.... .... .... ..... .... 19 3.3.1 Traditional Fault-Tolerant Techniques.... .... ..... .... 20 3.3.2 Approximate Computing.. .... .... .... .... ..... .... 22 3.4 System-Level Fault Tolerant Techniques ... .... .... ..... .... 26 3.4.1 Reliability-Aware Task Mapping.... .... .... ..... .... 26 3.4.2 Fault-Tolerant Network Design. .... .... .... ..... .... 27 vii viii Contents 4 High-Level Fault Injection and Simulation.... .... .... ..... .... 29 4.1 Architectural Fault Injection. .... .... .... .... .... ..... .... 29 4.1.1 Methodologies. ..... .... .... .... .... .... ..... .... 30 4.1.2 Flow of LISA-Based Fault Injection. .... .... ..... .... 33 4.1.3 Timing Fault Injection.... .... .... .... .... ..... .... 37 4.1.4 Experimental Results. .... .... .... .... .... ..... .... 39 4.1.5 Summary. .... ..... .... .... .... .... .... ..... .... 43 4.2 System-Level Fault Injection .... .... .... .... .... ..... .... 44 4.2.1 Fault Injection for System Modules . .... .... ..... .... 44 4.2.2 Experimental Results. .... .... .... .... .... ..... .... 46 4.2.3 Summary. .... ..... .... .... .... .... .... ..... .... 48 4.3 Statistical Fault Injection for Impact Evaluation of Application Performances ... .... ..... .... .... .... .... .... ..... .... 48 4.3.1 Setup and Case Study.... .... .... .... .... ..... .... 49 4.3.2 Modeling of Timing Errors.... .... .... .... ..... .... 51 4.3.3 Experiments of Statistical FI... .... .... .... ..... .... 55 4.3.4 Summary. .... ..... .... .... .... .... .... ..... .... 61 4.4 High-Level Processor Power/Thermal/Delay Joint Modeling Framework. .... .... ..... .... .... .... .... .... ..... .... 61 4.4.1 High-Level Power Modeling and Estimation... ..... .... 62 4.4.2 LISA-Based Thermal Modeling .... .... .... ..... .... 68 4.4.3 Thermal-Aware Delay Simulation... .... .... ..... .... 74 4.4.4 Automation Flow and Overhead Analysis. .... ..... .... 78 4.4.5 Summary. .... ..... .... .... .... .... .... ..... .... 80 5 Architectural Reliability Estimation.. .... .... .... .... ..... .... 81 5.1 Analytical Reliability Estimation Technique. .... .... ..... .... 81 5.1.1 Operation Reliability Model ... .... .... .... ..... .... 83 5.1.2 Instruction Error Rate .... .... .... .... .... ..... .... 84 5.1.3 Application Error Rate ... .... .... .... .... ..... .... 85 5.1.4 Analytical Reliability Estimation for RISC Processor. .... 86 5.1.5 Summary. .... ..... .... .... .... .... .... ..... .... 88 5.2 Probabilistic Error Masking Matrix.... .... .... .... ..... .... 89 5.2.1 Logic Masking in Digital Circuits... .... .... ..... .... 90 5.2.2 PeMM for Processor Building Blocks.... .... ..... .... 92 5.2.3 PeMM Characterization... .... .... .... .... ..... .... 94 5.2.4 Approximate Error Prediction Framework. .... ..... .... 97 5.2.5 Results in Error Prediction .... .... .... .... ..... .... 98 5.2.6 Summary. .... ..... .... .... .... .... .... ..... .... 104 5.3 Reliability Estimation Using Design Diversity ... .... ..... .... 104 5.3.1 Design Diversity.... .... .... .... .... .... ..... .... 105 5.3.2 Graph-Based Diversity Analysis.... .... .... ..... .... 107 5.3.3 Results in Diversity Estimation. .... .... .... ..... .... 113 5.3.4 Summary. .... ..... .... .... .... .... .... ..... .... 117 Contents ix 6 Architectural Reliability Exploration. .... .... .... .... ..... .... 119 6.1 Opportunistic Redundancy .. .... .... .... .... .... ..... .... 119 6.1.1 Opportunistic Protection .. .... .... .... .... ..... .... 120 6.1.2 Implementation..... .... .... .... .... .... ..... .... 122 6.1.3 Experimental Results. .... .... .... .... .... ..... .... 127 6.1.4 Summary. .... ..... .... .... .... .... .... ..... .... 130 6.2 Asymmetric Reliability..... .... .... .... .... .... ..... .... 130 6.2.1 Asymmetric Reliability... .... .... .... .... ..... .... 131 6.2.2 Exploration of Asymmetric Reliability ... .... ..... .... 134 6.2.3 Summary. .... ..... .... .... .... .... .... ..... .... 142 6.3 Statistical Error Confinement .... .... .... .... .... ..... .... 142 6.3.1 Proposed Error Confinement Method .... .... ..... .... 143 6.3.2 Realizing the Proposed Error Confinement in an RISC Processor. .... ..... .... .... .... .... .... ..... .... 143 6.3.3 Case Study and Statistical Analysis.. .... .... ..... .... 145 6.3.4 Results .. .... ..... .... .... .... .... .... ..... .... 147 6.3.5 Summary. .... ..... .... .... .... .... .... ..... .... 152 7 System-Level Reliability Exploration. .... .... .... .... ..... .... 155 7.1 System-Level Reliability Exploration Framework. .... ..... .... 155 7.1.1 Platform and Task Manager Firmware ... .... ..... .... 156 7.1.2 Core Reliability Aware Task Mapping ... .... ..... .... 160 7.1.3 Experimental Results. .... .... .... .... .... ..... .... 161 7.1.4 Summary. .... ..... .... .... .... .... .... ..... .... 163 7.2 Reliable System-Level Design Using Node Fault Tolerance.. .... 165 7.2.1 Node Fault Tolerance in Graph. .... .... .... ..... .... 166 7.2.2 Construct NFT for Generic Graph... .... .... ..... .... 167 7.2.3 Verify NFT Graphs Using Task Mapping. .... ..... .... 169 7.2.4 Experiments for Node Fault Tolerance ... .... ..... .... 172 7.2.5 Summary. .... ..... .... .... .... .... .... ..... .... 176 8 Conclusion and Outlook.. ..... .... .... .... .... .... ..... .... 177 8.1 Conclusion. .... .... ..... .... .... .... .... .... ..... .... 177 8.2 Outlook ... .... .... ..... .... .... .... .... .... ..... .... 178 Curriculum Vitae .. .... .... ..... .... .... .... .... .... ..... .... 181 Glossary.. .... .... .... .... ..... .... .... .... .... .... ..... .... 183 Bibliography .. .... .... .... ..... .... .... .... .... .... ..... .... 187 List of Figures Fig. 1.1 Overall flow of high-level reliability estimation and exploration. ..... .... .... .... .... .... ..... .... ... 2 Fig. 2.1 SER scale trend for SRAM and DRAM [177] Copyright ©2010 IEEE .. ..... .... .... .... .... .... ..... .... ... 9 Fig. 2.2 SER scale trend for combinatorial logic [172] Copyright ©2002 IEEE .. ..... .... .... .... .... .... ..... .... ... 10 Fig. 4.1 LISA-based fault injection and evaluation flow [215] Copyright ©2013 IEEE... .... .... .... .... ..... .... ... 33 Fig. 4.2 Fault injection through disturbance signals in LISA operation [215] Copyright ©2013 IEEE... .... ..... .... ... 34 Fig. 4.3 Graphical user interface for fault configuration and evaluation.. ..... .... .... .... .... .... ..... .... ... 35 Fig. 4.4 Simulator extension for injection of delay faults. ..... .... ... 38 Fig. 4.5 Exemplary EMR with increasing duration offault (RISC) [215] Copyright ©2013 IEEE .. .... .... .... ..... .... ... 40 Fig. 4.6 Exemplary EMR with increasing count offault (RISC) [215] Copyright ©2013 IEEE .. .... .... .... ..... .... ... 40 Fig. 4.7 ExemplaryEMRwithincreasingdurationoffault(VLIW) [215] Copyright ©2013 IEEE .. .... .... .... ..... .... ... 41 Fig. 4.8 System-level fault injection on virtual prototype [208] Copyright ©2014 ACM ... .... .... .... .... ..... .... ... 45 Fig. 4.9 H.264 decoder with fault injection [209] Copyright ©2014 ACM .. ..... .... .... .... .... .... ..... .... ... 47 Fig. 4.10 Median filter: original and filtered image [201] Copyright ©2014 ACM .. ..... .... .... .... .... .... ..... .... ... 47 Fig. 4.11 Median filter: reliability exploration [201] Copyright ©2014 ACM .. ..... .... .... .... .... .... ..... .... ... 48 Fig. 4.12 Performance and fault injection rate of the median benchmark for a model B based on STA @0:7 V, and b model B+ with supply voltage noise [37] Copyright ©2016 ACM ... .... .... .... .... ..... .... ... 53 xi

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.