ebook img

Reliability of Computer Systems and Networks PDF

546 Pages·2002·3.684 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Reliability of Computer Systems and Networks

Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design Martin L. Shooman Copyright2002John Wiley & Sons, Inc. ISBNs:0-471-29342-3(Hardback);0-471-22460-X (Electronic) RELIABILITY OF COMPUTER SYSTEMS AND NETWORKS RELIABILITY OF COMPUTER SYSTEMS AND NETWORKS Fault Tolerance, Analysis, and Design MARTIN L. SHOOMAN Polytechnic University and Martin L. Shooman & Associates A Wiley-Interscience Publication JOHN WILEY & SONS, INC. Designations used by companies to distinguish their products are often claimed as trademarks. In all instances where John Wiley & Sons, Inc., is aware of a claim, the product names appear in initial capital orALL CAPITAL LETTERS. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. Copyright2002by John Wiley & Sons, Inc., New York. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic or mechanical, including uploading, downloading, printing, decompiling, recording or otherwise, except as permitted under Sections107or108 of the1976United States Copyright Act, without the prior written permission of the Publisher. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc.,605Third Avenue, New York, NY10158-0012, (212)850-6011, fax (212)850-6008, E-Mail: PERMREQ @ WILEY.COM. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold with the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional person should be sought. ISBN0-471-22460-X This title is also available in print as ISBN0-471-29342-3. For more information about Wiley products, visit our web site at www.Wiley.com. To Danielle Leah and Aviva Zissel CONTENTS Preface xix 1 Introduction 1 1.1 What is Fault-Tolerant Computing?, 1 1.2 The Rise of Microelectronics and the Computer, 4 1.2.1 A Technology Timeline, 4 1.2.2 Moore’s Law of Microprocessor Growth, 5 1.2.3 Memory Growth, 7 1.2.4 Digital Electronics in Unexpected Places, 9 1.3 Reliability and Availability, 10 1.3.1 Reliability Is Often an Afterthought, 10 1.3.2 Concepts of Reliability, 11 1.3.3 Elementary Fault-Tolerant Calculations, 12 1.3.4 The Meaning of Availability, 14 1.3.5 Need for High Reliability and Safety in Fault- Tolerant Systems, 15 1.4 Organization of the Book, 18 1.4.1 Introduction, 18 1.4.2 Coding Techniques, 19 1.4.3 Redundancy, Spares, and Repairs, 19 1.4.4 N-Modular Redundancy, 20 1.4.5 Software Reliability and Recovery Techniques, 20 1.4.6 Networked Systems Reliability, 21 1.4.7 Reliability Optimization, 22 1.4.8 Appendices, 22 vii viii CONTENTS General References, 23 References, 25 Problems, 27 2 Coding Techniques 30 2.1 Introduction, 30 2.2 Basic Principles, 34 2.2.1 Code Distance, 34 2.2.2 Check-Bit Generation and Error Detection, 35 2.3 Parity-Bit Codes, 37 2.3.1 Applications, 37 2.3.2 Use of Exclusive OR Gates, 37 2.3.3 Reduction in Undetected Errors, 39 2.3.4 Effect of Coder–Decoder Failures, 43 2.4 Hamming Codes, 44 2.4.1 Introduction, 44 2.4.2 Error-Detection and -Correction Capabilities, 45 2.4.3 The Hamming SECSED Code, 47 2.4.4 The Hamming SECDED Code, 51 2.4.5 Reduction in Undetected Errors, 52 2.4.6 Effect of Coder–Decoder Failures, 53 2.4.7 How Coder–Decoder Failures Effect SECSED Codes, 56 2.5 Error-Detection and Retransmission Codes, 59 2.5.1 Introduction, 59 2.5.2 Reliability of a SECSED Code, 59 2.5.3 Reliability of a Retransmitted Code, 60 2.6 Burst Error-Correction Codes, 62 2.6.1 Introduction, 62 2.6.2 Error Detection, 63 2.6.3 Error Correction, 66 2.7 Reed–Solomon Codes, 72 2.7.1 Introduction, 72 2.7.2 Block Structure, 72 2.7.3 Interleaving, 73 2.7.4 Improvement from the RS Code, 73 2.7.5 Effect of RS Coder–Decoder Failures, 73 2.8 Other Codes, 75 References, 76 Problems, 78 3 Redundancy, Spares, and Repairs 83 3.1 Introduction, 85 3.2 Apportionment, 85 CONTENTS ix 3.3 System Versus Component Redundancy, 86 3.4 Approximate Reliability Functions, 92 3.4.1 Exponential Expansions, 92 3.4.2 System Hazard Function, 94 3.4.3 Mean Time to Failure, 95 3.5 Parallel Redundancy, 97 3.5.1 Independent Failures, 97 3.5.2 Dependent and Common Mode Effects, 99 3.6 An r-out-of-n Structure, 101 3.7 Standby Systems, 104 3.7.1 Introduction, 104 3.7.2 Success Probabilities for a Standby System, 105 3.7.3 Comparison of Parallel and Standby Systems, 108 3.8 Repairable Systems, 111 3.8.1 Introduction, 111 3.8.2 Reliability of a Two-Element System with Repair, 112 3.8.3 MTTF for Various Systems with Repair, 114 3.8.4 The Effect of Coverage on System Reliability, 115 3.8.5 Availability Models, 117 3.9 RAID Systems Reliability, 119 3.9.1 Introduction, 119 3.9.2 RAID Level 0, 122 3.9.3 RAID Level 1, 122 3.9.4 RAID Level 2, 122 3.9.5 RAID Levels 3, 4, and 5, 123 3.9.6 RAID Level 6, 126 3.10 Typical Commercial Fault-Tolerant Systems: Tandem and Stratus, 126 3.10.1 Tandem Systems, 126 3.10.2 Stratus Systems, 131 3.10.3 Clusters, 135 References, 137 Problems, 139 4 N-Modular Redundancy 145 4.1 Introduction, 145 4.2 The History of N-Modular Redundancy, 146 4.3 Triple Modular Redundancy, 147 4.3.1 Introduction, 147 4.3.2 System Reliability, 148 4.3.3 System Error Rate, 148 4.3.4 TMR Options, 150 x CONTENTS 4.4 N-Modular Redundancy, 153 4.4.1 Introduction, 153 4.4.2 System Voting, 154 4.4.3 Subsystem Level Voting, 154 4.5 Imperfect Voters, 156 4.5.1 Limitations on Voter Reliability, 156 4.5.2 Use of Redundant Voters, 158 4.5.3 Modeling Limitations, 160 4.6 Voter Logic, 161 4.6.1 Voting, 161 4.6.2 Voting and Error Detection, 163 4.7 N-Modular Redundancy with Repair, 165 4.7.1 Introduction, 165 4.7.2 Reliability Computations, 165 4.7.3 TMR Reliability, 166 4.7.4 N-Modular Reliability, 170 4.8 N-Modular Redundancy with Repair and Imperfect Voters, 176 4.8.1 Introduction, 176 4.8.2 Voter Reliability, 176 4.8.3 Comparison of TMR, Parallel, and Standby Systems, 178 4.9 Availability of N-Modular Redundancy with Repair and Imperfect Voters, 179 4.9.1 Introduction, 179 4.9.2 Markov Availability Models, 180 4.9.3 Decoupled Availability Models, 183 4.10 Microcode-Level Redundancy, 186 4.11 Advanced Voting Techniques, 186 4.11.1 Voting with Lockout, 186 4.11.2 Adjudicator Algorithms, 189 4.11.3 Consensus Voting, 190 4.11.4 Test and Switch Techniques, 191 4.11.5 Pairwise Comparison, 191 4.11.6 Adaptive Voting, 194 References, 195 Problems, 196 5 Software Reliability and Recovery Techniques 202 5.1 Introduction, 202 5.1.1 Definition of Software Reliability, 203 5.1.2 Probabilistic Nature of Software Reliability, 203 5.2 The Magnitude of the Problem, 205 CONTENTS xi 5.3 Software Development Life Cycle, 207 5.3.1 Beginning and End, 207 5.3.2 Requirements, 209 5.3.3 Specifications, 209 5.3.4 Prototypes, 210 5.3.5 Design, 211 5.3.6 Coding, 214 5.3.7 Testing, 215 5.3.8 Diagrams Depicting the Development Process, 218 5.4 Reliability Theory, 218 5.4.1 Introduction, 218 5.4.2 Reliability as a Probability of Success, 219 5.4.3 Failure-Rate (Hazard) Function, 222 5.4.4 Mean Time To Failure, 224 5.4.5 Constant-Failure Rate, 224 5.5 Software Error Models, 225 5.5.1 Introduction, 225 5.5.2 An Error-Removal Model, 227 5.5.3 Error-Generation Models, 229 5.5.4 Error-Removal Models, 229 5.6 Reliability Models, 237 5.6.1 Introduction, 237 5.6.2 Reliability Model for Constant Error-Removal Rate, 238 5.6.3 Reliability Model for Linearly Decreasing Error- Removal Rate, 242 5.6.4 Reliability Model for an Exponentially Decreasing Error-Removal Rate, 246 5.7 Estimating the Model Constants, 250 5.7.1 Introduction, 250 5.7.2 Handbook Estimation, 250 5.7.3 Moment Estimates, 252 5.7.4 Least-Squares Estimates, 256 5.7.5 Maximum-Likelihood Estimates, 257 5.8 Other Software Reliability Models, 258 5.8.1 Introduction, 258 5.8.2 Recommended Software Reliability Models, 258 5.8.3 Use of Development Test Data, 260 5.8.4 Software Reliability Models for Other Development Stages, 260 5.8.5 Macro Software Reliability Models, 262 5.9 Software Redundancy, 262 5.9.1 Introduction, 262 5.9.2 N-Version Programming, 263 5.9.3 Space Shuttle Example, 266 xii CONTENTS 5.10 Rollback and Recovery, 268 5.10.1 Introduction, 268 5.10.2 Rebooting, 270 5.10.3 Recovery Techniques, 271 5.10.4 Journaling Techniques, 272 5.10.5 Retry Techniques, 273 5.10.6 Checkpointing, 274 5.10.7 Distributed Storage and Processing, 275 References, 276 Problems, 280 6 Networked Systems Reliability 283 6.1 Introduction, 283 6.2 Graph Models, 284 6.3 Definition of Network Reliability, 285 6.4 Two-Terminal Reliability, 288 6.4.1 State-Space Enumeration, 288 6.4.2 Cut-Set and Tie-Set Methods, 292 6.4.3 Truncation Approximations, 294 6.4.4 Subset Approximations, 296 6.4.5 Graph Transformations, 297 6.5 Node Pair Resilience, 301 6.6 All-Terminal Reliability, 302 6.6.1 Event-Space Enumeration, 302 6.6.2 Cut-Set and Tie-Set Methods, 303 6.6.3 Cut-Set and Tie-Set Approximations, 305 6.6.4 Graph Transformations, 305 6.6.5 k-Terminal Reliability, 308 6.6.6 Computer Solutions, 308 6.7 Design Approaches, 309 6.7.1 Introduction, 310 6.7.2 Design of a Backbone Network Spanning-Tree Phase, 310 6.7.3 Use of Prim’s and Kruskal’s Algorithms, 314 6.7.4 Design of a Backbone Network: Enhancement Phase, 318 6.7.5 Other Design Approaches, 319 References, 321 Problems, 324 7 Reliability Optimization 331 7.1 Introduction, 331 7.2 Optimum Versus Good Solutions, 332

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.