Fault-Tolerant Parallel and Distributed Systems Fault-Tolerant Parallel and Distributed Systems by DIMITER R. AVRESKY Department of Electrical and Computer Engineering Boston University Boston, MA and DAVID R. KAELI Department of Electrical and Computer Engineering Northeastern University Boston, MA . ., ~ Springer Science+Business Media, LLC ISBN 978-1-4613-7488-6 ISBN 978-1-4615-5449-3 (eBook) DOI 10.1007/978-1-4615-5449-3 Library of Congress Cataloging-in-Publication Data A C.I.P. Catalogue record for this book is available from the Library of Congress. Copyright © 1998 by Springer Science+Business Media New York Origioally published by Kluwer Academic Publishers in 1998 Softcover reprint ofthe hardcover lst edition 1998 AII rights reserved. No part ofthis publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, mechanical, photocopying, recording, or otherwise, without the prior written permission ofthe publisher, Kluwer Springer Science+Business Media, LLC. Printed on acid-free pap er. Contents Preface ix Part I Fault-Tolerant Protocols 1. Comparing Synchronous and Asynchronous Group Communication 3 F. Cristian 2. Using Static Total Causal Ordering Protocols to Achieve 25 Ordered View Synchrony K.-Y. Siu and M. Iyer 3. A Failure-Aware Datagram Service 55 C. Fetzer and F. Cristian Part II Fault-Tolerant Distributed Systems 71 4. Portable Checkpoint For Heterogeneous Architectures 73 V. Strum pen and B. Ramkumar 5. A Checkpointing-Recovery Scheme for 93 Domino-Free Distributed Systems F.Quaglia, B. Ciciani, and R. Baldoni 6. Overview of a Fault-Tolerant System 109 A. Pruscino 7. An Efficient Recoverable DSM on a Network of Workstations: 123 Design and Implementation A.-M. Kermarrec and C. Morin 8. Fault-Tolerant Issues of Local Area MultiProcessors (LAMP) 139 Storage Subsystem Q. Li, E. Hong, and A. Tsukerman 9. . Fault-Tolerance Issues in RDBMS on SCI-Based Local Area 155 MultiProcessor (LAMP) Q. Li, A. Tsukerman, and E. Hong Part III: Dependable Systems 171 10. Distributed Safety-Critical Systems 173 P.J. Perrone and B. W. Johnson vi 11. Dependability and Other Challenges in the Collision 195 Between Computing and Telecommunication Y. Levendel 12. A Unified Approach for the Synthesis of Scalable and Testable 213 Embedded Architectures P.B. Bhat, C. Aktouf, Y.K. Prasanna, S. Gupta, and M.A. Breuer 13. A Fault-Robust SPMD Architecture for 3D-TV Image Processing 231 A. Chiari, B. Ciciani, and M. Romero Part IV: Fault-Tolerant Parallel Systems 247 14. A Parallel Algorithm for Embedding Complete 249 Binary Trees in Faulty Hypercubes S.B. Choi and A.K. Somani 15. Fault-Tolerant Broadcasting in a K-ary N-cube 267 B. Broeg and B. Bose 16. Fault Isolation and Diagnosis in Multiprocessor Systems with 285 Point-to-Point Communication Links K. Chakrabarty, M.G. Karpovsky, and L.B. Levitin 17. An Efficient Hardware Fault-Tolerant Technique 301 S.H. Hosseini, O.A. Abulnaja, and K. Vairavan 18. Reliability Evaluation of a Task Under a Hardware 315 Fault-Tolerant Technique O.A. Abulnaja, S.H. Hosseini, and K. . Vair 19. Fault Tolerance Measures for m-ary n-dimensional Hypercubes 329 Based on Forbidden Faulty Sets J. Wu and G. Guo 20. Dynamic Fault Recovery for Wormhole-Routed 341 Two-Dimensional Meshes D.R. Avresky and C.M. Cunningham 21. Fault-Tolerant Dynamic Task Scheduling 357 Based on Dataflow Graphs E. Maehle and F.-J. Markus 22. A Novel Replication Technique for Implementing 373 Fault-Tolerant Parallel Software A. Cheri/. M. Suzuki, and T. Katayama vii 23. User-Transparent Checkpoing and Restart for Parallel Computers 385 B. Bieker and E. Maehle Index 401 Preface The most important use of computing in the future will be in the context of the global "digital convergence" where everything becomes digital and every thing is inter-networked. The application will be dominated by storage, search, retrieval, analysis, exchange and updating of information in a wide variety of forms. Heavy demands will be placed on systems by many simultaneous re quests. And, fundamentally, all this shall be delivered at much higher levels of dependability, integrity and security. Increasingly, large parallel computing systems and networks are providing unique challenges to industry and academia in dependable computing, espe cially because of the higher failure rates intrinsic to these systems. The chal lenge in the last part of this decade is to build a systems that is both inexpensive and highly available. A machine cluster built of commodity hardware parts, with each node run ning an OS instance and a set of applications extended to be fault resilient can satisfy the new stringent high-availability requirements. The focus of this book is to present recent techniques and methods for im plementing fault-tolerant parallel and distributed computing systems. Section I, Fault-Tolerant Protocols, considers basic techniques for achieving fault-tolerance in communication protocols for distributed systems, including synchronous and asynchronous group communication, static total causal order ing protocols, and fail-aware datagram service that supports communications by time. A common framework for describing synchronous and asynchronous group communication services and a comparison of the properties that synchronous and asynchronous group communication can provide to simplify replicated programming is presented in the paper "Comparing Synchronous and Asyn chronous Group Communication". Group communication services, such as membership and atomic broadcast, simplify the maintenance of state replica consistency despite random communication delays, failures and recoveries. In distributed systems, high service availability can be achieved by letting a group of servers replicate the service state; if some servers fail, the surviving ones know the service state and can continue to provide the service. x The paper "Using Static Total Causal Ordering Protocols to Achieve Or dered View Synchrony" describes a view-synchronous totally ordered message delivery protocol for a dynamic asynchronous process group in an asynchronous communication environment. The protocol can handle asynchronous processes or link failures and also the simultaneous joining of multiple group of processes. A fail-aware datagram service that supports communication by t.ime delivers all messages whose computed one-way transmission delays are smaller than a given bound as "fast" and all other message as "slow" is presented in the paper "A Fail-Aware Datagram Service". The fail-aware datagram service is the foun dation of all other fail-aware services, such as fail-aware clock synchronization, fail-aware membership and fail-aware atomic broadcast. In Section II, Fault-Tolerant Distributed Systems, we consider different meth ods and approaches for achieving fault tolerance in distributed systems such as portable check-pointing for heterogeneous architectures, checkpointing-recovery scheme insuring domino-freeness, dependable cluster systems, recoverable dis tributed shared memory (DSM) on a network of workstations (NOW), fault tolerant scalable coherent interface (SCI)-based local area multiprocessor. An approach, which enables the failed computation to be recovered on a dif ferent processor architecture is shown in the paper "Portable Checkpointing for Heterogeneous Architectures". Sequential C programs are compiled into fault tolerant C programs, whose checkpoints can be migrated across heterogeneous networks and restarted on binary-incompatible architectures. The paper "A Checkpointing-Recovery Scheme for Domino-Free Distributed Systems" presents a checkpointing-recovery scheme for distributed systems. The proposed checkpointing algorithm ensures the progression of the recovery line reducing the number of checkpoints in comparison to previous propos als. The goal is achieved by introducing an equivalence relation between local checkpoints of a process and by exploiting the process' event history. A hardware architecture based on a cluster of commodity pa~ i3 and a set of software cluster services that will help in the design implementation and deployment of fault-resilient software is described in the paper "Overview of a Fault-Tolerant System". Depending on the use of these services and mech anisms the system can reach different levels of fault tolerance and reliability characteristics. Networks of Workstations (NOW) have become a convenient and less ex pensive alternative to parallel architectures for the execution of long-running parallel applications. The paper "An Efficient Recoverable DSM on a Network of Workstations: Design and Implementation" presents the realization and per formance evaluation of ICARE - a recoverable DSM (RDSM) associated with a process checkpointing mechanism. ICARE tolerates a single permanent node failure transparently to parallel applications which continue their execution on the remaining nodes. A prototype of ICARE is fully operational on an ATM network of workstations, running CHORUS micro-kernel. In the paper "Fault-Tolerant Issues of Local Area Multiprocessor (LAMP) Storage Subsystem" three main fault tolerance issues of the LAMP storage subsystem are discussed: system configurability for fault tolerance and perfor- xi mance, fast error detection and recovery, and fast logical volume reconstruction. Local Area MultiProcessor (LAMP) is a network of workstations with a shared physical memory. It uses low-latency and high bandwidth interconnections and provides remote DMA support. The interconnection is the Scalable Coherent Interface (SCI) which provides cache coherent, physically shared memory for multiprocessors via its bus-like point-point connections with high bandwidth and low latency. The interconnection network of LAMP is based on the Scalable Coherent Interface (SCI, IEEE std 1596 Scalable Coherent Interface). The pa per "Fault-Tolerance Issues in RDBMS on SCI-based Local Area Multiprocessor (LAMP)" explores the issues related to implementation of database systems on LAMP, particularly the fault-tolerant issues. In Section III, Dependable Systems, we consider general models and features of distributed safety-critical systems using commercial off-the-shelf component (COTS), service dependability in telecomputing systems constructed with off the-shelf components offering scalability and graceful degradation, a scalable and testable heterogeneous embedded architecture based on COTS for high-end signal processing applications, a fault-tolerant SPMD hierarchical architecture for real time processing of video signals. An overview of the problems encountered by those designing safety-critical systems along with the fundamentals. definitions and concepts employed by their design is presented in the paper "Distributed Safety-Critical Systems". A taxonomy that classifies the design solution space for safety-critical systems is presented. The paper "Dependability and Other Challenges in the Collision between Computing and Telecommunication" describes a distributed system composed of off-the-shelf components which can deliver advanced telecommunication ser vices. It is pointed out that the main difficulty to realize services using this approach resides in the need to create a robust dependable system. The re sources and their servers are heterogeneous and may be distributed locally or globally in the network. This architecture offers scalability and congestion man agement, and poses the significant challenge of overall service dependability. A new concept, that of scalable and testable embedded systems, is introduced in the paper "A unified approach for the synthesis of scalable embedded archi tectures". Parallel heterogeneous architectures based on COTS (Commercial Off-The-Shelf) components are becoming increasingly attractive as computing platforms for high-end signal processing applications such as Radar and Sonar. In comparison with traditional custom VLSI designs, these architectures offer advantages of flexibility, high performance, rapid design time, easy upgradabil ity, and low cost. The paper describes an unified approach for the synthesis of scalable architecture, based on COTS components. The approach is illustrated through a concrete example of a signal processing application. A fault-tolerant SPMD hierarchical architecture for real-time processing of video signals is introduced in the paper "A Fault-Robust SPMD Architecture for 3D-TV Image Processing". Fault-tolerant characteristics are evaluated by comparing the images produced by the system with and without faults in the architecture. xii Section IV, Fault-Tolerant Parallel Systems, considers embedding complete binary trees into a faulty hypercube interconnection architecture, single-node broadcasting in a faulty k-ary n-cube, software-implemented system-level test ing technique for multiprocessor systems with dedicated communication links, reliable execution of tasks and concurrent diagnosis of faulty processors and links, conditional connectivity for the m-ary n-dimensional hypercube, on-line recovery from intermittent and permanent faults within the links and nodes in two-dimensional meshes, fault-tolerance in parallel computers based on check pointing, self-diagnosis and rollback recovery, functional and attribute-based language for programming fault-tolerant applications, user-transparent back ward error recovery for message passing systems are considered. A scheme that can be used recursively in parallel to map a complete binary tree into a hypercube interconnection architecture with some faulty nodes is proposed in the paper "A Parallel Algorithm for Embedding Complete Binary Trees in the Faulty Hypercubes". Two algorithms have been described: one for a fault-free hypercube and the other for a faulty hypercube. It is shown that the scheme has a low time complexity as compared to the complexity of the existing algorithms. The paper "Fault-Tolerant Broadcasting in a K-ary N-cube " depicts an algorithm for one-to-all broadcasting in a k-ary-n cube. The algorithm is non redundant and fault-tolerant, and broadcasts correctly given n-l or less faults. It is called Partner Fault-Tolerant Algorithm. The time complexity of the algorithm is given. The paper "Fault Isolation and Diagnosis in Multiprocessor Systems with Point-to-Point Communication Links" presents an approach, which combines distributed system-level testing with processor self-test, and ensures fault-free operation by disconnecting all faulty processors and links from the system. The placement of monitors has been determined for several multiprocessor topolo gies including trees, hypercubes and meshes. In the paper "An Efficient Hardware Fault-Tolerant Technique" it is shown, that based on an efficient hardware fault-tolerant technique the reliable exe cution of tasks and concurrent diagnosis of faults can be accomplished, while processors and communication channels are subject to failure. The paper "Reliability Evaluation of a Task under a Hardware Fault-Tolerant Technique" presents an efficient technique, based on which each task's reliabil ity is increased when processors and communication channels are subject to failure. The concept of a forbidden set is exploited in the paper "Fault Tolerance Measures for M-ary N-dimensional Hypercubes Based on Forbidden Faulty Set" to achieve fault tolerance in hypercubes. In general, there are many ways to define a forbidden (feasible) faulty set depending on the topology of the system, application environment, statistical analysis of faulty patterns, and distribution of faulty-free nodes. An algorithm for detecting and compensating for intermittent and perma nent faults within the links and nodes of parallel computers, having an NxN