ebook img

Dependable Network Computing PDF

462 Pages·2000·23.145 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Dependable Network Computing

DEPENDABLE NETWORK COMPUTING THE KLUWER INTERNATIONAL SERIES IN ENGINEERING AND COMPUTER SCIENCE DEPENDABLE NETWORK COMPUTING edited by Dimiter R. Avresky Boston University Boston, MA USA "~. Springer Science+ Business Media, LLC Library of Congress Cataloging-in-Publication Data Dependable network computing I edited by Dimiter R. Avresky. p. cm. -- (Kluwer international series in engineering and computer science [v. 538]) Includes bibliographical references. ISBN 978-1-4613-7053-6 ISBN 978-1-4615-4549-1 (eBook) DOI 10.1007/978-1-4615-4549-1 1. Computer networks. 2. Parallel processing (Electronic computers) 3. Electronic data processing--Distributed processing. 1. Avresky, D. R. (Dimitri Ranguelov), 1944-II. Kluwer international series in engineering and computer science; SECS 538. TK5105.5 D467 2000 004.6--dc21 99-048352 Copyright • 2000 by Springer Science+Business Media New York Originally published by Kluwer Academic Publishers in 2000 Softcover reprint ofthe hardcover Ist edition 2000 AII rights reserved. No part of this publication may be reproduced, stored in a retrieval systemor transmitted in any form or by any means, mechanical, photo-copying, recording, or otherwise, without the prior written permission of the publisher, Springer Science +Business Media, LLC. Printed an acid-free paper. Contents Preface vii Section 1 -Dependable Software and Large Storage Systems: 1 Key Components for Dependable Network Computing 1. Dependability of Software-Based Critical Systems 3 Jean-Claude Laprie 2. An Analysis of Error Behavior in a Large Storage System 21 Nisha Talagala and David Patterson Section 2 -Dependable Broadcast and Protocols 49 in Computer Networks 3. Totally Ordered Broadcast in the Face of Network Partitions 51 Idit Keidar and Danny Dolev 4. On the Possibility of Group Membership Protocols 77 Massimo Franceschetti and Jehoshua Bruck 5. Reliable Clocks for Unreliable Process Groups 93 A. Mostefaoui, Michel Raynal and M. Takizawa 6. Implementing Protocols with Synchronous Objects 109 Claude Petitpierre and Antonio J. Restepo Zea Section 3 -Analysis of Computer Networks 141 7. Automated Formal Analysis of Networks 143 J. N. Reed, D. M. Jackson, B. Deianov and G.M. Reed 8. A BDD Approach to Dependability Analysis of Distributed 167 Computer Systems with Imperfect Coverage H. Zang, H. Sun and K. Trivedi Section 4 -Fault-Tolerant Routing and Reconfiguration in 191 Computer Networks 9. Fault-Tolerant Routing in the Internet Without Flooding 193 Paolo Narvaez. Kai-Yeung Siu and Hong-Yi Tzeng vi DEPENDABLE NETWORK COMPUTING 10. Dynamic Reconfiguration in High Speed Local Area Networks 207 Jose Duato, Rafael Casado, Francisco J. Quiles and Jose L. Sanchez 11. Minimal and Adaptive Fault-Tolerant Routing in ServerNet 2D Torus 233 Network D. R. Avresky, 1. Acosta, VI. Shurbanov and Z McCaffrey Section 5 -Fault-Tolerant Interconnection Networks 265 12. Tolerating Faults in Counting Networks 267 Marc D. Riedel and Jehoshua Bruck 13. Fault-Tolerant Multicasting in 2-D Meshes Using Extended 279 Safety Levels Xiao Chen, Jie Wu and Dajin Wang Section 6 -Dependable Distributed and Mobile Computing 313 14. Dependable Distributed and Mobile Computing -Utilizing Time 315 to Enhance Recovery from Failures W. Kent Fuchs, Nino Neves and Kuo-Feng Ssu 15. Design and Implementation of Fault-Tolerant Parallel Software 341 in a Distributed Environment Using a Functional Language M.Toyoshima, A.Cherif, M.Suzuki and T.Katayama 16. Overhead of Coordinated Checkpointing Protocols for Message 359 Passing Parallel Systems Bernd Bieker and Erik Maehle 17. A Multi-Level Layered System Approach to on On-Line Testing 379 H. Levendel Section 7 -Dependable Real-Time Distributed Systems 393 18. Task Schedule Algorithms for Fault Tolerance in Real-Time 395 Embedded Systems N. Kandasamy, 1. Hayes and B. T. Murray 19. The Rapids Simulator: A Testbed for Evaluating Scheduling. 413 Allocation and Fault-Recovery Algorithms in Distributed Real-Time Systems M. Allalouf, J. Chang, G. Dirairaj, 1. Haines, V. R. Lakamraju, K. Toutireddy, O. S. Unsal, K. ~ I. Koren and C. M. Krishna 20. Fault-Tolerant Dynamic Scheduling of Object-Based Tasks in 433 Multiprocessor Real-Time Systems Indranil Gupta, A, G. Manlmaran and C. Siva Ram Murthy Index 463 Preface Dependable Network Computing is becoming a key component of our eco nomical and social life. Specially, with the "revolution of the Internet" the traffic and the number of users are growing by one order of magnitude every 16 months. Every day millions of users are using their computers and network in real-time for electronic commerce and trading on Wall Street. This requires highly dependable networks, software, servers, and large storage systems. Dif ferent adaptive fault-tolerant routing and dynamic reconfiguration techniques should be implemented for avoiding faulty nodes, links and routers in the net work. The routing protocols in the networks should deadlock and livelock free. The Quality of Service (QoS) should be insured and the percentage of dropped packages has to be small, despite the large volume of traffic, con gested networks and the presence of failures. According to the recent studies, the mean-time between failures (MITF) in the majority of Internet backbone paths is 28 days or less. Specifically, the routers exhibit in average MTTF of fifteen days. Apparently, the computer scientists and researchers have to consider the dependability of the computer networks as a first priority task, because they have high intrinsic failure rates. Internet backbone's failures can easily generate millions of dollars of loses in electronic commerce revenue and interrupt the normal work of millions of end-users. The challenge is to built networks, based on Components of the Shelf (CarS), that are inexpensive, accessible, scalable and dependable. The chapters in this book provide insight into many of these problems and others that will challenge researchers and application developers of dependable computer networks. Section I -"Dependable Software and Large Storage Systems: Key Compo nents for Dependable Network Computing" The paper "Dependability of Software-Based Critical Systems" presents di rections for designing dependable software-based critical systems. Based on rich statistics of computer-related failures it is shown that failures are becom ing more and more distributed (AT&T outage in USA and credit card denial viii DEPENDABLE NETWORK COMPUTING of authorization in France). It demonstrates the dominance of design faults in client-server networks. It also shows that the interaction faults are becoming the second source of system failures. The design faults can affect hardware as well (residual bugs in Intel processors). Verification and validation efforts are approaching 75% of the total costs for critical software. The follow ing driving forces for implementing dependable systems are identified in the paper: a) cost-effective highly dependable systems via re-use, b) evolution towards integration (vs. federation), and c) fault evolution. Based on these conclusions, the following recommendations are made: a) supplement off-line validation with on-line protection via fault tolerance, b) extend the applicability of dependability measures for dependability prediction, c) establish a theory of composability of dependability properties, d) build dependability-explicit development processes. The paper "An Analysis of Error Behavior in a Large Storage System" analyzes the failures characteristics of 3.2 terabyte disc storage system, based on Redundant Arrays of Inexpensive Disks (RAID). The application for this storage system is a web accessible image collection. The switched Ethernet network is used for connecting a set of PC, each of them hosting set of disks through SCSI. The error rates for disks, network and SCSI are presented. The largest sources of errors in the system are SCSI (timeouts and parity) and network errors (49% and over 40% of all errors correspondingly). On the other hand, data disk errors are only around 4% overall, even though disks make up 90% of the components of the system. The network errors are more fatal for restarting the OS of the nodes than SCSI. The results show that network errors are heavily correlated over machines and it is necessary that all single points of failures to be removed from highly available storage systems. Section IT "Dependable Broadcast and Protocols in Networks" An algorithm for Totally Ordered Broadcast in the face of network partitions and process failures, using an underlying group communication service (GCS) as a building block, is presented in the paper "Totally Ordered Broadcast Network Partitions". It is guaranteed that if the majority of the processes form a connected component then these processes eventually deliver all messages sent by any of them, in the same order. The Totally Ordered Broadcast algorithms is called COReL. It is designed as a high-level service atop a group communication service which provides totally ordered group multicast and membership service. CORel uses the GCS as a failure detector and as a building block for reliable communication within connected network components. Messages are totally ordered within each connected network component by means of using logical timestamps (TSs) which are delivered along with the messages. The TS total order preserves the causal partial order. The GCS delivers messages at each process in the TS order. PREFACE IX Distributed systems like distributed database, web and firewall servers con sist of a group of processes that cooperate in order to complete specific tasks. Processes may cooperate to achieve fault tolerance. A Group Membership Protocol is of particular use in such systems. The paper "On The Possibility of Group Membership Protocols" identifies the main assumptions required for proving the impossibility of Group Membership in asynchronous systems and proposes an algorithm that solves the problem using a weak liveness property. The algorithm is fully distributed, it does not need any extension to the asyn chronous model of concurrent computation in terms of global failure detectors and can tolerate any number of failures and recoveries. Progress of the algo rithm can be easily guaranteed in practice in real-world systems. The failure model allows processes to crash, silently halting their execution. It is assumed that the communication of a process with its local failure detector to be done through a special receive-only channel on which it may receive a new list of id's of processes not suspected to have crashed. This is called the local connectivity view of the process. The specification for the group membership algorithm is determined by means of four properties: agreement, termination, validity and safety. The paper "Reliable Logical Clocks for Unreliable Process Groups" con siders a logical clock system for asynchronous group-based systems. Clocks are associated not with processes but with groups. Each group has a logical clock with which it can timestamp its events. The fact that the system is asyn chronous (no bounds on process speeds; no bound on message transfer delays) and that within each group, processes may crash, makes the problem non trivial. The asynchronous group clock protocol uses two underlying building blocks ( Reliable Multicast to Multigroups and Consensus). A systematic methodology based on synchronous active objects for imple menting programs using communication protocols is discussed in the paper "Implementing Protocols with Synchronous Objects". A synchronous objects can postpone the execution of one of its methods, when is called from another object, until it is ready to execute it, and it can wait for several calls in parallel, executing the first one that is ready. The methods are executed atomically, i.e. without interleaving their statements with the statements of other methods. A synchronous object contains an internal activity that runs in parallel with the activities of the other synchronous objects, and it may suspend its execution on parallel wait statements. The synchronous objects allow the realization of finite state machines in a simple manner, which makes the implementation of protocols, often presented under this form, straightforward. The concept of syn chronous objects is integrated in C++ (which then became sC++) and in Java. sC++ integrates a parallel wait statement and the inheritance of synchronous objects. Several examples are presented: the sliding window protocol, TCP client-server configuration, a CORBA library and a distributed consensus algo- x DEPENDABLE NE1WORK COMPUTING rithm. It is shown how concurrent programs written with synchronous objects can be analyzed for the existence of deadlocks. The analysis is demonstrated for a given program. Section ill - "Analysis of Computer Networks" New techniques which should prove valuable for formally reasoning about modem multiservice networks are presented in the paper "Automated Formal Analysis of Networks". These techniques are presented in the CSPIFDRmodel checking formalism. CSPIFDR belongs to the class of formalisms which com bine programming languages and finite state machines. A novel induction technique is describe which can be used to verify-end-to-end properties of certain arbitrary configured networks. This technique should prove extremely valuable for reasoning about livelock and deadlock freedom for complex net work protocols exercised by arbitrary numbers of network nodes. A powerful notion of refinement intuitively captures the idea that one system implements another. Mechanical support for refinement checking is provided by FOR re finement checker, which also checks for system properties such as deadlock or livelock. Its applicability is illustrated with an example patterned after the Re source reSerVation Protocol (RSVP), a protocol designed to support resource reservation for high-bandwidth multicast transmissions over IP networks. The leaky bucket algorithm, which attempts to smooth traffic burstiness at a network node is analyzed. A dependability analysis of computer networks with imperfect coverage, based on Binary Decision Diagrams (BDD), is proposed in the paper "A BDD Approach to Dependability Analysis of Distributed Computer Systems with Imperfect Coverage". This approach avoids high computation complexity for large systems and reduces significantly the size of the required memory for large number of disjoint products. This features of the algorithm makes it possible for the authors to study some practical and large distributed systems. Section IV -"Fault-Tolerant Routing and Reconfiguration in Computer Net works" Link-state protocols such as OSPF are the dominant routing technology in today's Internet. Despite their many advantages, these protocols require the flooding of new information across the entire routing area after changes in any link state (e.g. link failures). A scheme that will restore loop-free routing in a link-state routing protocol environment with minimum communication overhead is described in the paper "Fault-Tolerant Routing in the Internet without Flooding". The scheme restores all paths traversing the failed link by only performing local updates on the neighboring routers. This approach is useful to divert traffic when a link becomes particularly congested. Because very few routers need to be informed, this operation can be done quickly and as frequently as needed. The Branch Update Algorithm is proposed, which can deliver a packet to its destination without traveling in any loops for any

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.