Table Of ContentIgor Schagaev · Eugene Zouev ·
Kaegi Thomas
Software Design
for Resilient
Computer
Systems
Second Edition
Software Design for Resilient Computer Systems
Igor Schagaev Eugene Zouev
(cid:129) (cid:129)
Kaegi Thomas
Software Design for Resilient
Computer Systems
Second Edition
123
Igor Schagaev Eugene Zouev
IT-ACS Ltd Department ofInformatics
Stevenage,UK Technopolis
Innopolis,Kazan, Russia
Kaegi Thomas
IT-ACS Ltd
Stevenage,UK
ISBN978-3-030-21243-8 ISBN978-3-030-21244-5 (eBook)
https://doi.org/10.1007/978-3-030-21244-5
1stedition:©SpringerInternationalPublishingSwitzerland2016
2ndedition:©SpringerNatureSwitzerlandAG2020
Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpart
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
orinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilar
methodologynowknownorhereafterdeveloped.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfrom
therelevantprotectivelawsandregulationsandthereforefreeforgeneraluse.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
hereinorforanyerrorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregard
tojurisdictionalclaimsinpublishedmapsandinstitutionalaffiliations.
ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG
Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland
Preface
WhatIconsidertobethestrongestpointofthiswork,indeeditsmainadvantageis
theextensionofthewinningstrategyofmilitarypilotstoamultimodalcomplex:“If
your action leads to the unexpected, step back and play anew.” To recognize that
thereisahierarchyofresponseoptionsandthenchoosetheleastobvioussequence,
yettheoneleadingtosurvivalisahumanmiracle.Toapplythisconsistentlytoan
array of systems run by different algorithms is new engineering. Contrary to our
experienceswithautonomousandsemi-autonomousmultimodularsystems,wecan
avoid creating irreconcilable paradoxes (shutdowns or tonal failures.) In fact, they
can be resisted if our causational design logic is augmented (or replaced) with an
interactive-transformation-interactive approach. This changes our thinking from
compensating for some top-down hierarchy of (event) causes to enabling response
“negotiation” across and between all system modules. One way to view the
ResilientSystemTheoryistorecognizethatitcanresistunacceptableoutcomesby
negotiating multiple options to resolve multiple conflicts at multiple levels. The
introduction of “system resilience” requires nonlinear logic and redistributed
capacityforflexiblecoordinationandre-coordinationofinternalregimeconditions
and parameters. From this perspective, the survival of a multimodal complex is
achieved not by insisting on the maximum recovery from losses in or of its key
component(s) but on achieving a total system response and behavior with at least
minimally optimal integration and recovery under a variety of disorienting, dis-
abling,ordysfunctionalconditions.TheresiliencetheoryofProf.Schagaevandhis
colleagues promises to integrate this conceptual framework into a radically pur-
poseful engineering-design framework. Boris Gorbis Los Angeles.
Stevenage, UK Igor Schagaev
Kazan, Russia Eugene Zouev
Kaegi Thomas
v
Introduction for Second Edition
When in 1989 an anonymous reviewer commented on my short paper that “this
classification should be extended to description of distributed systems,” (Yet
another approach to classification of redundancy, CIM IMEKO Symposium 1990,
Helsinki, pp. 117–124) I was really excited, because people in the research com-
munitywerethinkingmuchdeeperandwiderthanmyself(-Ihadjustdefendedmy
Ph.D.).
Further,faulttolerancewasmigratingtodependability(JeanClaudeLapriewas
an indisputable authority and expert in this domain, see more www.springer.com/
gb/book/9783709191729, which later emerged as the concept of resilience.
Inprinciple,allthesenewpropertieshadconcretereasoningandmeaningbehind
them:whensomethingerroneoushappens,anysystemofourdesignshouldbeable
to cope with the problem. Options vary, as well as circumstances and area of
application, thus:
– If it stops the error propagating and freezes in a safe state, it is fail-stop, or
fail-safe;
– If it can cope with permanent faults inside the system, it is a fault-tolerant
system;
– When it continues with reduced functionality, it is graceful degradation;
– If it is designed with attention having been paid to reliability, availability, and
maintenance or serviceability, it is dependable system;
– Ifitiscapableoftoleratingobstaclescausedbyinternalandexternalfactorsand
can spring back, recover, and continue, then a system can be considered as
resilient.
There are two major ways to achieve any of the properties mentioned above: at
system level or at local level (technological). Obviously, any reasonable combi-
nation of both levels is also welcome. We do not want to repeat our papers and
books (https://www.springer.com/gb/book/9783319150680, https://www.springer.
com/gb/book/9783319468129) but to incorporate into the second edition any sig-
nificant progress that has emerged.
vii
viii IntroductionforSecondEdition
Speaking about ICT systems, especially safety-critical and real-time ones, we
might think about the implementation of resilience from the system level down
through to hardware and systems software. In addition, we need to consider that
each of the parts will both interact with and support each other.
Non-Functional Requirements (NFRs) of each part of the system were consid-
ered, such as:
– Performance;
– Reliability;
– Efficiency (mostly energy efficient).
Therefore, the systems that we design should be PRE-smart and provide these
properties throughout the life cycle.
Neither ourbooks to date—haveappeared as complete. These books have been
usedinChina,Switzerland,Russia,andUSA(mostlyMastersandPh.D.students),
and we have received substantial feedback, such as:
(cid:129) While reliability of hardware and availability at the system level are explained
andfine,therearenosections,orchaptersaboutperformance,especiallywhere
parallel and distributed systems are concerned;
(cid:129) How to apply (as mentioned in the above review) the classification and prop-
erties of resilience for and within distributed systems;
(cid:129) How real-time andsafety-critical applications shouldbetreated consideringthe
system resilience: rules for system and for packages—have they changed?
It was especially satisfying when we discovered that these segments are being
updated by researchers around the globe, providing excellent contributions to the
content.
Thus, our book became an evolving system in itself, aggregating our further
efforts with the efforts and results of our colleagues from China, Switzerland, UK,
and Russia. Our book has therefore become itself resilient, benefiting from the
contributions from the following:
Performance chapter (including element-level performance and parallel design)
was prepared and included using materials and having contributions from:
– Professor Hao Kai, Shantou University, China;
– Simon Monkman, IT-ACS Ltd researcher.
System software chapters were part of substantial efforts from:
– Professor Eugeny Zuev and his team in Technopolis, Kazan, Russia.
In turn, requested in 1989 consideration of system level of resilience for dis-
tributed systems were developed as two chapters: system level and algorithmic
implementation prepared by me and Stephen Farrell. In these chapters, we have
introduced a concept of desperation (for transactions within distributed systems)
andshowthatourexistingandnewresults,evenpatented:https://www.ipo.gov.uk/
p-ipsum/Case/PublicationNumber/GB2448351canbeextremelyusefulmakingthe
IntroductionforSecondEdition ix
whole network really resilient and achieving by far better service for applications,
especially when critical level of their use was assumed.
The structure of the book now looks like figure below illustrates:
FAULT TOLERANCE
PROPOSED SYSTEM
SOFTWARE AND HARDWARE
Theory and
Concept
Generalized
Algorithm of
Fault Tolerance
(GAFT) Proposed FT Hardware
Run Time ERRIC
System
Structure
GAFT Extension Hardware Hardware
Active Safety Reliability and System
System Performance
Proposed
Language Support Hardware
Comparison
Recovery
Implementation
SSW Functions of Language
and Features Support
Analysis
SSW Recovery
Preparation
Testing,
Checking and
Recovery HW Support
Algorithms
Resilience and
Recovery Desperation:
Algorithms Distributed Systems
Analysis
Resilience and
Desperation:
Implementation
SYSTEM SOFTWARE FOR
FAULT TOLERANCE
FUTURE: RESILIENCE OF
DISTRIBUTED SYSTEMS
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Hardware Faults. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Single Event Effects and Other Deviations . . . . . . . . . . . . . . . 9
2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Fault Tolerance: Theory and Concepts. . . . . . . . . . . . . . . . . . . . . . 11
3.1 Introduction to Reliability Theory . . . . . . . . . . . . . . . . . . . . . 11
3.2 Connection Between Reliability and Fault Tolerance. . . . . . . . 13
3.3 Models for Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Chapter Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Generalized Algorithm of Fault Tolerance (GAFT) . . . . . . . . . . . . 25
4.1 The Generalized Algorithm of Fault Tolerance . . . . . . . . . . . . 26
4.2 Definition of Fault Tolerance by GAFT . . . . . . . . . . . . . . . . . 30
4.3 Example of Possible GAFT Implementation . . . . . . . . . . . . . . 31
4.4 GAFT Properties: Performance, Reliability, Coverage . . . . . . . 33
4.5 Reliability Evaluation for Fault Tolerance. . . . . . . . . . . . . . . . 37
4.6 Hardware Redundancy and Reliability . . . . . . . . . . . . . . . . . . 38
4.6.1 Hardware Redundancy: Reliability Analysis. . . . . . . 39
4.7 Conceptual Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5 GAFT Generalization: A Principle and Model of Active System
Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.1 GAFT Extension: The Method of Active System Safety . . . . . 49
5.2 GAFT Derivation: A Principle of Active System Safety . . . . . 49
xi
xii Contents
5.3 Dependency Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.4 Recovery Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.5 PASS Tracing Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.5.1 Forward Tracing Algorithm. . . . . . . . . . . . . . . . . . . 53
5.5.2 Backward Tracing Algorithm. . . . . . . . . . . . . . . . . . 55
5.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6 System Software Support for Hardware Deficiency: Functions
and Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.1 System Software Life Cycle Versus Fault Tolerance . . . . . . . . 66
6.2 System Software Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7 Testing, Checking, and Hardware Syndrome . . . . . . . . . . . . . . . . . 71
7.1 Hardware-Checking Process. . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.2 Analysis of Checking Process . . . . . . . . . . . . . . . . . . . . . . . . 75
7.2.1 The System Model . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.2.2 Diagnostic Process Algorithm . . . . . . . . . . . . . . . . . 77
7.2.3 Procedure T1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.2.4 Extension of the Diagnostic Procedure. . . . . . . . . . . 81
7.2.5 Testing of Time-Sharing Systems . . . . . . . . . . . . . . 83
7.2.6 FT Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.3 System Monitoring of Checking Process: A Syndrome . . . . . . 88
7.3.1 Access and Location of the Syndrome . . . . . . . . . . . 92
7.3.2 Memory Configuration . . . . . . . . . . . . . . . . . . . . . . 94
7.3.3 Interfacing Zone: The Syndrome as Memory
Configuration Mechanism . . . . . . . . . . . . . . . . . . . . 96
7.3.4 Graceful Degradation Approach
and Implementation . . . . . . . . . . . . . . . . . . . . . . . . 97
7.3.5 Reconfiguration of Other Hardware Devices. . . . . . . 100
7.4 Software Support for Hardware Reconfiguration . . . . . . . . . . . 101
7.4.1 Software Support for Degradation . . . . . . . . . . . . . . 101
7.4.2 Hardware Condition Monitor. . . . . . . . . . . . . . . . . . 103
7.4.3 Hardware Condition Monitor—System Software
Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.5 Hardware Reconfiguration Outlook . . . . . . . . . . . . . . . . . . . . 108
7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8 Recovery Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.1 Runtime System Support for Fault Tolerance
and Reconfigurability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.2 Overview of Existing Backward Recovery Techniques . . . . . . 114