Table Of Content

ε RL: A Fault Tolerance Linguistic Structure for Distributed Applications Vincenzo De Florio and Geert Deconinck Katholieke Universiteit Leuven 4 Electrical Engineering Department, ELECTA Division, 1 0 Kasteelpark Arenberg 10, B-3001 Leuven-Heverlee, Belgium 2 E-mail: [email protected] n a J 5 Abstract extent of this process could be hardly foreseen in the 1 ere-days of modern computing: Those days the main The embedding of fault tolerance provisions into the role of computers was basically that of fast solvers of ] C application layer of a programming language is a non- numerical problems, which made it to some extent ac- D trivial task that has not found a satisfactory solution ceptable that outages and wrong results could occur 1 yet. Such a solution is very important, and the lack of ratheroften . Computer failureswere a bothering fact . s a simple, coherent and effective structuring technique to accept and live peacefully with. The very same in- c [ for fault tolerance has been termed by researchers in crease in computer reliability and performance pushed this field as the “software bottleneck of system develop- uptheintroductionofcomputerservicestilltheyactu- 1 ment”. Theaimofthispaperistoreportonthecurrent allypermeated oursociety. Consequently,whatwecall v status of a novel fault tolerance linguistic structure for the criticality of computer services—that is, the mag- 3 8 distributed applications characterized by soft real-time nitude of the consequences of a computer failure—has 6 requirements. A compliant prototype architecture is dramatically increased and, with it, the need for guar- 3 alsodescribed. Thekeyaspectofthisstructureisthatit antees that computer failures can be avoided or their . 1 allowstodecomposethetargetfault-tolerantapplication extent bounded. Dependability, or the trustworthiness 0 into three distinct components, respectively responsible ofa computersystemsuchthatreliancecanjustifiably 4 for (1) the functional service, (2) the management of be placed on the service it delivers [22], became a fun- 1 thefaulttoleranceprovisions, and(3)theadaptation to damental requirement. : v the current environmental conditions. The paper also Devising methods to fulfil the requirement for de- i X briefly mentions a few case studies and preliminary re- pendability of computer services has been and still is r sults obtained exercising the prototype. a hot researchtopic. We are not going to review those a methods,butmerelyobservethattheycanbeclassified accordingtothe(physicalorvirtual)machinetheyad- 1. Introduction dress: as an example, hardware fault tolerance (HFT) is the name of the class of methods that target physical faults and aim at preventing that they bring the 1.1.Trustingcomputerservices physical machine to a failure. We believe HFT is an important requirement to achieve a truly dependable Human society more and more expects and relies computer service, as it addresses the basement of the on the good quality of complex services supplied by computers: Computer servicesarebecoming more and 1ThisexcerptfromareportontheENIACactivity[31]gives more vital, in the sense that a lack of timely deliv- anideaofhowdependablecomputerswerein1947: “powerline ery ever more often can have a severe impact on cap- fluctuations and power failures made continuous operation di- itals, the environment, and even human lives. This rectly off transformer mains an impossibility [...] down times were long; error-free running periods were short [...]”. After state of facts is the consequence of the tremendous many considerable improvements, still “trouble-free operating growth in both the complexity and the crucial char- timeremainedatabout100hoursaweekduringthelast6years acter of roles nowadays assigned to computers. The oftheENIAC’suse”,i.e.,areliabilityofabout60%! 1 hierarchyofmachinesthatcollectivelysupplythatser- 2. A special-purpose component dealing with the vice. Likewise we are convincedthat, as any computer management of the FT provisions. service is the result of the concurrent progress of a hi- 3. A special-purpose component responsible for the erarchyofmachines,servicedependabilitymaybebest run-time adaptation of the FT provisions to the reachedthroughastrategythattargetthewholeofthe current environmental conditions. hierarchy: Failing to consider a tassel means weaking a link in the chain—a single point of overall service The structure of this paper is as follows: Sec- failure. tion 2 introduces the elements of our approach. Sec- ε The top of the hierarchy—the application layer—is tion 3 describes a RL-compliant prototype software no exception. On the contrary, a design fault at this architecture that has been developed in the frame- level may well be as jeopardizing as a physical fault in work of the two ESPRIT projects EFTOS (“embed- the hardware machine, for the application layer is the ded fault-tolerant supercomputing”) [15] and TIRAN very “place” where the service is specified (in its more (“tailorable fault tolerance frameworks for embedded abstract terms). applications”) [2]. That architecture focuses on com- It is this general purpose character that makes so ponent 2. Section 3 also mentions a few case stud- ε difficult devising an application level fault tolerance ies where RL is proving its effectiveness. The pa- (ALFT)strategy: Indeed,whileeffectivesolutionshave per is concluded by Sect. 4, which also provides the ε been found, e.g., for the hardware, the operating sys- reader with the elements of a new RL-compliant ar- tem, and the middleware layers, the problem of an ef- chitecture. Such architecture, which is being devel- fective system structure for expressing fault tolerance oped in the framework of the IST-2000-25434 project provisions in the application layer of computer pro- DepAuDE (“Dependability for embedded Automation grams is still an open one. systems in Dynamic Environments with intra-site and Structuring techniques provide means to control inter-site distribution aspects”), is to fully exploit the ε complexity, the latter being a relevant factor for pre- capabilitiesofRL. Thekeygoalofthis architectureis ventingtheintroductionofdesignfaults. Thisfactand to realizeall the special-purpose components of a fully ε the ever increasing complexity of today’s distributed RL-compliant distributed architecture, leaving to the software justify the need for simple, coherent, and ef- user the sole management of the service specification. fective structures for the expression of fault tolerance in the application software. This paper describes the 2. The Recovery Language Approach ε “recoverylanguageapproach”(RL),i.e.,astructuring technique for the expression of the fault tolerance de- ε This section describes RL, a FT linguistic struc- sign aspects in the applications characterized by soft ε turing technique for distributed applications with soft real-time requirements. The RL technique in particu- real-time constraints. By structuring technique we lar addresses three requirements of fault-tolerant soft- mean a set of methods by means of which it is pos- ware design: sible to express and to manage some FT provision. R1 Separation of the functional and fault tolerance In the following, we will characterize both the above design aspects, such that the two design concerns “methods”—expressingand managing a FT provision. do not conflict with each other. Furthermore, in order to characterize our technique with respect to the existing ones, we will make use, R2 Dynamic adaptability to varying environmental informally,ofa“base”ofstructuralproperties,namely conditions, obtained through a sort of dynamic linking of the fault tolerance executable code. sc: separation of design concerns, R3 A syntactical structure capable of hosting a wide a: adaptability to a varying environment, and 2 class of fault tolerance (FT) provisions . sa: syntactical adequacy, i.e., the adequacy of the ε The above requirements are met by exploiting RL’s technique at hosting a FT provision, averaged on capability to partition the design complexity of a dis- the set of possible FT provisions. tributed application into three components: Clearly the above properties respectively match re- 1. An application-specific component realizing the quirement R1, R2 and R3. In what follows we will ε functional specification. show that RL is a simple, coherent, and effective FT linguistic structure that provides satisfactoryvalues of 2By “FT provision” we mean any strategy (e.g. recovery the three structural properties (sc, a, sa) in the do- blocks), or mechanism (such as watchdog timers), that can be usedtointroduceFTaspects intoanapplication. main of soft real-time, distributed applications. 2 ε In RL two distinct programming languages are available to the programmer: a service language, i.e., the programming language addressing the functional designconcerns,andaspecial-purposelinguisticstruc- ture (called “recovery language”) for the expression of error recovery and reconfiguration tasks. This recov- erylanguagecomes into playeither asynchronously,as soon as an error is detected by an underlying error detection layer, or when some erroneous condition is signaled by the application processes. Error recovery and reconfiguration are specified as a set of guarded actions, i.e., actions that require a pre-condition to be fulfilledin orderto be executed. Recoveryactionsdeal with coarse-grainedentities of the application and the system, and pre-conditions query the current state of those entities. An example of a recovery action is the following one: ε if a transient faults affects “task 10” : Figure 1. Scheme of execution of a RL-compliant ap- restart task 10 plication: together with the application, two special- notify the group of tasks to which task 10 belongs purpose tasks are running—a system-wide database end managementsystem(wecallitthe“backbone”),which stores errordetectionnotificationssentbya periphery A larger example of guards and actions can be seen in of detection tools, and a “recovery application”, i.e., a ε Sect. 3, where a prototype RL-compliant architecture task responsible for the execution of the recovery ac- is described. tions. Thediagramdescribestheexecutionoftheuser- ε AnimportantaddedvalueofRListhatitallowsfor specified recoveryactions. The dotted line represents the expression of the recovery actions to be done in a jump to the execution of the next guarded action, if a design and programming context other than the one any.Errorrecoveryendswhenthelastguardedactionis in which the expression of the functional service takes evaluated. place. This minimizes non-functional code intrusion and hence enhances property sc. The execution of the recovery actions is done via a fixed (i.e., special-purpose) scheme, portrayed in the plexity, which decreases development times and costs. sequence diagram of Fig. 1: as soon as an error is de- Inthecurrentimplementation,describedinSect.3,the tected, a notification describing that event is sent to recoveryactionsaretranslatedintoa“recoverypseudo- a distributed entity responsible for the collection and code” (we call it r-code) that is interpreted by an r- themanagementofthesenotifications. Letuscallsuch code virtual machine. Currently, the r-code can either entitythe“backbone”(BB).Immediately afterstoring be read from a file or “hardwired” in the r-code vir- eachnotification,theguardsoftherecoveryactionsare tual machine. The separability of the r-code from the evaluated. Guards evaluation is done by querying the functionalcodeprovidesthe elementsforthe approach BB.When a guardis found to be true, its correspond- describedinSect.4,whichfocusesonadaptabilityand ing actions are executed, otherwise they are skipped. FT software reuse. ε The just sketched strategy represents the way RL The above strategy clearly focuses on the error re- performs its management of the FT provisions to be covery step of FT. In order to minimize the code in- embeddedinthetargetapplication. Animportantcon- trusion due to error detection and fault masking, we sequence of the adoption of this strategy is that the envisaged a configuration language that allows the functional executable code and the non-functional ex- user to set up ready-to-use instances of provisions se- ecutable code are distinct: the former implements the lectedfromacustomlibraryofsingle-versionFTmech- user tasks, while the latter is given by a proper cod- anisms, including, e.g., a watchdog timer or a voting ing of the recovery actions. This allows to decompose tool. These instances are also instrumented in such a the design process into two distinct phases. When the way as to forward transparently their notifications to interfacebetweenthetwo“aspects”issimpleandwell- the BB. Notifications include, e.g., a watchdog timer’s defined, this providesa wayto controlthe designcom- alarm,or a caughtdivision-by-zeroexception, or a mi- 3 norityinputvaluetoavotingtool. Anexampleofcon- hypothesesmatchwelltonowadaysdistributedsystems figuration language can be seen in Sect. 3. The same based on networked workstations—as such, they rep- translator that turns the recovery actions into the r- resent a general model with no practical restriction. code is used in that case to write the source files with The following assumptions characterizethe user appli- the configured instances. cation: • The service is supplied by a distributed applica- 2.1.SystemandApplicationModels tion. ε The target system for RL is assumed to be a dis- • It is written or is to be written in a procedural or tributed or parallel system. Basic components are object-oriented language such as C or Java. nodes, tasks, and the network. A node can be, e.g., a workstation in a networked cluster or a processor • The application is non safety-critical. in a MIMD parallel computer. Tasks are independent threads of execution running on the nodes. The net- • The target application is characterized by soft work system allows tasks on different nodes to com- real-time requirements. In particular, perfor- municate with each other. Nodes can be commercial- mance failures may occasionally show up during off-the-shelfhardwarecomponentswithnospecialpro- error recovery. visions for hardware FT. A general-purpose operating • Inter-processcommunicationtakesplacebymeans system (OS) is requiredon eachnode. No special pur- of the functions in the above mentioned message pose,distributed,orfault-tolerantOSisrequired. The passing library. Higher-level communication ser- system obeys the timed asynchronous distributed sys- vices, if available, must be based on the message tem model [5]: passing library as well. • Tasks communicate through the network via a Assuggested,e.g.,in[28],anyeffectivedesignincluding datagram service with omission/performance fail- dependability goals requires provisions, located at all ure semantics [4]. levels,toavoid,remove,ortoleratefaults. Hence,asan ε • Services are timed: specifications prescribe not application-level structuring technique, RL is comple- onlythe outputs andstate transitionsthatshould mentary to other approaches addressing FT at system occur in response to inputs, but also the time in- level, i.e., hardware-level and OS-level FT. In particu- tervalswithinwhichaclienttaskcanexpectthese lar, a system-level architecture such as GUARDS [25], outputs and transitions to occur. thatisbasedonredundancyandhardwareandOSpro- visions for systematic management of consensus, ap- • Tasks (including those related to the OS and the pears to be particularly appropriate for being coupled network) have crash/performance failure seman- ε with RL which offers application-level provisions for tics [4]. N-version programming and replication (see Sect. 3). • Tasks have access to a node-local hardware clock. ε 2.2.Work-flowofRL If more than one node is present, clocks on different nodes have a bounded drift rate. This section describes the work-flow corresponding • A “time-out” service is available at application- to the adoption of the RεL approach. Figure 2 sum- level: using it, tasks canschedule the executionof marizes the work-flow. The following basic steps have events so that they occur at a given future point been foreseen: in time, as measured by their local clock. • Inthefirststeps(labels1and2inthecitedfigure), Inparticular,thismodelallowsastraightforwardmod- thedesignerdescribesthekeyapplicationandsys- eling of system partitioning—as a consequence of suf- tem entities, such as tasks, groups of tasks, and ficiently many omission or performance communica- nodes. The main tool for this phase is the config- tion failures, correct nodes may be temporarily dis- uration language. connected from the rest of the system during so-called periods of instability [5]. A message passing library is • Next(step3),the designerconfiguresanumberof assumedtobeavailable,builtonthedatagramservice. basicFT tools(BTs)heorshehasdecidedtouse. Such library offers asynchronous, non-blocking multi- The configuration language is used for this. The cast primitives. As clearly explained in [5], the above output of steps 1–3 is the configuration code. 4 2.3. Specific Differences with respectto Other Ap- proaches Numerous techniques have been devised in the past to solve the problem of optimal and flexible development of dependability services to be embedded in the application layer of a computer program. In [6], some of these approaches are critically reviewed and qualitatively assessed with respect to a set of struc- turalattributes(separationofdesignconcerns,syntactical adequacy and adaptability). A non-exhaustive list of the systems and projects implementing these approaches is also given in the cited reference. In particular, approachesbased onmetaobject protocols[20] (MOPs), FT distributed programming languages [27] and aspect-oriented programming [21] (AOP) are reviewed therein. ε Figure2.Awork-flowdiagramforRL. Labelsreferto MetaobjectProtocols. ThekeyideabehindMOPs usagestepsandaredescribedinSect.2.2. isthatof“opening”theimplementationoftherun-time executive of an object-oriented language like C++ or Javasothat the developercanadoptandprogramdif- ferent,customsemantics,adjustingthelanguagetothe • Next (step 4), the designer defines which condi- needs of the user and to the requirements of the envi- tions needto be caught,andwhichactions should ronment. Using MOPs, the programmer can modify follow each caught condition. The resulting list is the behavior of fundamental features like methods in- codedasa numberofguardedactionsviaarecov- vocation, object creation and destruction, and mem- ery language. ber access. The key concept behind MOPs is that of computational reflection, or the causal connection be- • The configuration code and the recovery code are tweenasystemandameta-leveldescriptionrepresent- then converted via the translator into a set of C ing structural and computational aspects of that sys- headerfiles,Cfragments,andsystem-specificcon- tem [24]. An architecture supporting this approach is figuration files (steps 5 and 6). These files repre- FRIENDS [17]. FRIENDS implemented a number of sent: configured instances of the BTs, of the sys- FT provisions (e.g., replication, group-based commu- tem and of the application; initialization files for nication, synchronization, voting) as MOPs. the communication management functions; user A number of studies confirm that MOPs reach effi- preferences for the BB; and the recovery pseudo- ciency in some cases [20], though no experimental or code. analytical evidence allows to estimate the practicality and the applicability of this approach [26, 23]. MOPs only support object-oriented programming languages • On steps 7–9, the application source code and a andrequirespecialextensions orcustomprogramming set of configured instances of BTs are compiled languages. in order to produce the executable codes of the application. Aspect-oriented Programming Languages. Aspect-oriented programming [21] is a programming • Next,theBBandtherecoveryinterpreterarecom- methodology and a structuring technique that explic- piled on steps 10–13. itlyaddresses,atsystem-widelevel,theproblemofthe best code structure to express different, possibly con- The resulting components, i.e., the executable codes flictingdesigngoalslikeforinstancehighperformance, of the application, the backbone,and RINT, represent optimal memory usage, or dependability. the entities portrayed in Fig. 2. Developed as a Xerox PARC project, AspectJ is In the following we briefly summarize the specific anaspect-orientedextensionto the Java programming differences between ours and other novel approaches. language [19, 23]. A study has been carried out on 5 the capability of AspectJ as an AOP language supporting exception detection and handling [23]. It has been shown how AspectJ can be used to develop so- called “plug-and-play” exception handlers: libraries of exception handlers that can be plugged into many dif- ferentapplications. Thistranslatesintobettersupport for managing different configurations at compile-time. Uptonow,noAOPtoolorprogramminglanguageex- istsforflexibledevelopmentofdependable services: As- pectJonlyaddressesexceptiondetectionandhandling. Remarkably enough, the authors of a recent study on AspectJandits supporttothis fieldconclude[23]that “whetherthepropertiesofAspectJ[documentedinthis paper]leadtoprogramswithfewerimplementationer- rors and that can be changed easier, is still an open research topic that will require serious usability studies as AOP matures”. 3. The ariel Configuration and Recovery Figure3.ArepresentationoftheTIRAN elements. The Language central, whiter layers constitute the TIRAN framework. This same structure is replicated on each processing This section describes a prototypic architecture nodeofthesystem. ε based on RL that has been developed during recently endedprojectTIRAN.Inthefollowing,inSect.3.1we presentthe contentsofTIRAN.The maincomponents • Assess the correctness of the framework. of the TIRAN architecture are then briefly introduced in Sect. 3.2. In particular, the TIRAN recovery lan- • Quantifythefulfillmentoftime,dependabilityand guage, ariel, is reported in Sect. 3.3 and a few case cost requirements. studies in Sect. 3.4. • Provide guidelines to the configuration process of 3.1.TheTIRANProject the users. Most of this framework has been designed for being The main objective of project TIRAN (ESPRIT platform independent. A single version of the frame- 28620)has been to develop a software framework that workhasbeenwrittenintheCprogramminglanguage provides fault-tolerant capabilities to automation sys- makinguseofalibraryof“basicservices”(BSL)devel- tems. Application-level support to FT is provided by ε oped by the TIRAN consortium. The TIRAN frame- means of a RL-compliant architecture, which is de- work is currently running on Windows-NT, Windows- scribed in the rest of this section. The framework pro- CE, the Virtuoso microkernel [29], VxWorks, and the videsalibraryofsoftwareFTprovisionsthatarepara- TEX microkernel [30]. metric and support an easy configurationprocess. Us- The project results, driven by industrial users’ re- ing the framework, application developers are allowed quirements and market demand, is being integrated to select, configure and integrate provisions for fault into the Virtuoso microkernel and adopted by ENEL masking,errordetection,isolationandrecoveryamong and SIEMENS within their application fields. those offered by the library. Goal of the project is to provide a tool that significantly reduces the develop- 3.2.TheTIRANFramework menttimesandcostsofanewdependablesystem. The targetmarketsegmentconcernsnon-safety-criticaldis- tributed soft-real-time embedded systems [3]. TIRAN Figure 3 draws the TIRAN architecture and posi- explicitly adoptsformaltechniques to supportrequire- tions its main components into it. In particular, the mentspecificationandpredictiveevaluation[16]. This, box labeled “Ariel” represents the TIRAN recovery together with the intensive testing on pilot applica- language, ariel. The central, whiter layers represent tions, is exploited in order to: the TIRAN framework. In particular: 6 • Level 0 hosts the BSL (see Sect. 3), which gives Next section focuses on the key component of the system-independent access to the services pro- TIRAN prototype, namely, the ariel recovery lan- vided by the underlying run-time system. guage. • Level 1 services are provided by a set of BTs for 3.3.ThearielLanguage error detection and fault masking (level 1.1) and by another set addressing isolation, recovery and reconfiguration (level 1.2). These services are not Within TIRAN, a single syntactical framework— distributed on multiple nodes. provided by the ariel language—serves the application designer as both a configuration and a recovery • Level 2 hosts the TIRAN BB [10]. This is the language. arielisalanguagewithasyntaxsomewhat componentresponsiblefor the managementofthe similar to that of the UNIX shells. ariel deals with distributed database (DB) that maintains records five basic types: “nodes”, “tasks”, “groups”, integers, describingerrorsdetectedbyLevel1.1BTs. Italso and real numbers. A node is a uniquely identifiable includes a time-out management system, called processing node of the system, e.g., a processor of a TOM [11], and a recovery interpreter, RINT, ac- MIMD supercomputer. A task is a uniquely identifi- tuallyavirtualmachineexecutingther-code. The able process or thread in the system. A group is a BBexecutesanalgorithm,describedin[10],which uniquely identifiable collection of tasks, possibly run- allows it to tolerate node and component crashes ning on different nodes. Nodes, tasks, and groups are and to withstand partitioning caused by tempo- genericallycalledentities. Entitiesareuniquelyidenti- raryperiodsofcommunicationinstability. TheBB fiedvianon-negativeintegers;forinstance,NODE3orN3 straightforwardlysupportsthe α-countfaultiden- refer to processing node currently configured as num- tification mechanism [1] by feeding α-count filters ber 3. Symbolic constants can be “imported” from C immediatelyafterthearrivalofeachnewerrorde- language header files through the statement INCLUDE. tectionnotification. InFig.3,theedgeconnecting Whencurlybracketsappeararoundastring,the value RINT to ariel means that RINT actually imple- of the corresponding symbolic constant is returned. ments (executes) the ariel programs. Note the The keystatementin ariel is the IF,which is used control and data messages that flow from BB to to code a recovery action as follows: TOM, DB, and RINT. RINT also sends control messagestotheisolationandrecoveryBTs. These IF [ guard ] THEN actions, are low-level messages that request specific recov- where a guard checks whether an entity, according to ery actions. Data messages flow also from BB to the current contents of the database, is in one of the a monitoring tool [13]. followingstates: active;affectedby afault; affectedby • Dependable mechanisms (DMs), i.e., high-level, a transient fault; isolated; restarted. A guard can also distributed FT tools exploiting the services of the check the current “phase” of a task, e.g., its current BB and of the BTs, are located at level 3. These algorithmicstep,thatthetaskcandeclareviaacustom tools include a distributed voting tool [9], a dis- BSLfunction. Actionscanbe guards—whichallowsto tributed synchronization tool, and a data stabi- representrecoveryactionsastrees—andremoteorlocal lizer [12]. The DMs receive notifications from commands for: sending messages to tasks and groups; RINTinordertoexecutereconfigurationssuchas, terminating, isolating, starting or restarting an entity. for instance, introducing a spare task to take over Restarting a node means rebooting it, terminating a the role of a failed task. node means performing a node shutdown. Isolating a ThelayersaroundtheTIRANframeworkinFig.3rep- taskmeansdisablingitscommunicationdescriptors. A resent (from the layer at the bottom and proceeding localcommandisexecutedbythelocalBBcomponent, counter-clockwise): whilearemoteoneisfirstsenttothecorrespondingBB component and then executed by it. • The run-time system. arielallowsalsotoconfigureitsBTs. Forinstance, • The functional application layer and the recov- the following syntax: ery languageapplication layer (again, box labeled INCLUDE "mydefinitions.h" “Ariel”). WATCHDOG {MYWD} WATCHES TASK {MYTASK} • Amonitoringtool,forhypermediarenderingofthe HEARTBEATS EVERY {HEARTBEAT} MS current state of the system within the windows of ON ERROR WARN TASK {CONTROLLER} a WWW browser. END WATCHDOG 7 produces a source code configuring a watchdog that, cited sources for a full description of the case studies once enabled by its first heartbeat, expects new such and their evaluation. messages every HEARTBEAT milliseconds, or sends task CONTROLLER an alarm message. Note that in this case 4. Conclusions and Future Work the error detection code intrusion is reduced to the functioncallforsendingheartbeats. Configurationalso A novel fault tolerance linguistic structure for dis- includes replicated tasks and N-version programming. tributed applications has been briefly described. Such Syntaxesforretryblocksandconsensusrecoveryblocks structure is at the core of the strategy that is cur- have been also implemented. The ariel translator, called “art”, produces both rently being designed within IST-2000-25434 Project “DepAuDE” to allow dependable real-time applica- the configured instances of the BTs and the recovery tions with intra-site and inter-site distribution aspects pseudo-code (r-code). The latter can either be out- to adapt to a changing environment ([8] briefly men- put as a binary file, to be read by RINT at run-time, tions the key ideas behind the DepAuDE strategy). or as an include file to be compiled with RINT. This Thedesignoftheelementsofthearchitecturesketched r-codeisthenre-executedbyRINTeachtimetheback- in this paper, which explicitly addresses requirement bonenotifiesitthataneweventhasbeenstoredinthe R1, R2 and R3, is one of the goals of DepAuDE. As database—as described in Fig. 1. ε mentioned before, RL is being used in several case studies withpromisingresults. Oneofthese casestud- 3.4.CaseStudies iesisdescribedin[14]. Theadoptionofarecoverylan- The ariel language and the TIRAN framework guage within a generative communication infrastruc- have been exercised in the course of project EFTOS ture (such as the one of LINDA [18]) is also currently and project TIRAN on a number of case studies, in as being experimented [7]. different an application domain as postal automation, electrical substation automation, and airport light- Acknowledgements. This project is partly sup- ing systems. These case studies were formulated by ported by the IST-2000-25434 Project “DepAuDE”. two members of the EFTOS and TIRAN consortia Geert Deconinck is a Postdoctoral Fellow of the Fund (Siemens and ENEL) and have their origin within the for Scientific Research - Flanders (Belgium) (FWO). internal strategies of those companies. One of these case studies is reported in [14]. Another noteworthy case study has been the development of a Level 3 FT References mechanism supporting distributed voting. This tool exploits two features of ariel: first, it makes use of [1] A. Bondavalli, S. Chiaradonna, F. Di Giandomenico, sparecomponents—errorrecoverystrategieslikerecon- andF.Grandoni. Threshold-basedmechanismstodis- figuration and graceful degradation (when spares are exhausted) can be expressed in terms of ariel scripts criminate transient from intermittent faults. IEEE Trans. on Computers, 49(3):230–245, March 2000. and result in no code intrusion. Secondly, it exploits the built-in support of the α-count fault identification [2] O. Botti, V. De Florio, G. Deconinck, S. Donatelli, A. Bobbio, A. Klein, H. Kufner, R. Lauwereins, mechanisminordertoletthe userexpressdifferenter- E. Thurner, and E. Verhulst. TIRAN: Flexible and ror recovery strategies depending on the nature of the portablefaulttolerancesolutionsforcosteffectivede- corresponding faults. This allows to express recovery pendable applications. In P. Amestoy, editor, Proc. actions such as: of the 5th Euro-Par Conference, Lecture Notes in IF [ FAULTY TASK {MYTASK} ] Computer Science, volume 1685, pages 1166–1170, THEN Toulouse, France, August/September 1999. Springer- IF [ TRANSIENT TASK {MYTASK} ] Verlag, Berlin. THEN Conservative strategy [3] O. Botti, V. De Florio, G. Deconinck, R. Lauwere- (e.g., restart the task) ins, F. Cassinari, S. Donatelli, A. Bobbio, A. Klein, ELSE Reconfiguration H.Kufner, E.Thurner,and E. Verhulst. The TIRAN FI approachtoreusingsoftware implementedfault toler- FI. ance.InProc.ofthe8thEuromicroWorkshoponPar- allelandDistributedProcessing(Euro-PDP’00),pages This aims at keeping reconfiguration as the ultimate 325–332, Rhodos,Greece,January2000.IEEEComp. solution in order to minimize the rate at which redun- Soc. Press. dancy is “consumed”. Markov modeling of this ap- [4] F. Cristian. Understanding fault-tolerant distributed proach shows that it allows to enhance considerably systems. Communications of the ACM, 34(2):56–78, reliability [6]. For the sake of brevity we refer to the February1991. 8 [5] F. Cristian and C. Fetzer. The timed asynchronous (Reusable software solutions for more fault-tolerant) distributed system model. IEEE Trans. on Parallel Industrial embedded HPC applications. Supercom- and Distributed Systems, 10(6):642–657, June1999. puter, XIII(69):23–44, 1997. [6] V. De Florio. A Fault-Tolerance Linguistic Structure [16] G.DondossolaandO.Botti. Faulttolerancespecifica- for Distributed Applications. PhD thesis, Dept. of tion: proposalofamethodcombiningsemi-formaland Electrical Engineering, UniversityofLeuven,October formalapproaches. InProc. of the International Con- 2000. ISBN 90-5682-266-7. ference on Fundamental Approaches to Software En- [7] V.DeFlorio andG. Deconinck. Aparallel processing gineering (FASE 2000), held in the framework of the model based on generative communication and recov- European Joint Conferences on Theory and Practice ery languages. In Proc. of the 14th Int.l Conference ofSoftware(ETAPS2000),LectureNotesinComputer onSoftware&SystemsEngineeringandtheirApplica- Science, volume 1783, pages 82–96, Berlin, Germany, tions (ICSSEA 2001), Paris, France, December 2001. March 2000. Springer-Verlag, Berlin. [8] V.DeFlorioandG.Deconinck. Onsomekeyrequire- [17] J.-C. Fabre and T. Pérennou. A metaobject archi- ments of mobile application software. In Proc. of the tecture for fault-tolerant distributed systems: The 9thAnnualIEEEInternationalConferenceandWork- FRIENDSapproach. IEEE Transactions on Comput- shop on the Engineering of Computer Based Systems ers, 47(1):78–95, January 1998. (ECBS),Lund,Sweden,April2002.IEEEComp.Soc. [18] D. Gelernter. Generative communication in Linda. Press. ACM Trans. on Prog. Languages and Systems, 7(1), [9] V.DeFlorio, G.Deconinck,andR.Lauwereins. Soft- January 1985. ware tool combining fault masking with user-defined tm [19] G. Kiczales. AspectJ : aspect-oriented pro- recovery strategies. IEE Proceedings – Software, tm gramming using Java technology. In Proc. 145(6):203–211, December 1998. Special Issue on De- of the Sun’s 2000 Worldwide Java Developer pendableComputingSystems.IEEinassociationwith Conference (JavaOne), San Francisco, Cali- the British Computer Society. fornia, June 2000. Slides available at URL [10] V.DeFlorio,G.Deconinck,andR.Lauwereins.Anal- http://aspectj.org/servlets/AJSite?channel= docu- gorithmfortoleratingcrashfailuresindistributedsys- mentation&subChannel=papersAndSlides. tems. In Proc. of the 7th Annual IEEE International [20] G. Kiczales, J. des Rivières, and D. G. Bobrow. The ConferenceandWorkshopontheEngineeringofCom- Art of the Metaobject Protocol. TheMITPress,Cam- puter Based Systems (ECBS),pages9–17,Edinburgh, bridge, MA, 1991. Scotland, April 2000. IEEE Comp. Soc. Press. [21] G. Kiczales, J. Lamping, A. Mendhekar, C. Maeda, [11] V. De Florio, G. Deconinck, and R. Lauwereins. C. Videira Lopes, J.-M. Loingtier, and J. Irwin. Application-level time-out support for real-time sys- Aspect-oriented programming. In Proc. of the Eu- tems. In Proc. of the 6th IFAC Workshop on Al- ropean Conference on Object-Oriented Programming gorithms and Architectures for Real-Time Control (ECOOP), Lecture Notes in Computer Science, vol- (AARTC’2000), pages 31–36, Palma de Mallorca, ume1241, Finland, June1997. Springer, Berlin. Spain, May 2000. [22] J.-C. Laprie. Dependable computing and fault toler- [12] V. De Florio, G. Deconinck, R. Lauwereins, and ance: Concepts and terminology. In Proc. of the 15th S.Graeber. Designandimplementationofadatasta- Int.SymposiumonFault-TolerantComputing(FTCS- bilizing software tool. In Proc. of the 9th Euromi- 15), pages 2–11, Ann Arbor, Mich., June 1985. IEEE cro Workshop on Parallel and Distributed Processing Comp. Soc. Press. (Euro-PDP’01), Mantova,Italy,February2001.IEEE [23] M. Lippert and C. Videira Lopes. A study on ex- Comp. Soc. Press. [13] V. De Florio, G. Deconinck, M. Truyens, W. Rosseel, ception detection and handling using aspect-oriented and R. Lauwereins. A hypermedia distributed appli- programming. InProc.ofthe22ndInternationalCon- cation for monitoring and fault-injection in embed- ference on Software Engineering (ICSE’2000), Limm- ded fault-tolerant parallel programs. In Proc. of the erick,Ireland, June 2000. 6th Euromicro Workshop on Parallel and Distributed [24] P. Maes. Concepts and experiments in computa- Processing (Euro-PDP’98), pages 349–355, Madrid, tionalreflection. InProc.oftheConferenceonObject- Spain, January 1998. IEEE Comp. Soc. Press. Oriented Programming Systems, Languages, and Ap- [14] V. De Florio, S. Donatelli, and G. Dondossola. Flex- plications (OOPSLA-87), pages 147–155, Orlando, ible development of dependability services: An expe- FL, 1987. rience derived from energy automation systems. In [25] D. Powell, J. Arlat, L. Beus-Dukic, A. Bondavalli, Proc. of the 9th Annual IEEE International Confer- P. Coppola, A. Fantechi, E. Jenn, C. Rabéjac, and ence and Workshop on the Engineering of Computer A. Wellings. GUARDS: A generic upgradable archi- Based Systems (ECBS), Lund, Sweden, April 2002. tectureforreal-timedependablesystems.IEEETrans. IEEE Comp. Soc. Press. on Parallel and Distributed Systems, 10(6):580–599, [15] G. Deconinck, T. Varvarigou, O. Botti, V. De Flo- June1999. rio, A. Kontizas, M. Truyens, W. Rosseel, R. Lauw- [26] B. Randell and J. Xu. The evolution of the recov- ereins, F. Cassinari, S. Graeber, and U. Knaak. ery block concept. In M. Lyu, editor, Software Fault 9 Tolerance, chapter1,pages 1–21. JohnWiley&Sons, New York,1995. [27] B. Robben. Language Technology and Metalevel Ar- chitectures for Distributed Objects. PhD thesis, Dept. ofComputerScience,UniversityofLeuven,May1999. [28] J. H. Saltzer, D. P. Reed, and D. D. Clark. End- to-end arguments in system design. ACM Trans. on Computer Systems, 2(4):277–288, 1984. [29] E. Systems. Virtuoso v.4 reference manual, 1998. [30] TXT. TEX User Manual. TXT Ingegneria Informat- ica, Milano, Italy, 1997. [31] M. H. Weik. The ENIAC story. ORDNANCE — The Journal of the American Ordnance Associ- ation, January-February 1961. Available at URL http://ftp.arl.mil/∼mike/comphist/eniac-story.html. 10

$\mathcal R\!\raise2pt\hbox{$\varepsilon$}\!\hbox{$\mathcal L$}$: A Fault Tolerance Linguistic Structure for Distributed Applications PDF

0.6 MB·

by Vincenzo De Florio

#journals #arxiv

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview $\mathcal R\!\raise2pt\hbox{$\varepsilon$}\!\hbox{$\mathcal L$}$: A Fault Tolerance Linguistic Structure for Distributed Applications

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.