ebook img

An Algorithm for Tolerating Crash Failures in Distributed Systems PDF

0.66 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview An Algorithm for Tolerating Crash Failures in Distributed Systems

An Algorithm for Tolerating Crash Failures in Distributed Systems VincenzoDeFlorio, Geert Deconinck,and Rudy Lauwereins KatholiekeUniversiteitLeuven, Electrical EngineeringDepartment,ACCA Group, 6 Kard. Mercierlaan 94, B-3001 Heverlee,Belgium. 1 E-mail:{vincenzo.deflorio|lauwerin}@gmail.com 0 2 n Abstract toembedintohis/herapplicationsotoenhanceitsdepend- a J ability.Thesetoolsinclude,e.g.,watchdogtimers,traphan- 6 In the framework of the ESPRIT project 28620 “TIRAN” dlers,localanddistributedvotingtools. 1 (tailorablefaulttoleranceframeworksforembeddedappli- ThemaindifferencebetweenTIRANandotherlibraries cations), a toolsetof error detection, isolation, andrecov- with similar purposes, e.g., ISIS [3] or HATS [12], is the ] C erycomponentsisbeingdesignedtoserveasabasicmeans adoption of a special component, located between the ba- for orchestrating application-level fault tolerance. These sictoolsetandtheuserapplication. Thisentityistranspar- D tools will be used either as stand-alone componentsor as entlyreplicatedoneachnodeofthesystem,tokeeptrackof . s theperipheralcomponentsofadistributedapplication,that eventsoriginatingin the basic layer (e.g., a missing heart- c we call “the backbone”. The backbone is to run in the beatfromataskguardedbyawatchdog)orintheuserap- [ backgroundof the user application. Its objectives include plication(e.g.,thespawningofanewtask),andtoallowthe 1 (1)gatheringandmaintainingerror detectioninformation orchestrationofsystem-wideerrorrecoveryandreconfigu- v producedbyTIRANcomponentslikewatchdogtimers,trap ration.Wecallthiscomponent“thebackbone”. 1 3 handlers,orbyexternaldetectionservicesworkingatker- In order to perform these tasks, the backbone hooks to 2 nelor driver level, and (2) using this information at error each active instance of the basic tools and is transparently 4 recovery time. In particular, those TIRAN tools related to informedofanyerrordetectionorfaultmaskingeventtak- 0 errordetectionandfaultmaskingwillforwardtheirdeduc- ing place in the system. Similarly, it also hooks to a li- . 1 tionstothebackbonethat,inturn,willmakeuseofthisin- braryofbasicservices.Thislibraryincludes,amongothers, 0 formationtoorchestrateerrorrecovery,requestingrecovery functionsforremotecommunication,fortask creationand 6 and reconfigurationactions to those tools related to error management,andtoaccessthelocalhardwareclock.These 1 : isolationandrecovery. Clearlyakeypointinthisapproach functions are instrumented so to transparently forward to v is guaranteeing that the backbone itself tolerates internal thebackbonenotificationsofeventslikethecreationorthe i X and external faults. In this article we describe one of the terminationofathread.Speciallow-levelservicesatkernel r means that are used within the TIRAN backbone to fulfill oratdriver-levelarealsohookedtothebackbone—forex- a thisgoal: adistributedalgorithmfortoleratingcrashfail- ample,onacustomboardbasedonAnalogDevicesADSP- urestriggeredby faultsaffectingatmostallbutoneofthe 21020 DSP and on IEEE 1355-compliant communication components of the backbone or at most all but one of the chips [1], communication faults are transparently notified nodes of the system. We call this the algorithm of mutual tothebackbonebydriver-leveltools. Doingthis,aninfor- suspicion. mation stream flows on each node from different abstract layersto the localcomponentof the backbone. Thiscom- ponentmaintainsand updatesthis informationin the form 1. Introduction ofasystemdatabase,alsoreplicatingitondifferentnodes. Wheneveranerrorisdetectedorafaultismasked,those In the framework of the ESPRIT project 28620 TIRAN tools related to error detection and fault mask- “TIRAN” [4], a toolset of error detection, isolation, and ing forward their deductionsto the backbonethat, in turn, recoverycomponents is being developed to serve as a ba- makesuseofthisinformationtomanageerrorrecovery,re- sicmeansfororchestratingapplication-levelsoftwarefault questingrecoveryandreconfigurationactionstothosetools tolerance.Thebasiccomponentsofthistoolsetcanbecon- relatedtoerrorisolationandrecovery(seeFig.1). sideredasready-madesoftwaretoolsthatthedeveloperhas The specification of which actions to take is to be sup- 2.Basicassumptions The target system is a distributed system consisting of n nodes(n ≥ 1). Nodesareassumedto belabeledwith a uniquenumber in {0,...,n−1}. The backboneis a dis- tributedapplicationconsistingofntripletsoftasks: (taskD,taskI,taskR). Basically task D bears its name from the fact that it deals with the system database of the backbone, while task I manages “I’m alive” signals, and task R deals with error recovery(seefurtheron). Wecall“agent”anysuchtriplet. At initialisation time, on each node there is exactly one agent,identifiedthroughthelabelofthenodewhereitruns on. For any 0 ≤ k < n, we call “t[k]” task t of agent Figure1.Eachprocessingnodeofthesystem k.Agentsplaytworoles:coordinator(alsocalledmanager) hosts oneagent of thebackbone. Thiscom- andassistant(alsocalledbackupagentorsimply“backup”). ponent is the intermediary of the backbone Intheinitial,correctstatethereisjustonecoordinator.The onthatnode. Inparticular,itgathersinforma- choiceofwhichnodeshouldhostthecoordinator,aswellas tion from the TIRAN tools for error detection system configurationandnodelabelingisdoneatcompile and fault containment (grey circles) and for- timethroughaconfigurationscript. wards requests to those tools managing er- Figure2displaysabackbonerunningonafournodesys- ror containment and recovery (light grey cir- tem (aParsytecXplorerMIMDenginebasedonPowerPC cles). These latter execute recovery actions microprocessors). Inthiscase, nodezerohoststhecoordi- and possibly, at test time, fault injection re- natorwhilenodes1–3executeasassistants. Intheportrayed quests. situationnoerrorshavebeendetected,andthisisrendered withgreencirclesandastatuslabelequalto“OK”. Eachagent, beita coordinatororanassistant, executes anumberofcommontasks. Amongthesetaskswehave: 1. Interfacing the instances of the basic tools of the pliedbytheuserintheformofa“recoveryscript”,asortof framework—thisistobecarriedoutbytaskD. ancillaryapplicationcontextdevotedtoerrorrecoverycon- cerns, consisting of a number of guarded commands: the 2. Organizing/maintaining data gathered from the executionofblocksofbasic recoveryactions, e.g., restart- instances—alsospecificoftaskD. ingagroupoftasks,rebootinganodeandsoforth,isthen subject to the evaluation of boolean clauses based on the 3. Errorrecoveryandreconfigurationmanagement(task currentcontentsofthe system database—forfurtherinfor- R). mationonthissubjectsee[9,11]. 4. Self-check. This takes place througha distributed al- Clearly a key point in this scheme is guaranteeing that gorithm that is executed by task D and task I in all thebackboneitselftoleratesinternalaswellexternalfaults. agentsinthecasethatn>1. Inthisarticlewedescribeoneofthemeansthathavebeen designedwithintheTIRANbackbonetofulfillthisgoal: a This article does not cover points 1–3; in particular distributedalgorithmfor toleratingcrash failurestriggered points1and2are dealtwithasreportedin[10], whilethe byfaultsaffectingatmostallbutoneofthecomponentsof issue of errorrecoveryandreconfigurationmanagementis the backbone. We call this the algorithm of mutualsuspi- describedin[9].Inthefollowing,wedescribethealgorithm cion.Asin[6],weassumeatimedasynchronousdistributed mentionedatpoint4. Asthekeypointofthisalgorithmis systemmodel[5]. Furthermore,weassumetheavailability the fact that the coordinator and assistants mutually ques- ofasynchronouscommunicationmeans. Noatomicbroad- tiontheir state, we callthisthe algorithmofmutualsuspi- castprimitiveisrequired. cion(AMS). Figure2. A Netscape browser offersaglobal view of the TIRAN backbone: number, role, and state of each component is displayed. “DIR net” is a nickname for the backbone. “OK” means that no faults have been de- Figure3.Arepresentationofthealgorithmof tected. When the user selects a component, themanager. furtherinformationrelatedtothatcomponent is displayed. Thesmall icon on bottomlinks toapagewithinformationrelatedtoerrorre- covery. sistantisalive)messageaspartoftheirtaskD. • Foreach0≤k <n,taskD[k]periodicallysetsaflag. ThisflagisperiodicallyclearedbytaskI[k]. 3. The algorithm ofmutual suspicion • If, for any valid j, task I[j] finds that the flag has This Section describes AMS. Simulation and testing not been set during the period just elapsed, task I[j] show that AMS is capable of tolerating crash failures of broadcastsaTEIF(thisentityisfaulty)message. up to n −1 agents (some or all of which may by caused byanodecrash). Thisimpliesfail-stopbehaviour,thatcan • Whenanyagent,saytheagentonnodeg,doesnotre- bereachedinhardware,e.g.,byarchitecturesbasedondu- ceiveanytimelymessagefromanotheragent,saythe plicationandcomparison[14],andcoupledwithothertech- agent on node f, be that a MIA or a TAIA message, niqueslike,e.g.,controlflowmonitoring[15]ordiversifica- thenagentgentersaso-calledsuspicion-periodbyset- tionandcomparison[13].Ifacoordinatororitsnodecrash, tingflagsus[f]. Thisstateleadstothreepossiblenext a non-crashedassistant becomes coordinator. Whenevera states,correspondingtotheseevents: crashedagentisrestarted,possiblyafteranodereboot,itis inserted againin the backboneas an assistant. Let us first 1. Agentg receivesaTEIFfromtaskI[f]withina sketchthestructureofAMS.Letmbethenodewherethe specifictimeperiod,saytclockticks. coordinatorfirstrunson.Inshort: 2. Agent g receives a (late) TAIA from task D[f] withintclockticks. • ThecoordinatorperiodicallybroadcastsaMIA(man- ageris alive) message as partof its task D, the assis- 3. No message is receivedby agentg fromneither tantsperiodicallysendthecoordinatoraTAIA(thisas- taskI[f]nortaskD[f]withintclockticks. Figure5.Arepresentationoftheconjointac- tion of task I and task D on the “I’m alive” flag. suchthat,foranytime-relatedclausec∈C, Figure4.Arepresentationofthealgorithmof a(c)=message“clausechaselapsed” ∈M. theassistant. Task A[j] monitors a set of time-related clauses and, each time one of them occurs, say clause c, it sends task D[j] message a(c). Messages are sent via asynchronous State1correspondstodeduction“agentonnodef has primitivesbasedonmailboxes. crashed, though node f is still operational”. State 2 In particular, the coordinator, which we assume to run translates into deduction “both agent on node f and initiallyonnodem,instructsitstaskAsothatthefollowing nodef areoperational,thoughforsomereasonagentf clausesbemanaged: oritscommunicationmeanshavebeensloweddown”. State 3 is the detection of a crash failure for node f. • (MIA SEND, j, MIA SENDTIMEOUT), j different Thesedeductionsleadtoactionsaimingatrecovering fromm: everyMIA SENDTIMEOUTclock ticks, a agent f or (if possible) the whole node f, possibly messageoftype(MIA SEND,j)shouldbesenttotask electing a new coordinator. In the present version of D[m],i.e.,taskDonthecurrentnode. Thislatterwill AMS,the electionalgorithmissimplycarriedoutas- respondtosucheventbysendingeachtaskD[j]aMIA sumingthenextcoordinatortobetheassistantonnode (managerisalive)message. m+1modn. • (TAIA RECV, j, TAIA RECVTIMEOUT) for each j different from m: every TAIA RECVTIMEOUT Ascomplianttothetimedasynchronousdistributedsys- clock ticks at most, a message of type TAIA is to be tem model, we assume the presence of an alarm manager received from task D[j]. The arrival of such a mes- task (task A) on each node of the system, spawned at ini- sage or of any other “sign of life” from task D[j] tialization time by the agent. Task A is used to translate translates also in renewing the corresponding alarm. time-relatedclauses,e.g.,“tclocktickshaveelapsed”,into On the other hand, the arrival of a message of type message arrivals. Letus call task A[j] the task A running (TAIA RECV, k), k different from m, sent by task on node j, 0 ≤ j < n. Task A may be represented as a A[m],warnstaskD[m]thatassistantonnodeksentno function sign of life throughout the TAIA RECVTIMEOUT- clock-tickperiodjustelapsed. Thismakestask D[m] a:C →M setitsflagsus[k]. • (I’M ALIVE SET,m, I’M ALIVE SETTIMEOUT): every I’M ALIVE SETTIMEOUT clock ticks, task A[m] sendstaskD[m]anI’M ALIVE SET message. As a response to this, task D[m] sets the “I’m alive” flag,amemoryvariablethatissharedbetweentasksof typeDandI. Furthermore,wheneverflagsus[k]isset,foranykdiffer- entfromm,thefollowingclauseissenttotaskAforbeing managed: • (TEIFRECV,k,TEIFRECVTIMEOUT):thisclause simplyaskstaskA[m]toschedulethesendingofmes- sage(TEIFRECV, k)to taskD[m] afterTEIFRECV TIMEOUTclockticks.Thisactioniscanceledshould alateTAIAmessagearrivetotaskD[m]fromtaskk, or should a TEIF message from task I[k] arrive in- stead. Inthefirstcase,sus[k] isclearedand(possibly empty) actions corresponding to a slowed down task D[k] are taken. In the latter case, task D[k] is as- sumed to have crashed, its clauses are removed from thelistofthosemanagedbytaskA[m],andflagsus[k] iscleared.ItisassumedthattaskI[k]willtakecarein thiscaseofrevivingtaskD[k]. Anyfuturesignoflife from task D[k] is assumed to mean that task D[k] is backinoperation.InsuchacasetaskD[k]wouldthen be re-entered in the list of operationalassistants, and task D[m] would then request task A[m] to include again an alarm of type (MIA SEND, k, MIA SEND TIMEOUT) and an alarm of type (TAIA RECV, k, TAIA RECVTIMEOUT)initslist. Ifa(TEIFRECV, k)messagereachestaskD[m],theentirenodekisas- sumed to have crashed. Node recovery may start at this point, if available, or a warning message should besenttoanexternaloperatorsothat,e.g.,nodek be rebooted. Similarlyanyassistant,saytheoneonnodek,instructs itstaskAsothatthefollowingclausesbemanaged: • (TAIA SEND, m, TAIA SENDTIMEOUT): every TAIA SENDTIMEOUT clock ticks, a message of type (TAIA SEND, m) is to be sent to task D[k], i.e., task D on the current node. This latter will respond to such event by sending task D[m] (i.e., the manager) a TAIA (this agent is alive) message. Should a data message be sent to the manager in the middle of TAIA SENDTIMEOUT-clock-tick period, alarm (TAIA SEND, m, TAIA SENDTIMEOUT) is renewed. This may happen for instance because one ofthebasicTIRANtoolsforerrordetectionreportsan eventtotaskD[k].Sucheventmustbesenttotheman- Figure6.Pseudo-codeofthecoordinator. ager for it to update its database. In this case we say thattheTAIAmessageissentinpiggybackingwiththe eventnotificationmessage. • (MIA RECV, m, MIA RECVTIMEOUT): every MIA RECVTIMEOUTclockticksatmost,amessage oftypeMIAistobereceivedfromtaskD[m],i.e.,the manager.Thearrivalofsuchamessageorofanyother “sign of life” from the manager translates also in re- newingthecorrespondingalarm. Ifamessageoftype (MIA RECV,m),isreceivedfromtaskD[k]andsent bytask A[k], this meansthatno sign of life hasbeen receivedfromthemanagerthroughouttheMIA RECV TIMEOUT-clock-tickperiodjustelapsed.Thismakes taskD[k]setflagsus[m]. • (I’M ALIVE SET, k, I’M ALIVE SETTIMEOUT): every I’M ALIVE SETTIMEOUT clock ticks, task A[k] sends task D[k] an I’M ALIVE SET message. As a response to this, task D[k] sets the “I’m alive” flag. Furthermore,wheneverflagsus[m]isset, thefollowing clauseissenttotaskA[k]forbeingmanaged: • (TEIFRECV, m, TEIFRECVTIMEOUT): this clause simply asks task A[k] to postpone sending message(TEIFRECV,k)totaskD[k]ofTEIFRECV TIMEOUT clock ticks. This action is canceled should a late MIA message arrive to task D[k] from the manager, or should a TEIF message from task I[m]arriveinstead. Inthefirstcase,sus[m]iscleared andpossiblyemptyactionscorrespondingtoaslowed downmanageraretaken. Inthelattercase,taskD[m] isassumedtohavecrashed,itsclauseisremovedfrom thelistofthosemanagedbytaskA[k],andflagsus[m] iscleared. Itisassumedthattask I[m] willtakecare inthiscaseofrevivingtaskD[m]. Anyfuturesignof lifefromtaskD[m]isassumedtomeanthattaskD[m] isbackin operation. Insuch a case taskD[m] would be demotedto the role of assistant and enteredin the list of operational assistants. The role of coordinator wouldthenhavebeenassigned, via anelection, to an agentformerlyrunningasassistant. Ifa(TEIFRECV,m)messagereachestaskD[k],theen- tirenodeofthemanagerisassumedtohavecrashed. Node recoverymaystartatthispoint. Anelectiontakesplace— thenextassistant(modulon)iselectedasnewmanager. Also task I oneachnode,say nodek, instructsits task Asothatthefollowingclausebemanaged: • (I’M ALIVECLEAR, k, I’M ALIVECLEAR TIMEOUT): every I’M ALIVECLEARTIMEOUT clock ticks task A[k] sends task I[k] an I’M ALIVE CLEAR message. As a response to this, task I[k] clearsthe“I’malive”flag. Figure7.Pseudo-codeoftheassistant. Figures 3, 4, and 5 supply a pictorial representation of this algorithm. Figure 6 and Fig. 7 respectively show a pseudo-codeofthecoordinatorandoftheassistant. 3.1.Thealarmmanagerclass resilience. The backbone is currently being implemented fortargetplatformsbasedonWindowsCE,VxWorks, and This section briefly describes task A. This task makes TEX [2]. A GSPN modelof the algorithm of mutualsus- useofaspecialclasstomanagelistsofalarms[8].Theclass picionhasbeendevelopedbytheUniversityofTurin,Italy, allowsthe clientto “register”alarms, specifyingalarm-ids andhasbeenusedtovalidateandevaluatethesystem.Sim- anddeadlines. ulationsofthismodelprovedtheabsenceofdeadlocksand Oncethefirstalarmisentered,thetaskmanagingalarms livelocks. Measurementsoftheoverheadsinfaultfreesce- creates a linked-list of alarms and polls the top of the list. nariosandwhenfaultsoccurwillalsobecollectedandanal- For each new alarm to be inserted, an entry in the list is ysed. foundandthelistismodifiedaccordingly. Ifthetopentry expires,auser-definedalarmfunctionisinvoked. Thisisa Acknowledgments. Thisprojectispartlysupportedbyan generalmechanismthatallowsto associate anyeventwith FWOKredietaanNavorsersandbytheESPRIT-IVproject theexpiringofanalarm. Inthecaseofthebackbone,task 28620“TIRAN”. AonnodeksendsamessagetotaskD[k]—thesameresult may also be achieved by sending an UNIX signal to task References D[k]. Special alarmsare defined as “cyclic”, i.e., they are automaticallyrenewedateachnewexpiration,afterinvok- [1] Anonymous. IEEEstandardforHeterogeneousInter- ingthealarmfunction.Aspecialfunctionrestartsanalarm, Connect(HIC)(Low-cost,low-latencyscalableserial i.e., it deletes and re-entersan entry. It is also possible to interconnectforparallelsystemconstruction).Techni- temporarilysuspendanalarmandre-enableitafterwards. calReport1355-1995(ISO/IEC14575),IEEE,1995. [2] Anonymous. TEXUserManual. TXTIngegneriaIn- 4. Current status andfuture directions formatica,Milano,Italy,1997. AprototypalimplementationoftheTIRANbackboneis [3] K. P. Birman. Replication and fault tolerance in running on a Parsytec Xplorer, a MIMD engine, using 4 the Isis system. ACM Operating Systems Review, PowerPC nodes. The system has been tested and proved 19(5):79–86,1985. to be able to tolerate a numberof software-injectedfaults, [4] Oliver Botti, Vincenzo De Florio, Geert Decon- e.g., componentand node crashes (see Fig. 8 and Fig. 9). inck,SusannaDonatelli,AndreaBobbio,AxelKlein, Faults are scheduled as another class of alarmsthat, when H.Kufner,RudyLauwereins,E.Thurner,andE.Ver- triggered, send a fault injection message to the local task hulst. TIRAN: Flexible and portable fault tolerance D or task I. The specification of which fault to inject is solutions for cost effective dependable applications. readbythebackboneatinitialisationtimefromafilecalled In P. Amestoy et al., editors, Proc. of the 5th Int. “.faultrc”. The user can specify fault injectionsby editing Euro-ParConference (EuroPar’99),Lecture Notes in thisfile,e.g.,asfollows: Computer Science, volume 1685, pages 1166–1170, INJECTCRASHONCOMPONENT1 Toulouse,France, August/September1999.Springer- AFTER5000000TICKS Verlag,Berlin. INJECTCRASHONNODE0 [5] Flaviu Cristian and Christof Fetzer. The timed asyn- AFTER10000000TICKS. chronous distributed system model. IEEE Trans. ThefirsttwolinesinjectacrashontaskD[1]after5seconds on Parallel and Distributed Systems, 10(6):642–657, fromtheinitialisationofthebackbone,thesecondonesin- June1999. jectasystemrebootofnode0after10seconds.Viafaultin- [6] FlaviuCristianandFrankSchmuck.Agreeingonpro- jectionitisalsopossibletoslowdownartificiallyacompo- cessorgroupmembershipinasynchronousdistributed nentforagivenperiod.Sloweddowncomponentsaretem- systems. TechnicalReportCSE95-428,UCSD,1995. porarilyandautomaticallydisconnectedandthenaccepted againintheapplicationwhentheirperformancegoesback [7] VincenzoDeFlorio,GeertDeconinck,MarioTruyens, tonormalvalues.ScenariosarerepresentedintoaNetscape Wim Rosseel and Rudy Lauwereins. A hyperme- windowwhereamonitoringapplicationdisplaysthestruc- dia distributed application for monitoring and fault- ture of the user application, mapsthe backboneroles onto injection in embedded fault-tolerant parallel pro- the processingnodesof the system, andconstantlyreports grams. In Proc. of the 6th Euromicro Workshop on abouttheeventstakingplaceinthesystem[7]. ParallelandDistributedProcessing(PDP’98),pages This system has been recently redeveloped so to en- 349–355,Madrid, Spain, January1998.IEEEComp. hance its portability and performance and to improve its Soc.Press. Figure 8. A fault is injected on node 0 of a system of four nodes. Node 0 hosts the manager of the backbone. Inthetopleftpicturetheuserselectsthefaulttobeinjectedandconnectstoaremotely controllable Netscape browser. The top right picture shows this latter as it renders the shape and state of the system. The textual window reports the current contents of the list of alarms used by task A[0]. In thebottomleft picturethecrash of node0has been detected and a newmanager has beenelected. Onelection,themanagerexpectsnode0tobebackinoperationafterarecoverystep. Thisrecovery step isnotperformed in thiscase. As aconsequence,node0isdetected asinactive andlabeledas“KILLED”. [8] Vincenzo De Florio, Geert Deconinck, and Rudy tems, volume2, pages98–104,Milan, Italy, Septem- Lauwereins. Atime-outmanagementsystemforreal- ber1999.IEEEComp.Soc.Press. time distributed applications. Submitted for publica- [11] Geert Deconinck, Mario Truyens, Vincenzo De Flo- tionin IEEETrans.onComputers. rio,WimRosseel,RudyLauwereins,andRonnieBel- mans. Aframeworkbackboneforsoftwarefaulttoler- [9] Vincenzo De Florio, Geert Deconinck, and Rudy anceinembeddedparallelapplications.InProc.ofthe Lauwereins. The recovery language approach for 7thEuromicro Workshop onParallelandDistributed software-implementedfault tolerance. Submitted for Processing (PDP’99), pages 189–195, Funchal, Por- publication in ACM Transactions on Computer Sys- tugal,February1999.IEEEComp.Soc.Press. tems. [12] YennunHuang and ChandraM.R. Kintala. Software fault tolerance in the application layer. In Michael [10] Geert Deconinck, Vincenzo De Florio, Rudy Lauw- Lyu, editor, Software Fault Tolerance, chapter 10, ereins,andRonnieBelmans.Asoftwarelibrary,acon- pages231–248.JohnWiley&Sons,NewYork,1995. trol backbone and user-specified recovery strategies to enhance the dependability of embedded systems. [13] D. Powell. Preliminary definition of the GUARDS In Proc. of the 25th Euromicro Conference (Euromi- architecture. TechnicalReport96277,LAAS-CNRS, cro ’99), Workshop on Dependable Computing Sys- January1997. Figure9.When theuserselectsacircular icon intheWeb pageof Fig.2,thebrowserreplieswitha listingofalltheeventsthattookplaceonthecorrespondingnode. Here,alistofeventsoccurredon taskD[1]duringtheexperimentshowninFig.8isdisplayed. Eventsarelabeledwithanevent-idand with the time of occurrence (in seconds). Note in particular event 15, corresponding to deduction “anodehascrashed”,and events16–18,in whichtheelectionofthemanagertakesplaceandtask D[1] takes over the role of the former manager, task D[0]. Restarting as manager, task D[1] resets thelocalclock. [14] D.K.Pradhan. Fault-TolerantComputerSystemsDe- sign. Prentice-Hall,UpperSaddleRiver,NJ,1996. [15] M. Schuette and J. P. Shen. Processor control flow monitoringusingsignaturedinstructionstreams.IEEE Trans.onComputers,36(3):264–276,March1987.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.