NASA Langley’s Research and Technology-Transfer Program in Formal Methods Ricky W. Butler James L. Caldwell Victor A. Carren~o C. Michael Holloway Paul S. Miner Assessment Technology Branch NASA Langley Research Center Hampton, Virginia Ben L. Di Vito V(cid:19)(cid:16)GYAN Inc. Hampton, Virginia Abstract ti(cid:12)ed \Provably Correct System Speci(cid:12)cation" and \Veri(cid:12)cation FormalismFor Error-Free Speci(cid:12)cation" This paper presents an overview of NASA Lang- as key areas of research for future avionics software ley’s research program in formal methods. The ma- and ultrareliable electronics systems [8]. jor goals of this work are to make formal methods practical for use on life critical systems, and to or- 1.1 Why Formal Methods Are Necessary chestrate the transfer of this technology to U.S. in- dustry through use of carefully designed demonstra- Digital systems (both hardware and software) are tion projects. Several direct technology transfer ef- notorious for their unpredictable and unreliable be- forts have been initiated that apply formal methods havior: to critical subsystems of real aerospace computer sys- Studies have shown that for every six new tems. The research team consists of (cid:12)ve NASA civil large-scalesoftwaresystemsthatareputinto servants and contractors from Odyssey Research As- operation, two others are cancelled. The sociates, SRI International, and V(cid:19)(cid:16)GYAN Inc. average software development project over- shoots its schedule by half; larger projects generallydoworse. Andthree quarters ofall 1 Rationale For Formal Methods Re- largesystemsare\operatingfailures"thatei- search Program ther do not function as intended or are not used at all. NASA Langley Research Center has been develop- Despite 50years ofprogress, the softwarein- ing techniques for the design and validation of (cid:13)ight dustry remains years{perhaps decades{short critical systems for over two decades. Although much of the mature engineering discipline needed progress has been made in developing methods to ac- to meet the demands of an information-age commodatephysicalfailures,design(cid:13)awsremainase- society[6]. rious problem [1, 2, 3, 4, 5, 6, 7]. A 1991 report by 1 theNationalCenterForAdvancedTechnologies iden- LaurenRuthWienerdescribesthesoftwareproblemin 1A technical council funded by the Aerospace Industries her book, Digital Woes: Why We Should Not Depend Association of America (AIA) that representsthe major U.S. Upon Software: aerospacecompaniesengagedintheresearch,developmentand manufactureofaircraft,missilesandspacesystems,andrelated Software products|even programs of mod- propulsion,guidance,controlandotherequipment. est size|are among the most complex arti- facts that humansproduce, and software de- (cid:15) TheasynchronousoperationoftheAFTI-F16and velopmentprojects areamongourmostcom- sensor noise ledeach channelto declare the other plex undertakings. They soak up however channels failed in (cid:13)ight test 44. The plane was much time or money, however many people (cid:13)ownhomeonasinglechannel. Otherpotentially we throw at them. disastrous bugs were detected in (cid:13)ight tests 15 and 36. The results are only modestlyreliable. Even after the most thorough and rigorous test- (cid:15) TheHiMATcrashlandedwithoutitslandinggear ing some bugs remain. We can never test all due to a design (cid:13)aw. The problem was traced to threads through the system with allpossible atimingchangeinthesoftwarethathadsurvived inputs[5]. extensive testing. Thehardwareindustryalsofacesseriousdi(cid:14)culties,as (cid:15) AbugintheYC-14redundancymanagementwas evidenced by the recent design error in the Pentium found during (cid:13)ight test. The bug caused a large (cid:13)oating point unit. In response to an outcry over the mistrackingbetween redundant channels. design (cid:13)aw in the Pentium (cid:13)oating point unit, Intel’s President, Andy Grove, wrote on the comp.sys.intel (cid:15) In(cid:13)ighttests ofthe X31,thecontrolsystemwent Internet bulletin board: into a reversionary mode four times in the (cid:12)rst nine (cid:13)ights, usually due to a disagreement be- After almost 25 years in the microprocessor tween the two air data sources. business, I have come to the conclusion that no microprocessor is ever perfect; they just (cid:15) The nationwide saturation of the AT&T switch- comecloser toperfection witheachstepping. ingsystems on January 15,1990was caused by a In the life of a typical microprocessor, we go timingproblem in a fault-recovery mechanism. thru [sic] half a dozen or more such step- pings.... (cid:15) The (cid:12)rst Shuttle mission (STS-1) was scrubbed because the (cid:12)fth backup computer could not be In a recent Washington Post article, Michael Schrage synchronized with the other four. wrote: Three basic strategies are advocated for handling Pentium type problems will prove to be design (cid:13)aws in life critical systems: the rule|rather than the isolated, aberrant exceptions|as new generations of complex 1. Testing (Lots of it) hardware and software hitthe market. More insidious errors and harmful bugs are in- 2. DesignDiversity(i.e. softwarefaulttolerance: N- evitable. That is the new reality[9]. version programming,recovery blocks, etc.) For life critical systems, errors may mean disaster. 3. Fault Avoidance (i.e. formal speci(cid:12)cation Thepotentialforerrors ishigh,because these systems and veri(cid:12)cation, automatic program synthesis, must not only perform their functions correctly, but reusable modules) also mustbe able to recover fromthe e(cid:11)ects of failing components. Often the physical fault tolerance fea- The problem with life testing is that in order to tures of these systems are more complex and suscep- measure ultrareliability one must test for exorbitant tible to design errors than any of the basic functions amounts of time. For example, to measure a 10(cid:0)9 of the system. John Rushby writes: probability of failure for a 1 hour mission one must 9 test for more than 10 hours (114,000years). Organization of redundancy and fault- The basic idea behind design diversity is to use tolerance for ultra-high reliability is a chal- separate designandimplementationteamstoproduce lenging problem: redundancy management multipleversionsfromthe samespeci(cid:12)cation. At run- can account for half the software in a (cid:13)ight time,non-exact threshold voters are used to maskthe control system and, if less than perfect can e(cid:11)ect of a design error in one of the versions. The itself become the primary source of system hope is that the design (cid:13)aws will manifest errors in- failure [10]. dependently ornearly so. By assumingindependence, Inacomprehensiveassessmentofformalmethods[11], one can obtain ultrareliable-level estimates of system John Rushby discusses several notorious examples of reliability, even with failure rates for the individual (cid:0)4 such failures. These include the following: versions on the order of 10 =hour. Unfortunately, the independence assumptionhasbeen rejected atthe CICSproject [16]. The (cid:12)rst (cid:12)ve years ofNASALang- 99%con(cid:12)dence levelinseveralexperimentsforlowre- ley’sprogramhaveadvancedthecapabilitiesofformal liabilitysoftware [12, 13]. methods to the point where commercial exploitation Furthermore, the independence assumption cannot is near. be validated for high reliability software because of There are many di(cid:11)erent types of formal methods the exorbitant test times required. If one cannot as- withvariousdegrees ofrigor. Thefollowingisauseful sume independence then one must measure correla- ((cid:12)rst-order)taxonomyofthedegrees ofrigorinformal tions. This is infeasible as well; it requires as much methods: testingtimeaslife-testingthesystem,becausethecor- relations must be in the ultrareliable region in order Level-1: Formalspeci(cid:12)cation ofallorpart ofthe for the system to be ultrareliable. Therefore, it is not system. possible, within feasible amounts of testing time, to Level-2: Formalspeci(cid:12)cation at twoormorelev- establishthatdesigndiversityachievesultrareliability. els of abstraction and paper and pencil Consequently,designdiversitycancreatean\illusion" proofsthatthedetailedspeci(cid:12)cationim- of ultrareliabilitywithout actually providing it. For a plies the more abstract speci(cid:12)cation. more detailed discussion, see [14]. Level-3: Formal proofs checked by a mechanical theorem prover. Webelievethatformalmethodso(cid:11)ertheonlyintel- lectuallydefensiblemethodforhandlingdesignfaults. (cid:0)9 Level 1 represents the use of mathematical logic, or Since the often quoted 1(cid:0)10 reliability is clearly a speci(cid:12)cation language that has a formal semantics, beyond the range of quanti(cid:12)cation,we have no choice to specify the system. This can be done at several but to develop life critical systems in the most rig- levels of abstraction. For example, one level might orous manner available to us, which is use of formal enumeratethe required abstract properties ofthe sys- methods. tem, while another level describes an implementation that is algorithmicin style. 1.2 What are Formal Methods Level 2 formal methods go beyond Level 1 by de- velopingpencil-and-paperproofsthattheconcretelev- Engineering relies heavily on mathematical mod- els logicallyimplythe abstract, property-oriented lev- els and calculation to make judgments about designs. els. Level 3 is the most rigorous application of formal This is in stark contrast to the way in which soft- methods. Here one uses a semi-automatic theorem ware systems are designed|with ad hoc technique prover to make sure that all of the proofs are valid. and after-implementation testing. Formal methods TheLevel3process ofconvincingamechanicalprover bringtosoftwareandhardwaredesignthesameadvan- is really a process of developing an argument for an tagesthatotherengineeringendeavorshaveexploited: ultimateskeptic who must be shown every detail. mathematicalanalysisbasedonmodels. Formalmeth- It is important to realize that formal methods are ods are used to specify and model the behavior of notanall-or-nothingapproach. Theapplicationoffor- a system and to formally verify that the system de- malmethodsto the mostcriticalportions ofa system sign and implementationsatisfy functionaland safety is a pragmatic and useful strategy. Although a com- properties. In principle, these techniques can produce plete formal veri(cid:12)cation of a large complex system is error-free design; however, this requires a complete impracticalatthistime,agreatincrease incon(cid:12)dence veri(cid:12)cation fromthe requirements down to the imple- in the system can be obtained by the use of formal mentation,which is rarely done in practice. methods at key locations in the system. For more in- Thus,formalmethodsare the appliedmathematics formation on the basic principles of formal methods, of computer systems engineering. They serve a sim- see [17]. ilar role in computer design as Computational Fluid Dynamics(CFD)playsinaeronauticaldesign,provid- ing a means of calculating and hence predicting what 2 Goals of Our Program, Strategy, and the behavior of a digital system will be prior to its implementation. Research Team Thetremendousscienti(cid:12)cpotentialofformalmeth- ods has been recognized by theoreticians for a long ThemajorgoalsoftheNASALangleyresearchpro- time, but the formal techniques have remained the gramare to makeformalmethodspractical foruse on province of a few academicians, with only a few ex- life critical systems developed in the United States, ceptions such as the Transputer [15] and the IBM and to orchestrate the transfer of this technology to industry throughuse ofcarefullydesigned demonstra- terrent for industry. Therefore, one of the goals of tion projects. Our intention is to concentrate our re- the NASA Langley program is to build a large body search e(cid:11)orts on the technically challenging areas of of background theories needed for aerospace applica- digital(cid:13)ight-controlsystems design thatare currently tions. beyond the state-of-the-art, while initiating demon- We also have been involved with standards activi- stration projects in problem domains where current ties in order to strengthen the United States commit- formal methods are adequate. The challenge of the ment to safety. demonstrationprojects shouldnotbeunderestimated. Thatwhichis feasibleforexperts thathavedeveloped 2.1 Technology Transfer the tools and methods is often di(cid:14)cult for practition- ers in the aerospace industry. There is often a long Thekeytosuccessfultechnologytransferisbuilding \learning curve" associated with the tools, the tools a cooperative partnership with a customer. In order are not production-quality, and the tools have few or for this partnership to work,NASA Langley mustbe- no examples for speci(cid:12)c problemdomains. Therefore, come directly involved in speci(cid:12)c problemdomainsof wearesetting upcooperativee(cid:11)orts between industry 2 the aerospace industry . NASA must also e(cid:11)ectively and the developers of the formalmethodsto facilitate communicate its basic research accomplishments in a the technology transfer process. manner that reveals a signi(cid:12)cant potential bene(cid:12)t to This strategy leverages the huge investment of the aerospace community. Equally important is the ARPA and the National Security Agency in develop- need for industry to make an investment to work to- ment of tools and concentrates on the problems spe- gether with NASA on joint projects to devise demon- ci(cid:12)c to the aerospace problem domain. NASA Lang- stration projects that are realistic and practical. The leyhasnotsponsored thedevelopmentofanygeneral- ultimategoalof our technology transfer process is for purpose theorem provers. However, the technology formalmethods to become the \state-of-the-practice" transferprojectshaveleadtosigni(cid:12)cantimprovements for U.S. industry development of ultrareliable digital in the Prototype Veri(cid:12)cation System (PVS) theorem avionicssystems. However,beforewecandevelopnew prover[10] that SRI International (SRI) is develop- toolsandtechniquessuitableforadoptionbyindustry, ing. Severaldomain-speci(cid:12)ctoolsarebeingsponsored: we must work with the system developers in industry (1) Tablewise, (2) VHDL-analysis tool, and (3) DRS. to understand their needs. We must also overcome These tools are discussed in later sections. the natural skepticism that industry has of any new It is also important to realize that formal meth- technology. ods include a large class of mathematical techniques Our basic approachto technologytransfer isas fol- and tools. Methods appropriate for one problem do- lows. The (cid:12)rst step is to (cid:12)nd an industry represen- main may be totally inappropriate for other prob- tative who has become interested in formal methods, lem domains. The following are some of the spe- believes thatthere isapotentialbene(cid:12)t ofsuch meth- ci(cid:12)c domains in which our program has concen- ods,andiswillingtoworkwithus. Thenextstepisto trated: (1) architectural-level fault tolerance, (2) fund our formal methods research team to apply for- clock-synchronization, (3) interactive consistency, (4) mal methods to an appropriate example application. design of hardware devices such as microprocessors, This process allowsthe industry representative to see memory management units, DMA controllers, (5) what formal methods are and what it has to o(cid:11)er, asynchronous communication protocols, (6) design and it allows us (the formal methods team) to learn and veri(cid:12)cation of application-speci(cid:12)c integrated cir- the design and implementationdetails ofstate-of-the- cuits (ASICS), (7)Space Shuttle software,(8) naviga- practice components so we can better tailor our tools tion software, (9) decision tables, (10) railroadsignal- and techniques to industry’s needs. If the demonstra- ing systems. tionproject reveals a signi(cid:12)cantpotential bene(cid:12)t, the We are also interested in applying formalmethods nextstage ofthetechnologytransfer process isforthe to many di(cid:11)erent portions of the life-cycle, such as industry representative to initiate an internal formal (1) requirements analysis, (2) high-level design, (3) methods program,and begin a true cooperative part- detailed design, and (4) implementation. nership with us. Often, there is a sizable e(cid:11)ort associated with the Another important part of our technology trans- developmentofthebackgroundmathematicaltheories fer strategy is working with the Federal Aviation Ad- needed for a particular problem domain. Although 2Todate,oure(cid:11)ortshaveconcentratedontheaerospacein- such theories are reusable and in the long run can be- dustry,butwe are activelyseekingpartnersfrom otherindus- come \cost-e(cid:11)ective", the initial costs can be a de- triesalso. ministration(FAA) toupdate certi(cid:12)cation technology branch; Odyssey Research Associates (ORA) devel- with respect to formal methods. If the certi(cid:12)cation oped a formalspeci(cid:12)cation of the GCS application. process can be rede(cid:12)ned in a manner that awards John Rushby has written a chapter for the FAA credit forthe use offormalmethods,asigni(cid:12)cantstep Digital Systems Validation Handbook Volume III on towardsthetransferofthistechnologytothecommer- formalmethods[20]. The handbook provides detailed cial aircraft industry willhave been accomplished. information about digital system design and valida- Langley has also been sponsoring a series of work- tionand is used by the FAA certi(cid:12)ers. In preparation shops on formal methods. The (cid:12)rst workshop, held for this chapter, Rushby produced a comprehensive in August 1990, focused on building cooperation analysis of formalmethods [11]. and communication between U.S. formal methods GeorgeFinelli,theformerassistantBranchHeadof researchers[18]. The second, held in August 1992, theSystemValidationMethodsBranch(theBranchin focused on education of the U.S. aerospace industry which the formalmethods team worked before NASA about formal methods[19]. A third workshop will be Langley’s reorganization in 1994) and a member of held in May 1995. the RTCA committee formed to develop DO{178B, Another component of our technology transfer together with Ben Di Vito (V(cid:19)(cid:16)GYAN Inc.), was in- strategy, is to use the NASA’s Small Business Inno- strumental in including formal methods as an alter- vative Research (SBIR) program to assist smallbusi- nate means of compliance in the DO{178B standard. nesses in the development of commerciallyviable for- Currently, members of the Langley sta(cid:11) are in- malmethodstoolsandtechniques. The(cid:12)rst contracts volved in RTCA committees SC-180 (Airborne Elec- under the programbegan in early 1994. tronic Hardware) and SC-182 (Minimal Operating Finally, to facilitate technology transfer, informa- Performance Standard for an Airborne Computer Re- tion on NASA Langley’s formal methods research is source). available on the Internet via either anonymous FTP or World Wide Web. PostScript and DVI versions 2.3 Team of manyresearch papers are availablethrough anony- mous FTP on machine deduction.larc.nasa.gov TheLangleyformalmethodsprograminvolvesboth (IP address: 128.155.18.16) in directory pub/fm. localresearchers andindustrial/academicresearchers This directory, and much more information, is also working under contract to NASA Langley. Cur- availablethrough World Wide Web, using the follow- rently the local team consists of (cid:12)ve civil servants ing UniformResource Locator: and one contractor (V(cid:19)(cid:16)GYAN Inc.). The lead NASA Langley formal methods researcher, Ricky W. But- http://atb-www.larc.nasa.gov/fm-top.html ler, may be contacted through electronic mail to [email protected]. 2.2 FAA/RTCA Involvement NASA Langley has recently awarded two (cid:12)ve-year task-assignment contracts speci(cid:12)cally devoted to for- mal methods (from the competitive NASA RFP 1- As the federal agency responsible for certi(cid:12)cation 132-DIC.1021). The selected contractors were SRI of civil air transports, the FAA shares our interest International (SRI) and Odyssey Research Associates in promising approaches to engineering and validat- (ORA). This was a follow-oncontract fromthe previ- ing ultrareliable (cid:13)ight-control systems. Additionally, ous competitivecontract that hadawarded three con- because the FAA must approve any new methodolo- tracts to SRI, ORA, and Computational Logic Inc. gies for developing lifecritical digitalsystems for civil (CLI). air transports, their acceptance of formal methods is a necessary precursor to its adoptionby industry sys- tem designers. We are working with Pete Saraceni of 3 Current Technology Development theFAATechnicalCenterandMikeDeWalt,FAANa- tionalResource Specialist for Software, to insure that and Transfer Projects our program is relevant to the certi(cid:12)cation process. The FAA has co-sponsored some of our work. John 3.1 AAMP5/AAMP-FV Project Rushby of SRI gave a tutorial on formal methods at an FAA Software Advisory Team(SWAT) meetingat In 1993, NASA Langley initiated a joint project their request. The SWAT team suggested that we in- involving Collins Commercial Avionics and SRI In- clude an assessment of formalmethods in an ongoing ternational. The goal was to investigate the ap- Guidance Control Software (GCS) experiment in our plication of formal techniques to a commercial mi- croprocessor design, the Collins AAMP5 micropro- Decision Diagrams to determine if a particular table cessor. The AAMP5 is the latest member of the is exclusive (for every combination of parameter val- CAPS/AAMP family of microprocessors and is ob- ues,atmostoneactioncanbechosen) andexhaustive ject code compatiblewith the AAMP2 processor [21]. (for every combination of parameter values, at least TheCAPS/AAMPfamilyofmicroprocessorshasbeen one action can be chosen). The tool is also capable widelyused bythe commercialandmilitaryaerospace of automatically generating documentation and Ada industries. Some examples of use of earlier members code from a decision table. We consider this a level of the family include: (1) Boeing 747-400 Integrated 3 application of formal methods: although a general Display System (IDS), (2) Boeing 737-300 Electronic purpose prover isnotused, the analysisismechanized FlightInstrumentationSystem(EFIS),(3)Boeing777 in a computer program. FlightControl Backdrive, (4) Boeing 757,767Autopi- In 1995, ORA will develop algorithms to handle lot Flight Director System (AFDS), and (5) military advanced analysis of decision tables. Two particular and commercial Global Positioning (GPS) Systems. areas ofanalysis that willbe considered are testing of The (cid:12)rst phase of the project consisted of the formal additional properties of tables and techniques for ef- speci(cid:12)cation of the AAMP5 instruction set and mi- (cid:12)ciently handling partitioned tables. The Honeywell croarchitecture using SRI’s PVS [22, 23] While for- personnel involved in the project hope that the con- mally specifying the microprocessor, two design er- cepts developed in the Tablewise project can be in- rors were discovered in the microcode. These er- corporated into an industrial-strength tool that will rors were uncovered as a result of questions raised signi(cid:12)cantlyreduce the e(cid:11)ort required to develop new by the formalmethodsresearchers at Collinsand SRI software. while seeking to formally specify the behavior of the microprocessor[24]. The Collinsformalmethodsteam 3.3 Union Switch and Signal believes that this e(cid:11)ort has prevented two signi(cid:12)cant errors from going into the (cid:12)rst fabrication of the mi- Aspartofajointresearch agreement,NASALang- croprocessor. ley formalmethodsresearchers are collaboratingwith The second phase of the project consisted of for- engineers at Union Switch and Signal (US&S) to use mallyverifyingthe microcodeof arepresentative sub- formalmethodsinthe designofrailwayswitchingand set of the AAMP5 instructions. Collins seeded two control applications. Railway switching control sys- errors inthe microcodeprovidedtoSRIinanattempt tems,likedigital(cid:13)ightcontrolsystems,aresafetycrit- to assess the e(cid:11)ectiveness of formalveri(cid:12)cation. Both icalsystems. US&SistheleadingU.S.supplierofrail- of these errors (and suggested corrections) were dis- wayswitchingcontrolsystems. TheirAdvanced Tech- covered while proving the microcode correct[24]. It nologyGroup,leadbyDr. JosephProfeta,hasapplied is noteworthy that both the level 2 and level 3 appli- formal methods in past e(cid:11)orts and turned to NASA cations of formal methods were successful in (cid:12)nding forexpertise in integratingthese techniques intotheir bugs. Based on the success of the AAMP5 project, a next generation products. new e(cid:11)ort has been initiatedwith Rockwell-Collinsto The initial project, started in 1993, was a cooper- applyformalmethodsinthedesignlevelveri(cid:12)cationof ative e(cid:11)ort between NASA, US&S, and Odyssey Re- a microprocessor, currently designated as AAMP-FV. search Associates. The result of this (cid:12)rst year’s work wasaformalmathematicalmodelofa railwayswitch- 3.2 Tablewise Project ingnetwork,de(cid:12)ned intwolevels. Thetoplevelofthe modelprovides the mechanismsfor de(cid:12)ning the basic Under NASA funding, Odyssey Research Asso- concepts: track, switches, trains and their positions ciates is working with Honeywell Air Transport Sys- and control liners of a train (i.e. how far down the temsDivision(Phoenix) tostudy the incorporationof track it has clearance to travel.) The second level is formalmethods into the company’s software develop- a formalization of the standard scheme used in rail- ment processes. Because Honeywell uses decision ta- road control, the block modelcontrol system. A level bles to specify the requirements and designs for much 2 proof that the (cid:12)xed block control system is \safe" 3 oftheirsoftware ,ORAisdevelopingaprototypetool, with respect to the top level model has been com- calledTablewise,toanalyzethecharacteristicsofdeci- pleted. Models of US&S proprietary control schemes sion tables. Tablewise uses a generalization of Binary were also formulated. 3A decisiontable is a tabularformatfor de(cid:12)ningthe rules The European formalmethods communityhas ad- thatchooseaparticularactiontoperformbasedonthevalues dressedsafetypropertiesofcertaincomponentsofrail- ofcertainparameters. roadcontrolsystems,butthe workthere hastypically beenatlowerlevels. ThecooperativeworkwithUS&S mal speci(cid:12)cations was developed for the new Shuttle is unique in that a high level model of a railroad sys- navigationprincipalfunctionsknownasGPSReceiver temhasbeen described andused toanalyzethe safety StateProcessingandGPSReference StateProcessing, of various control schemes. usingthelanguageofSRI’sPrototypeVeri(cid:12)cationSys- The next phase of the collaborative e(cid:11)ort willcon- tem(PVS).Whilewritingtheformalspeci(cid:12)cations,43 centrate on formalmodelingandanalysisof the fault- minordiscrepancies weredetected intheCRandthese tolerant core ofUS&S’s next generation fail-stopcon- have been reported to Loral requirements analysts. trol architecture. The Three Engine Out (3 E/O) Task is executed each cycle during powered (cid:13)ight until either a contin- 3.4 Space Applications gency abort maneuver is required or progress along thepowered (cid:13)ighttrajectory issu(cid:14)cienttopreclude a contingencyaborteven ifthree mainenginesfail. The A team spread across three NASA centers has 3E/Otaskconsists oftwoparts: 3E/ORegionSelec- been formed to study the application and technology tion and 3 E/O Guidance. 3 E/O Region Selection is transfer of formalmethods to NASA space programs. A consortium of researchers and practitioners from responsibleforselectingthetypeofexternaltank(ET) LaRC,JSC,and JPL,together with support fromLo- separation maneuver and assigning the corresponding regionindex. 3E/Oguidancemonitorsascent param- ralSpaceInformationSystems,SRIInternational,and etersanddeterminesifanabortmaneuverisnecessary. V(cid:19)(cid:16)GYAN Inc., has been actively pursuing this objec- tive since late 1992. The near term goal is to de(cid:12)ne Wehavedeveloped andanalyzedaformalmodelof and carry out pilotprojects using portions of existing the series ofsequentialmaneuversthatcomprisethe 3 large-scale space programs. The long term goal is to E/Oalgorithm. Todate,20potentialissues havebeen enable organizations such as Loral to reduce formal found, including undocumented assumptions, logical methods to practice on programs of national impor- errors, and inconsistent and imprecise terminology. tance. These (cid:12)ndings are listed as potential issues pending TheNASAFormalMethodsDemonstrationProject review by the 3 E/O requirements analyst. for Space Applications focuses on the use of formal The GPS and 3 E/O tasks have continued into methods for requirements analysis because the team 1995. We hope to get formal methods incorporated believes that formalmethods are morepracticallyap- asarequirementsanalysistechnique forSpaceShuttle pliedtorequirementsanalysisthantolate-lifecyclede- software. In addition, NASA Langley contributed to velopment phases [25]. A series of trial projects was a NASA guidebook under development by the inter- conducted and cost e(cid:11)ectiveness data were collected. center team. The (cid:12)rst volume of the guidebook is The team’se(cid:11)orts in1993were concentrated onasin- intended for managers of NASA projects who will be gle pilot project (discussed in a subsequent section), using formal methods in requirements analysis activ- while e(cid:11)orts beginning in 1994 have been more dif- ities. A second volume is planned that will be aimed fuse. at practitioners. NASA will publish the (cid:12)rst volume NASALangley’sprimaryrolein1994includedsup- early in 1995, with the second volume expected by port for two Space Shuttle software change requests early 1996. (CR).One CRconcerns the integrationofnew Global Positioning System (GPS) functions while the other 3.5 NASA Small Business Innovative Re- concerns anewfunctiontocontrolcontingency aborts search Program known as Three Engine Out (3 E/O). Both of these tasksinvolveclosecooperationamongformalmethods researchers atNASA Langley,V(cid:19)(cid:16)GYANInc., andSRI In 1993, a formal methods subtopic was a part of International with requirements analysts from Loral theNASASmallBusinessInnovativeResearch(SBIR) Space InformationSystems. solicitation. Two proposals were selected for 6-month The Space Shuttleis tobe retro(cid:12)tted withGPS re- PhaseIfundingfor1994: VHDLLightweightTools,by ceivers in anticipation of the TACAN navigation sys- Odyssey Research Associates, andDRS | Derivation tembeingphased outby the DoD.Additionalnaviga- Reasoning System, A Digital Design Derivation Sys- tionsoftware willbe incorporated toprocess the posi- tem for Hardware Synthesis, by Derivation Systems, tionand velocityvectors generated bythese receivers. Inc. of Bloomington, Indiana. After the completion A decision was made to focus the trial formal meth- of the Phase I e(cid:11)orts, both companies were selected ods task onjust afew key areas because the CRitself for continued Phase II funding. Contracts for these is very large and complex. A set of preliminary for- e(cid:11)orts just recently began. 4 Past E(cid:11)orts Thus, a major objective of this approach is to mini- mizethe amountofexperimentaltesting required and This section describes previous work in each of the maximizethe ability to reason mathematicallyabout followingfourfocus areas: fault-tolerantsystems, ver- correctness of the design. Although testing cannot be i(cid:12)cation of software, veri(cid:12)cation of hardware devices, eliminatedfromthedesign/validationprocess,thepri- andcivilairtransportrequirementsspeci(cid:12)cation. This mary basis of belief in the dependability of the system section omits much of the early work described at must come from analysis rather than from testing. COMPASS 91 [26]. 4.1 Fault-tolerant Systems 4.1.1 The Reliable Computing Platform The ReliableComputingPlatformdispatches control- The goal of this focus area was to create a formal- law application tasks and executes them on redun- ized theory of fault tolerance including redundancy management,clock synchronization, Byzantine agree- dant processors. The intended applications are safety critical with reliability requirements on the order of ment, voting,etc. Much of the theory developed here (cid:0)9 1(cid:0)10 . The reliable computing platform performs is applicable to future fault-tolerant systems designs. the necessary fault-tolerantfunctions and provides an A detailed design of a fault-tolerant reliable comput- interface to the network of sensors and actuators. ingbase,theReliableComputingPlatform(RCP),has been developed and proven correct. It is hoped that The RCP operating system provides the applica- the RCP will serve as a demonstration of the formal tionssoftwaredeveloperwithareliablemechanismfor methodsprocess andprovideafoundationthatcanbe dispatchingperiodictasks ona fault-tolerantcomput- expanded andused forfuture aerospace applications. ing base that appears to him as a single ultrareliable The RCP architecture was designed in accordance processor. Thetoplevelofthe hierarchydescribes the with a system design philosophy called \Design For operating system as a function that sequentially in- Validation" [27, 28]. The basic tenets of this design vokes application tasks. This view of the operating philosophy are as follows: system will be referred to as the uniprocessor speci- (cid:12)cation (US), which is formalized as a state transi- 1. A system is designed in such amannerthat com- tion system and forms the basis of the speci(cid:12)cation plete and accurate models can be constructed to for the RCP. Fault tolerance is achieved by voting re- estimatecriticalproperties such as reliabilityand sults computedbythe replicated processors operating performance. All parameters of the model that on the same inputs. Interactive consistency checks on cannot be deduced from the logical design must sensor inputs and voting of actuator outputs require be measured. All such parameters must be mea- synchronization of the replicated processors. The sec- surable within a feasible amountof time. ondlevelinthehierarchy(RS)describes theoperating 2. The design process makes tradeo(cid:11)s in favor of systemasasynchronous systemwhere eachreplicated designs that minimize the number of parameters processor executes the same application tasks. The thatmustbemeasuredinordertoreduce theval- existence of a global time base, an interactive consis- idation cost. A design that has exceptional per- tency mechanismand areliablevotingmechanismare formancepropertiesyetrequiresthemeasurement assumed at this level. ofhundreds ofparameters (for example,by time- Level 3 of the hierarchy (DS) breaks a frame into consuming fault-injection experiments) would be four sequential phases. This allows a more explicit rejected over a less capable system that requires modeling of interprocessor communication and the minimalexperimentation. time phasing of computation, communication, and voting. At the fourth level (DA), the assumptions of 3. The system is designed and veri(cid:12)ed using rigor- the synchronous model are discharged through use of ous mathematicaltechniques. It is assumed that the interactive-convergence clock synchronization al- the formal veri(cid:12)cation makes system failure due gorithm[29]. to design faultsnegligibleso the reliabilitymodel In the LE model, a more detailed speci(cid:12)cation of does not include transitions representing design the activities on a local processor are presented. In errors. particular, three areas of activity are elaborated in 4. The reliability (or performance) model is shown detail: (1) task dispatching and execution, (2) mini- to be accurate with respect to the system imple- malvoting,and (3) interprocessor communicationvia mentation. This is accomplished analyticallynot mailboxes. An intermediate model, DA minv, that experimentally. simpli(cid:12)edthe constructionofthe LEmodelwasused. Of primary importance in the LE speci(cid:12)cation is the use of a memorymanagementunit by the local exec- MaximumClock Skew Property utive in order to prevent the overwriting of incorrect " memorylocations while recovering fromthe e(cid:11)ects of j a transient fault. Synchronization Algorithm The top two levels of the RCP were originally for- " mallyspeci(cid:12)edinstandardmathematicalnotationand j connected via mathematical(i.e. level 2 formalmeth- DigitalCircuit Implementation ods) proof [30, 31, 32]. Under the assumption that a majority of processors is working in each frame, the proof establishes that the replicated system com- Figure 1: Hierarchical Veri(cid:12)cation of Clock Synchro- putes the same results as a single processor system nization not subject to failures. Su(cid:14)cient conditions were de- veloped that guarantee that the replicated system re- erty of the form: covers fromtransient faults withina bounded amount oftime. SRI subsequently generalized the modelsand 8 non-faulty p;q:jCp(t)(cid:0)Cq(t)j<(cid:14) constructed a mechanical proof in Ehdm [33]. Next, the local team developed the third and fourth level where (cid:14) is the maximum clock skew guaranteed by models. The top two levels and the two new models the algorithmas long as a su(cid:14)cient number of clocks (i.e. DSandDA) were then speci(cid:12)ed inEhdmandall (andthe processors theyareattachedto)areworking. ofthe proofs were done mechanicallyusing the Ehdm The function Cp(t) gives the value of clock p at real 5.2 prover [34, 35]. time t. The middle level in the hierarchy is a math- Both the DA minv model and the LE model ematical de(cid:12)nition of the synchronization algorithm. were speci(cid:12)ed formally and have been veri(cid:12)ed using The bottom level is a detailed digital design of a cir- the Ehdm veri(cid:12)cation system[36]. All RCP speci(cid:12)- cuitthatimplementsthe algorithm. Thebottomlevel cations and proofs are available electronically via the is su(cid:14)ciently detailed to maketranslation into silicon Internet using anonymous FTP or World Wide Web straight forward. (WWW) access. Anonymous FTP access is avail- The veri(cid:12)cation process involves two important able through the host deduction.larc.nasa.govus- steps: (1) veri(cid:12)cation that the algorithmsatis(cid:12)es the ing the path pub/fm/larc/RCP-specs. WWW ac- maximumskew property and (2) veri(cid:12)cation that the cess to the FTP directory is provided through the digital circuitry correctly implements the algorithm. NASA LangleyFormalMethods Programhome page: The (cid:12)rst step was completed by SRI International. http://atb-www.larc.nasa.gov/fm-top.html The (cid:12)rst such proof was accomplished during the de- signandveri(cid:12)cationofSIFT[29]. The proofwasdone 4.1.2 Clock Synchronization by hand in the style of journal proofs. More recently this proof step was mechanically veri(cid:12)ed using the Theredundancymanagementstrategiesofvirtuallyall Ehdm theorem prover[39, 40]. In addition, SRI me- fault-tolerantsystems depend onsomeformofvoting, chanically veri(cid:12)ed Schneider’s clock synchronization which in turn depends on synchronization. Although paradigm[41] using Ehdm[42, 43]. A further general- in many systems the clock synchronization function izationwas foundat NASA Langley[44]4. The design has not been decoupled from the applications (e.g. of a digital circuit to distribute clock values in sup- the redundant versions of the applications synchro- port of fault-tolerant synchronization was completed nize by messages), research and experience have led by SRI and was partially veri(cid:12)ed.5 CLI reproduced ustobelievethatsolvingthesynchronizationproblem the SRI veri(cid:12)cation of the interactive convergence al- independently from the applications design can pro- gorithmusing the Boyer-Moore theorem prover [45]. vide signi(cid:12)cant simpli(cid:12)cation of the system [37, 38]. NASA Langley researchers designed and imple- The operating system is built on top of this clock- mented a fault-tolerant clock synchronization circuit synchronization foundation. Of course, the correct- capable of recovery from transient faults [46, 47, 44]. ness of this foundation is essential. Thus, the clock The top-level speci(cid:12)cation for the design is the Ehdm synchronization algorithmand itsimplementationare 4The boundeddelay assumptionwas shown to follow from primecandidatesforformalmethods. Theveri(cid:12)cation theotherassumptionsofthetheory. strategy shown in (cid:12)gure 1 is being explored. 5UnliketheNASAcircuit,theSRIintentisthattheconver- The top-level in the hierarchy is an abstract prop- gencealgorithmbeimplementedinsoftware. veri(cid:12)cation of Schneider’s paradigm. The circuit was circuitandverifyingthecombinationhasnotyetbeen implementedwithprogrammablelogicdevices(PLDs) explored. and FOXI (cid:12)ber optic communicationschips [48]. Thambidurai and Park [58] introduced a fault Using a combination of formal techniques, a veri- model that classi(cid:12)ed faults into three categories: (cid:12)ed clock synchronization circuit design has alsobeen asymmetric, symmetric, and benign. They further developed[49]. The principal design tool was the Dig- suggested the need for and developed an algorithm ital Design Derivation system (DDD) developed by thathadcapabilitiesbeyondthatoftheearlierByzan- Indiana University[50]. Some design optimizations tinegeneralsalgorithms. Inparticular,theiralgorithm thatwerenotpossiblewithinDDDwere veri(cid:12)edusing can mask the e(cid:11)ects of a less severe class of faults, in PVS. amore e(cid:11)ective way. SRIhas formallyveri(cid:12)ed an im- proved version of this algorithm[59, 60, 61] The newly developed hybrid-fault theory was then 4.1.3 Byzantine Agreement Algorithms applied to the analysis of the Charles Stark Draper Labs\Fault-TolerantProcessor"(FTP).Auniquefea- Fault-tolerantsystems,althoughinternallyredundant, ture of this architecture is its use of \interstages" to mustdeal with single-source informationfromthe ex- relay messages between processors. These are signi(cid:12)- ternal world. For example, a (cid:13)ight control system is cantly smaller than a processor and lead to an asym- builtaroundthe notionoffeedbackfromphysicalsen- metric architecture that is far more e(cid:14)cient than the sorssuchasaccelerometers,positionsensors,andpres- traditional Byzantine agreement architectures. The sure sensors. Although these can be replicated (and SRI work not only formalized the existing informal they usually are), the replicates do not produce iden- analysis but extended it to cover a wider range of tical results. To use bit-by-bit majority voting, all of faultybehavior[62]. thecomputationalreplicatesmustoperateonidentical Also SRI subsequently generalized their clock syn- input data. Thus, the sensor values (the complete re- chronization work to encompass the hybrid fault dundant suite) must be distributed to each processor model[63]. inamannerwhichguaranteesthatallworkingproces- sorsreceiveexactlythesamevalueeveninthepresence 4.2 Veri(cid:12)cation of Software ofsomefaultyprocessors. ThisistheclassicByzantine Generals problem [51]; algorithms to solve the prob- Our past software veri(cid:12)cation projects are de- lem are called Byzantine agreement algorithms. CLI scribed in this section. investigated the formal veri(cid:12)cation and implementa- tion of such algorithms. They formally veri(cid:12)ed the 4.2.1 Formal Speci(cid:12)cation of Space Shuttle original Marshall, Shostak, and Lamport version of Jet Select this algorithmusing the Boyer Moore theorem prover [52]. They also implemented this algorithm down to NASA Langley worked with NASA Johnson Space theregister-transferlevelanddemonstratedthatitim- Center and the Jet Propulsion Laboratory (JPL) in plements the mathematical algorithm [53], and then a study to explore the feasibility and utility of ap- subsequently veri(cid:12)ed the design down to a hardware plying mechanically-supported formal methods to re- description language HDL developed at CLI [54]. A quirements analysis for space applications. The team more e(cid:14)cient mechanical proof of the oral messages worked jointlyto develop a formalspeci(cid:12)cation ofthe algorithmwas also developed by SRI[55]. Jet Select function ofthe NASA Space Shuttle, which ORA also investigated the formal veri(cid:12)cation of is a portion of the Shuttle’s Orbit Digital Auto-Pilot Byzantine Generals algorithms. They focused on the (DAP). Although few proofs were produced for this practicalimplementationofaByzantine-resilientcom- speci(cid:12)cation, 46 issues were identi(cid:12)ed and several mi- munications mechanismbetween Mini-Cayuga micro- nor errors were found in the requirements. A second processors [56, 57]. The Mini-Cayuga is a small but speci(cid:12)cation was produced for an abstract (i.e., high formally veri(cid:12)ed microprocessor developed by ORA. level) representation of the Jet Select requirements. It is a research prototype and has not been fabri- Thisabstraction,alongwiththe24proofsofkeyprop- cated. The communicationscircuitry wouldserve asa erties, wasaccomplishedinunder 2workmonths,and foundationforafault-tolerantarchitecture. Itwasde- although it only uncovered 6 issues, several of these signed assuming that the underlying processors were issues weresigni(cid:12)cant. Eventhislevel1applicationof synchronized (say by a clock synchronization circuit). formalmethods was able to uncover hidden problems The issues involved with connecting the Byzantine in a highlycritical and mature FSSR speci(cid:12)cation for communications circuit with a clock synchronization Shuttle.