ebook img

DTIC ADA443594: Probabilistic Planning for Behavior-Based Robots PDF

6 Pages·0.18 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA443594: Probabilistic Planning for Behavior-Based Robots

(cid:3) ProbabilisticPlanning for Behavior-Based Robots Amin Atrash and SvenKoenig CollegeofComputing GeorgiaInstituteofTechnology Atlanta,Georgia30332-0280 famin,[email protected] Abstract (POMDPs)(Sondik1978)forplanningandcombinesthem withMissionLab. POMDPsprovideanelegantandtheoret- Partially Observable Markov Decision Process models ically groundedway for probabilisticplanning(Cassandra, (POMDPs)havebeenappliedtolow-levelrobotcontrol. We Kaelbling, & Littman 1994). So far, they have been used show how to use POMDPs differently, namely for sensor- mainly to solve low-level planningtasks for mobilerobots planninginthecontextofbehavior-basedrobotsystems.This ispossiblebecausesolutionsofPOMDPscanbeexpressedas such as path following and localization (Fox, Burgard, & policygraphs,whicharesimilar tothefinite stateautomata Thrun 1998; Mahadevan, Theocharous, & Khaleeli 1998; thatbehavior-basedsystemsusetosequencetheirbehaviors. Cassandra, Kaelbling,& Kurien1996;Simmons &Koenig AnadvantageofoursystemoverpreviousPOMDPnaviga- 1995). In this paper, we show that POMDPs can also be tion systemsis that it is able to find close-to-optimal plans usedtosolvehigher-levelplanningtasksformobilerobots. since it plans at a higher level and thus with smaller state The key ideabehindourrobotarchitecture is thatPOMDP spaces.Anadvantageofoursystemoverbehavior-basedsys- planners can generate policy graphs rather than the more temsthatneedtogetprogrammedbytheirusersisthatitcan popular valuesurfaces. Policy graphs are similar to thefi- optimize plansduringmissions andthusdeal robustly with nite state automata of MissionLab. An advantage of our probabilisticmodelsthatareinitiallyinaccurate. robot architecture is that it uses POMDPs in small state Introduction spaces. WhenPOMDPsareusedforlow-levelplanning,the statespaces areoftenlargeandfindingoptimalorclose-to- Mobilerobotshavetodealwithvariouskindsofuncertainty, optimalPOMDPsbecomes extremelytime-consuming(Pa- suchasnoisyactuators,noisysensors,anduncertaintyabout padimitriou&Tsitsiklis1987).Thus,existingrobotsystems the environment. Behavior-based robot systems, such as have sofaronlybeen able tousegreedyPOMDP planning MissionLab (Endo et al. 2000), can operate robustly in methods that produceextremely suboptimal plans (Koenig the presence of uncertainty (Arkin 1998). Its operation is & Simmons 1998). Our robot architecture, on the other controlledby plans inform of finitestateautomata, whose hand,isabletofindclose-to-optimalplans. states correspond to behaviors and whose arcs correspond Inthefollowing,wefirstgiveanexampleofsensorplan- toobservations. Thesefinitestateautomatahavetobepro- ningandthengiveoverviewsofbehavior-basedroboticsand grammed by the users of the system at the beginning of a POMDPs using this example. Next, we describe how our mission.However,plansgeneratedbyhumansarerarelyop- robotarchitecturecombinestheseideasbytransformingthe timalbecausetheyinvolvecomplextradeoffs. Consider,for outputof the POMDP planner (policygraphs) to the input example, a simplesensor-planningtask, where a robothas ofMissionLab(finitestateautomata). Finally,wereporton todecidehowoftentosensebeforeitstartstoact. Sincethe twoexperimentsthatshowthattheabilitytooptimizeplans sensorsoftherobotarenoisy,itmayhavetosensemultiple duringmissionsisimportantbecausetheresultingsystemis times. Ontheotherhand,sensingtakestime. Howoftenthe abletodeal robustlywithprobabilisticmodels thatareini- robotshouldsensedepends ontheamountofsensor noise, tiallyinaccurate. thecostofsensing,andtheconsequencesofactingbasedon wrongsensorinformation. Example: SensorPlanning In this paper, we develop a robot architecture that usesPartiallyObservableMarkovDecisionProcess models We use the followingsensor-planningexample throughout thispaper, whichissimilartoan exampleusedin(Cassan- (cid:3)ThisresearchissupportedbyDARPA/U.S.ArmySMDCcon- dra,Kaelbling,&Littman1994).Assumethatapolicerobot tract#DASG60-99-C-0081.Approvedforpublicrelease;distribu- attemptstofindwoundedhostagesinabuilding.Whenitis tionunlimited. Theviewsandconclusionscontainedinthisdoc- at a doorway, ithas to decide whether to search the room. ument are those of the authors and should not be interpreted as The robot can either use its microphone to listen for ter- representing the official policies, either expressedor implied, of rorists(OBSERVE);entertheroom,lookaround,leavethe thesponsoringorganizationsandagenciesortheU.S.government. Copyright (cid:13)c 2001, American Association for Artificial Intelli- room, and proceed to the next doorway (ENTER ROOM gence(www.aaai.org). Allrightsreserved. ANDPROCEED);ormovetothenextdoorwayrightaway Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2001 2. REPORT TYPE 00-00-2001 to 00-00-2001 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Probabilistic Planning for Behavior-Based Robots 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Defense Advanced Research Projects Agency,3701 North Fairfax REPORT NUMBER Drive,Arlington,VA,22203-1714 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT see report 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 5 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 (PROCEED). The cost of OBSERVE is always 5, and the ar(1s:1 E, an1te)r = R 1o0o0m and Proceed a1: Enter Room and Proceed costofPROCEEDisalways50. Eachroomisoccupiedby r(s2, a1) = -500 a2: Observe a2: Observe terroristswithprobability0.5. OBSERVEreportseitherthat r(s1, a2) = -5 r(s2, a2)= -5 0.5 theroomisoccupiedbyterrorists(OBSERVEOCCUPIED) 0.5 1.0 or not(OBSERVE EMPTY). Althoughthemicrophoneal- 1.0 s1: Room Empty s2: Room Occupied waysdetectstheabsenceofterrorists,itdoesnotdetectthe p(s1) = 0.5 p(s2) = 0.5 presence of terroristswithprobability0.2. Multipleobser- 0.5 0.5 vationsaredrawnindependentlyfromthisprobabilitydistri- a3: Proceed bution, which is notcompletely unrealisticfor sound. The r(s1, a3)= -50 a3: Proceed r(s2,a3) = -50 robotgets a reward for enteringa room, as an incentiveto o1: Observe Empty q(o1|s1, a1) = 0.5 p(s1) = 0.5 find wounded hostages. However, it also gets a penalty if o2: Observe Occupied q(o1|s2, a1) = 0.5 p(s2) = 0.5 a1: Enter Room and Proceed q(o1|s1, a2) = 1.0 theroomisoccupiedbyterroristssinceterroristsmightde- aa23:: OPrbosceerevde qq((oo11||ss21,, aa23)) == 00..25 pp((ss11||ss11,, aa12)) == 01..50 s1: Room Empty q(o1|s2,.a3) = 0.5 p(s1|s1, a3) = 0.5 stroy it. If the room is not occupied by terrorists (ROOM s2: Room Occupied p(s1|s2, a1) = 0.5 p(s1|s2, a2) = 0.0 EMPTY), thenENTER ROOMANDPROCEEDresultsin SO=={{so11, ,s o22}} p(s1|s2, a3) = 0.5 A(s1) = A(s2) = {a1, a2, a3} arewardof100. However,iftheroomisoccupiedbyterror- ists(ROOMOCCUPIED),thenENTERROOMANDPRO- Figure1:POMDP CEEDresultsinapenaltyof500.Themaindecisionthatthe robothas tomakeishowoftentoOBSERVEand, depend- ingonthesensorobservations,whethertoPROCEEDtothe in it. The POMDP furtherconsists of a transitionfunction nextdoorwayrightawayortofirstENTERtheROOMAND p, where p(s0js;a) denotes the probabilitywith which the thenPROCEED. system transitionsfrom states to state s0 when action a is executed,anobservationfunctionq,whereq(ojs;a)denotes Behavior-BasedRobotics the probability of making observation o when action a is Behavior-basedroboticsusesatightcouplingbetweensens- executed in state s, and a reward functionr, where r(s;a) ingandactingtooperaterobustlyinthepresence ofuncer- denotesthefinitereward(negativecost)thatresultswhenac- tainty. Therobotalwaysexecutes abehaviorsuchas“move tionaisexecutedinstates. APOMDPprocessisastream tothedoorway”or“entertheroom.” Tosequencethesebe- of <state, observation, action, reward> quadruples. The haviors, behavior-based robotics often uses finite state au- POMDP process is always in exactly one state and makes tomatawhosestatescorrespondtobehaviorsandwhosearcs state transitions at discrete time steps. The initialstate of correspondtotriggers(observations). Thecurrentstatedic- thePOMDPprocessisdrawnaccordingtotheprobabilities tates the behavior of the robot. When an observation is (cid:25)(s). Thus, p(st = s) = (cid:25)(s) for t = 1. Assume that madeandthereisanedgelabeledwiththisobservationthat at time t, the POMDP process is in state st 2 S. Then, leavesthecurrentstate,thecurrentstatechangestothestate a decisionmaker chooses an actionat fromA(st) forexe- pointedto by the edge. Since the finitestate automata are cution. This results in reward rt = r(st;at) and observa- based onbehaviors andtriggers, therobotdoesnotrequire tionot 2 O thatisgenerated accordingtotheprobabilities amodeloftheworldorcompleteinformationaboutthecur- p(ot =o)=q(ojst;at). Next,thePOMDPprocesschanges rentstateoftheworld. For example, arobotdoesnotneed state. Thesuccessor statest+1 2S isselectedaccordingto to knowthe number of doorways or the distances between the probabilitiesp(st+1 = s) = p(sjst;at). This process them. repeatsforever. WeusearobotsystembasedonMissionLab(Endoetal. As an example, Figure 1 shows the POMDP that corre- 2000). MissionLab provides a large number of behaviors sponds to our sensor-planning task. The robot starts at a andtriggerswithwhichuserscanbuildfinitestateautomata, doorway without knowing whether the room is occupied. thatcan thenbeexecuted onavarietyofrobotsorinsimu- Thus, it is in state ROOM OCCUPIED with probability lation. Thefinitestateautomatahavetobeprogrammedby 0.5 and ROOM EMPTY with probability0.5 but does not theusers ofthesystem at thebeginningofamission. This know which oneitis in. In bothstates, therobotcan OB- hasthedisadvantagethatMissionLabcannotoptimizethefi- SERVE,PROCEEDtothenextdoorway,orENTERROOM nitestateautomataduringthemission,forexample,whenit ANDPROCEED. OBSERVEdoesnotchangethestatebut learnsmoreaccurateprobabilitiesorwhentheenvironments thesensorobservationprovidesinformationaboutit. PRO- change. Furthermore, humansoftenassumethatsensorare CEED and ENTER ROOM AND PROCEED both result accurate. Their plans are therefore often suboptimal. We in the robot being at the next doorway and thus again in address thisissuebydevelopingarobotarchitecturewhich state ROOM OCCUPIED with probability0.5 and ROOM uses a POMDP planner to generate plans based on proba- EMPTYwithprobability0.5. Theobservationprobabilities bilisticmodelsoftheworld. andrewardsoftheactionsareasdescribedabove. POMDPs Policy Graphs POMDPsconsistofafinitesetofstatesS,afinitesetofob- Assume that a decision maker has to determine which ac- servationsO, and an initialstate distribution(cid:25). Each state tiontoexecute foragivenPOMDPattimet. Thedecision s 2 S has afiniteset of actionsA(s) thatcan beexecuted makerknowsthespecificationofthePOMDP, executedthe Observe EmptyObserve Occupied POMDP PPOlaMnnDePr Policy Graph Reducer RPGeodrlauipcchyed MissionLab Start Figure3:RobotArchitecture Observe Occupied Observe Observe Occupied Proceed Observe Empty Observe Occupied Observe Observe Occupied Observe Empty Observe Empty Observe Enter Room and Proceed Observe Empty Figure2:PolicyGraph actions a1:::at−1, and made the observations o1:::ot−1. Theobjectiveofthedecisionmaker istomaximizetheav- eragePtotalreward over an infiniteplanninghorizon,which isE( 1 [γt−1r ]), whereγ 2 (0;1]is adiscountfactor. t=1 t The discountfactor specifies therelativevalueof a reward received aftertactionexecutions comparedtothesame re- wardreceived oneactionexecution earlier. Oneoftenuses adiscountfactorslightlysmaller thanonebecause thisen- suresthattheaveragetotalrewardisfinite,nomatterwhich actionsarechosen. (Weuseγ =0:99).Inourcase,therobot Figure4:FiniteStateAutomaton isthe decisionmaker whoexecutes movement andsensing actionsandreceivesinformationaboutthestateoftheworld from inaccurate sensors, such as the microphone. We let opedbyHanseninhisdissertationattheUniversityofMas- therobotmaximizetheaveragetotalrewardoveraninfinite sachusettsatAmherst(Hansen1998).ThisPOMDPplanner horizonbecauseitsearchesalargenumberofrooms. canoftenfindoptimalorclose-to-optimalpolicygraphsfor Itisafundamentalresultofoperationsresearchthatopti- ourPOMDPproblemsinseconds. WeusethePOMDPplan- malbehaviorsfortherobotcanbeexpressedeitherasvalue ner unchanged, withone small exception. We noticed that surfacesorpolicygraphs(Sondik1978). Valuesurfacesare manyverticesofthepolicygraphareoftenunreachablefrom mappings from probability distributions over the states to the start vertex and eliminatethese vertices usinga simple values. Therobotcalculatestheexpectedvalueoftheprob- graph-searchtechnique. abilitydistributionover thestatesthatresultsfromtheexe- cutionofeachactionandthenchoosestheactionthatresults The RobotArchitecture inthelargestexpectedvalue.Policygraphsaregraphswhere theverticescorrespondtoactionsandthedirectededgescor- Figure3showsaflowgraphofourrobotarchitecture. The respondtoobservations. Therobotexecutes theactionthat user inputs a POMDP that models the planning task. The corresponds toitscurrentvertex. Then, itmakes an obser- robotarchitecturethenusesthePOMDPplannertoproduce vation,followsthecorrespondingedge,andrepeatsthepro- apolicygraphandremovesallverticesthatareunreachable cess. fromtheinitialvertex. Bymappingtheactionsofthepolicy ItisfarmorecommonforPOMDPplannerstousevalue graphtobehaviorsandtheobservationstotriggers,thepol- surfaces thanpolicygraphs. However, policygraphs allow icygraphisthentransformedtoafinitestateautomatonand us to integrate POMDP planning and behavior-based sys- used tocontroltheoperationofMissionLab. The user still temsbecause oftheirsimilaritytofinitestateautomata. As hastoinputinformationbutnowonlytheplanningtaskand anexample,Figure2showstheoptimalpolicygraphforour its parameters (that is, the probabilities and costs) and no sensor-planningtask. Thispolicygraphspecifiesabehavior longertheplans. Oncethefinitestateautomatonisreadinto where therobot senses three times before it decides to en- MissionLab,weallowtheusertoexamineandeditit,forex- ter a room. If any of the sensing operations indicates that ample,toaddadditionalpartstothemissionormakeitpart theroomisoccupied,therobotdecidestomovetothenext ofalargerfinitestateautomaton. Figure4showsascreen- doorwaywithoutenteringtheroom. shotofthepolicygraphfromFigure2afteritwasreadinto Optimalpolicy graphs can potentiallybe large butoften MissionLabandaugmentedwithdetailsabouthowtoimple- turnouttobeverysmall(Cassandra, Kaelbling,&Littman ment PROCEED (namely by marking thecurrent doorway 1994). However, finding optimal or close-to-optimal pol- and proceeding along the hallway until the robot is at an icy graphs is PSPACE-complete in general (Papadimitriou unmarkeddoorway)andENTER ROOMAND PROCEED &Tsitsiklis1987)andthusonlyfeasibleforsmallplanning (namely by entering the room, leaving the room, marking tasks. WedecidedtouseaPOMDPplannerthatwasdevel- thecurrentdoorway,andproceedingalongthehallwayuntil therobotisatanunmarkeddoorway).Furthermore,theuser 5000 decidedthatitwasmorerobusttostartwiththebehaviorthat OBapstiemlianle P Plalann Change in Optimal Plan proceedsalongthehallwaybecausethentherobotcanstart 4000 anywhereinthehallwayandnotonlyatdoorways. Ourrobotarchitectureshowsthatitispossibletointegrate 3000 Observe 0 times PiegnvrOgaeMprt,hhDesthPsaeonrpledulatfiainronneniitnesomgfstPaaalnOtledMsabeuDmethPoaasmnvtiiainoctarf-.odbriPafmfsOeeorMdefnsDpcyoPeslstsiecmabysesgst,uwrambepyeehnsstp.hpeHaoctoliaifwcycy--- Average Total Reward 2000 Observe 3 times Observe 2 tiOmbesse rve 1 time 1000 tions are discrete and that the robot makes an observation after each action execution. Finite state automata assume 0 thatbehaviors are continuousand triggerscan be observed atanytimeduringtheexecutionofthebehaviors.Twoissues needtobeaddressedinthiscontext.First,weneedMission- −10−010000 −900 −800 −700 −600 −500 −400 −300 −200 −100 0 Reward For Entering Occupied Room (x = r(s2,a1)) Lab to be able to deal with actions of finite duration. We deal with this problem by adding an ACTION FINISHED Figure5: Average TotalRewards vs. Reward for Entering trigger to MissionLab. (This extension is not needed for anOccupiedRoom(x)(Analytical) oursensing-planningtask.) Second, weneedtodealwitha potentialcombinatorialexplosionofthenumberofobserva- tions.MostPOMDPplannersassumethateveryobservation Using thenotationof Figure1, the probabilityp(o = o ) t 1 canbemadeineverystate. Consequently,everyvertexina withwhichthesensorreportsOBSERVEEMPTYisp(o = t policygraphhasoneoutgoingedgeforeachpossibleobser- o1) = q(o1js1;a2)p(st = s1)+q(o1js2;a2)p(st = s2) = vation. However, theobservationsare n tuplesif thereare 1:00:5+ 0:20:5 = 3=5. Consequently, the average total nsensorsandthenumberofobservationscanthusbelarge. reward v(k) of the baseline plan if the robot starts in k is v(k) =−5+γ(3=5v(l)+2=5v(m)). Similarderivationsre- This is nota problemfor finitestateautomatasince obser- sultinasystemofthreelinearequationsinthreeunknowns: vations that do not cause state transitionsdo not appear in them. Wedealwiththisproblembyomittingsubtasksfrom v(k) = −5+γ(3=5v(l)+2=5v(m)) the POMDP planning task that can be abstracted away or v(l) = 1=6x+5=6100+γv(k) arepre-sequencedanddonotneedtobeplanned.Forexam- v(m) = −50+γv(k) ple, ENTER ROOM AND PROCEED isa macro-behavior that consists of a sequence of observations and behaviors, Solvingthissystemofequationsyieldsv(k) = 1241:21+ as shown in Figure 4. By omitting the details of ENTER 4:97x. Figure 5 shows this graph together with the aver- ROOM AND PROCEED, the observations IN ROOM, IN agetotalrewardoftheplansgenerated byoursystem, as a HALLWAY, and MARKEDDOORWAYdo notneed tobe functionofx. Ascanbeseen, thenumberoftimesarobot consideredduringplanning. hastosenseOBSERVEEMPTYbeforeitentersaroomin- creases as itbecomes more expensiveto enter an occupied Experiments room. (The markers show when a change in plan occurs.) Wetesttheperformanceofoursystem,bothanalyticallyand The robotpays acost for theadditionalsensing operations experimentally,bycomparingtheaveragetotalrewardofits but this decreases the probability of entering an occupied plans(thatis,optimalplans)againsttheplanstypicallygen- room. Changingtheplansas xchanges allowstheaverage eratedbyusers. Foroursensor-planningtask,userstypically totalrewardoftheplansgenerated byoursystemtodeteri- createplansthatsenseonlyonce, nomatter whattheprob- oratemuchmoreslowlythantheaveragetotalrewardofthe abilitiesand costsare. The robotexecutes ENTER ROOM baselineplan. Thisresultshowsthatoursystemhas anad- AND PROCEED if itsenses ROOM EMPTY, otherwiseit vantageovertheprevioussystembecauseitisabletoadapt executes PROCEED. Wethereforeusethisplanasbaseline plansduringmissionswhenthecostscanbeestimatedmore planandcomparetheplansgeneratedbyoursystemagainst precisely. It also shows that our system has an advantage it. over the previous system because humans are not good at AnalyticalResults: Todemonstratethatoursystem has planningwithuncertaintyandthustheirplansarerarelyop- an advantage over the previous system because it is able timal. For example, the original sensor-planning problem to optimize its plans during missions when it is able to has x = −500andtheaverage totalrewardofthebaseline estimate the costs more precisely, we determine analyti- plan is only -1,246.24 whereas the average total reward of cally howthe average totalreward ofthebaselineplan de- theplangeneratedbyoursystemis374.98. pends on the reward x = r(s ;a ) for entering an occu- 2 1 Similarresultscan beobservediftheinitialprobabilities pied room. Let k be the state directly before the robot are inaccurate. Figure6 shows theaverage totalreward of executes OBSERVE, l the state directly before it executes the baseline plan together with the average total reward of ENTER ROOMAND PROCEED, andmthestatedirectly before it executes PROCEED. If the robot is in k, then it theplansgeneratedbyoursystem,asafunctionoftheprob- incurs a cost of 5 for executing OBSERVE. It then transi- ability y = q(o2js2;a2) with which the microphone cor- tionsfromktol withtheprobabilitywithwhichthesensor rectlyclassifies an occupiedroom, forbothx = −200and reports OBSERVE EMPTY, otherwise it transitions to m. x = −500. As can be seen, the number of times a robot 1000 architecturescanonlyfindverysuboptimalplans. Optimal Plan (x = −200) OChpatimngael Pinl aOnp (txim =a −l 5P0la0n) ( x = − 200) WeusedthisinsighttoimproveMissionLab,abehavior- Change in Optimal Plan (x = −500) Observe 1 time 800 Baseline Plan (x = −200) based systemwherethefinitestateautomatahadtobepro- Baseline Plan (x = −500) grammed bytheusersofthesystematthebeginningofthe 600 Observe 2 times mission.Thishadthedisadvantagethathumansarenotgood Average Total Reward 240000 Observe 4 times Observe 3 times Observe 3 times Observe 2 times aopthtpreotpidfilmaunnacinlet.eicnslIgtonaswtecei-ottanhout-truooanpmsctti,eamrtotaaauliwrnphtrlyoeanbnaosnittdblauetrhtacurhinssistatehmlcsetoouirrraeepbaldlaceoncetsuosraaorntepeottripamorroneizbllyye- abilitiesorwhentheenvironmentchanges. Observe 4 times Itisfutureworktostudyinterfacesthatallowuserstoeas- 0 ilyinputPOMDPs, includingprobabilitiesandcosts. Also, Observe 5 times weintendtoimplementsamplingmethodsforadaptingthe −20050 55 60 65 70 75 80 85 90 95 100 probabilities and costs of POMDPs during missions to be Probability of OBSERVE OCCUPIED given ROOM OCCUPIED (y=q(o2|s2,a2) able to update the plan during execution. Finally, it is fu- Figure 6: Average Total Rewards vs. Probability of Cor- ture workto scale up our robotarchitecture by developing rectlyClassifyinganOccupiedRoom(Analytical) POMDPplannersthatareabletotakeadvantageofthestruc- ture of the POMDP planningtasks and thus are more effi- cientthancurrentPOMDPplanners. Baseline Optimal Plan Plan References Reward for Entering Occupied Room x = -200 16.92 282.89 Reward for Entering Occupied Room x = -500 -1000.42 498.23 Arkin,R. 1998.Behavior-BasedRobotics. MITPress. Cassandra,A.;Kaelbling,L.;andKurien,J. 1996. Actingunder Figure7:AverageTotalRewards(Experimental) uncertainty: DiscreteBayesianmodelsfor mobilerobotnaviga- tion. InProceedingsof theInternationalConferenceonIntelli- gentRobotsandSystems,963–972. hastosenseOBSERVEEMPTYbeforeitentersaroomin- Cassandra,A.;Kaelbling,L.;andLittman,M. 1994.Actingopti- mallyinpartiallyobservablestochasticdomains. InProceedings creasesasthesensorbecomesmorenoisy.Thisagainallows oftheNationalConferenceonArtificialIntelligence,1023–1028. theaverage totalreward oftheplans generated by oursys- tem todeterioratemuch moreslowlythantheaverage total Endo,Y.;MacKenzie,D.;Stoychev,A.;Halliburton,W.;Ali,K.; Balch,T.; Cameron,J.; andChen,Z. 2000. MissionLab: User rewardofthebaselineplan,demonstratingtheadvantagesof manualforMissionLabversion4.0. Technicalreport,Collegeof oursystem. Computing,GeorgiaInstituteofTechnology. ExperimentalResults: We alsoperformed asimulation Fox, D.; Burgard, W.; andThrun, S. 1998. Active markovlo- studywithMissionLabtocomparetheaveragetotalreward calizationformobilerobots. RoboticsandAutonomousSystems of the plans generated by our system against the baseline 25:195–207. plan, fory = 0:8and bothx = −200andx = −500. We Hansen,E.1998.SolvingPOMDPsbysearchinginpolicyspace. usedfourroomsandaveragedovertenruns. Figure7shows InProceedingsoftheInternationalConferenceonUncertaintyin the results. In both cases, the average total reward of the ArtificialIntelligence,211–219. baselineplanismuchsmallerthantheaveragetotalreward Koenig,S.,andSimmons,R. 1998. Xavier: Arobotnavigation oftheplansgeneratedbyoursystem. (Thetableshowsthat architecturebasedonpartially observableMarkovdecisionpro- theaveragetotalrewardoftheplansgeneratedbyoursystem cessmodels. InKortenkamp,D.; Bonasso,R.; andMurphy,R., actuallyincreased as itbecame morecostly toenter an oc- eds.,ArtificialIntelligenceBasedMobileRobotics:CaseStudies cupiedroom. Thisartifactisduetothereducedprobability ofSuccessfulRobotSystems.MITPress.91–122. ofenteringanoccupiedroom,causingthesituationtonever Mahadevan,S.;Theocharous,G.;andKhaleeli,N. 1998. Rapid occurduringourlimitednumberofruns.) Theseresultsare conceptlearningformobilerobots. AutonomousRobotsJournal similartotheanalyticalresultsshowninFigure5. 5:239–251. Papadimitriou, C., and Tsitsiklis, J. 1987. The complexity of Conclusion Markovdecisionprocesses.MathematicsofOperationsResearch 12(3):441–450. This paper reported on initialwork that uses Partially Ob- Puterman, M. 1994. Markov Decision Processes – Discrete servableMarkovDecisionProcessmodels(POMDPs)inthe StochasticDynamicProgramming.Wiley. context of behavior-based systems. The insightto making Simmons,R.,andKoenig,S.1995.Probabilisticrobotnavigation thiscombinationworkisthatPOMDPplannerscangenerate inpartiallyobservableenvironments.InProceedingsoftheInter- policy graphs rather than the more popular value surfaces, nationalJointConferenceonArtificialIntelligence,1080–1087. and policy graphs are similar to the finite state automata Sondik, E. 1978. The optimal control of partially observable thatbehavior-basedsystemsusetosequencetheirbehaviors. Markov processes over the infinite horizon: Discounted costs. This combination also keeps the POMDPs small, which OperationsResearch26(2):282–304. allows our POMDP planners to find optimal or close-to- optimalplanswhereas thePOMDP plannersofotherrobot

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.