Table Of Content(cid:3)
ProbabilisticPlanning for Behavior-Based Robots
Amin Atrash and SvenKoenig
CollegeofComputing
GeorgiaInstituteofTechnology
Atlanta,Georgia30332-0280
famin,skoenigg@cc.gatech.edu
Abstract (POMDPs)(Sondik1978)forplanningandcombinesthem
withMissionLab. POMDPsprovideanelegantandtheoret-
Partially Observable Markov Decision Process models
ically groundedway for probabilisticplanning(Cassandra,
(POMDPs)havebeenappliedtolow-levelrobotcontrol. We
Kaelbling, & Littman 1994). So far, they have been used
show how to use POMDPs differently, namely for sensor-
mainly to solve low-level planningtasks for mobilerobots
planninginthecontextofbehavior-basedrobotsystems.This
ispossiblebecausesolutionsofPOMDPscanbeexpressedas such as path following and localization (Fox, Burgard, &
policygraphs,whicharesimilar tothefinite stateautomata Thrun 1998; Mahadevan, Theocharous, & Khaleeli 1998;
thatbehavior-basedsystemsusetosequencetheirbehaviors. Cassandra, Kaelbling,& Kurien1996;Simmons &Koenig
AnadvantageofoursystemoverpreviousPOMDPnaviga- 1995). In this paper, we show that POMDPs can also be
tion systemsis that it is able to find close-to-optimal plans usedtosolvehigher-levelplanningtasksformobilerobots.
since it plans at a higher level and thus with smaller state The key ideabehindourrobotarchitecture is thatPOMDP
spaces.Anadvantageofoursystemoverbehavior-basedsys-
planners can generate policy graphs rather than the more
temsthatneedtogetprogrammedbytheirusersisthatitcan
popular valuesurfaces. Policy graphs are similar to thefi-
optimize plansduringmissions andthusdeal robustly with
nite state automata of MissionLab. An advantage of our
probabilisticmodelsthatareinitiallyinaccurate.
robot architecture is that it uses POMDPs in small state
Introduction spaces. WhenPOMDPsareusedforlow-levelplanning,the
statespaces areoftenlargeandfindingoptimalorclose-to-
Mobilerobotshavetodealwithvariouskindsofuncertainty,
optimalPOMDPsbecomes extremelytime-consuming(Pa-
suchasnoisyactuators,noisysensors,anduncertaintyabout
padimitriou&Tsitsiklis1987).Thus,existingrobotsystems
the environment. Behavior-based robot systems, such as
have sofaronlybeen able tousegreedyPOMDP planning
MissionLab (Endo et al. 2000), can operate robustly in
methods that produceextremely suboptimal plans (Koenig
the presence of uncertainty (Arkin 1998). Its operation is
& Simmons 1998). Our robot architecture, on the other
controlledby plans inform of finitestateautomata, whose
hand,isabletofindclose-to-optimalplans.
states correspond to behaviors and whose arcs correspond
Inthefollowing,wefirstgiveanexampleofsensorplan-
toobservations. Thesefinitestateautomatahavetobepro-
ningandthengiveoverviewsofbehavior-basedroboticsand
grammed by the users of the system at the beginning of a
POMDPs using this example. Next, we describe how our
mission.However,plansgeneratedbyhumansarerarelyop-
robotarchitecturecombinestheseideasbytransformingthe
timalbecausetheyinvolvecomplextradeoffs. Consider,for
outputof the POMDP planner (policygraphs) to the input
example, a simplesensor-planningtask, where a robothas
ofMissionLab(finitestateautomata). Finally,wereporton
todecidehowoftentosensebeforeitstartstoact. Sincethe
twoexperimentsthatshowthattheabilitytooptimizeplans
sensorsoftherobotarenoisy,itmayhavetosensemultiple
duringmissionsisimportantbecausetheresultingsystemis
times. Ontheotherhand,sensingtakestime. Howoftenthe
abletodeal robustlywithprobabilisticmodels thatareini-
robotshouldsensedepends ontheamountofsensor noise,
tiallyinaccurate.
thecostofsensing,andtheconsequencesofactingbasedon
wrongsensorinformation.
Example: SensorPlanning
In this paper, we develop a robot architecture that
usesPartiallyObservableMarkovDecisionProcess models We use the followingsensor-planningexample throughout
thispaper, whichissimilartoan exampleusedin(Cassan-
(cid:3)ThisresearchissupportedbyDARPA/U.S.ArmySMDCcon- dra,Kaelbling,&Littman1994).Assumethatapolicerobot
tract#DASG60-99-C-0081.Approvedforpublicrelease;distribu-
attemptstofindwoundedhostagesinabuilding.Whenitis
tionunlimited. Theviewsandconclusionscontainedinthisdoc-
at a doorway, ithas to decide whether to search the room.
ument are those of the authors and should not be interpreted as
The robot can either use its microphone to listen for ter-
representing the official policies, either expressedor implied, of
rorists(OBSERVE);entertheroom,lookaround,leavethe
thesponsoringorganizationsandagenciesortheU.S.government.
Copyright (cid:13)c 2001, American Association for Artificial Intelli- room, and proceed to the next doorway (ENTER ROOM
gence(www.aaai.org). Allrightsreserved. ANDPROCEED);ormovetothenextdoorwayrightaway
Report Documentation Page Form Approved
OMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and
maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,
including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington
VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it
does not display a currently valid OMB control number.
1. REPORT DATE 3. DATES COVERED
2001 2. REPORT TYPE 00-00-2001 to 00-00-2001
4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER
Probabilistic Planning for Behavior-Based Robots
5b. GRANT NUMBER
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S) 5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION
Defense Advanced Research Projects Agency,3701 North Fairfax REPORT NUMBER
Drive,Arlington,VA,22203-1714
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)
11. SPONSOR/MONITOR’S REPORT
NUMBER(S)
12. DISTRIBUTION/AVAILABILITY STATEMENT
Approved for public release; distribution unlimited
13. SUPPLEMENTARY NOTES
14. ABSTRACT
see report
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF
ABSTRACT OF PAGES RESPONSIBLE PERSON
a. REPORT b. ABSTRACT c. THIS PAGE 5
unclassified unclassified unclassified
Standard Form 298 (Rev. 8-98)
Prescribed by ANSI Std Z39-18
(PROCEED). The cost of OBSERVE is always 5, and the ar(1s:1 E, an1te)r = R 1o0o0m and Proceed a1: Enter Room and Proceed
costofPROCEEDisalways50. Eachroomisoccupiedby r(s2, a1) = -500
a2: Observe a2: Observe
terroristswithprobability0.5. OBSERVEreportseitherthat r(s1, a2) = -5 r(s2, a2)= -5
0.5
theroomisoccupiedbyterrorists(OBSERVEOCCUPIED) 0.5
1.0
or not(OBSERVE EMPTY). Althoughthemicrophoneal- 1.0 s1: Room Empty s2: Room Occupied
waysdetectstheabsenceofterrorists,itdoesnotdetectthe p(s1) = 0.5 p(s2) = 0.5
presence of terroristswithprobability0.2. Multipleobser- 0.5 0.5
vationsaredrawnindependentlyfromthisprobabilitydistri-
a3: Proceed
bution, which is notcompletely unrealisticfor sound. The r(s1, a3)= -50 a3: Proceed
r(s2,a3) = -50
robotgets a reward for enteringa room, as an incentiveto
o1: Observe Empty q(o1|s1, a1) = 0.5 p(s1) = 0.5
find wounded hostages. However, it also gets a penalty if o2: Observe Occupied q(o1|s2, a1) = 0.5 p(s2) = 0.5
a1: Enter Room and Proceed q(o1|s1, a2) = 1.0
theroomisoccupiedbyterroristssinceterroristsmightde- aa23:: OPrbosceerevde qq((oo11||ss21,, aa23)) == 00..25 pp((ss11||ss11,, aa12)) == 01..50
s1: Room Empty q(o1|s2,.a3) = 0.5 p(s1|s1, a3) = 0.5
stroy it. If the room is not occupied by terrorists (ROOM s2: Room Occupied p(s1|s2, a1) = 0.5
p(s1|s2, a2) = 0.0
EMPTY), thenENTER ROOMANDPROCEEDresultsin SO=={{so11, ,s o22}} p(s1|s2, a3) = 0.5
A(s1) = A(s2) = {a1, a2, a3}
arewardof100. However,iftheroomisoccupiedbyterror-
ists(ROOMOCCUPIED),thenENTERROOMANDPRO-
Figure1:POMDP
CEEDresultsinapenaltyof500.Themaindecisionthatthe
robothas tomakeishowoftentoOBSERVEand, depend-
ingonthesensorobservations,whethertoPROCEEDtothe
in it. The POMDP furtherconsists of a transitionfunction
nextdoorwayrightawayortofirstENTERtheROOMAND p, where p(s0js;a) denotes the probabilitywith which the
thenPROCEED. system transitionsfrom states to state s0 when action a is
executed,anobservationfunctionq,whereq(ojs;a)denotes
Behavior-BasedRobotics
the probability of making observation o when action a is
Behavior-basedroboticsusesatightcouplingbetweensens- executed in state s, and a reward functionr, where r(s;a)
ingandactingtooperaterobustlyinthepresence ofuncer- denotesthefinitereward(negativecost)thatresultswhenac-
tainty. Therobotalwaysexecutes abehaviorsuchas“move tionaisexecutedinstates. APOMDPprocessisastream
tothedoorway”or“entertheroom.” Tosequencethesebe- of <state, observation, action, reward> quadruples. The
haviors, behavior-based robotics often uses finite state au- POMDP process is always in exactly one state and makes
tomatawhosestatescorrespondtobehaviorsandwhosearcs state transitions at discrete time steps. The initialstate of
correspondtotriggers(observations). Thecurrentstatedic- thePOMDPprocessisdrawnaccordingtotheprobabilities
tates the behavior of the robot. When an observation is (cid:25)(s). Thus, p(st = s) = (cid:25)(s) for t = 1. Assume that
madeandthereisanedgelabeledwiththisobservationthat at time t, the POMDP process is in state st 2 S. Then,
leavesthecurrentstate,thecurrentstatechangestothestate a decisionmaker chooses an actionat fromA(st) forexe-
pointedto by the edge. Since the finitestate automata are cution. This results in reward rt = r(st;at) and observa-
based onbehaviors andtriggers, therobotdoesnotrequire tionot 2 O thatisgenerated accordingtotheprobabilities
amodeloftheworldorcompleteinformationaboutthecur- p(ot =o)=q(ojst;at). Next,thePOMDPprocesschanges
rentstateoftheworld. For example, arobotdoesnotneed state. Thesuccessor statest+1 2S isselectedaccordingto
to knowthe number of doorways or the distances between the probabilitiesp(st+1 = s) = p(sjst;at). This process
them. repeatsforever.
WeusearobotsystembasedonMissionLab(Endoetal. As an example, Figure 1 shows the POMDP that corre-
2000). MissionLab provides a large number of behaviors sponds to our sensor-planning task. The robot starts at a
andtriggerswithwhichuserscanbuildfinitestateautomata, doorway without knowing whether the room is occupied.
thatcan thenbeexecuted onavarietyofrobotsorinsimu- Thus, it is in state ROOM OCCUPIED with probability
lation. Thefinitestateautomatahavetobeprogrammedby 0.5 and ROOM EMPTY with probability0.5 but does not
theusers ofthesystem at thebeginningofamission. This know which oneitis in. In bothstates, therobotcan OB-
hasthedisadvantagethatMissionLabcannotoptimizethefi- SERVE,PROCEEDtothenextdoorway,orENTERROOM
nitestateautomataduringthemission,forexample,whenit ANDPROCEED. OBSERVEdoesnotchangethestatebut
learnsmoreaccurateprobabilitiesorwhentheenvironments thesensorobservationprovidesinformationaboutit. PRO-
change. Furthermore, humansoftenassumethatsensorare CEED and ENTER ROOM AND PROCEED both result
accurate. Their plans are therefore often suboptimal. We in the robot being at the next doorway and thus again in
address thisissuebydevelopingarobotarchitecturewhich state ROOM OCCUPIED with probability0.5 and ROOM
uses a POMDP planner to generate plans based on proba- EMPTYwithprobability0.5. Theobservationprobabilities
bilisticmodelsoftheworld. andrewardsoftheactionsareasdescribedabove.
POMDPs Policy Graphs
POMDPsconsistofafinitesetofstatesS,afinitesetofob- Assume that a decision maker has to determine which ac-
servationsO, and an initialstate distribution(cid:25). Each state tiontoexecute foragivenPOMDPattimet. Thedecision
s 2 S has afiniteset of actionsA(s) thatcan beexecuted makerknowsthespecificationofthePOMDP, executedthe
Observe EmptyObserve Occupied POMDP PPOlaMnnDePr Policy Graph Reducer RPGeodrlauipcchyed MissionLab
Start
Figure3:RobotArchitecture
Observe Occupied
Observe Observe Occupied
Proceed
Observe Empty
Observe Occupied
Observe
Observe Occupied Observe Empty
Observe Empty
Observe
Enter Room and Proceed
Observe Empty
Figure2:PolicyGraph
actions a1:::at−1, and made the observations o1:::ot−1.
Theobjectiveofthedecisionmaker istomaximizetheav-
eragePtotalreward over an infiniteplanninghorizon,which
isE( 1 [γt−1r ]), whereγ 2 (0;1]is adiscountfactor.
t=1 t
The discountfactor specifies therelativevalueof a reward
received aftertactionexecutions comparedtothesame re-
wardreceived oneactionexecution earlier. Oneoftenuses
adiscountfactorslightlysmaller thanonebecause thisen-
suresthattheaveragetotalrewardisfinite,nomatterwhich
actionsarechosen. (Weuseγ =0:99).Inourcase,therobot
Figure4:FiniteStateAutomaton
isthe decisionmaker whoexecutes movement andsensing
actionsandreceivesinformationaboutthestateoftheworld
from inaccurate sensors, such as the microphone. We let
opedbyHanseninhisdissertationattheUniversityofMas-
therobotmaximizetheaveragetotalrewardoveraninfinite
sachusettsatAmherst(Hansen1998).ThisPOMDPplanner
horizonbecauseitsearchesalargenumberofrooms.
canoftenfindoptimalorclose-to-optimalpolicygraphsfor
Itisafundamentalresultofoperationsresearchthatopti-
ourPOMDPproblemsinseconds. WeusethePOMDPplan-
malbehaviorsfortherobotcanbeexpressedeitherasvalue
ner unchanged, withone small exception. We noticed that
surfacesorpolicygraphs(Sondik1978). Valuesurfacesare
manyverticesofthepolicygraphareoftenunreachablefrom
mappings from probability distributions over the states to
the start vertex and eliminatethese vertices usinga simple
values. Therobotcalculatestheexpectedvalueoftheprob-
graph-searchtechnique.
abilitydistributionover thestatesthatresultsfromtheexe-
cutionofeachactionandthenchoosestheactionthatresults
The RobotArchitecture
inthelargestexpectedvalue.Policygraphsaregraphswhere
theverticescorrespondtoactionsandthedirectededgescor- Figure3showsaflowgraphofourrobotarchitecture. The
respondtoobservations. Therobotexecutes theactionthat user inputs a POMDP that models the planning task. The
corresponds toitscurrentvertex. Then, itmakes an obser- robotarchitecturethenusesthePOMDPplannertoproduce
vation,followsthecorrespondingedge,andrepeatsthepro- apolicygraphandremovesallverticesthatareunreachable
cess. fromtheinitialvertex. Bymappingtheactionsofthepolicy
ItisfarmorecommonforPOMDPplannerstousevalue graphtobehaviorsandtheobservationstotriggers,thepol-
surfaces thanpolicygraphs. However, policygraphs allow icygraphisthentransformedtoafinitestateautomatonand
us to integrate POMDP planning and behavior-based sys- used tocontroltheoperationofMissionLab. The user still
temsbecause oftheirsimilaritytofinitestateautomata. As hastoinputinformationbutnowonlytheplanningtaskand
anexample,Figure2showstheoptimalpolicygraphforour its parameters (that is, the probabilities and costs) and no
sensor-planningtask. Thispolicygraphspecifiesabehavior longertheplans. Oncethefinitestateautomatonisreadinto
where therobot senses three times before it decides to en- MissionLab,weallowtheusertoexamineandeditit,forex-
ter a room. If any of the sensing operations indicates that ample,toaddadditionalpartstothemissionormakeitpart
theroomisoccupied,therobotdecidestomovetothenext ofalargerfinitestateautomaton. Figure4showsascreen-
doorwaywithoutenteringtheroom. shotofthepolicygraphfromFigure2afteritwasreadinto
Optimalpolicy graphs can potentiallybe large butoften MissionLabandaugmentedwithdetailsabouthowtoimple-
turnouttobeverysmall(Cassandra, Kaelbling,&Littman ment PROCEED (namely by marking thecurrent doorway
1994). However, finding optimal or close-to-optimal pol- and proceeding along the hallway until the robot is at an
icy graphs is PSPACE-complete in general (Papadimitriou unmarkeddoorway)andENTER ROOMAND PROCEED
&Tsitsiklis1987)andthusonlyfeasibleforsmallplanning (namely by entering the room, leaving the room, marking
tasks. WedecidedtouseaPOMDPplannerthatwasdevel- thecurrentdoorway,andproceedingalongthehallwayuntil
therobotisatanunmarkeddoorway).Furthermore,theuser 5000
decidedthatitwasmorerobusttostartwiththebehaviorthat OBapstiemlianle P Plalann
Change in Optimal Plan
proceedsalongthehallwaybecausethentherobotcanstart 4000
anywhereinthehallwayandnotonlyatdoorways.
Ourrobotarchitectureshowsthatitispossibletointegrate 3000 Observe 0 times
PiegnvrOgaeMprt,hhDesthPsaeonrpledulatfiainronneniitnesomgfstPaaalnOtledMsabeuDmethPoaasmnvtiiainoctarf-.odbriPafmfsOeeorMdefnsDpcyoPeslstsiecmabysesgst,uwrambepyeehnsstp.hpeHaoctoliaifwcycy--- Average Total Reward 2000 Observe 3 times Observe 2 tiOmbesse rve 1 time
1000
tions are discrete and that the robot makes an observation
after each action execution. Finite state automata assume
0
thatbehaviors are continuousand triggerscan be observed
atanytimeduringtheexecutionofthebehaviors.Twoissues
needtobeaddressedinthiscontext.First,weneedMission- −10−010000 −900 −800 −700 −600 −500 −400 −300 −200 −100 0
Reward For Entering Occupied Room (x = r(s2,a1))
Lab to be able to deal with actions of finite duration. We
deal with this problem by adding an ACTION FINISHED Figure5: Average TotalRewards vs. Reward for Entering
trigger to MissionLab. (This extension is not needed for anOccupiedRoom(x)(Analytical)
oursensing-planningtask.) Second, weneedtodealwitha
potentialcombinatorialexplosionofthenumberofobserva-
tions.MostPOMDPplannersassumethateveryobservation Using thenotationof Figure1, the probabilityp(o = o )
t 1
canbemadeineverystate. Consequently,everyvertexina withwhichthesensorreportsOBSERVEEMPTYisp(o =
t
policygraphhasoneoutgoingedgeforeachpossibleobser- o1) = q(o1js1;a2)p(st = s1)+q(o1js2;a2)p(st = s2) =
vation. However, theobservationsare n tuplesif thereare 1:00:5+ 0:20:5 = 3=5. Consequently, the average total
nsensorsandthenumberofobservationscanthusbelarge. reward v(k) of the baseline plan if the robot starts in k is
v(k) =−5+γ(3=5v(l)+2=5v(m)). Similarderivationsre-
This is nota problemfor finitestateautomatasince obser-
sultinasystemofthreelinearequationsinthreeunknowns:
vations that do not cause state transitionsdo not appear in
them. Wedealwiththisproblembyomittingsubtasksfrom v(k) = −5+γ(3=5v(l)+2=5v(m))
the POMDP planning task that can be abstracted away or
v(l) = 1=6x+5=6100+γv(k)
arepre-sequencedanddonotneedtobeplanned.Forexam-
v(m) = −50+γv(k)
ple, ENTER ROOM AND PROCEED isa macro-behavior
that consists of a sequence of observations and behaviors, Solvingthissystemofequationsyieldsv(k) = 1241:21+
as shown in Figure 4. By omitting the details of ENTER 4:97x. Figure 5 shows this graph together with the aver-
ROOM AND PROCEED, the observations IN ROOM, IN agetotalrewardoftheplansgenerated byoursystem, as a
HALLWAY, and MARKEDDOORWAYdo notneed tobe functionofx. Ascanbeseen, thenumberoftimesarobot
consideredduringplanning. hastosenseOBSERVEEMPTYbeforeitentersaroomin-
creases as itbecomes more expensiveto enter an occupied
Experiments room. (The markers show when a change in plan occurs.)
Wetesttheperformanceofoursystem,bothanalyticallyand The robotpays acost for theadditionalsensing operations
experimentally,bycomparingtheaveragetotalrewardofits but this decreases the probability of entering an occupied
plans(thatis,optimalplans)againsttheplanstypicallygen- room. Changingtheplansas xchanges allowstheaverage
eratedbyusers. Foroursensor-planningtask,userstypically totalrewardoftheplansgenerated byoursystemtodeteri-
createplansthatsenseonlyonce, nomatter whattheprob- oratemuchmoreslowlythantheaveragetotalrewardofthe
abilitiesand costsare. The robotexecutes ENTER ROOM baselineplan. Thisresultshowsthatoursystemhas anad-
AND PROCEED if itsenses ROOM EMPTY, otherwiseit vantageovertheprevioussystembecauseitisabletoadapt
executes PROCEED. Wethereforeusethisplanasbaseline plansduringmissionswhenthecostscanbeestimatedmore
planandcomparetheplansgeneratedbyoursystemagainst precisely. It also shows that our system has an advantage
it. over the previous system because humans are not good at
AnalyticalResults: Todemonstratethatoursystem has planningwithuncertaintyandthustheirplansarerarelyop-
an advantage over the previous system because it is able timal. For example, the original sensor-planning problem
to optimize its plans during missions when it is able to has x = −500andtheaverage totalrewardofthebaseline
estimate the costs more precisely, we determine analyti- plan is only -1,246.24 whereas the average total reward of
cally howthe average totalreward ofthebaselineplan de- theplangeneratedbyoursystemis374.98.
pends on the reward x = r(s ;a ) for entering an occu-
2 1 Similarresultscan beobservediftheinitialprobabilities
pied room. Let k be the state directly before the robot
are inaccurate. Figure6 shows theaverage totalreward of
executes OBSERVE, l the state directly before it executes
the baseline plan together with the average total reward of
ENTER ROOMAND PROCEED, andmthestatedirectly
before it executes PROCEED. If the robot is in k, then it theplansgeneratedbyoursystem,asafunctionoftheprob-
incurs a cost of 5 for executing OBSERVE. It then transi- ability y = q(o2js2;a2) with which the microphone cor-
tionsfromktol withtheprobabilitywithwhichthesensor rectlyclassifies an occupiedroom, forbothx = −200and
reports OBSERVE EMPTY, otherwise it transitions to m. x = −500. As can be seen, the number of times a robot
1000 architecturescanonlyfindverysuboptimalplans.
Optimal Plan (x = −200)
OChpatimngael Pinl aOnp (txim =a −l 5P0la0n) ( x = − 200) WeusedthisinsighttoimproveMissionLab,abehavior-
Change in Optimal Plan (x = −500) Observe 1 time
800 Baseline Plan (x = −200) based systemwherethefinitestateautomatahadtobepro-
Baseline Plan (x = −500)
grammed bytheusersofthesystematthebeginningofthe
600 Observe 2 times mission.Thishadthedisadvantagethathumansarenotgood
Average Total Reward 240000 Observe 4 times Observe 3 times Observe 3 times Observe 2 times aopthtpreotpidfilmaunnacinlet.eicnslIgtonaswtecei-ottanhout-truooanpmsctti,eamrtotaaauliwrnphtrlyoeanbnaosnittdblauetrhtacurhinssistatehmlcsetoouirrraeepbaldlaceoncetsuosraaorntepeottripamorroneizbllyye-
abilitiesorwhentheenvironmentchanges.
Observe 4 times Itisfutureworktostudyinterfacesthatallowuserstoeas-
0
ilyinputPOMDPs, includingprobabilitiesandcosts. Also,
Observe 5 times
weintendtoimplementsamplingmethodsforadaptingthe
−20050 55 60 65 70 75 80 85 90 95 100 probabilities and costs of POMDPs during missions to be
Probability of OBSERVE OCCUPIED given ROOM OCCUPIED (y=q(o2|s2,a2)
able to update the plan during execution. Finally, it is fu-
Figure 6: Average Total Rewards vs. Probability of Cor- ture workto scale up our robotarchitecture by developing
rectlyClassifyinganOccupiedRoom(Analytical) POMDPplannersthatareabletotakeadvantageofthestruc-
ture of the POMDP planningtasks and thus are more effi-
cientthancurrentPOMDPplanners.
Baseline Optimal
Plan Plan
References
Reward for Entering Occupied Room x = -200 16.92 282.89
Reward for Entering Occupied Room x = -500 -1000.42 498.23 Arkin,R. 1998.Behavior-BasedRobotics. MITPress.
Cassandra,A.;Kaelbling,L.;andKurien,J. 1996. Actingunder
Figure7:AverageTotalRewards(Experimental) uncertainty: DiscreteBayesianmodelsfor mobilerobotnaviga-
tion. InProceedingsof theInternationalConferenceonIntelli-
gentRobotsandSystems,963–972.
hastosenseOBSERVEEMPTYbeforeitentersaroomin- Cassandra,A.;Kaelbling,L.;andLittman,M. 1994.Actingopti-
mallyinpartiallyobservablestochasticdomains. InProceedings
creasesasthesensorbecomesmorenoisy.Thisagainallows
oftheNationalConferenceonArtificialIntelligence,1023–1028.
theaverage totalreward oftheplans generated by oursys-
tem todeterioratemuch moreslowlythantheaverage total Endo,Y.;MacKenzie,D.;Stoychev,A.;Halliburton,W.;Ali,K.;
Balch,T.; Cameron,J.; andChen,Z. 2000. MissionLab: User
rewardofthebaselineplan,demonstratingtheadvantagesof
manualforMissionLabversion4.0. Technicalreport,Collegeof
oursystem.
Computing,GeorgiaInstituteofTechnology.
ExperimentalResults: We alsoperformed asimulation
Fox, D.; Burgard, W.; andThrun, S. 1998. Active markovlo-
studywithMissionLabtocomparetheaveragetotalreward
calizationformobilerobots. RoboticsandAutonomousSystems
of the plans generated by our system against the baseline
25:195–207.
plan, fory = 0:8and bothx = −200andx = −500. We
Hansen,E.1998.SolvingPOMDPsbysearchinginpolicyspace.
usedfourroomsandaveragedovertenruns. Figure7shows
InProceedingsoftheInternationalConferenceonUncertaintyin
the results. In both cases, the average total reward of the
ArtificialIntelligence,211–219.
baselineplanismuchsmallerthantheaveragetotalreward
Koenig,S.,andSimmons,R. 1998. Xavier: Arobotnavigation
oftheplansgeneratedbyoursystem. (Thetableshowsthat
architecturebasedonpartially observableMarkovdecisionpro-
theaveragetotalrewardoftheplansgeneratedbyoursystem
cessmodels. InKortenkamp,D.; Bonasso,R.; andMurphy,R.,
actuallyincreased as itbecame morecostly toenter an oc-
eds.,ArtificialIntelligenceBasedMobileRobotics:CaseStudies
cupiedroom. Thisartifactisduetothereducedprobability ofSuccessfulRobotSystems.MITPress.91–122.
ofenteringanoccupiedroom,causingthesituationtonever
Mahadevan,S.;Theocharous,G.;andKhaleeli,N. 1998. Rapid
occurduringourlimitednumberofruns.) Theseresultsare conceptlearningformobilerobots. AutonomousRobotsJournal
similartotheanalyticalresultsshowninFigure5. 5:239–251.
Papadimitriou, C., and Tsitsiklis, J. 1987. The complexity of
Conclusion Markovdecisionprocesses.MathematicsofOperationsResearch
12(3):441–450.
This paper reported on initialwork that uses Partially Ob-
Puterman, M. 1994. Markov Decision Processes – Discrete
servableMarkovDecisionProcessmodels(POMDPs)inthe
StochasticDynamicProgramming.Wiley.
context of behavior-based systems. The insightto making
Simmons,R.,andKoenig,S.1995.Probabilisticrobotnavigation
thiscombinationworkisthatPOMDPplannerscangenerate
inpartiallyobservableenvironments.InProceedingsoftheInter-
policy graphs rather than the more popular value surfaces,
nationalJointConferenceonArtificialIntelligence,1080–1087.
and policy graphs are similar to the finite state automata
Sondik, E. 1978. The optimal control of partially observable
thatbehavior-basedsystemsusetosequencetheirbehaviors.
Markov processes over the infinite horizon: Discounted costs.
This combination also keeps the POMDPs small, which
OperationsResearch26(2):282–304.
allows our POMDP planners to find optimal or close-to-
optimalplanswhereas thePOMDP plannersofotherrobot