ebook img

Imitating Driver Behavior with Generative Adversarial Networks PDF

0.41 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Imitating Driver Behavior with Generative Adversarial Networks

Imitating Driver Behavior with Generative Adversarial Networks Alex Kuefler1, Jeremy Morton2, Tim Wheeler2, and Mykel Kochenderfer2 Abstract—The ability to accurately predict and simulate turning actions. Recent advances in computing and deep humandrivingbehavioriscriticalforthedevelopmentofintelli- learning have allowed this approach to scale to realistic genttransportationsystems.Traditionalmodelingmethodshave scenarios, such as parking lot, highway, and markerless employed simple parametric models and behavioral cloning. road conditions [10]. These BC approaches are conceptually This paper adopts a method for overcoming the problem of cascading errors inherent in prior approaches, resulting in sound [11], but tend to fail in practice as small inaccuracies realistic behavior that is robust to trajectory perturbations. compound during simulation. Inaccuracies lead the policy to 7 We extend Generative Adversarial Imitation Learning to the states that are underrepresented in the training data (e.g., an 1 training of recurrent policies, and we demonstrate that our ego-vehicleedgingtowardsthesideoftheroad),whichleads 0 modeloutperformsrule-basedcontrollersandmaximumlikeli- toyetpoorerpredictions,andultimatelytoinvalidorunseen 2 hood models in realistic highway simulations. Our model both reproduces emergent behavior of human drivers, such as lane situations (e.g., off-road driving). This problem of cascading n change rate, while maintaining realistic control over long time errors [12] is well-known in the sequential decision making a horizons. literature and has motivated work on alternative IL methods, J such as inverse reinforcement learning (IRL) [14]. 4 I. INTRODUCTION 2 Inverse reinforcement learning assumes that the expert Accurate human driver models are critical for realistic followsanoptimalpolicywithrespecttoanunknownreward ] simulation of driving scenarios, and have the potential to function.Iftherewardfunctionisrecovered,onecansimply I A significantly advance research in automotive safety. Tradi- use RL to find a policy that behaves identically to the . tionally,humandrivermodelinghasbeenthesubjectofboth expert. This imitation extends to unseen states; in highway s c rule-based and data-driven approaches. Early rule-based at- drivingavehiclethatisperturbedtowardthelaneboundaries [ tempts include parametric models of car following behavior, should know to return toward the lane center. IRL thus withstrongbuilt-inassumptionsaboutroadconditions[1]or generalizes much more effectively and does not suffer from 1 v driver behavior [2]. The Intelligent Driver Model (IDM) [3] manyoftheproblemsofBC.Becauseofthesebenefits,some 9 extendedthisworkbycapturingasymmetriesbetweenaccel- recent efforts in human driver modeling emphasize IRL [15, 9 eration and deceleration, preferred free road and bumper-to- 16]. However, IRL approaches are typically computationally 6 bumper headways, and realistic braking behavior. Such car- expensive in their recovery of an expert cost function. 6 followingmodelswerelaterextendedtomultilaneconditions Instead,recentworkhasattemptedtoimitateexpertbehavior 0 . with controllers like MOBIL [4], which maintains a utility through direct policy optimization, without first learning 1 function and “politeness parameter” to capture intelligent a cost function [13, 17]. Generative Adversarial Imitation 0 7 driver behavior in both acceleration and turning. These Learning (GAIL) [17] in particular has performed well on a 1 controllersarealllargelycharacterizedbysmooth,collision- numberofbenchmarktasks,leveragingtheinsightthatexpert : free driving, but rely on assumptions about driver behavior behavior can be imitated by training a policy to produce v i and a small set of parameters that may not generalize well actionsthatabinaryclassiermistakesforthoseofanexpert. X to diverse driving scenarios. In this work, we apply GAIL to the task of modeling r Incontrast,imitationlearning(IL)approachesrelyondata human highway driving behavior. Our major contributions a typically provided through human demonstration in order to are twofold. First, we extend GAIL to the optimization learn a policy that behaves similarly to an expert. These of recurrent neural networks, showing that such policies policies can be represented with expressive models, such perform with greater fidelity to expert behavior than feed- as neural networks, with less interpretable parameters than forward counterparts. Second, we apply our models to a those used by rule-based methods. Prior human behavior realistic highway simulator, where expert demonstrations modelsforhighwaydrivinghavereliedonbehavioralcloning are given by real-world driver trajectories included in the (BC), which treats IL as a supervised learning problem, NGSIM dataset [18, 19]. We demonstrate that policy net- fitting a model to a fixed dataset of expert state-action pairs worksoptimizedbyGAILcapturemanydesirableproperties [5–8]. ALVINN [9], an early BC approach, trained a neural of earlier IL models, such as reproducing emergent driver networktomaprawimagesandrangefinderinputstodiscrete behavior and assigning high likelihood to expert actions, while simultaneously reducing collision and off-road rates 1AlexKueflerisintheSymbolicSystemsProgramatStanfordUniversity, necessaryforlonghorizonhighwaysimulations.Unlikepast Stanford,CA94305,[email protected] work, our model learns to map raw LIDAR readings and 2Are in the department of Aeronautics and Astronautics at Stanford simple, hand-picked road features to continuous actions, University, Stanford, CA 94305, USA {jmorton2, wheelert, mykel}@stanford.edu adjusting only turn-rate and acceleration each time step. II. PROBLEMFORMULATION establish preferences between mildly undesirable behavior (e.g., hard braking) and extremely undesirable behavioral We regard highway driving as a sequential decision mak- (e.g., collisions). And finally, RL maximizes the global, ex- ingtaskinwhichthedriverobeysastochasticpolicyπ(a|s) pectedreturnonatrajectory,ratherthanlocalinstructionsfor mapping observed road conditions s to a distribution over eachobservation.Oncepreferencesarelearned,apolicymay drivingactionsa.Givenaclassofpoliciesπ parameterized θ take mildly undesirable actions now in order to avoid awful by θ, we seek to find the policy that best mimics human situations later. As such, reinforcement learning algorithms drivingbehavior.WeadoptanILapproachtoinferthispolicy provide robustness against cascading errors. fromadatasetconsistingofasequenceofstate-actiontuples (st,at). IL can be performed using BC or reinforcement III. POLICYREPRESENTATION learning. Our learned policy must be able to capture human driving A. Behavioral Cloning behavior, which involves: Behavioral cloning solves a regression problem in which • Non-linearity in the desired mapping from states to actions (e.g., large corrections in steering to avoid the policy parameterization is obtained by maximizing the collisionscausedbysmallchangesinthecurrentstate). likelihoodoftheactionstakeninthetrainingdata.BCworks well for states adequately covered by the training data. It • High-dimensionality of the state representation, which must describe properties of the ego-vehicle, in addition is forced to generalize when predicting actions for states to surrounding cars and road conditions. with little or no data coverage, which can lead to poor behavior. Unfortunately, even if simulations are initialized • Stochasticitybecausehumansmaytakedifferentactions each time they encounter a given traffic scene. in common states, the stochastic nature of the policies allow small errors in action predictions to compound over time, To address the first and second points, we represent all eventually leading to states that human drivers infrequently learned policies πθ using neural networks. To address the visit and are not adequately covered by the training data. third point, we interpret the network’s real-valued outputs Poorer predictions can cause a feedback cycle known as given input st as the mean µt and logarithm of the diagonal cascading errors [20]. covariancelogνt ofaGaussiandistribution.Actionsarecho- In a highway driving context, cascading errors can lead sen by sampling at ∼ πθ(at | st). An example feedforward to off-road driving and collisions. Datasets rarely contain model is shown in Fig.2. We evaluate both feedforward and information about how human drivers behave in these situ- recurrent network architectures. ations, which can lead BC policies to act erratically when Feedforward neural networks directly map inputs to out- they encounter such states. puts. The most common architecture, multilayer perceptrons Behavioral cloning has been successfully used to produce (MLPs), consist of alternating layers of tunable weights and driving policies for simple behaviors such as car-following element-wise nonlinearities. Neural networks have gained on freeways, in which the state and action space can be widespread popularity due to their ability to learn robust adequately covered by the training set. When applied to hierarchical features from complicated inputs [22, 23], and learning general driving models with nuanced behavior and have been used in automotive behavioral modeling for ac- thepotentialtodriveoutoflane,BConlyproducesaccurate tion prediction in car-following contexts [6, 24–27], lateral predictions up to a few seconds [5, 6]. position prediction [28], and maneuver classification [29]. ThefeedforwardMLPislimitedinitsabilitytoadequately B. Reinforcement Learning addresspartiallyobservableenvironments.Inrealworlddriv- Reinforcement learning (RL) instead assumes that drivers ing,sensorerrorandocclusionsmaypreventthedriverfrom in the real world follow an expert policy π whose actions seeing all relevant parts of the driving state. By maintaining E maximize the expected, global return sufficientstatisticsofpastobservationsinmemory,recurrent policies [30, 31] disambiguate perceptually similar states by (cid:34)(cid:88)T (cid:35) acting with respect to histories of, rather than individual, R(π,r)=E γtr(s ,a ) (1) π t t observations. In this work, we represent recurrent policies t=0 using Gated Recurrent Unit (GRU) networks due to their weighted by a discount factor γ ∈ [0,1). The local reward comparable performance with fewer parameters than other function r(s ,a ) may be unknown, but fully characterizes architectures [32]. t t expertbehaviorsuchthatanypolicyoptimizingR(π,r)will We use similar architectures for the feedforward and perform indistinguishably from π . recurrent policies. The recurrent policies consist of five E Learning with respect to R(π,r) has several advantages feedforward layers that decrease in size from 256 to 32 over maximum likelihood BC in the context of sequential neurons, with an additional GRU layer consisting of 32 decision making [21]. First, r(s ,a ) is defined for all state- neurons. Exponential linear units (ELU) were used through- t t action pairs, allowing an agent to receive a learning signal out the network, which have been shown to combat the even from unusual states. In contrast, BC only receives a vanishinggradientproblemwhilesupportingazero-centered learningsignalforthosestatesrepresentedinalabeled,finite distribution of activation vectors [33]. The MLP policies dataset. Second, unlike labels, rewards allow a learner to havethesamearchitecture,excepttheGRUlayerisreplaced with an additional feedforward layer. For each network neural network D parameterized by ψ, the GAIL objective ψ architecture,onepolicyistrainedthroughBCandonepolicy is given by: is trained through GAIL. In all, we trained four neural network policies: GAIL GRU, GAIL MLP, BC GRU, and max min V(θ,ψ)= E [logD (s,a)]+ BC MLP. ψ θ (s,a)∼XE ψ (2) E [log(1−D (s,a))]. ψ IV. POLICYOPTIMIZATION (s,a)∼Xθ Contrary to BC, which is trained with traditional regres- When fitting ψ, Equation (2) can simply be interpreted as sion techniques, reinforcement learning policies do not have a sigmoid cross entropy objective, maximized by minibatch traininglabelsforindividualactions.Controllerperformance gradientascent.PositiveexamplesaresampledfromX and E is instead evaluated by expected return. This approach is negative examples are sampled from rollouts generated by problematic in modeling human drivers, as the reward func- interactionsofπ withthesimulationenvironment.However, θ tion r(s ,a ) is unknown. We first discuss a method for V(θ,ψ) is non-differentiable with respect to θ, requiring t t training a policy with a known reward function and then optimization via RL. provide a method for learning the reward function. In order to fit π , a surrogate reward function can be θ formulated from Eq. (2) as: A. Trust Region Policy Optimization Policy gradient algorithms are a particularly effective r˜(st,at;ψ)=−log(1−Dψ(st,at)), (3) class of reinforcement learning techniques for optimizing which approaches infinity as tuples (s , a ) drawn from X differentiable policies, including neural networks. As with t t θ becomeindistinguishablefromelementsofX basedonthe standardbackpropagation,networkparametersareoptimized E predictions of D . After performing rollouts with a given using gradient-based updates, but the gradient can only be ψ setofpolicyparametersθ,surrogaterewardsr˜(s ,a ;ψ)are approximatedusingsimulatedrolloutsofthepolicyinteract- t t calculated and TRPO is used to perform a policy update. ing with the environment. Although r˜(s ,a ;ψ) may be quite different from the true This empirical gradient estimate typically exhibits a high t t rewardfunctionoptimizedbyexperts,itcanbeusedtodrive amount of variance. In practice, this variance can cause π into regions of the state-action space similar to those parameterupdatesthatdonotimproveorevenreduceperfor- θ explored by π . mance. In this work, we use Trust Region Policy Optimiza- E tion(TRPO)tolearnourhumandrivingpolicies[34].TRPO V. DATASET updatespolicyparametersthroughaconstrainedoptimization procedurethatenforcesthatapolicycannotchangetoomuch We use the public Next-Generation Simulation (NGSIM) in a single update, and hence limits the damage that can be datasets for US Highway 101 [19] and Interstate 80 [18]. caused by noisy gradient estimates. NGSIM provides 45 minutes of driving at 10Hz for each Although the true reward function that governs the be- roadway.TheUSHighway101datasetcoversanareainLos havior of any particular human driver is unknown, domain Angeles approximately 640m in length with five mainline knowledge can be used to craft a surrogate reward function lanes and a sixth auxiliary lane for highway entrance and such that a policy maximizing this quantity will realize a exit. The Interstate 80 dataset covers an area in the San similar stochastic state-action mapping as π . Drivers avoid Francisco Bay Area approximately 500m in length with six E collisions and going off road, while also favoring smooth mainline lanes, including a high-occupancy vehicle lane and driving and minimizing lane-offset. If such features can an onramp. be combined into a reward function that closely approxi- Traffic density in both datasets transitions from uncon- mates the true reward function for human driving r(s ,a ), gested to full congestion and exhibits a high degree of t t then modeling driver behavior reduces to RL. However, vehicle interaction as vehicles merge on and off the high- handcrafting an accurate reward function is often difficult, way and must navigate in congested flow. The diversity of which motivates the use of Generative Adversarial Imitation drivingconditionsandtheforcedinteractionoftrafficpartic- Learning. ipants makes these sources particularly useful for behavioral studies. The trajectories were smoothed using an extended B. Generative Adversarial Imitation Learning Kalman filter [35] on a bicycle model and projected to lanes usingcenterlinesextractedfromtheNGSIMCADfiles.Cars, Although r(s ,a ) is unknown, a surrogate reward t t trucks, buses, and motorcycles are in the dataset, but only r˜(s ,a ) may be learned directly from data, without making t t car trajectories were used for model training. use of domain knowledge. GAIL [17] trains a policy to performexpert-likebehaviorbyrewardingitfor“deceiving” VI. EXPERIMENTS aclassifiertrainedtodiscriminatebetweenpolicyandexpert state-action pairs. Consider a set of simulated state-action In this work, we use GAIL and BC to learn policies for pairs X = {(s ,a ),(s ,a ),...,(s ,a )} sampled from two-dimensional highway driving. The performance of these θ 1 1 2 2 T T π and a set of expert pairs X sampled from π . For a policiesissubsequentlyevaluatedrelativetobaselinemodels. θ E E Fig. 1: GAIL diagram. Circles represent values, rect- Fig. 2: Architecture for the feedforward multilayer angles represent operations, and dotted arrows rep- perceptron driving policy. The network output µ and resent recurrence relations. Whereas policy π states covarianceparametersν areusedtoconstructaGaus- θ s and actions a are evaluated by discriminator D sian distribution over driver actions. t t ψ during both training of π and D , expert pairs θ ψ (boxed) are sampled only while training D . ψ A. Environment TABLE I: Core features used by the neural networks. All experiments were conducted with the rllab reinforce- Feature Units Description ment learning framework [36]. The simulation environment is a driving simulation on the NGSIM 80 and 101 road Speed ms−1 longitudinalspeed networks. Simulations are initialized to match frames from VehicleLength m boundingboxlength the NGSIM data, and the ego vehicle is randomly chosen VehicleWidth m boundingboxwidth fromamongthetrafficparticipantsintheframe.Simulations LaneOffset m lateralcenterlineoffset are run for 100 steps at 10Hz and are ended prematurely if Lane-RelativeHeading rad headingangleintheFrenetframe the ego vehicle is involved in a collision, drives off road, or −1 drives in reverse. LaneCurvature m curvatureofclosestcenterlinepoint The ego vehicle is driven according to a bicycle model MarkerDist.(L) m lat.dist.toleftlanemarking with acceleration and turn-rate sampled from the policy net- MarkerDist.(R) m lat.dist.torightlanemarking work.Allothertrafficparticipantsarereplayeddirectlyfrom theNGSIMdata,butareaugmentedwithemergencybraking in the event of an imminent rear-end collision with the ego restrict the model to a subset of vehicle relationships, we vehicle. Specifically, if the acceleration predicted by the introduce a more general and flexible feature representation. Intelligent Driver Model (IDM) [3] is less than an activation In addition to the core features, a set of LIDAR-like threshold of −2m/s2, the vehicle then accelerates according beams emanating from the vehicle are used to gather infor- to the IDM while tracking the closest lane centerline. The mation about its surroundings. These beams measure both IDM is parameterized with a desired speed equal to the the distance and range rate of the first vehicle struck by vehicle’s speed at transition, a minimum spacing of 1m, them, up to a maximum range. Our work used a maximum a desired time headway of 0.5s, a nominal acceleration of range of 100m, with 20 range and range rate beams, each 3m/s2, and a comfortable braking deceleration of 2.5m/s2. spaced uniformly in complete 360° coverage around the ego B. Features vehicle’s center, as shown in Fig. 3. All experiments use the same set of features. These Finally,asetofthreeindicatorfeaturesareusedtoidentify features can be decomposed into three sets. The first set, when the ego vehicle encounters undesirable states. These the core features, are eight scalar values that provide basic features take on a value of one whenever the ego vehicle is information about the vehicle’s odometry, dimensions, and involved in a collision, drives off road, or travels in reverse, the lane-relative ego state. These are listed in Table I. and are zero otherwise. All features were concatenated into The core features alone are insufficient to describe the a single 51-element vector and fed into each model. local context. Information about neighboring vehicles and The previous action taken by the ego vehicle is not thelocalroadstructuremustbeincorporatedaswell.Several included in the set of features provided to the policies. We approaches exist for encoding such information in hand- found that policies can develop an over-reliance on previous selectedfeaturesrelevanttothedrivingtask[37].Ratherthan actions at the expense of relying on the other features con- bysamplingn=20simulatedtracesperrecordedtrajectory: (cid:118) (cid:117) m n RWSEH =(cid:117)(cid:116)m1n(cid:88)(cid:88)(cid:16)vH(i)−vˆH(i,j)(cid:17)2, (4) i=1j=1 where v(i) is the true value in the ith trajectory at time H horizon H and vˆ(i,j) is the simulated variable under sample H j for the ith trajectory at time horizon H. We extract the RWSE in predictions of global position, centerline offset, Fig. 3: LIDAR-like beams used for measuring range and and speed over time horizons up to 5s. range rate. 2) Kullback-Leibler Divergence: Driver models should produce distributions over emergent quantities that match tained in their input. To counteract this problem, we studied thoseobservedinreal-worlddata.Foreachmodel,empirical theeffectofreplacingthepreviousactionswithrandomnoise distributions were computed over speed, acceleration, turn- earlyinthetrainingprocess.However,itwasfoundthateven rate, jerk, and inverse time-to-collision (iTTC) over simu- with these mitigations the inclusion of previous actions had lated trajectories. The closeness between the simulated and a detrimental effect on policy performance. real-world distributions was quantified using the Kullback- Leibler (KL) divergence. Piecewise uniform distributions C. Baseline Models with 100 evenly spaced bins were used. The first baseline that we used to compare against our 3) Emergent Behavior: We also extracted a set of emer- deep policies is a static Gaussian (SG) model, which is an gent metrics that indicate model imitation performance in unchanging Gaussian distribution π(a|s)=N(a|µ,Σ) fit relation to the NGSIM dataset. These additional metrics are using maximum likelihood. the lane change rate, the offroad duration, the collision rate, ThesecondbaselinemodelisaBCapproachusingmixture and the hard brake rate. regression (MR) [6]. The model has been used for model- The lane change rate is the average number of times a predictive control and has been shown to work well in vehicle makes a lane change within a 10-second trajectory. simulation and in real-world drive tests. Our MR model is Offroad duration is the average number of time steps per a Gaussian mixture over the joint space of the actions and trajectory that a vehicle spends more than 1m outside the features, trained using expectation maximization [38]. The closest outer road marker. The collision rate is the fraction stochastic policy is formed from the weighted combination of trajectories where the ego vehicle intersects with another of the Gaussian components conditioned on the features. trafficparticipant.Thehardbrakeratecapturesthefrequency Greedy feature selection is used during training to select a at which a model chooses to brake harder than −3m/s2. subsetofpredictorsuptoamaximumfeaturecountthreshold Theenvironmentinwhichvalidationoccursisnotentirely while minimizing the Bayesian information criterion [39]. realistic, as the non-ego vehicles have pre-recorded trajecto- The final baseline model uses a rule-based controller to ries and do not always properly respond to deviations of governthelateralandlongitudinalmotionoftheegovehicle. the ego vehicle from its original trajectory, leading to an The longitudinal motion is controlled by the Intelligent artificially high number of collisions. Hence, we also extract Driver Model with the same parameters as the emergency the hard brake rate to help quantify how often dangerous braking controller used in the simulation environment. For driving situations occur. the lateral motion, MOBIL [4] is used to select the desired VII. RESULTS lane, with a proportional controller used to track the lane centerline. A small amount of noise is added to both the Validation results for root-weighted square error are given lateral and longitudinal accelerations to make the controller in Fig. 4. The root-weighted square error results show nondeterministic. that the feedforward BC model has the best short-horizon performance, but then begins to accumulate error for longer D. Validation time horizons. GAIL produces more stable trajectories and To evaluate the relative performance of each model, a it short term predictions perform well. One can clearly see systematic validation procedure was performed. For each the controller adhere to the lane-centerline, so its lane offset model, 1,000 ten-second scenes were simulated 20 times error is close to a constant 0.5, which demonstrates that each in an environment identical to the one used to train human drivers do not always closely track the nearest lane- theGAILpolicies.Astheserolloutswereperformed,several centerline. metrics were extracted to quantify the ability of each model KL divergence scores are given in Fig. 5. The KL diver- to simulate human driver behavior. gence results show very good tracking for SG in everything 1) Root-Weighted Square Error: The root-weighted but jerk. SG cannot overfit, and always takes the average square error (RWSE) captures the deviation of a model’s action. Its performance in other metrics is poor. GAIL probability mass from real-world trajectories [40]. For pre- GRU performs well on the iTTC, speed, and acceleration dictedvariablev overmtrajectories,weestimatetheRWSE metrics. It does poorly with respect to turn-rate and jerk. 1.5 IDM+MOBIL SG ce 1 n MR e g ver BCMLP di BCGRU L 0.5 K GAILMLP GAILGRU 0 iTTC Speed Acceleration Turn-rate Jerk Fig. 5: TheKLdivergenceforvariousemergentmetricspulledfrom10strajectories. 1.5 NGSIM IDM+MOBIL ue 1 SG Val MR nt ge BCMLP mer 0.5 BCGRU E GAILMLP GAILGRU 0 LaneChangeRate OffroadDuration CollisionRate HardBrakeRate Fig. 6: Emergentvaluesforeachmodel. 6 mostly due to its accelerations being generally small in m) ( magnitude, but does reasonably well with turn-rate and jerk. n 4 o ositi 2 Validation results for emergent variables are given in P Fig.6.TheemergentvaluesshowthatGAILpoliciesoutper- 0 formtheBCpolicies.TheGAILGRUpolicyhastheclosest m) 0.8 matchtothedataeverywhereexceptforhardbrakes(itrarely ( Offset 00..46 tbaektetesretxhtarnemSGe aacntdioinss)o.nMpaixrtwuriethrethgereBssCiopnollaicrgieesl,ybpuetrifsosrmtilsl ne 0.2 a susceptible to cascading errors. Offroad duration is perhaps L 0 the most striking statistic; only GAIL (and of course IDM + (m/s) 2 MSGOBneIvLe)rabreraakbelsehtaordstabyecoanusteheitrooandlyfodrrievxetsenstdreadigshttr,eatcnhdesit. eed 1 has a high collision rate as a consequence. It is interesting p S thatthecollisionrateforIDM+MOBILisroughlythesame 0 0 1 2 3 4 5 asthecollisionrateforGAILGRU,despitethefactthatIDM + MOBIL should not collide. The inability of other vehicles Horizon (s) within the simulation environment to fully react to the ego- IDM+MOBIL SG MR BCMLP BCGRU GAILMLP vehicle may explain this phenomenon. GAILGRU The results demonstrate that GAIL-based models capture many desirable properties of both rule-based and machine Fig. 4: The root weighted square error for each candidate model vs. learning methods, while avoiding common pitfalls. With predictionhorizon.Deeppoliciesoutperformtheothermethods. the exception of the hand-coded controller, GAIL policies achieve the lowest collision and off-road driving rates, con- This poor performance is likely due to the fact that, on siderablyoutperformingbaselineandsimilarlystructuredBC average, the GAIL GRU policy takes similar actions to models. However, GAIL also achieves a lane change rate humans, but oscillates between actions more than humans. closer to real human driving than any other method against For instance, rather than outputting a turn-rate of zero on which it is compared. straight road stretches, it will alternate between outputting Furthermore,extendingGAILtorecurrentpoliciesleadsto small positive and negative turn-rates. In comparison with improved performance. This result is an interesting contrast theGAILpolicies,theBCpoliciesareworsewithiTTC.The withtheBCpolicies,wheretheadditionofrecurrencelargely GRU version has the largest KL divergence in acceleration, does not lead to better results. Thus, we find that recurrence by itself is insufficient for addressing the detrimental effects [4] A. Kesting, M. Treiber, and D. Helbing, “General that cascading errors can have on BC policies. lane-changingmodelmobilforcar-followingmodels”, Journal of the Transportation Research Board, 2007. VIII. CONCLUSIONS [5] T. A. Wheeler, P. Robbel, and M. Kochenderfer, This paper demonstrates the effectiveness of deep imi- “Analysis of microscopic behavior models for prob- tation learning as a means of training driver models that abilistic modeling of driver behavior”, in IEEE In- perform realistically over long time horizons, while simul- ternational Conference on Intelligent Transportation taneously capturing microscopic, human-like behavior. Our Systems (ITSC), 2016. contributionshavebeento(1)extendGenerativeAdversarial [6] S. Lefe`vre, C. Sun, R. Bajcsy, and C. Laugier, “Com- Imitation Learning to the optimization of recurrent policies, parison of parametric and non-parametric approaches andto(2)applythistechniquetothecreationofanew,intel- for vehicle speed prediction”, American Control Con- ligent model of highway driving that outperforms the state ference (ACC), pp. 3494–3499, 2014. of the art on several metrics. Although behavioral cloning [7] T. Gindele, S. Brechtel, and R. Dillmann, “Learning still outperforms Generative Adversarial Imitation Learning context sensitive behavior models from observations onshort(∼2s)horizons,itsgreedybehaviorpreventsitfrom for predicting traffic situations”, in IEEE Interna- achieving realistic driving over an extended period. The use tional Conference on Intelligent Transportation Sys- of policy optimization by Generative Adversarial Imitation tems (ITSC), 2013. Learning enables us to overcome this problem of cascading [8] G. Agamennoni, J. Nieto, and E. Nebot, “Estimation errorstoproducelong-term,stabletrajectories.Furthermore, of multivehicle dynamics by considering contextual the use of policy representation by deep, recurrent neural information”,IEEETransactionsonRobotics,vol.28, networks enables us to learn directly from general sensor no. 4, pp. 855–870, 2012. inputs(i.e.,LIDARdistanceandrangerate)thatcancapture [9] D. A. Pomerleau, “ALVINN: an autonomous land arbitrary traffic states and simulate partial observability. vehicle in a neural network”, DTIC Document, Tech. We have argued that reinforcement learning schemes in- Rep. AIP-77, 1989. corporating surrogate reward functions overcome problems [10] M.Bojarski,D.DelTesta,D.Dworakowski,B.Firner, arisingfromsupervisedlearninginhighwaydrivermodeling. B. Flepp, P. Goyal, L. Jackel, M. Monfort, U. Muller, Assuch,futureworkmaywishtoexplorewaysofcombining J.Zhang,X.Zhang,andJ.Zhao,“Endtoendlearning other reward signals with our own. Whereas Generative for self-driving cars”, IEEE Computer Society Con- AdversarialImitationLearningcaptureshuman-likebehavior ference on Computer Vision and Pattern Recognition present in the dataset, simulators may also wish to enforce (CVPR), 2016. certain behaviors (e.g., explicitly modeling driver style) by [11] U. Syed and R. E. Schapire, “A game-theoretic ap- combining the learned, surrogate reward with a reward proach to apprenticeship learning”, in Advances in function crafted from hand-picked features. An engineered Neural Information Processing Systems (NIPS), 2007. reward function could also be used to penalize the oscil- [12] S. Ross and J. Bagnell, “Efficient reductions for lations in acceleration and turn-rate produced by the GAIL imitation learning”, in International Conference on GRU.Finally,weofferourmodelofhumandrivingbehavior Artificial Intelligence and Statistics, 2010. asanimportantelementforsimulatingrealistichighwaycon- [13] J. Ho, J. K. Gupta, and S. Ermon, “Model-free imita- ditions.Futureworkwillapplyourmodeltodecisionmaking tionlearningwithpolicyoptimization”,arXivpreprint andsafetyvalidation.Thecodeassociatedwiththispapercan arXiv:1605.08478, 2016. be found at https://github.com/sisl/gail-driver. [14] P. Abbeel and A. Y. Ng, “Apprenticeship learning ACKNOWLEDGMENT via inverse reinforcement learning”, in International Conference on Machine Learning (ICML), 2004. This material is based upon work supported by the Ford [15] D.S.Gonza´lez,J.Dibangoye,andC.Laugier,“High- Motor Company, Robert Bosch LLC, and the National speed highway scene prediction based on driver mod- Science Foundation Graduate Research Fellowship Program els learned from demonstrations”, in IEEE Interna- under Grant No. DGE-114747. tional Conference on Intelligent Transportation Sys- REFERENCES tems (ITSC), 2016. [1] P. A. Seddon, “A program for simulating the disper- [16] D. Sadigh, S. Sastry, S. A. Seshia, and A. D. Dragan, sion of platoons of road traffic”, Simulation, vol. 18, “Planning for autonomous cars that leverages effects no. 3, pp. 81–90, 1972. onhumanactions”,inRobotics:ScienceandSystems, [2] P. G. Gipps, “A behavioural car-following model for 2016. computer simulation”, Transportation Research Part [17] J.HoandS.Ermon,“Generativeadversarialimitation B:Methodological,vol.15,no.2,pp.105–111,1981. learning”, arXiv preprint arXiv:1606.03476, 2016. [3] M.Treiber,A.Hennecke,andD.Helbing,“Congested [18] J. Colyar and J. Halkias, “US highway 80 dataset”, traffic states in empirical observations and micro- FederalHighwayAdministration(FHWA),Tech.Rep. scopic simulations”, Physical Review E, vol. 62, no. FHWA-HRT-06-137, 2006. 2, p. 1805, 2000. [19] J. Colyar and J. Halkias, “US highway 101 dataset”, [29] P. Boyraz, M. Acar, and D. Kerr, “Signal modelling FederalHighwayAdministration(FHWA),Tech.Rep. and hidden Markov models for driving manoeuvre FHWA-HRT-07-030, 2007. recognitionanddriverfaultdiagnosisinanurbanroad [20] J. A. Bagnell, “An invitation to imitation”, DTIC scenario”, in IEEE Intelligent Vehicles Symposium, Document, Tech. Rep. CMU-RI-TR-15-08, 2015. 2007. [21] R.S.SuttonandA.G.Barto,ReinforcementLearning: [30] D.Wierstra,A.Fo¨rster,J.Peters,andJ.Schmidhuber, An Introduction, 1. MIT Press, 1998. “Recurrent policy gradients”, Logic Journal of IGPL, [22] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, vol. 18, no. 5, pp. 620–634, 2010. “Convolutional deep belief networks for scalable un- [31] N. Heess, J. J. Hunt, T. P. Lillicrap, and D. Silver, supervised learning of hierarchical representations”, “Memory-based control with recurrent neural net- in International Conference on Machine Learning works”, arXiv preprint arXiv:1512.04455, 2015. (ICML), 2009. [32] J. Chung, C. Gulcehre, K. Cho, and Y. Ben- [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Im- gio, “Empirical evaluation of gated recurrent neu- agenet classification with deep convolutional neural ral networks on sequence modeling”, arXiv preprint networks”, in Advances in Neural Information Pro- arXiv:1412.3555, 2014. cessing Systems (NIPS), 2012. [33] D.-A.Clevert,T.Unterthiner,andS.Hochreiter,“Fast [24] J. Hongfei, J. Zhicai, and N. Anning, “Develop a andaccuratedeepnetworklearningbyexponentiallin- car-following model using data collected by ”five- ear units (ELUs)”, arXiv preprint arXiv:1511.07289, wheel system””, in IEEE International Conference 2015. on Intelligent Transportation Systems (ITSC), vol. 1, [34] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and 2003. P. Moritz, “Trust region policy optimization”, in In- [25] S. Panwai and H. Dia, “Neural agent car-following ternationalConferenceonMachineLearning(ICML), models”, IEEE Transactions on Intelligent Trans- 2015. portation Systems, vol. 8, no. 1, pp. 60–70, 2007. [35] R.E.Kalman,“Anewapproachtolinearfilteringand [26] A. Khodayari, A. Ghaffari, R. Kazemi, and R. Braun- prediction problems”, Journal of Basic Engineering, stingl, “A modified car-following model based on a vol. 82, no. 1, pp. 35–45, 1960. neural network model of the human driver effects”, [36] Y. Duan, X. Chen, R. Houthooft, J. Schulman, IEEETransactionsonSystems,Man,andCybernetics, and P. Abbeel, “Benchmarking deep reinforcement vol. 42, no. 6, pp. 1440–1449, 2012. learning for continuous control”, arXiv preprint [27] J. Morton, T. A. Wheeler, and M. J. Kochenderfer, arXiv:1604.06778, 2016. “Analysis of recurrent neural networks for probabilis- [37] C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deep- tic modeling of driver behavior”, IEEE Transactions Driving: Learning affordance for direct perception in on Intelligent Transportation Systems, 2016. autonomous driving”, in IEEE International Confer- [28] Q. Liu, B. Lathrop, and V. Butakov, “Vehicle lateral ence on Computer Vision (ICCV), 2015. position prediction: a small step towards a compre- [38] J. Friedman, T. Hastie, and R. Tibshirani, The Ele- hensive risk assessment system”, in IEEE Interna- ments of Statistical Learning. Springer, 2001. tional Conference on Intelligent Transportation Sys- [39] G. Schwarz, “Estimating the dimension of a model”, tems (ITSC), 2014. The Annals of Statistics, vol. 6, no. 2, pp. 461–464, 1978. [40] J. Cox and M. J. Kochenderfer, “Probabilistic airport acceptance rate prediction”, in AIAA Modeling and Simulation Conference, 2016.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.