Table Of Content

Brain4Cars: Car That Knows Before You Do via Sensory-Fusion Deep Learning Architecture Ashesh Jain1,2, Hema S Koppula1,2, Shane Soh2, Bharad Raghavan2, Avi Singh1, and Ashutosh Saxena3 Cornell University1, Stanford University2, Brain Of Things Inc.3 {ashesh,hema}@cs.cornell.edu, [email protected], {shanesoh,bharadr,asaxena}@cs.stanford.edu Abstract—Advanced Driver Assistance Systems (ADAS) have made driving safer over the last decade. They prepare vehicles 6 for unsafe road conditions and alert drivers if they perform a 1 dangerous maneuver. However, many accidents are unavoidable 0 because by the time drivers are alerted, it is already too late. 2 Anticipatingmaneuversbeforehandcanalertdriversbeforethey n perform the maneuver and also give ADAS more time to avoid a or prepare for the danger. J In this work we propose a vehicular sensor-rich platform and Face 5 learning algorithms for maneuver anticipation. For this purpose Camera 0.2 we equip a car with cameras, Global Positioning System (GPS), ] andacomputingdevicetocapturethedrivingcontextfromboth 0.7 O inside and outside of the car. In order to anticipate maneuvers, 0.1 Road Camera R we propose a sensory-fusion deep learning architecture which jointly learns to anticipate and fuse multiple sensory streams. Fig. 1: Anticipating maneuvers. Our algorithm anticipates driving . s OurarchitectureconsistsofRecurrentNeuralNetworks(RNNs) maneuversperformedafewsecondsinthefuture.Itusesinformation c thatuseLongShort-TermMemory(LSTM)unitstocapturelong frommultiplesourcesincludingvideos,vehicledynamics,GPS,and [ temporal dependencies. We propose a novel training procedure streetmapstoanticipatetheprobabilityofdifferentfuturemaneuvers. which allows the network to predict the future given only a 1 partial temporal context. We introduce a diverse data set with v 1180milesofnaturalfreewayandcitydriving,andshowthatwe 0 Inordertoanticipatemaneuvers,wereasonwiththecontex- can anticipate maneuvers 3.5 seconds before they occur in real- 4 timewithaprecisionandrecallof90.5%and87.4%respectively. tual information from the surrounding events, which we refer 7 to as the driving context. We obtain this driving context from 0 multiple sources. We use videos of the driver inside the car 0 I. INTRODUCTION and the road in front, the vehicle’s dynamics, global position . 1 Over the last decade cars have been equipped with various coordinates(GPS),andstreetmaps;fromthisweextractatime 0 assistivetechnologiesinordertoprovideasafedrivingexperi- series of multi-modal data from both inside and outside the 6 1 ence.Technologiessuchaslanekeeping,blindspotcheck,pre- vehicle.Thechallengeliesinmodelingthetemporalaspectsof : crash systems etc., are successful in alerting drivers whenever driving and fusing the multiple sensory streams. In this work v theycommitadangerousmaneuver[43].StillintheUSalone we propose a specially tailored approach for anticipation in i X more than 33,000 people die in road accidents every year, the such sensory-rich settings. r majority of which are due to inappropriate maneuvers [2]. We Anticipationofthefutureactionsofahumanisanimportant a therefore need mechanisms that can alert drivers before they perception task with applications in robotics and computer perform a dangerous maneuver in order to avert many such vision [39, 77, 33, 34, 73]. It requires the prediction of future accidents [56]. events from a limited temporal context. This differentiates Inthisworkweaddresstheproblemofanticipatingmaneu- anticipationfromactivityrecognition[73],wherethecomplete vers that a driver is likely to perform in the next few seconds. temporal context is available for prediction. Furthermore, Figure 1 shows our system anticipating a left turn maneuver a in sensory-rich robotics settings like ours, the context for fewsecondsbeforethecarreachestheintersection.Oursystem anticipation comes from multiple sensors. In such scenarios also outputs probabilities over the maneuvers the driver can the end performance of the application largely depends on perform. With this prior knowledge of maneuvers, the driver howtheinformationfromdifferentsensorsarefused.Previous assistance systems can alert drivers about possible dangers works on anticipation [33, 34, 39] usually deal with single- before they perform the maneuver, thereby giving them more data modality and do not address anticipation for sensory-rich time to react. Some previous works [22, 41, 50] also predict robotics applications. Additionally, they learn representations a driver’s future maneuver. However, as we show in the usingshallowarchitectures[30,33,34,39]thatcannothandle following sections, these methods use limited context and/or long temporal dependencies [6]. do not accurately model the anticipation problem. Inordertoaddresstheanticipationproblemmoregenerally, Training example Test example (𝐱 ,𝐱 ,…,𝐱 )→𝐲 (𝐱 ,𝐱 ,…,𝐱 ) 1 2 𝑇 1 2 𝑡 𝐲 𝐲 𝐲 𝐲 𝐲 𝑡 𝐱 𝐱 𝐱 𝐱 𝐱 𝐱 1 2 3 𝑇 1 𝑡 Training RNN for anticipation Anticipation given partial context ? Fig. 3: Variations in the data set. Images from the data set [30] for a left lane change. (Left) Views from the road facing camera. (Right) Driving style of the drivers vary for the same maneuver. 𝑡=1 𝑇 1 𝑡 Fig. 2: (Left) Shows training RNN for anticipation in a sequence- to-sequencepredictionmanner.Thenetworkexplicitlylearnstomap the partial context (x ,..,x ) ∀t to the future event y. (Right) At NeuralModels which is especially designed for 1 t test time the network’s goal is to anticipate the future event as soon robotics applications with multiple sensory streams. as possible, i.e. by observing only a partial temporal context. Our data set and deep learning code are publicly available at: http://www.brain4cars.com we propose a Recurrent Neural Network (RNN) based archi- II. RELATEDWORK tecture which learns rich representations for anticipation. We focus on sensory-rich robotics applications, and our architec- Our work builds upon the previous works on assisitive ture learns how to optimally fuse information from different vehicular technologies, anticipating human activities, learning sensors. Our approach captures temporal dependencies by temporal models, and computer vision methods for analyzing using Long Short-Term Memory (LSTM) units. We train human face. our architecture in a sequence-to-sequence prediction manner Assistivefeaturesforvehicles.Latestcarsavailableinmarket (Figure 2) such that it explicitly learns to anticipate given a comes equipped with cameras and sensors to monitor the partial context, and we introduce a novel loss layer which surrounding environment. Through multi-sensory fusion they helps anticipation by preventing over-fitting. provide assisitive features like lane keeping, forward collision We evaluate our approach on a driving data set with 1180 avoidance, adaptive cruise control etc. These systems warn miles of natural freeway and city driving collected across drivers when they perform a potentially dangerous maneu- two states – from 10 drivers and with different kinds of ver [59, 68]. Driver monitoring for distraction and drowsiness driving maneuvers. The data set is challenging because of has also been extensively researched [21, 55]. Techniques the variations in routes and traffic conditions, and the driving likeeye-gazetrackingarenowcommerciallyavailable(Seeing styles of the drivers (Figure 3). We demonstrate that our deep MachinesLtd.)andhasbeeneffectiveindetectingdistraction. learning sensory-fusion approach anticipates maneuvers 3.5 Our work complements existing ADAS and driver monitoring seconds before they occur with 84.5% precision and 77.1% techniques by anticipating maneuvers several seconds before recall while using out-of-the-box face tracker. With more they occur. sophesticated 3D pose estimation of the face, our precision Closely related to us are previous works on predicting the and recall increases to 90.5% and 87.4% respectively. We driver’s intent. Vehicle trajectory has been used to predict believe that our work creates scope for new ADAS features the intent for lane change or turn maneuver [9, 22, 41, 44]. to make roads safer. In summary our key contributions are as Most of these works ignore the rich context available from follows: cameras, GPS, and street maps. Previous works have ad- • We propose an approach for anticipating driving maneu- dressedmaneuveranticipation[1,50,15,67]throughsensory- vers several seconds in advance. fusion from multiple cameras, GPS, and vehicle dynamics. • We propose a generic sensory-fusion RNN-LSTM archi- In particular, Morris et al. [50] and Trivedi et al. [67] used tecture for anticipation in robotics applications. Relevance Vector Machine (RVM) for intent prediction and • Wereleasethefirstdatasetofnaturaldrivingwithvideos performedsensoryfusionbyconcatenatingfeaturevectors.We from both inside and outside the car, GPS, and speed will show that such hand designed concatenation of features information. does not work well. Furthermore, these works do not model • We release an open-source deep learning package the temporal aspect of the problem properly. They assume that informative contextual cues always appear at a fixed temporal modeling include various filtering methods, such time before the maneuver. We show that this assumption is as Kalman and particle filters [64], Hidden Markov Models, not true, and in fact the temporal aspect of the problem and many types of Dynamic Bayesian Networks [51]. Some should be carefully modeled. In contrast to these works, our previous works [9, 40, 53] used HMMs to model different RNN-LSTM based sensory-fusion architecture captures long aspects of the driver’s behaviour. Most of these generative temporal dependencies through its memory cell and learns approaches model how latent (hidden) states influence the rich representations for anticipation through a hierarchy of observations. However, in our problem both the latent states non-linear transformations of input data. Our work is also and the observations influence each other. In the following related to works on driver behavior prediction with different sections, we will describe the Autoregressive Input-Output sensors [26, 21, 20], and vehicular controllers which act on HMM (AIO-HMM) for maneuver anticipation [30] and will these predictions [59, 68, 18]. use it as a baseline to compare our deep learning approach. Anticipation and Modeling Humans. Modeling of human UnlikeAIO-HMMourdeeparchitecturehaveinternalmemory motionhasgivenrisetomanyapplications,anticipationbeing which allows it to handle long temporal dependencies [24]. one of them. Anticipating human activities has shown to Furthermore, the input features undergo a hierarchy of non- improve human-robot collaboration [73, 36, 46, 38, 16]. Sim- linear transformation through the deep architecture which ilarly, forecasting human navigation trajectories has enabled allows learning rich representations. robots to plan sociable trajectories around humans [33, 8, Two building blocks of our architecture are Recurrent 39, 29]. Feature matching techniques have been proposed Neural Networks (RNNs) [54] and Long Short-Term Memory for anticipating human activities from videos [57]. Modeling (LSTM) units [25]. Our work draws upon ideas from pre- human preferences has enabled robots to plan good trajecto- vious works on RNNs and LSTM from the language [62], ries [17, 60, 28, 31]. Similar to these works, we anticipate speech[23],andvision[14]communities.Ourapproachtothe human actions, which are driving maneuvers in our case. joint training of multiple RNNs is related to the recent work However, the algorithms proposed in the previous works do onhierarchicalRNNs[19].WeconsiderRNNsinmulti-modal not apply in our setting. In our case, anticipating maneuvers setting, which is related to the recent use of RNNs in image- requires modeling the interaction between the driving context captioning [14]. Our contribution lies in formulating activity and the driver’s intention. Such interactions are absent in the anticipation in a deep learning framework using RNNs with previous works, and they use shallow architectures [6] that LSTM units. We focus on sensory-rich robotics applications, do not properly model temporal aspects of human activities. and our architecture extends previous works doing sensory- Theyfurtherdealwithasingledatamodalityanddonottackle fusion with feed-forward networks [52, 61] to the fusion of the challenges of sensory-fusion. Our problem setup involves temporalstreams.Usingourarchitecturewedemonstratestate- all these challenges, for which we propose a deep learning of-the-art on maneuver anticipation. approachwhichefficientlyhandlestemporaldependenciesand learns to fuse multiple sensory streams. III. OVERVIEW Analyzing the human face. The vision approaches related to We first give an overview of the maneuver anticipation our work are face detection and tracking [69, 76], statistical problem and then describe our system. modelsofface[10]andposeestimationmethodsforface[75]. ActiveAppearanceModel(AAM)[10]anditsvariants[47,74] A. Problem Overview statistically model the shape and texture of the face. AAMs have also been used to estimate the 3D-pose of a face from Our goal is to anticipate driving maneuvers a few seconds a single image [75] and in design of assistive features for before they occur. This includes anticipating a lane change driver monitoring [55, 63]. In our approach we adapt off-the- before the wheels touch the lane markings or anticipating if shelfavailablefacedetection[69]andtrackingalgorithms[58] the driver keeps straight or makes a turn when approaching (seeSectionVI).Ourapproachallowsustoeasilyexperiment an intersection. This is a challenging problem for multiple with more advanced face detection and tracking algorithms. reasons. First, it requires the modeling of context from dif- We demonstrate this by using the Constrained Local Neural ferent sources. Information from a single source, such as a Field(CLNF)model[4]andtracking68fixedlandmarkpoints cameracapturingeventsoutsidethecar,isnotsufficientlyrich. on the driver’s face and estimating the 3D head-pose. Additional visual information from within the car can also be used.Forexample,thedriver’sheadmovementsareusefulfor Learning temporal models. Temporal models are commonly anticipation – drivers typically check for the side traffic while usedtomodelhumanactivities[35,49,71,72].Thesemodels changing lanes and scan the cross traffic at intersections. havebeenusedinbothdiscriminativeandgenerativefashions. The discriminative temporal models are mostly inspired by Second, reasoning about maneuvers should take into ac- the Conditional Random Field (CRF) [42] which captures countthedrivingcontextatbothlocalandgloballevels.Local the temporal structure of the problem. Wang et al. [72] context requires modeling events in vehicle’s vicinity such as and Morency et al. [49] propose dynamic extensions of the surrounding vision, GPS, and speed information. On the the CRF for image segmentation and gesture recognition other hand, factors that influence the overall route contributes respectively. On the other hand, generative approaches for to the global context, such as the driver’s final destination. Third,theinformativecuesnecessaryforanticipationappearat (a) Setup (b) Vision Algorithms (c) Inside and Outside Vehicle Features (d) Model (e) Anticipation .31 Softmax .07 .74 Fusion Layer Face camera 1 LSTM 1 Networks 0 Road camera Input: Videos, GPS Face detection & t=5 t=0 Feature Inside Outside Speed & Maps Tracking feature points Motion of facial points Outside context vector Features Features Fig. 4: System Overview. Our system anticipating a left lane change maneuver. (a) We process multi-modal data including GPS, speed, street maps, and events inside and outside of the vehicle using video cameras. (b) Vision pipeline extracts visual cues such as driver’s head movements.(c)Theinsideandoutsidedrivingcontextisprocessedtoextractexpressivefeatures.(d,e)Usingourdeeplearningarchitecture we fuse the information from outside and inside the vehicle and anticipate the probability of each maneuver. t=1 B. System Overview For maneuver anticipation our vehicular sensory platform includes the following (as shown in Figure 4): Left lane change 1) Adriver-facingcamerainsidethevehicle.Wemountthis t=4 t=2 camera on the dashboard and use it to track the driver’s head movements. This camera operates at 25 fps. y 2) A camera facing the road is mounted on the dash- x board to capture the (outside) view in front of the car. Right lane change This camera operates at 30 fps. The video from this t=5 t=3 camera enables additional reasoning on maneuvers. For example, when the vehicle is in the left-most lane, the only safe maneuvers are a right-lane change or Trajectories of t=0 keeping straight, unless the vehicle is approaching an Right turn facial points intersection. 3) Aspeedloggerforvehicledynamicsbecausemaneuvers Fig.5:Variable time occurrence of events.Left:Theeventsinside correlate with the vehicle’s speed, e.g., turns usually the vehicle before the maneuvers. We track the driver’s face along with many facial points. Right: The trajectories generated by the happen at lower speeds than lane changes. horizontal motion of facial points (pixels) ‘t’ seconds before the 4) A Global Positioning System (GPS) for localizing the maneuver. X-axis is the time and Y-axis is the pixels’ horizontal vehicle on the map. This enables us to detect upcoming coordinates.Informativecuesappearduringtheshadedtimeinterval. road artifacts such as intersections, highway exits, etc. Suchcuesoccuratvariabletimesbeforethemaneuver,andtheorder in which the cues appear is also important. Using this system we collect 1180 miles of natural city and freeway driving data from 10 drivers. We denote the information from sensors with feature vector x. Our vehicular systems gives a temporal sequence of feature vectors {(x ,x ,...,x ,...)}. For now we do not distinguish between 1 2 t the information from different sensors, later in Section V-B we introduce sensory fusion. In Section VI we formally variable times before the maneuver, as illustrated in Figure 5. define our feature representations and describe our data set In particular, the time interval between the driver’s head in Section VIII-A. We now formally define anticipation and movement and the occurrence of the maneuver depends on present our deep learning architecture. many factors such as the speed, traffic conditions, etc. In addition, appropriately fusing the information from mul- IV. PRELIMINARIES tiple sensors is crucial for anticipation. Simple sensory fusion approaches like concatenation of feature vectors performs We now formally define anticipation and then present our poorly, as we demonstrate through experiments. In our pro- Recurrent Neural Network architecture. The goal of anticipa- posed approach we learn a neural network layer for fusing tion is to predict an event several seconds before it happens the temporal streams of data coming from different sensors. given the contextual information up to the present time. Our resulting architecture is end-to-end trainable via back The future event can be one of multiple possibilities. At propagation, and we jointly train it to: (i) model the temporal training time a set of temporal sequences of observations and aspectsoftheproblem;(ii)fusemultiplesensorystreams;and events {(x ,x ,...,x ) ,y }N is provided where x is the 1 2 T j j j=1 t (iii) anticipate maneuvers. observation at time t, y is the representation of the event (described below) that happens at the end of the sequence at t=T, and j is the sequence index. At test time, however, the algorithm receives an observation x at each time step, and t its goal is to predict the future event as early as possible, i.e. by observing only a partial sequence of observations {(x ,...,x )|t < T}. This differentiates anticipation from 1 t activity recognition [70, 37] where in the latter the complete observation sequence is available at test time. In this paper, x is a real-valued feature vector and y = [y1,...,yK] is a t vector of size K (the number of events), where yk denotes Fig. 6: Internal working of an LSTM unit. the probability of the temporal sequence belonging to event the k such that (cid:80)K yk =1. At the time of training, y takes k=1 theformofaone-hotvectorwiththeentryinycorresponding to the ground truth event as 1 and the rest 0. the hidden representation from the previous time step ht−1. LSTM applies the following set of update operations: In this work we propose a deep RNN architecture with LongShort-TermMemory(LSTM)units[25]foranticipation. it =σ(Wixt+Uiht−1+Vict−1+bi) (3) Below we give an overview of the standard RNN and LSTM f =σ(W x +U h +V c +b ) (4) t f t f t−1 f t−1 f which form the building blocks of our architecture. c =f (cid:12)c +i (cid:12)tanh(W x +U h +b ) (5) t t t−1 t c t c t−1 c o =σ(W x +U h +V c +b ) (6) A. Recurrent Neural Networks t o t o t−1 o t o h =o (cid:12)tanh(c ) (7) t t t A standard RNN [54] takes in a temporal sequence of vectors (x ,x ,...,x ) as input, and outputs a sequence where (cid:12) is an element-wise product and σ is the logistic 1 2 T of vectors (h1,h2,...,hT) also known as high-level repre- function.σandtanhareappliedelement-wise.W∗,V∗,U∗, sentations. The representations are generated by non-linear andb∗ aretheparameters,furthertheweightmatricesV∗ are transformation of the input sequence from t = 1 to T, as diagonal. The input and forget gates of LSTM participate in described in the equations below. updating the memory cell (5). More specifically, forget gate controls the part of memory to forget, and the input gate h =f(Wx +Hh +b) (1) t t t−1 computes new values based on the current observation that y =softmax(W h +b ) (2) t y t y are written to the memory cell. The output gate together with wheref isanon-linearfunctionappliedelement-wise,andy thememorycellcomputesthehiddenrepresentation(7).Since t isthesoftmaxprobabilitiesoftheeventshavingseentheob- LSTM cell activation involves summation over time (5) and servationsuptox .W,H,b,W ,b aretheparametersthat derivatives distribute over sums, the gradient in LSTM gets t y y arelearned.Matricesaredenotedwithbold,capitalletters,and propagatedoveralongertimebeforevanishing.Inthestandard vectorsaredenotedwithbold,lower-caseletters.Inastandard RNN, we replace the non-linear f in equation (1) by the RNN a common choice for f is tanh or sigmoid. RNNs LSTMequationsgivenaboveinordertocapturelongtemporal with this choice of f suffer from a well-studied problem of dependencies. We use the following shorthand notation to vanishinggradients[54],andhencearepooratcapturinglong denote the recurrent LSTM operation. temporal dependencies which are essential for anticipation. A (h ,c )=LSTM(x ,h ,c ) (8) t t t t−1 t−1 common remedy to vanishing gradients is to replace tanh non-linearities by Long Short-Term Memory cells [25]. We We now describe our RNN architecture with LSTM units now give an overview of LSTM and then describe our model foranticipation.Followingwhichwewilldescribeaparticular for anticipation. instantiation of our architecture for maneuver anticipation where the observations x come from multiple sources. B. Long-Short Term Memory Cells V. NETWORKARCHITECTUREFORANTICIPATION LSTM is a network of neurons that implements a memory cell [25]. The central idea behind LSTM is that the memory In order to anticipate, an algorithm must learn to predict cell can maintain its state over time. When combined with the future given only a partial temporal context. This makes RNN, LSTM units allow the recurrent network to remember anticipationchallengingandalsodifferentiatesitfromactivity long term context dependencies. recognition. Previous works treat anticipation as a recognition LSTM consists of three gates – input gate i, output gate problem [34, 50, 57] and train discriminative classifiers (such o, and forget gate f – and a memory cell c. See Figure 6 asSVM orCRF)on thecompletetemporal context.However, for an illustration. At each time step t, LSTM first computes at test time these classifiers only observe a partial temporal its gates’ activations {i ,f } (3)(4) and updates its memory contextandmakepredictionswithinafilteringframework.We t t cell from c to c (5), it then computes the output gate modelanticipationwitharecurrentarchitecturewhichunfolds t−1 t activation o (6), and finally outputs a hidden representation through time. This lets us train a single classifier that learns t h (7). The inputs into LSTM are the observations x and to handle partial temporal context of varying lengths. t t Furthermore, anticipation in robotics applications is challenging because the contextual information can come from Egrxopwoinnegn ltoiaslsly 𝑙𝑜𝑠𝑠= −∑𝑒−𝑇−𝑡log(y𝑡𝑘) multiple sensors with different data modalities. Examples include autonomous vehicles that reason from multiple sensors [3] or robots that jointly reason over perception and Predictions 𝐲𝑡−1 𝐲𝑡 𝐲𝑡+1 languageinstructions[48].Insuchapplicationsthewayinfor- Softmax mation from different sensors is fused is critical to the appli- Fusion cation’s final performance. We therefore build an end-to-end Layer deep learning architecture which jointly learns to anticipate and fuse information from different sensors. LSTM LSTM LSTM LSTM LSTM LSTM A. RNN with LSTM units for anticipation At the time of training, we observe the complete temporal observation sequence and the event {(x ,x ,...,x ),y}. 𝐱𝑡−1 𝐱𝑡 𝐱𝑡+1 𝐳𝑡−1 𝐳𝑡 𝐳𝑡+1 1 2 T Fig. 7: Sensory fusion RNN for anticipation. (Bottom) In the Our goal is to train a network which predicts the fu- Fusion-RNNeachsensorystreamispassedthroughtheirindependent ture event given a partial temporal observation sequence RNN. (Middle) High-level representations from RNNs are then {(x1,x2,...,xt)|t < T}. We do so by training an RNN in a combined through a fusion layer. (Top) In order to prevent over- sequence-to-sequencepredictionmanner.Giventrainingexam- fitting early in time the loss exponentially increases with time. ples {(x ,x ,...,x ) ,y }N we train an RNN with LSTM 1 2 T j j j=1 units to map the sequence of observations (x ,x ,...,x ) to 1 2 T the sequence of events (y ,...,y ) such that y = y,∀t, where W and b are model parameters, and LSTM 1 T t ∗ ∗ x as shown in Fig. 2. Trained in this manner, our RNN and LSTM process the sensory streams (x ,...,x ) and z 1 T will attempt to map all sequences of partial observations (z ,...,z )respectively.Thesameframeworkcanbeextended 1 T (x ,x ,...,x ) ∀t ≤ T to the future event y. This way to handle more sensory streams. 1 2 t our model explicitly learns to anticipate. We additionally use LSTM units which prevents the gradients from vanishing and C. Exponential loss-layer for anticipation. allows our model to capture long temporal dependencies in Weproposeanewlosslayerwhichencouragesthearchitec- human activities.1 turetoanticipateearlywhilealsoensuringthatthearchitecture does not over-fit the training data early enough in time when B. Fusion-RNN: Sensory fusion RNN for anticipation there is not enough context for anticipation. When using We now present an instantiation of our RNN architecture the standard softmax loss, the architecture suffers a loss of forfusingtwosensorystreams:{(x1,...,xT), (z1,...,zT)}.In −log(ytk) for the mistakes it makes at each time step, where the next section we will describe these streams for maneuver yk is the probability of the ground truth event k computed t anticipation. by the architecture using Eq. (12). We propose to modify this An obvious way to allow sensory fusion in the RNN is by loss by multiplying it with an exponential term as illustrated concatenating the streams, i.e. using ([x ;z ],...,[x ;z ]) as in Figure 7. Under this new scheme, the loss exponentially 1 1 T T input to the RNN. However, we found that this sort of simple grows with time as shown below. concatenation performs poorly. We instead learn a sensory N T (cid:88)(cid:88) fusion layer which combines the high-level representations of loss= −e−(T−t)log(yk) (13) t sensor data. Our proposed architecture first passes the two j=1t=1 sensory streams {(x ,...,x ), (z ,...,z )} independently This loss penalizes the RNN exponentially more for the mis- 1 T 1 T through separate RNNs (9) and (10). The high level repre- takes it makes as it sees more observations. This encourages sentations from both RNNs {(hx,...,hx), (hz,...,hz) are themodeltofixmistakesasearlyasitcanintime.Thelossin 1 T 1 T then concatenated at each time step t and passed through a equation 13 also penalizes the network less on mistakes made fully connected (fusion) layer which fuses the two represen- early in time when there is not enough context available. This tations (11), as shown in Figure 7. The output representation way it acts like a regularizer and reduces the risk to over-fit from the fusion layer is then passed to the softmax layer for very early in time. anticipation(12).Thefollowingoperationsareperformedfrom t=1 to T. VI. FEATURES (hx,cx)=LSTM (x ,hx ,cx ) (9) t t x t t−1 t−1 We extract features by processing the inside and outside (hzt,czt)=LSTMz(zt,hzt−1,czt−1) (10) drivingcontexts.Wedothisbygroupingtheoverallcontextual Sensory fusion: e =tanh(W [hx;hz]+b ) (11) information from the sensors into: (i) the context from inside t f t t f thevehicle,whichcomesfromthedriverfacingcameraandis y =softmax(W e +b ) (12) t y t y represented as temporal sequence of features (z ,...,z ); and 1 T (ii) the context from outside the vehicle, which comes from 1Driving maneuvers can take up to 6 seconds and the value of T can go upto150withacameraframerateof25fps. the remaining sensors: GPS, road facing camera, and street m a rg 0.9 o ts iH e lgn 0 t=5 t=0 A (a) Features using KLT Tracker 2D Trajectories n o ito m ra lu g n t=5 t=0 A (b) Features using CLNF Tracker 2D Trajectories s Fig. 9: Improved features for maneuver anticipation. We track e iro facial landmark points using the CLNF tracker [4] which results in tc moreconsistent2DtrajectoriesascomparedtotheKLTtracker[58] e ja usedbyJainetal.[30].Furthermore,theCLNFalsogivesanestimate rT t=5 t=3 t=2 t=0 of the driver’s 3D head pose. Fig. 8: Inside vehicle feature extraction. The angular histogram featuresextractedatthreedifferenttimestepsforaleftturnmaneuver. Bottom:Trajectoriesforthehorizontalmotionoftrackedfacialpixels Head motion features. For maneuver anticipation the hori- ‘t’secondsbeforethemaneuver.Att=5secondsbeforethemaneuver zontal movement of the face and its angular rotation (yaw) the driver is looking straight, at t=3 looks (left) in the direction of maneuver, and at t=2 looks (right) in opposite direction for the are particularly important. From the face tracking we ob- crossing traffic. Middle: Average motion vector of tracked facial tain face tracks, which are 2D trajectories of the tracked pixelsinpolarcoordinates.r istheaveragemovementofpixelsand facial points in the image plane. Figure 8 (bottom) shows arrow indicates the direction in which the face moves when looking how the horizontal coordinates of the tracked facial points from the camera. Top: Normalized angular histogram features. vary with time before a left turn maneuver. We represent the driver’s face movements and rotations with histogram features.Inparticular,wetakematchingfacialpointsbetween maps. We represent the outside context with (x1,...,xT). In successiveframesandcreatehistogramsoftheircorresponding ordertoanticipatemaneuvers,ourRNNarchitecture(Figure7) horizontal motions (in pixels) and angular motions in the processes the temporal context {(x1,...,xt), (z1,...,zt)} at image plane (Figure 8). We bin the horizontal and angular every time step t, and outputs softmax probabilities yt for the motions using [≤ −2, −2 to 0, 0 to 2, ≥ 2] and followingfivemaneuvers:M={leftturn,rightturn,leftlane [0 to π, π to π, π to 3π, 3π to 2π], respectively. We also change, right lane change, straight driving}. calcula2te t2he mean movem2 ent2of the driver’s face center. This gives us φ(face) ∈ R9 facial features per-frame. The driver’s A. Inside-vehicle features. eye-gazeisalsousefulafeature.However,robustlyestimating 3Deye-gazeinoutsideenvironmentisstillatopicofresearch, The inside features z capture the driver’s head movements and orthogonal to this work on anticipation. We therefore do t at each time instant t. Our vision pipeline consists of face not consider eye-gaze features. detection,tracking,andfeatureextractionmodules.Weextract 3Dheadposeandfaciallandmarkfeatures.Ourframeworkis head motion features per-frame, denoted by φ(face). We flexibleandallowsincorporatingmoreadvancedfacedetection compute z by aggregating φ(face) for every 20 frames, i.e., t and tracking algorithms. For example we replace the KLT zt =(cid:80)2i=01φ(facei)/(cid:107)(cid:80)2i=01φ(facei)(cid:107). tracker described above with the Constrained Local Neural Facedetectionandtracking.Wedetectthedriver’sfaceusinga Field(CLNF)model[4]andtrack68fixedlandmarkpointson trainedViola-Jonesfacedetector[69].Fromthedetectedface, the driver’s face. CLNF is particularly well suited for driving wefirstextractvisuallydiscriminative(facial)pointsusingthe scenarios due its ability to handle a wide range of head pose Shi-Tomasi corner detector [58] and then track those facial and illumination variations. As shown in Figure 9, CLNF points using the Kanade-Lucas-Tomasi (KLT) tracker [45, 58, offers us two distinct benefits over the features from KLT (i) 66]. However, the tracking may accumulate errors over time whilediscriminativefacialpointsmaychangefromsituationto becauseofchangesinilluminationduetotheshadowsoftrees, situation,trackingfixedlandmarksresultsinconsistentoptical traffic, etc. We therefore constrain the tracked facial points to flowtrajectorieswhichaddstorobustness;and(ii)CLNFalso follow a projective transformation and remove the incorrectly allows us to estimate the 3D head pose of the driver’s face tracked points using the RANSAC algorithm. While tracking by minimizing error in the projection of a generic 3D mesh the facial points, we lose some of the tracked points with model of the face w.r.t. the 2D location of landmarks in the every new frame. To address this problem, we re-initialize the image.Thehistogramfeaturesgeneratedfromtheopticalflow tracker with new discriminative facial points once the number trajectories along with the 3D head pose features (yaw, pitch of tracked points falls below a threshold [32]. androw),giveusφ(face)∈R12whenusingtheCLNFtracker. Input Layer x1 x2 x3 x𝑇 Outside features Driver states ? HLiadydeern h1 ℎ2 ℎ3 ℎ𝑇 Inside features OLuatypeurt z1 z2 z3 z𝑇 𝑇 Fig. 10: AIO-HMM. The model has three layers: (i) Input (top): thislayerrepresentsoutsidevehiclefeaturesx;(ii)Hidden(middle): thislayerrepresentsdriver’slatentstatesh;and(iii)Output(bottom): thislayerrepresentsinsidevehiclefeaturesz.Thislayeralsocaptures temporal dependencies of inside vehicle features. T represents time. Fig. 11: Our data set is diverse in drivers and landscape. A. Modeling driving maneuvers In Section VIII we present results with the features from KLT, as well as the results with richer features obtained from Modeling maneuvers require temporal modeling of the the CLNF model. driving context. Discriminative methods, such as the Support Vector Machine and the Relevance Vector Machine [65], which do not model the temporal aspect perform poorly on B. Outside-vehicle features. anticipation tasks, as we show in Section VIII. Therefore, a temporal model such as the Hidden Markov Model (HMM) is The outside feature vector x encodes the information t better suited to model maneuver anticipation. about the outside environment such as the road conditions, vehicle dynamics, etc. In order to get this information, we An HMM models how the driver’s latent states generate use the road-facing camera together with the vehicle’s GPS both the inside driving context (zt) and the outside driving coordinates, its speed, and the street maps. More specifically, context (xt). However, a more accurate model should capture we obtain two binary features from the road-facing camera howeventsoutsidethevehicle(i.e.theoutsidedrivingcontext) indicating whether a lane exists on the left side and on the affect the driver’s state of mind, which then generates the right side of the vehicle. We also augment the vehicle’s GPS observationsinsidethevehicle(i.e.theinsidedrivingcontext). coordinates with the street maps and extract a binary feature Such interactions can be modeled by an Input-Output HMM indicating if the vehicle is within 15 meters of a road artifact (IOHMM)[7].However,modelingtheproblemwithIOHMM such as intersections, turns, highway exists, etc. We also does not capture the temporal dependencies of the inside encode the average, maximum, and minimum speeds of the driving context. These dependencies are critical to capture vehicle over the last 5 seconds as features. This results in a the smooth and temporally correlated behaviours such as the x ∈R6 dimensional feature vector. driver’s face movements. We therefore present Autoregressive t Input-Output HMM (AIO-HMM) which extends IOHMM to model these observation dependencies. Figure 10 shows the VII. BAYESIANNETWORKSFOR AIO-HMM graphical model for modeling maneuvers. We MANEUVERANTICIPATION learn separate AIO-HMM model for each maneuver. In order toanticipatemaneuvers,duringinferencewedeterminewhich InthissectionweproposealternateBayesiannetworks[30] model best explains the past several seconds of the driving based on Hidden Markov Model (HMM) for maneuver antic- context based on the data log-likelihood. In Appendix A we ipation. These models form a strong baseline to compare our describe the training and inference procedure for AIO-HMM. sensory-fusion deep learning architecture. VIII. EXPERIMENTS Driving maneuvers are influenced by multiple interactions involving the vehicle, its driver, outside traffic, and occa- In this section we first give an overview of our data sionally global factors like the driver’s destination. These set and then present the quantitative results. We also interactions influence the driver’s intention, i.e. their state of demonstrate our system and algorithm on real-world driv- mindbeforethemaneuver,whichisnotdirectlyobservable.In ing scenarios. Our video demonstrations are available at: our Bayesian network formulation, we represent the driver’s http://www.brain4cars.com. intention with discrete states that are latent (or hidden). In A. Driving data set order to anticipate maneuvers, we jointly model the driving context and the latent states in a tractable manner. We rep- Our data set consists of natural driving videos with both resent the driving context as a set of features described in inside and outside views of the car, its speed, and the global Section VI. We now present the motivation for the Bayesian position system (GPS) coordinates.2 The outside car video networks and then discuss our key model Autoregressive Input-Output HMM (AIO-HMM). 2Theinsideandoutsidecamerasoperateat25and30frames/sec. captures the view of the road ahead. We collected this driving Algorithm 1 Maneuver anticipation data set under fully natural settings without any intervention.3 Initialize m∗ =driving straight It consists of 1180 miles of freeway and city driving and Input Features {(x ,...,x ),(z ,...,z )} and prediction 1 T 1 T encloses 21,000 square miles across two states. We collected threshold p th thisdatasetfrom10driversoveraperiodoftwomonths.The Output Predicted maneuver m∗ complete data set has a total of 2 million video frames and while t=1 to T do includes diverse landscapes. Figure 11 shows a few samples Observe features (x ,...,x ) and (z ,...,z ) 1 t 1 t from our data set. We annotated the driving videos with a Estimate probability y of each maneuver in M t total of 700 events containing 274 lane changes, 131 turns, m∗ =argmax y t m∈M t and 295 randomly sampled instances of driving straight. Each if m∗ (cid:54)=driving straight & y {m∗}>p then t t t th lane change or turn annotation marks the start time of the m∗ =m∗ t maneuver, i.e., before the car touches the lane or yaws, break respectively. For all annotated events, we also annotated the end if laneinformation,i.e.,thenumberoflanesontheroadandthe end while current lane of the car. Our data set is publicly available at Return m∗ http://www.brain4cars.com. B. Baseline algorithms and the high-level representations from RNNs are then We compare the following algorithms: fused via a fully-connected layer. The loss at each time step takes the form −log(yk). • Chance: Uniformly randomly anticipates a maneuver. t • Fusion-RNN-Exp-Loss (F-RNN-EL): This architecture is • SVM [50]: Support Vector Machine is a discriminative similar to F-RNN-UL, except that the loss exponentially classifier [11]. Morris et al. [50] takes this approach for grows with time −e−(T−t)log(yk). anticipatingmaneuvers.4 WetraintheSVMon5seconds t of driving context by concatenating all frame features to Our RNN and LSTM implementations are open-sourced get a R3840 dimensional feature vector. and available at NeuralModels [27]. For the RNNs in our Fusion-RNN architecture we use a single layer LSTM of size • Random-Forest [12]: This is also a discriminative classi- 64withsigmoidgateactivationsandtanhactivationforhidden fierthatlearnsmanydecisiontreesfromthetrainingdata, representation. Our fully connected fusion layer uses tanh andattesttimeitaveragesthepredictionoftheindividual activation and outputs a 64 dimensional vector. Our overall decision trees. We train it on the same features as SVM architecture (F-RNN-EL and F-RNN-UL) have nearly 25,000 with 150 trees of depth ten each. parameters that are learned using RMSprop [13]. • HMM: This is the Hidden Markov Model. We train the HMM on a temporal sequence of feature vectors that we C. Evaluation protocol extract every 0.8 seconds, i.e., every 20 video frames. We evaluate an algorithm based on its correctness in pre- We consider three versions of the HMM: (i) HMM E: dicting future maneuvers. We anticipate maneuvers every 0.8 with only outside features from the road camera, the seconds where the algorithm processes the recent context vehicle’sspeed,GPSandstreetmaps(SectionVI-B);(ii) and assigns a probability to each of the four maneuvers: HMMF:withonlyinsidefeaturesfromthedriver’sface {left lane change, right lane change, left turn, right turn} (Section VI-A); and (ii) HMM E+F: with both inside and a probability to the event of driving straight. These five and outside features. probabilitiestogethersumtoone.Afteranticipation,i.e.when • IOHMM:Jainetal.[30]modeleddrivingmaneuverswith thealgorithmhascomputedallfiveprobabilities,thealgorithm this Bayesian network. It is trained on the same features predicts a maneuver if its probability is above a threshold as HMM E+F. p . If none of the maneuvers’ probabilities are above this th • AIO-HMM: Jain et al. [30] proposed this Bayesian net- threshold, the algorithm does not make a maneuver prediction work for modeling maneuvers. It is trained on the same and predicts driving straight. However, when it predicts one features as HMM E+F. of the four maneuvers, it sticks with this prediction and • Simple-RNN (S-RNN):Inthisarchitecturesensorstreams makes no further predictions for next 5 seconds or until a are fused by simple concatenation and then passed maneuver occurs, whichever happens earlier. After 5 seconds through a single RNN with LSTM units. or a maneuver has occurred, it returns to anticipating future maneuvers.Algorithm1showstheinferencestepsformaneu- • Fusion-RNN-Uniform-Loss (F-RNN-UL): In this archi- ver anticipation. tecturesensorstreamsarepassedthroughseparateRNNs, During this process of anticipation and prediction, the 3Protocol:Wesetupcameras,GPSandspeedrecordingdeviceinsubject’s algorithm makes (i) true predictions (tp): when it predicts the personal vehicles and left it to record the data. The subjects were asked to correct maneuver; (ii) false predictions (fp): when it predicts ignoreoursetupanddriveastheywouldnormally. a maneuver but the driver performs a different maneuver; (iii) 4Morriesetal.[50]consideredbinaryclassificationproblem(lanechange vsdrivingstraight)andusedRVM[65]. false positive predictions (fpp): when it predicts a maneuver TABLE I: Maneuver Anticipation Results. Average precision, recall and time-to-maneuver are computed from 5-fold cross-validation. Standard error is also shown. Algorithms are compared on the features from Jain et al. [30]. Lanechange Turns Allmaneuvers Time-to- Time-to- Time-to- Method Pr(%) Re(%) Pr(%) Re(%) Pr(%) Re(%) maneuver(s) maneuver(s) maneuver(s) Chance 33.3 33.3 - 33.3 33.3 - 20.0 20.0 - Morrisetal.[50]SVM 73.7±3.4 57.8±2.8 2.40 64.7±6.5 47.2±7.6 2.40 43.7±2.4 37.7±1.8 1.20 Random-Forest 71.2±2.4 53.4±3.2 3.00 68.6±3.5 44.4±3.5 1.20 51.9±1.6 27.7±1.1 1.20 HMME 75.0±2.2 60.4±5.7 3.46 74.4±0.5 66.6±3.0 4.04 63.9±2.6 60.2±4.2 3.26 HMMF 76.4±1.4 75.2±1.6 3.62 75.6±2.7 60.1±1.7 3.58 64.2±1.5 36.8±1.3 2.61 HMME+F 80.9±0.9 79.6±1.3 3.61 73.5±2.2 75.3±3.1 4.53 67.8±2.0 67.7±2.5 3.72 IOHMM 81.6±1.0 79.6±1.9 3.98 77.6±3.3 75.9±2.5 4.42 74.2±1.7 71.2±1.6 3.83 (OurfinalBayesiannetwork)AIO-HMM 83.8±1.3 79.2±2.9 3.80 80.8±3.4 75.2±2.4 4.16 77.4±2.3 71.2±1.3 3.53 S-RNN 85.4±0.7 86.0±1.4 3.53 75.2±1.4 75.3±2.1 3.68 78.0±1.5 71.1±1.0 3.15 F-RNN-UL 92.7±2.1 84.4±2.8 3.46 81.2±3.5 78.6±2.8 3.94 82.2±1.0 75.9±1.5 3.75 (Ourfinaldeeparchitecture)F-RNN-EL 88.2±1.4 86.0±0.7 3.42 83.8±2.1 79.9±3.5 3.78 84.5±1.0 77.1±1.3 3.58 but the driver does not perform any maneuver (i.e. driving to model the much needed long temporal dependencies for straight); and (iv) missed predictions (mp): when it predicts anticipation. Additionally, unlike AIO-HMM, F-RNN-EL is driving straight but the driver performs a maneuver. We a discriminative model that does not make any assumptions evaluatethealgorithmsusingtheirprecisionandrecallscores: about the generative nature of the problem. The results also tp tp highlight the importance of modeling the temporal nature in Pr = ; Re= tp+fp+fpp tp+fp+mp the data. Classifiers like SVM and Random Forest do not (cid:124) (cid:123)(cid:122) (cid:125) (cid:124) (cid:123)(cid:122) (cid:125) model the temporal aspects and hence performs poorly. Total#ofmaneuverpredictions Total#ofmaneuvers The performance of several variants of our deep archi- The precision measures the fraction of the predicted maneu- tecture, reported in Table I, justifies our design decisions versthatarecorrectandrecallmeasuresthefractionofthema- to reach the final fusion architecture. When predicting all neuvers that are correctly predicted. For true predictions (tp) maneuvers, F-RNN-EL gives 6% higher precision and recall we also compute the average time-to-maneuver, where time- thanS-RNN,whichperformsasimplefusionbyconcatenating to-maneuver is the interval between the time of algorithm’s the two sensor streams. On the other hand, F-RNN models prediction and the start of the maneuver. each sensor stream with a separate RNN and then uses a We perform cross validation to choose the number of the fully connected layer to fuse the high-level representations driver’s latent states in the AIO-HMM and the threshold on at each time step. This form of sensory fusion is more probabilities for maneuver prediction. For SVM we cross- principled since the sensor streams represent different data validate for the parameter C and the choice of kernel from modalities. In addition, exponentially growing the loss further Gaussian and polynomial kernels. The parameters are chosen improvestheperformance.Ournewlossschemepenalizesthe as the ones giving the highest F1-score on a validation set. networkproportionaltothelengthofcontextithasseen.When TheF1-scoreistheharmonicmeanoftheprecisionandrecall, predicting all maneuvers, we observe that F-RNN-EL shows defined as F1=2∗Pr∗Re/(Pr+Re). an improvement of 2% in precision and recall over F-RNN- UL. We conjecture that exponentially growing the loss acts D. Quantitative results like a regularizer. It reduces the risk of our network over- fittingearlyintimewhenthereisnotenoughcontextavailable. We evaluate the algorithms on maneuvers that were not Furthermore,thetime-to-maneuverremainscomparableforF- seen during training and report the results using 5-fold cross RNN with and without exponential loss. validation.TableIreportstheprecisionandrecallscoresunder three settings: (i) Lane change: when the algorithms only TheBayesiannetworksAIO-HMMandHMME+F adopt predict for the left and right lane changes. This setting is different sensory fusion strategies. AIO-HMM fuses the two relevant for highway driving where the prior probabilities of sensory streams using an input-output model, on the other turns are low; (ii) Turns: when the algorithms only predict hand HMM E +F performs early fusion by concatenation. for the left and right turns; and (iii) All maneuvers: here the As a result, AIO-HMM gives 10% higher precision than algorithmsjointlypredictallfourmaneuvers.Allthreesettings HMM E +F for jointly predicting all the maneuvers. AIO- include the instances of driving straight. HMM further extends IOHMM by modeling the temporal dependenciesofeventsinsidethevehicle.Thisresultsinbetter Table I compares the performance of the baseline antic- performance: on average AIO-HMM precision is 3% higher ipation algorithms, Bayesian networks, and the variants of than IOHMM, as shown in Table I. Another important aspect our deep learning model. All algorithms in Table I use same of anticipation is the joint modeling of the inside and outside feature vectors and KLT face tracker which ensures a fair driving contexts. HMM F learns only from the inside driving comparison. As shown in the table, overall the best algorithm context, while HMM E learns only from the outside driving formaneuveranticipationisF-RNN-EL,andthebestperform- context.Theperformancesofboththemodelsisthereforeless ing Bayesian network is AIO-HMM. F-RNN-EL significantly than HMM E+F, which learns jointly both the contexts. outperforms AIO-HMM in every setting. This improvement in performance is because RNNs with LSTM units are very Table II compares the fpp of different algorithms. False expressive models with an internal memory. This allows them positive predictions (fpp) happen when an algorithm predicts

Brain4Cars: Car That Knows Before You Do via Sensory-Fusion Deep Learning Architecture PDF

2.2 MB·

by Ashesh Jain

#journals #arxiv

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Brain4Cars: Car That Knows Before You Do via Sensory-Fusion Deep Learning Architecture

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.