Driving behaviors for adas and autonomous driving XI

Driving Behaviors for ADAS
and Autonomous Driving XI
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California

Outline
• Online Vehicle Traj. Pred. using Policy Anticipation Net and Optim.-based Context Reasoning (3.3)
• Interaction-aware Kalman Neural Networks for Trajectory Prediction (4.25)
• Predicting Vehicle Behaviors Over An Extended Horizon Using Behavior Interaction Network (6.3)
• Rules of the Road: Predicting Driving Behavior with a Convolutional Model of Semantic
Interactions (6.21)
• DROGON: A Causal Reasoning Framework for Future Trajectory Forecast （7.31）
• Multiple Futures Prediction (12.6)
• Real-time Multi-target Path Prediction and Planning for Autonomous Driving aided by FCN (9.17)
• PRECOG: PREdiction Conditioned On Goals in Visual Multi-Agent Settings (ICCV.10.27)
• Human Driver Behavior Prediction based on UrbanFlow (11.9)
• VisionNet: A Drivable-space-based Interactive Motion Pred. Net for Autonomous Driving (1.8)

Online Vehicle Trajectory Prediction using Policy Anticipation
Network and Optimization-based Context Reasoning
• This paper presents an online two-level vehicle trajectory prediction framework for urban
autonomous driving where there are complex contextual factors, such as lane geometries, road
constructions, traffic regulations and moving agents.
• This method combines high-level policy anticipation with low-level context reasoning.
• A long short-term memory (LSTM) network to anticipate the vehicle’s driving policy (e.g., forward,
yield, turn left, turn right, etc.) using its sequential history observations.
• The policy is then used to guide a low-level optimization-based context reasoning process.
• It is essential to incorporate the prior policy anticipation due to the multimodal nature of the
future trajectory.
• Moreover, contrary to existing regression-based trajectory prediction methods, the optimization-
based reasoning process can cope with complex contextual factors.
• The final output of the two-level reasoning process is a continuous trajectory that automatically
adapts to different traffic configurations and accurately predicts future vehicle motions.

Illustration of the two-level reasoning methodology at an intersection. The two reference
lines corresponding to the two possible policies (turn left or go forward) are shown in cyan,
and the predicted trajectory is shown in green. In this example, the high- level policy is
first anticipated (namely, turn left) and the relevant contextual information (lane geometry,
construction, other agents) is then used in the optimization-based trajectory prediction.

During the high-level reasoning, the sequential state
observations are fed to the policy anticipation network,
which provides the future policy that a vehicle is likely to
execute. Policy anticipation network is based on an RNN
encoder structure. Together with the map information,
the policy can be properly interpreted in the driving
context and a reference prediction is generated and fed to
the optimization- based context reasoning process. The
optimization process renders various environment
observations and encodes them into the multi-layer cost
map structure. A non-linear optimization process is then
conducted to generate the predicted vehicle trajectory.
Illustration of the two-level reasoning framework
The policy interpretation module combines the policy
anticipation results with a local map, so that the
optimization- based context reasoning can start with a
reasonable initial guess.

Illustration of the policy interpretation. The local reference lines extracted
from the map are marked in cyan. The left illustrates that when the vehicle
(ID 0) has not shown any intention to turn, it is anticipated to be executing a
“forward” policy, so the forward reference line is extracted. After the vehicle
shows a left-turn pattern, the left-turn reference line is extracted. The local
context region is marked in the transparent cyan area.

Illustration of the multi-layer cost map
structure. The top- left image is captured
from CARLA, and a road construction site is
marked in yellow. The top-right figure
shows the static layer with repulsive forces
(cost) pointing to the free space. The
bottom- left image illustrates the costs
induced by the desired velocity, and the
bottom-right image shows the cost induced
by the red light. Different forces may be
conflicting (as around the dashed box).

Illustration of red light offence. When the kinetic energy of the
vehicle can overcome the artificial repulsive force induced by
the red light, the red light offence is captured and a warning is
provided by our prediction system.

Interaction-aware Kalman Neural Networks for
Trajectory Prediction
• Forecasting the motion of surrounding obstacles (vehicles, bicycles, pedestrians and etc.) benefits
the on-road motion planning for intelligent and autonomous vehicles.
• Complex scenes always yield great challenges in modeling the patterns of surrounding traffic.
• For example, one main challenge comes from intractable interaction effects in a complex traffic system.
• This paper proposes a multi-layer architecture Interaction- aware Kalman Neural Networks
(IaKNN) which involves an interaction layer for resolving high-dimensional traffic environmental
observations as interaction-aware accelerations, a motion layer for transforming the accelerations
to interaction-aware trajectories, and a filter layer for estimating future trajectories with a Kalman
filter network.
• Attributed to the multiple traffic data sources, the end-to-end trainable approach technically
fuses dynamic and interaction-aware trajectories boosting the prediction performance.

Specifically, an effective prediction subsystem needs to
handle the on-road challenges including noisy sensing
information and complex traffic scenes. Existing on-road
prediction subsystem is categorized into three models,
namely the physics-based motion model, the maneuver-
based motion model, and the interaction-aware motion
model. IaKNN is a multi-layer architecture consisting of
three layers, namely an interaction layer, a motion layer,
and a filter layer. The interaction layer is a deep neural
network with multiple convolution layers laying before
the LSTM encoder-decoder architecture. The motion
layer is similar to the existing physics-based motion
model which transforms accelerations into trajectories
by using kinematics models. The filter layer consists of
mainly a Kalman filter for optimally estimating the
future trajectories based on the interaction-aware
trajectories outputted by the motion layer.

Illustration of the IaKNN Model: In the diagram, at timestamp t, the environmental observation flows into
the interaction layer which generates the interaction-aware acceleration. Then, calculate the interaction-
aware trajectory of vehicles w.r.t Vehicle Dynamic Model (VDM) in motion layer. In the end, time-varying
multi-agent Kalman neural networks run over the predicted time horizon L to fuse dynamic trajectory and
interaction-aware trajectory. Particularly, the time-varying process and measurement noises in the filter
layer are set by zero-mean Gaussian noises with covariance formulated in a gated-structure neural network.

Illustration of RMSE and NLL of model CV, V-LSTM, S- LSTM, C-
VGMM+VIM, CS-LSTM, IaKNN-NoFL, and IaKNN.
The predicted trajectories and the real ones are
drawn in blue and green color, respectively.

Predicting Vehicle Behaviors Over An Extended
Horizon Using Behavior Interaction Network
• Anticipating possible behaviors of traffic participants is an essential capability of auto. vehicles.
• Many behavior detection and maneuver recognition methods only have a very limited prediction
horizon that leaves inadequate time and space for planning.
• To avoid unsatisfactory reactive decisions, it is essential to count long-term future rewards in
planning, which requires extending the prediction horizon.
• This paper uncovers that clues to vehicle behaviors over an extended horizon can be found in
vehicle interaction, which makes it possible to anticipate the likelihood of a certain behavior, even
in the absence of any clear maneuver pattern.
• Adopt a RNN for observation encoding, and based on that, a vehicle behavior interaction network
(VBIN) to capture the vehicle interaction from the hidden states and connection feature of each
interaction pair.
• The output is a probabilistic likelihood of multiple behavior classes, which matches the
multimodal and uncertain nature of the distant future.

Illustration of the benefit of extending the prediction horizon. Assume that the green vehicle is the
ego vehicle with the planning module, while the transparency of the vehicles represents the time
elapsed. For a detection-based method (on the top), the LC prediction is given when the blue vehicle
has a clear LC pattern, which may result in a sudden braking of the ego vehicle due to the late
discovery. However, from the interaction point of view, the blue vehicle is moving at a high speed
and is blocked by the slowly moving red car. The blue vehicle has two interaction choices: brake to
avoid collision or merge into the other lane. By learning from a large number of interaction patterns,
the likelihood of LC can be estimated, even before the blue vehicle has a clear LC maneuver.

Illustration of the popular social pooling strategy. The RNN hidden states in the
same spatial cell are pooled and passed to a fully connected layer to generate the
total social effect on the target vehicle (yellow). Vehicles which have totally
different dynamics in the same cell will share the same weight. However, if the
vehicle that is highlighted with a circle is moving slowly, it is supposed to has a large
weight since it blocks the LC route of the target vehicle, but if it is moving much
faster than the target vehicle, it should has little impact on the target vehicle’s LC.

VBIN building block: pairwise interaction unit (PIU). RNNs are
implemented using gated recurrent units (GRUs), and the
hidden states are used as the input of the PIU.
The basic element of the VBIN is the pairwise
interaction unit (PIU).
The PIU learns to weight the social effect of
an interaction pair based on their maneuver
histories and relative dynamics.
PIU takes three inputs: two RNN encodings of
a pair of vehicles, and the connection feature.
The connection feature is extracted via
another feature extraction function, which is
based on the history of both vehicles and
represents the relative states of both vehicles.

• Neighborhood interaction unit (NIU).
• The “neighborhood” is defined by a grid centered at the prediction target and each vehicle is
associated with one cell.
• However, vehicles travel in a semi-structured environment where there are semantic
elements, such as lanes, which makes the occupancy grid not the best choice.
• Taking the highway scenario for example, vehicles tend to interact with the nearest vehicles
in the current and neighboring lanes, and these neighboring vehicles will be informative for
interaction modeling.
• The selection process for a highway scenario is as follows:
• 1) Select the two vehicles at the front and rear in the current lane.
• 2) Select the two vehicles with the closest longitudinal coordinates to the target vehicle in
the neighboring lanes.
• 3) Select the vehicles immediately at the front and the vehicle immediately at the rear w.r.t.
the neighboring vehicle.

Illustration of the PIU connections for each selected neighboring vehicle

VBIN building block: neighborhood interaction
unit (NIU). All the PIUs are identical.Illustration of the VBIN structure. All the NIUs are identical.
Behavior decoding. NIU output represents the social effect applied to vehicle. We then concatenate NIU
output with its original local maneuver encoding. The concatenated vector now contains both the features
extracted from its own maneuver history and the features extracted from the neighborhood interaction.
Applying the above process to all prediction targets, obtain a social batch, with each row representing the
combined encoding for each individual vehicle.

Illustration of the dataflow of the system
Scalable deployment. Specifically, since the PIUs and NIUs are all identical, precompute and
reuse the following tensors: 1) precompute the connection features for the neighborhood of
each agent, which forms a tensor, and store an index tensor, which records the index of the
corresponding neighboring vehicle; 2) use the RNN to encode local maneuver features and get
the resulting hidden tensor. After the preparations, run a forward pass. After the output layers of
the NIU, the tensor will be concatenated with the original hidden states and decoded.

Rules of the Road: Predicting Driving Behavior with
a Convolutional Model of Semantic Interactions
• Focus on the problem of predicting future states of entities in complex, real-world driving
scenarios.
• Previous research has used low-level signals to predict short time horizons, and has not
addressed how to leverage key assets relied upon heavily by industry self-driving systems:
(1) large 3D perception efforts which provide highly accurate 3D states of agents with
rich attributes, and (2) detailed and accurate semantic maps of the environment (lanes,
traffic lights, crosswalks, etc).
• A unified representation which encodes such high-level semantic information in a spatial
grid, allowing the use of deep convolutional models to fuse complex scene context.
• This enables learning entity-entity and entity-environment interactions with simple, feed-
forward computations in each timestep within an overall temporal model of an agent’s
behavior.
• Different ways of modelling the future as a distribution over future states using standard
supervised learning.

Entity future state prediction task on a top-down scene:
A target entity of interest is shown in red, with a real
future trajectory shown in pink. The most likely
predicted trajectory is shown in cyan, with alternate
trajectories shown in green. Uncertainty ellipses
showing 1 standard deviation of uncertainty are
displayed for the most likely trajectory only. Other
entities are rendered in magenta (pedestrians), blue
(vehicles) and orange (bicycles). The ego vehicle which
captured the scene is shown in green. Velocities are
shown as orange lines scaled proportional to 1m/s.
Examples of underlying semantic map information
shown are lane lines, crosswalks and stop lines.

Entity and world context representation

Two different network architectures for occupancy grid maps (predicting Gaussian trajectories instead is done
by simply replacing the convolutional-transpose network with a fully-connected layer).

DROGON: A Causal Reasoning Framework for
Future Trajectory Forecast
• DROGON (Deep RObust Goal-Oriented trajectory prediction Network) for
accurate vehicle trajectory forecast by considering behavioral intention of
vehicles in traffic scenes.
• The main insight is that a causal relationship between intention and behavior of
drivers can be reasoned from the observation of their relational interactions
toward an environment.
• To succeed in causal reasoning, build a conditional prediction model to forecast
goal-oriented trajectories, which is trained with the following stages: (i) relational
inference where we encode relational interactions of vehicles using the
perceptual context; (ii) intention estimation to compute the probability
distribution of intentional goals based on the inferred relations; and (iii) causal
reasoning where we reason about the behavior of vehicles as future locations
conditioned on the intention.
• To properly evaluate the performance of our approach, a new large-scale dataset
collected at road intersections with diverse interactions of vehicles.

First infer relational interactions of vehicles with each other and with an environment. The following module
estimates the probability distribution of intentional goals (zones). Then, conditionally reason about the goal-
oriented behavior as multiple trajectories being sampled from the estimated distribution。

Graph models to encode spatio-temporal interactions. (a) A node represents the state of each road user,
whereas (b) it is a visual encoding of spatio-temporal interactions captured from each region of the discretized
grid between adjacent frames。

• A large dataset is collected in the San Francisco Bay Area (San Francisco,
Mountain View, San Mateo, and Santa Cruz), focusing on highly interactive
scenarios at four-way intersections.

Multiple Futures Prediction
• Temporal prediction is critical for making intelligent and robust decisions in complex dynamic
environments.
• Motion prediction needs to model the inherently uncertain future which often contains multiple
potential outcomes, due to multi- agent interactions and the latent goals of others.
• Towards these goals, a probabilistic framework that efficiently learns latent variables to jointly
model the multi-step future motions of agents in a scene.
• This framework is data-driven and learns semantically meaningful latent variables to represent
the multimodal future, without requiring explicit labels.
• Using a dynamic attention-based state encoder, learn to encode the past as well as the future
interactions among agents, efficiently scaling to any number of agents.
• Finally, the model can be used for planning via computing a conditional probability density over
the trajectories of other agents given a hypothetical rollout of the ‘self’ agent.

Examples illustrating the need for muti-modal interactive
predictions. (a): There are a few possible modes for the blue
vehicle. (b and c): Time-lapsed visualization of how interactions
between agents influences each other’s trajectories.

• Formulating a probabilistic framework of continuous space but discrete time system with a finite
(but variable) number of N interacting agents.
• RNNs are typically employed to sequentially model the distribution in a cascade form.
• However, there are two major challenges specific to multi-agent prediction framework:
• (1) Multimodality: the mapping from X to Y is not a function, but rather a 1-to-many mapping.
• (2) Variable-Agents: the number of agents N is variable and unknown.
• It introduce a set of stochastic latent variables zn with the intuition: zn would learn to represent
intentions (left/right/straight) and/or behavior modes (aggressive/conservative).
• Learning maximizes the marginalized distribution, where z is free to learn any latent behavior so
long as it helps to improve the data log-likelihood.
• Each z is conditioned on X at the current time and will influence the distrib. over future states Y.
• A key feature of the MFP is that zn is only sampled once at time t, and must be consistent for the
next T time steps.
• This leads to a tractability and more realistic intention/goal modeling.

Graphical model and computation graph of the MFP

• A point-of-view (PoV) transformation is first used to transform the past states to each agent’s own
reference frame by translation and rotation such that +x-axis aligns with agent’s heading.
• Then instantiate an encoding and a decoding RNNper agent.
• Each encoding RNN is responsible for encoding the past observations into a feature vector.
• Scene context is transformed via a CNN into its own feature.
• The features are combined via a dynamic attention encoder, to provide inputs both to the latent
variables as well as to the ensuing decoding RNNs.
• During predictive rollouts, the decoding RNN will predict its own agent’s state at every timestep.
• The predictions will be aggregated and subsequently transformed, providing inputs to every
agent/RNN for the next timestep.
• Latent variables Z provide extra inputs to the decoding RNNs to enable multimodality.
• Finally, the output consists of a 5 dim vector governing a Bivariate Normal distribution.

Diagram for dynamic attentional state encoding. MFP uses state
encoding at every timestep to convert the state of surrounding
agents into a feature vector for next-step prediction.
Each agent uses a NN to transform its state
(positions, velocity, acceleration, and heading)
into a key or descriptor, which is then matched via
a radial basis function to a fixed number of “slots"
with learned keys in the encoder network. The
ego agent has a separate slot to send its own
state. Slots are aggregated and further
transformed by a two layer encoder network,
encoding a state (e.g. 128 dim vector). The entire
dynamic encoder can be learned in an end-to-end
fashion. The key-matching is similar to dot-
product attention, however, the use of radial basis
functions allows to learn spatially sensitive and
meaningful keys to extract relevant agents.

(a) CARLA data. (b) Sample rollouts overlayed, showing learned multimodality. (c) MFP learned semantically
meaningful latent modes automatically: triangle: right turn, square: straight ahead, circle: stop.

Real-time Multi-target Path Prediction and
Planning for Autonomous Driving aided by FCN
• Real-time multi-target path planning is a key issue in the field of autonomous driving.
• Although multiple paths can be generated in real-time with polynomial curves, the generated
paths are not flexible enough to deal with complex road scenes such as S-shaped road and
unstructured scenes such as parking lots.
• Search and sampling-based methods, such as A* and RRT and their derived methods, are flexible
in generating paths for these complex road environments.
• However, the existing algorithms require significant time to plan to multiple targets, which greatly
limits their application in autonomous driving.
• A real-time path planning method for multi-targets is proposed.
• Train a fully convolutional neural network (FCN) to predict a path region for the target at first.
• By taking the predicted path region as soft constraints, the A* algorithm is then applied to search
the exact path to the target.

Start with the existing TiEV A* path planning method to
automatically generate massive training samples. The
perception information fed to TiEV A* includes the
static obstacle map (white), the dynamic obstacle
object (white), the reference path derived from the
global planning (green) . The TiEV A* then search the
path from the ego position (orange) to the target (red
dot) (a). Then generate the training sample out of the
above planning results. The input of the training
sample is composed of three components: the
obstruction region (red), the reference region (green)
and the target region (blue) (b). The point and line
features are all dilated to regions to facilitate feature
encoding in FCN. The label of the training sample is the
dilated path region (c).

The FCN workflow: Giving the combined perception input (the first stage); The three main components are
extracted (the second stage), where the obstacle map is in red, the reference path map in green and the target
map in blue; These three components are merged into a three-channel image which is the input of the FCN
encoder (the third stage); Finally, the FCN is trained against the path labels (the fourth stage).

TiEV A*’s lookup table: This lookup table is
composed of 21x21 grids. Assuming the center
gird is the current expanding point, this lookup
table approximate 368 different directions, so
it also has 368 different actions.
Take all the perception input of the TiEV A* and argument the
data for generating a massive set of training samples. The
obstruction map is augmented by randomly adding simulated
vehicle obstacles along the reference path and its parallel path
to make the environment more complicated.

Augmentation of the reference global path: the figure on the
left represents the planning result of TiEV A*; The graph at the
top right is the results of the randomly shifted reference global
path; The bottom right figure is the FCN label extracted from
the original planning result.

The comparison of the path planning results between the
original TiEV A* and the improved TiEV A*, where the
number of targets is 1, 3, 10, 20, 30, 50 from left to right.

PRECOG: PREdiction Conditioned On Goals in
Visual Multi-Agent Settings
• For autonomous vehicles (AVs) to behave appropriately on roads populated by human-driven
vehicles, they must be able to reason about the uncertain intentions and decisions of other
drivers from rich perceptual information.
• Towards these capabilities, a probabilistic forecasting model of future interactions between a
variable number of agents.
• It performs both forecasting and conditional forecasting, which reasons about how all agents will
likely respond to the goal of a controlled agent (here, the AV).
• Train models on real and simulated data to forecast vehicle trajectories given past positions and
LI- DAR.
• This model’s predictions of all agents improve when conditioned on knowledge of the AV’s goal,
further illustrating its capability to model agent interactions.

• Planning means the algorithmic process of producing a sequence of future decisions (in this
model, choices of latent values) likely to satisfy a goal.
• Forecasting means the prediction of a sequence of likely future states; forecasts can either be
single-agent or multi- agent.
• Finally, conditional forecasting means forecasting by conditioning on one or more agent goals.
• By planning an agent’s decisions to a goal and sampling from the other agents’s stochastic
decisions, it is multi-agent conditional forecasting.
• The model reasons probabilistically about plausible future interactions between agents given rich
observations of their environment.
• It uses latent variables to capture the uncertainty in other agents’ decisions.
• The key idea is the use of factorized latent variables to model decoupled agent decisions even
though agent dynamics are coupled.
• Factorization across agents and time enable to query the effects of changing an arbitrary agent’s
decision at an arbitrary time step.

The factorized latent variable model of forecasting and planning for 2 agents. In a) use latent variable to
represent variation in agent’s plausible scene-conditioned reactions to all agents, causing uncertainty in
every agents’ future states. Beyond forecasting, admit planning robot decisions by deciding latent
variable = decision (b). Shaded nodes represent observed or determined variables, and square nodes
represent robot decisions. Thick arrows represent grouped dependencies of non-Makovian state “carried
forward”. Note: the latent variable factorizes across agents, isolating the robot’s reaction variable. Human
reactions remain uncertain (unobserved) and uncontrollable (the robot cannot decide), and yet the
robot’s decisions will still influence human drivers (and vice-versa). In c) the implementation.

• Factorization makes it possible to use the model for highly flexible conditional forecasts.
• Conditional forecasts predict how other agents would likely respond to different robot decisions
at different moments in time.
• Since robots are not merely passive observers, but one of potentially many agents, the ability to
anticipate how they affect others is critical to their ability to plan useful, safe, and effective
actions, critical to their utility within a planning and control framework.
• Drivers can appear to take highly stochastic actions in part because of not observing their goals.
• In practical scenarios, the robot knows its own goals, can choose its own actions, and can plan a
course of action to achieve a desired goal.
• While many objectives are valid, use imitative models (IM), which estimate the likeliest state
trajectory an expert “would have taken” to satisfy a goal, based on prior expert demonstrations.
• It generalizes IM to multi-agent environments, and plan w.r.t. the uncertainty of human drivers
close by.

CARLA and nuScenes multi-agent forecasting evaluation

Examples of multi-agent forecasting with the learned ESP model. In each scene, 12 joint samples are shown, and LIDAR colors
are discretized to near-ground and above-ground. Left: (CARLA) the model predicts Car 1 could either turn left or right, while
the other agents’ future maintain multimodality in their speeds. Center-left: The model predicts Car 2 will likely wait (it is
blocked by Cars 3 and 5), and that Cars 3 and 5 sometimes move forward together, and sometimes stay stationary. Center-right:
Car 2 is predicted to overtake Car 1, which itself is forecasted to continue to wait for pedestrians and Car 2. Right: Car 4 is
predicted to wait for the other cars to clear the intersection, and Car 5 is predicted to either start turning or continue straight.

Human Driver Behavior Prediction based on UrbanFlow
• How autonomous vehicles and human drivers share public transportation systems is an important
problem, as fully automatic transportation environments are still a long way off.
• Understanding human drivers’ behavior can be beneficial for autonomous vehicle decision
making and planning, especially when the autonomous vehicle is surrounded by human drivers
who have various driving behaviors and patterns of interaction with other vehicles.
• An LSTM-based trajectory prediction method for human drivers which can help the autonomous
vehicle make better decisions, especially in urban intersection scenarios.
• Meanwhile, in order to collect human drivers’ driving behavior data in the urban scenario, a
system called UrbanFlow which includes the whole procedure from raw bird’s-eye view data
collection via drone to the final processed trajectories.
• The system is mainly intended for urban scenarios but can be extended to be used for any traffic
scenarios.

The UrbanFlow dataset processing pipeline. The pipeline
includes the drone data collection and process flow from
raw video data to the final trajectory data.
Optimized stabilization method flow

• Video Stabilization. Reference frame, homography matrix, re-alignment.
• Object Detection: RetinaNet was fine-tuned using pre-trained weights from the COCO dataset.
• Map Construction and Coordinate Transition: The first step in the creation of the map is to crop
the area of interest, which in this case is the roads. To attack this problem, took advantage of the
image segmentation network ”U-net”. After detecting the road and applying a color filter to
detect the lane markings on the road, the work transforms all the detected vehicle positions from
the original image-based coordinates to the road-based coordinates.
• Local x and y based on the road-based coordinates
• Vehicle length and width
• Section ID
• Lane ID
• Vehicle Tracking and Trajectory Smoothing: After the positions of vehicles have been transformed
into the local (road) coordinates, apply the tracking algorithm to track each car. Meanwhile,
smooth each vehicle’s trajectory.

Transition from original image-based
coordinate to road-based coordinate
Intention Prediction Network Structure
Trajectories Prediction Network Structure
Reference trajectories according to the direction
intention. Follow the center of the lanes.

• Intention Prediction: direction intention and yield intention. The direction intentions include
Going Straight (GS), Turing Left (TL) and Turning Right (TR). The yield intention indicates the
prediction of which car will go through the potential crash point first. For the interacting driver
pairs with intentions of GS and TL or TL and TR, the input states include the positions, velocities,
heading angles and relative distance to the intersection center of both cars of each pair. During
the interaction procedure, the yield motion also changes based on the counterpart’s behavior.
This will contribute as a key factor to the next-step motion planning module and help to generate
a safer and feasible trajectory.
• Trajectories Prediction: Based on the results of direction and yield predictions, a more detailed
trajectory prediction procedure includes more information on the future trajectories. Pt includes
information on velocities and positions of the target car. A reference trajectory is first selected
according to the intention prediction results. According to the reference trajectories with
intersection geometry information, the velocities, heading angles and relative distance to the
intersection center of both cars, the network can predict the future trajectories.

Direction and yield intention as well as MSE of
trajectories prediction for the target vehicle.

VisionNet: A Drivable-space-based Interactive
Motion Pred. Net for Autonomous Driving
• The comprehension of environmental traffic situation largely ensures the driving safety of
autonomous vehicles.
• Hard to be well addressed due to the limitation of collective influence in the complex scenarios.
• Only model interactions through spatial relations between the target obstacle and its neighbors.
• The training stage of the interactions lacks effective supervision. As a result, far from promising.
• More intuitively, may transform the problem into calculating the interaction-aware drivable
spaces and design the CNN-based VisionNet for trajectory prediction.
• The VisionNet accepts a sequence of motion states, i.e., location, velocity and acceleration, to
estimate the future drivable spaces.
• The reified interactions significantly increase the interpretation ability of the VisionNet and refine
the prediction.
• To further advance the performance, an interactive loss to guide the generation of the drivable
spaces.

Illustration of the motion prediction task. Having a predicted vehicle
with red color, the autonomous driving system estimates the driving
spaces of all the surrounding obstacles and infers the interaction-
aware global drivable spaces. The resulting drivable spaces are used
to assist in trajectory generation.

• Occupancy Grid Maps (OGMs) divide a realistic area into the mesh grids, where each grid
indicates the observation probability of the target object.
• VisionNet is composed of 2 deep networks, the interaction network and the prediction network.
• The interaction network is fed by the past trajectories of all obstacles and computes the Global
DrivAble Spaces (GDAS) from Tobs+1 to Tpred.
• Actually, the GDAS represents a comprehensive influence among the traffic agents.
• Then, depending on the historical trajectory and the GDAS, the prediction network produces the
future trajectory for each obstacle.
• Different from conventional coordinate-based methods, it models the global interactive effects
among obstacles and takes advantage of OGMs for future trajectory prediction.
• The VisionNet can consequently take in interactions and outperform previous methods in both
complex conditions and normal scenarios.

Main architecture of the VisionNet

• Under the collective influence of all obstacles, the interaction network aims at predicting the
GDAS in the future.
• An intuitive solution is to predict DrivIng Spaces (DIS) for each obstacle first and then generates
GDAS based on individual results.
• Divide DIS into two parts, the Basic DrivIng Spaces (B-DIS) and the Noise DrivIng Spaces (N-DIS).
• B-DIS represents the fundamental moving region controlled by the measured velocity and
acceleration.
• N-DIS captures the disturbance which is dominated by both the measurement noise and the
process noise.
• VisionNet simultaneously takes the DIS of all obstacles in the past and forecasts the future GDAS.
• The output of the prediction network represents drivable spaces under the collective influence
of the traffic participants.

• The prediction network contains two parts, i.e., feature fusion and image synthesis.
• Feature fusion: use an encoder to embed the observed OGMs into a latent feature map;
• As for drivable spaces, need to remove the “self-predictive” information;
• Specifically, the GDAS indicates safety and collision regions in the scene;
• Those collision areas are determined by the movement of all the obstacles, including the
predicted obstacle. In other words, the GDAS contains “self-predictive” information;
• If the GDAS was directly adopted to predict the future motion of the obstacle, it could
confuse the prediction network and lead to a poor performance;
• Leverage an ablation mask to filter out “self-predictive” regions in GDAS; an additional
encoder is employed to transform filtered drivable spaces into a latent feature map.
• Image synthesis: employ several upsampling deconvolutional layers to synthesize images.
• With the generated fusion feature map z, there are alternative techniques to produce OMGs
as sequential issues, such as Conv-LSTM and CNN.

Comparison of Prediction Performance on ETH&UCY Datasets

Driving behaviors for adas and autonomous driving XI

Driving behaviors for adas and autonomous driving XI

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Driving behaviors for adas and autonomous driving XI

Similar to Driving behaviors for adas and autonomous driving XI (20)

More from Yu Huang

More from Yu Huang (20)

Recently uploaded

Recently uploaded (20)

Driving behaviors for adas and autonomous driving XI