Driving behaviors for adas and autonomous driving XIII
1. Driving Behaviors for ADAS
and Autonomous Driving XIII
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California
2. Outline
• Jointly Learnable Behavior and Trajectory Planning for Self-Driving Vehicles (10.10)
• Traject. Predict. for Auto. Driving based on Multi-Head Attention with Joint Agent-Map Representation (6.4)
• MANTRA: Memory Augmented Networks for Multiple Trajectory Prediction (6.5)
• CoverNet: Multimodal behavior prediction using trajectory sets (CVPR.6.14)
• Motion Prediction using Trajectory Sets and Self-Driving Domain Knowledge (6.8)
• Learning Situational Driving (CVPR.6.14)
• AMENet: Attentive Maps Encoder Network for Trajectory Prediction (6.15)
• MCENET: Multi-Context Encoder Network for Homogeneous Agent Traj. Pred. in Mixed Traffic (6.23)
• Multi-Head Attention based Probabilistic Vehicle Trajectory Prediction (7.4)
• Probabilistic Multi-modal Trajectory Prediction with Lane Attention for Autonomous Vehicles (7.6)
• Traffic Agent Trajectory Prediction Using Social Convolution and Attention Mechanism (7.6)
• Planning on the fast lane: Learning to interact using attention mechanisms in path integral IRL (7.11)
• Vehicle Trajectory Prediction by Transfer Learning of Semi-Supervised Models (7.14)
3. Jointly Learnable Behavior and Trajectory Planning
for Self-Driving Vehicles
• The motion planners used in self-driving vehicles need to generate trajectories that are
safe, comfortable, and obey the traffic rules.
• This is usually achieved by two modules: behavior planner, which handles high-level
decisions and produces a coarse trajectory, and trajectory planner that generates a
smooth, feasible trajectory for the duration of the planning horizon.
• These planners, however, are typically developed separately, and changes in the behavior
planner might affect the trajectory planner in unexpected ways.
• Furthermore, the final trajectory outputted by the trajectory planner might differ
significantly from the one generated by the behavior planner, as they do not share the
same objective.
• Here it is a jointly learnable behavior and trajectory planner.
• Unlike most existing learnable motion planners that address either only behavior
planning, or use an uninterpretable neural network to represent the entire logic from
sensors to driving commands, this approach features an interpretable cost function on
top of perception, prediction and vehicle dynamics, and a joint learning algorithm that
learns a shared cost function employed by our behavior and trajectory components.
4. Jointly Learnable Behavior and Trajectory Planning
for Self-Driving Vehicles
The learnable motion planner has
discrete and continuous components,
minimizing the same cost function with
a same set of learned cost weights.
5. Jointly Learnable Behavior and Trajectory Planning
for Self-Driving Vehicles
A: Given a scenario, generate a set of possible SDV behaviors. B: Left and right lane boundaries
and the driving path that are relevant to the intended behavior are considered in the cost
function. C: SDV geometry for spatiotemporal overlapping cost are approximated using circles.
D: The SDV yields to pedestrians through stop lines on the driving paths.
6. Jointly Learnable Behavior and Trajectory Planning
for Self-Driving Vehicles
• Motion planners of modern self-driving cars are composed of two modules.
• The behavioral planner is responsible for making high level decisions.
• The trajectory planner takes the decision of the behavioral planner and a coarse trajectory and
produces a smooth trajectory for the duration of the planning horizon.
• Unfortunately these planners are typically developed separately, and changes in the behavioral
planner might affect, in unexpected ways, the trajectory planner.
• Furthermore, the trajectory outputted by the trajectory planner might differ in terms of behavior
from the one returned by the behavioral planner as they do not share the same objective.
• This motion planner comes as both behavioral and trajectory planners share the same objective.
• At each planning iteration, depending on the SDV location on the map, a subset of these
behaviors, denoted by B(W), is allowed by traffic-rules and hence considered for evaluation.
• It then generates low-level realizations of the high-level behaviors by generating a set of
trajectories T (b) relative to these paths.
7. Jointly Learnable Behavior and Trajectory Planning
for Self-Driving Vehicles
• A safe trajectory for the SDV should not only be collision- free, but also satisfy a safety-distance to
the surrounding obstacles, including both the static and dynamic objects such as vehicles,
pedestrians, cyclists, unknown objects, etc.
• It defines costs to capture the S-T overlap and violation of safety-distance respectively.
• For this, the SDV polygon is approximated by a set of circles with the same radii along the vehicle,
then using the distance from the center of the circles to the object polygon to evaluate the cost.
• The SDV is expected to adhere to the structure of the road. Therefore, introduce sub-costs that
measure such violations.
• The driving-path and boundaries that are considered for these sub-costs depend on the candidate
behavior.
• The driving-path cost is the squared distance towards the driving path and the lane boundary cost
is the squared violation distance of a safety threshold.
8. Jointly Learnable Behavior and Trajectory Planning
for Self-Driving Vehicles
Left: Headway cost penalizes unsafe distance to leading vehicles. Right:
for each sampled trajectory, a weight function determines how relevant
an obstacle is to the SDV in terms of its lateral offset.
9. Jointly Learnable Behavior and Trajectory Planning
for Self-Driving Vehicles
• As the SDV is driving behind a leading vehicle in either lane-following or lane-change behavior, it
should keep a safe longitudinal distance that depends on speed of SDV and the leading vehicle.
• Compute the headway cost as the violation of the safety distance after applying a comfortable
constant deceleration, assuming that the leading vehicle applies a hard brake and deciding which
vehicles are leading the SDV at each time-step in the planning horizon.
• Use a weight function of the lateral distance between the SDV and other vehicles to determine
how relevant they are for the headway cost.
• The distance violation costs incurred by vehicles that are laterally aligned with SDV dominate the
cost, compatible with lane change maneuvers where deciding the lead vehicles can be difficult.
• Pedestrians are vulnerable road users and hence require extra caution, defining a yield cost.
• The mission route is represented as a sequence of lanes, from which to specify all lanes that are
on the route or are connected to the route by permitted lane-changes.
• A cost-to-go function to capture the value of the final state, speed-limit of a lane for a cost that
penalizes a trajectory which exceeds the eligible speed and costs for comfortable driving.
10. Jointly Learnable Behavior and Trajectory Planning
for Self-Driving Vehicles
Behavioral decisions include obstacle side
assignment and lane information, which are sent
through the behavioral- trajectory interface.
Example trajectories in a nudging scenario
11. Jointly Learnable Behavior and Trajectory Planning
for Self-Driving Vehicles
The max-margin objective uses a surrogate loss to learn
the sub-cost weights, since selecting the optimal trajectory
within a discrete set is not differentiable. In contrast, the
iterative optimization in the trajectory planner is a
differentiable module, where gradients of the imitation loss
function can be computed using the backpropagation through
time (BPTT) algorithm. Since unrolling the full optimization can
be computationally expensive, unroll only for a truncated
number of steps after we obtain a solution. Perform M gradient
descent steps after obtaining the optimal trajectory, and
backpropagate through these M steps only. If the control
obtained from the continuous optimization converges to the
optimum, then backpropagating through a truncated number
of steps is approximating of the inverse Hessian at the optimum.
12. Jointly Learnable Behavior and Trajectory Planning
for Self-Driving Vehicles
•Behavioral with max-margin (“B+M”) learns the weight vector through the max-margin (+M) learning
on the behavioral planner only.
•Full Inference (“B+M +J”) uses the trained weights of “B+M”, and runs the joint inference algorithm
(+J) at test time.
•Full Learning & Inference (“B+M +J +I”) learns the weight vector using the combination of max-
margin (+M) and imitation objective (+I), and runs the joint inference algorithm (+J) at test time.
13. Traject. Predict. for Auto. Driving based on Multi-Head
Attention with Joint Agent-Map Representation
• Predicting the trajectories of surrounding agents is an essential ability for robots navigating
complex real-world environments.
• Autonomous vehicles (AV) in particular, can generate safe and efficient path plans by predicting
the motion of surrounding road users.
• Future trajectories of agents can be inferred using two tightly linked cues: the locations and past
motion of agents, and the static scene structure.
• The configuration of the agents may uncover which part of the scene is more relevant, while the
scene structure can determine the relative influence of agents on each others motion.
• To better model the interdependence of the two cues, a multi- head attention-based model that
uses a joint representation of the static scene and agent configuration for generating both keys
and values for the attention heads.
• To address the multimodality of future agent motion, use each attention head to generate a
distinct future trajectory of the agent.
• The visualization of attention maps adds a layer of interpretability to the trajectories predicted by
the model.
14. Traject. Predict. for Auto. Driving based on Multi-Head
Attention with Joint Agent-Map Representation
MHA-JAM (MHA with Joint Agent Map representation): Each LSTM encoder generates an encoding
vector of one of the surrounding agent recent motion. The CNN backbone transforms the input map
image to a 3D tensor of scene features. A combined representation of the context is built by
concatenating the surrounding agents motion encodings and the scene features. Each attention head
models a possible way of interaction between the target (green car) and the combined context features.
Each LSTM decoder receives a context vector and the target vehicle encoding and generates a possible
distribution over a possible predicted trajectory conditioned on each context.
15. Traject. Predict. for Auto. Driving based on Multi-Head
Attention with Joint Agent-Map Representation
Off-road loss: an auxiliary loss function that
penalizes locations predicted by the model the fall
outside the drivable area. It is proportional to the
distance of a predicted location from the nearest
point on the drivable area.
Regression loss: To not penalize plausible trajectories
generated by the model that do not correspond to
the ground truth, use a variant of the best-of-L
regression loss for training our model. Compute the
negative log-likelihood (NLL) of the ground truth
trajectory under each of the L modes output by the
model and consider the minimum of the L NLL values
as the regression loss.
Classification loss: In addition to the regression loss,
consider cross entropy.
16. Traject. Predict. for Auto. Driving based on Multi-Head
Attention with Joint Agent-Map Representation
Joint vs. separate agent-map representation for the
attention heads. two models: (1) a baseline where
attention weights are separately generated for the map
and agents features by generating keys and values for each
set of features independent of the other (a), (2) this
formulation where each attention head generates keys
and values based on a joint representation of agent and
map features (b).
17. Traject. Predict. for Auto. Driving based on Multi-Head
Attention with Joint Agent-Map Representation
18. MANTRA: Memory Augmented Networks for
Multiple Trajectory Prediction
• Autonomous vehicles are expected to drive in complex scenarios with several independent non
cooperating agents.
• Path planning for safely navigating in such environments can not just rely on perceiving present
location and motion of other agents.
• It requires instead to predict such variables in a far enough future: the problem of multimodal
trajectory prediction exploiting a Memory Augmented Neural Network.
• This method learns past and future trajectory embeddings using RNNs and exploits an associative
external memory to store and retrieve such embeddings.
• Trajectory prediction is then performed by decoding in-memory future encodings conditioned
with the observed past.
• It incorporates scene knowledge in the decoding state by learning a CNN on top of semantic
scene maps.
• Memory growth is limited by learning a writing controller based on the predictive capability of
existing embeddings.
• Thanks to the non-parametric nature of the memory module, the trained system can continuously
improve by ingesting novel patterns.
19. MANTRA: Memory Augmented Networks for
Multiple Trajectory Prediction
MANTRA addresses multimodal trajectory
prediction. Obtain multiple future predictions
given an observed past relying on a Memory
Augmented Neural Network.
20. MANTRA: Memory Augmented Networks for
Multiple Trajectory Prediction
Architecture of MANTRA. The encoding of an observed past trajectory is used as key to read
likely future encodings from memory. A multimodal prediction is obtained by decoding each
future encoding, conditioned by the observed past. The surrounding context is processed
by a CNN and fed to the Refinement Module to adjust predictions.
21. MANTRA: Memory Augmented Networks for
Multiple Trajectory Prediction
Representation learning: The encoders learn to map past and future points into a
meaningful representation and the decoder learns to reproduce the future. Instead of
using just the future as input, condition the reconstruction process also with an
encoding of the past. Past and future trajectories are encoded separately; a decoder
reconstructs future trajectory only.
23. CoverNet: Multimodal Behavior Prediction
using Trajectory Sets
• CoverNet, a new method for multimodal, probabilistic trajectory prediction for
urban driving.
• Previous work has employed a variety of methods, including multimodal
regression, occupancy maps, and 1-step stochastic policies.
• It frames the trajectory prediction problem as classification over a diverse set of
trajectories.
• The size of this set remains manageable due to the limited number of distinct
actions that can be taken over a reasonable prediction horizon.
• It structures the trajectory set to a) ensure a desired level of coverage of the state
space, and b) eliminate physically impossible trajectories.
• By dynamically generating trajectory sets based on the agent’s current state, they
further improve the method’s efficiency.
26. Motion Prediction using Trajectory Sets and Self-
Driving Domain Knowledge
• Predicting the future motion of vehicles has been studied using various
techniques, including stochastic policies, generative models, and regression.
• Recent work has shown that classification over a trajectory set, which
approximates possible motions, achieves state-of-the-art performance and avoids
issues like mode collapse.
• However, map information and the physical relationships between nearby
trajectories is not fully exploited in this formulation.
• Build on classification-based approaches to motion prediction by adding an
auxiliary loss that penalizes off-road predictions.
• This auxiliary loss can easily be pretrained using only map information (e.g., off-
road area), which significantly improves performance on small datasets.
• Weighted cross-entropy losses to capture spatial-temporal relationships among
trajectories.
27. Motion Prediction using Trajectory Sets and Self-
Driving Domain Knowledge
Visualization of on-road (black) and
off-road (red) trajectories
Visualization of the target distribution in the
standard cross-entropy formulation (left), and
the weighted cross-entropy loss (right)
28. Motion Prediction using Trajectory Sets and Self-
Driving Domain Knowledge
Results listed as Argoverse | nuScenes
29. Learning Situational Driving
• Human drivers have a remarkable ability to drive in diverse visual conditions and
situations, e.g., from maneuvering in rainy, limited visibility conditions with no lane
markings to turning in a busy intersection while yielding to pedestrians.
• In contrast, state-of-the-art sensorimotor driving models struggle when encountering
diverse settings with varying relationships between observation and action.
• To generalize when making decisions across diverse conditions, humans leverage multiple
types of situation- specific reasoning and learning strategies.
• Motivated by this observation, a framework for learning a situational driving policy that
effectively captures reasoning under varying types of scenarios.
• The key idea is to learn a mixture model with a set of policies to capture multiple driving
modes.
• First optimize the mixture model through behavior cloning.
• Then refine the model by directly optimizing for the driving task itself, i.e., supervised with
the navigation task reward.
• It is more scalable than methods assuming access to privileged information, e.g.,
perception labels, as it only assumes demonstration and reward-based super- vision.
30. Learning Situational Driving
Situational Driving. To address the complexity in
learning perception-to-action driving models, we
introduce a situational framework using a behavior
module. The module reasons over current on-road
scene context when composing a set of learned
behavior policies under varying driving scenarios.
Our approach is used to improve over behavior
reflex and privileged approaches in terms of
robustness and scalability.
31. Learning Situational Driving
Approach Overview. The agent learns to
combine a set of expert policies in a context-
dependent, task- optimized manner to robustly
drive in diverse scenarios.
33. AMENet: Attentive Maps Encoder Network for
Trajectory Prediction
• Trajectory prediction is a crucial task in different communities, such as intelligent
transportation systems, computer vision, and mobile robot applications.
• However, there are many challenges to predict the trajectories of heterogeneous road
agents (e.g., pedestrians, cyclists and vehicles) at a microscopical level.
• For example, an agent might be able to choose multiple plausible paths in complex
interactions with other agents in varying environments, and the behavior of each agent is
affected by the various behaviors of its neighboring agents.
• To this end, an end-to-end generative model named Attentive Maps Encoder Network
(AMENet) for accurate and realistic multi-path trajectory prediction.
• It leverages the target road user’s motion information (i.e., movement in xy-axis in a
Cartesian space) and the interaction information with the neighboring road users at each
time step, which is encoded as dynamic maps that are centralized on the target road user.
• A conditional variational auto-encoder module is trained to learn the latent space of
possible future paths based on the dynamic maps and then used to predict multiple
plausible future trajectories conditioned on the observed past trajectories.
34. AMENet: Attentive Maps Encoder Network for
Trajectory Prediction
An overview of the proposed framework. It consists of four modules: the X-Encoder and Y-Encoder are
used for encoding the observed and the future trajectories, respectively. They have a similar structure. The
Sample Generator produces diverse samples of future generations. The Decoder module is used to decode
the features from the produced samples in the last step and predicts the future trajectory sequentially
35. AMENet: Attentive Maps Encoder Network for
Trajectory Prediction
Structure of the X-Encoder. The encoder has
two branches: the upper one is used to
extract motion information of target agents,
and the lower one is used to learn the
interaction information among the
neighboring road users from dynamic maps
over time. Each dynamic map consists of 3
layers that represents orientation, travel
speed and relative position, which are
centralized on the target road user
respectively. The motion information and
the interaction information are encoded by
their own LSTM sequentially. The last
outputs of the two LSTMs are concatenated
and forwarded to a fc layer to get the final
output of the X-Encoder.
The Y-Encoder has the same structure as the X-Encoder but
it is used for extracting features from the future trajectories
and only used in the training phase.
37. MCENET: Multi-Context Encoder Network for
Homogeneous Agent Traj. Pred. in Mixed Traffic
• Trajectory prediction in urban mixed-traffic zones (a.k.a. shared spaces) is critical for
many intelligent transportation systems, such as intent detection for autonomous driving.
• However, there are many challenges to predict the trajectories of heterogeneous road
agents (pedestrians, cyclists and vehicles) at a microscopical level.
• For example, an agent might be able to choose multiple plausible paths in complex
interactions with other agents in varying environments.
• Multi-Context Encoder Network (MCENET) is trained by encoding both past and future
scene context, interaction context and motion information to capture the patterns and
variations of the future trajectories using a set of stochastic latent variables.
• In inference time, combine the past context and motion info of the target agent with
samplings of the latent variables to predict multiple realistic trajectories in the future.
• Through experiments on several datasets of varying scenes, it outperforms some of the
recent state-of-the-art methods for mixed traffic trajectory prediction by a large margin
and more robust in a very challenging environment.
38. MCENET: Multi-Context Encoder Network for
Homogeneous Agent Traj. Pred. in Mixed Traffic
Predicting the future trajectory (d) by observing the past trajectories (c) considering the scene (a)
and grouping context (b). Three kinds of scene context: (1) aerial photograph provides overview
of the environment, (2) segmented map defines the accessible areas respective to road agents’
transport mode and (3) the motion heat map describes the prior of how different agents move.
Different colors denote different agents or agent groups.
39. MCENET: Multi-Context Encoder Network for
Homogeneous Agent Traj. Pred. in Mixed Traffic
The pipeline for the method. The ground truth Y and the associated interaction and scene context are
injected to the input only in training. They are not available in inference. The latent variables are sampled N
times and concatenated with the output of X-Encoder for predicting multiple future paths.
41. Multi-Head Attention based Probabilistic Vehicle
Trajectory Prediction
• Online-capable deep learning model for probabilistic vehicle trajectory prediction.
• A simple encoder-decoder architecture based on multi- head attention.
• It generates the distribution of the predicted trajectories for multiple vehicles in parallel.
• It models the interactions by learning to attend to a few influential vehicles in an
unsupervised manner, which can improve the interpretability of the network.
• Interpretability: The use of multi-head attention improves the interpretability of the
neural network because the model can learn the social relations of neighboring vehicles
in an unsupervised manner.
• Scalability: As the output dimension of multi-head attention is flexible to the number of
the vehicles, the network can be extended to very dense traffic scenarios. The network is
tested in an autonomous vehicle platform with surrounding vehicles less than 30. The
average computation time is 50ms.
• Accuracy: The method is verified by using the naturalistic trajectory data in highway, and
the better performance than the existing methods in terms of positional error.
42. Multi-Head Attention based Probabilistic Vehicle
Trajectory Prediction
The road on the left denotes the input of the prediction model, which consists of the past
trajectories of surrounding vehicles, X, and the lane information, I. The road on the right denotes
the output of the prediction model, which is the distribution of the future trajectories, P(Y|X,I).
43. Multi-Head Attention based Probabilistic Vehicle
Trajectory Prediction
Structure of the attention layer for
both the lane and the vehicles.
44. Probabilistic Multi-modal Trajectory Prediction
with Lane Attention for Autonomous Vehicles
• Trajectory prediction is crucial for autonomous vehicles.
• The planning system not only needs to know the current state of the surrounding objects but also
their possible states in the future.
• As for vehicles, their trajectories are significantly influenced by the lane geometry and how to
effectively use the lane information is of active interest.
• Most of the existing works use rasterized maps to explore road information, which does not
distinguish different lanes.
• It is an instance-aware representation for lane representation.
• By integrating the lane features and trajectory features, a goal-oriented lane attention module is
proposed to predict the future locations of the vehicle.
• The lane representation together with the lane attention module can be integrated into the
widely used encoder-decoder framework to generate diverse predictions.
• Most importantly, each generated trajectory is associated with a probability to handle the
uncertainty.
• It does not suffer from collapsing to one behavior modal and can cover diverse possibilities.
45. Probabilistic Multi-modal Trajectory Prediction
with Lane Attention for Autonomous Vehicles
An overview of this method. The model consists of a trajectory encoder, a lane encoder,
an interaction network, a lane attention module and a final trajectory decoder.
46. Probabilistic Multi-modal Trajectory Prediction
with Lane Attention for Autonomous Vehicles
(a) An example of selected lanes. The blue dot represents the last location of the target vehicle. “s”
and “e” denotes the start and end of a road segment respectively. (b) The architecture of Lane Encoder.
“conv 1, 64” means 1D convolution with kernel size of 1 and 64 output channels. The final output is a
128-d vector for each lane. (c) The structure of proposed lane attention module.
48. Traffic Agent Trajectory Prediction Using
Social Convolution and Attention Mechanism
• The trajectory prediction is significant for the decision-making of autonomous driving vehicles.
• This paper proposes a model to predict the trajectories of target agents around an autonomous
vehicle.
• The main idea is considering the history trajectories of the target agent and the influence of
surrounding agents on the target agent.
• It encodes the target agent history trajectories as an attention mask and constructs a social map
to encode the interactive relationship between the target agent and its surrounding agents.
• Given a trajectory sequence, the LSTM networks are firstly utilized to extract the features for all
agents, based on which the attention mask and social map are formed.
• Then, the attention mask and social map are fused to get the fusion feature map, which is
processed by the social convolution to obtain a fusion feature representation.
• Finally, this fusion feature is taken as the input of a variable-length LSTM to predict the trajectory
of the target agent.
• The variable-length LSTM enables the model to handle the case that the number of agents in the
sensing scope is highly dynamic in traffic scenes.
49. Traffic Agent Trajectory Prediction Using
Social Convolution and Attention Mechanism
The target agent is marked by the grey
square. The blue grid region around it is its
grid cell. It generates input representation for
all agents based on trajectory information.
These representation are passed through
LSTMs and eventually used to construct the
social map, the target agent’s representation
is encoded as the attention mask. The
production of attention mask and social map
is passed through ConvNets and then
concatenated together with the target agent
tensor to produce latent representation.
Finally, this latent representation are passed
through an LSTM to generate a trajectory
prediction for the target agent.
50. Traffic Agent Trajectory Prediction Using
Social Convolution and Attention Mechanism
The Results For Trajectory Prediction On BLVD Dataset
The Results Of Different Combination Models
51. Planning on the fast lane: Learning to interact using
attention mechanisms in path integral inverse RL
• General-purpose trajectory planning algorithms for automated driving utilize complex reward
functions to perform a combined optimization of strategic, behavioral, and kinematic features.
• The specification and tuning of a single reward function is a tedious task and does not generalize
over a large set of traffic situations.
• Deep learning approaches based on path integral inverse reinforcement learning have been
successfully applied to predict local situation-dependent reward functions using features of a set
of sampled driving policies.
• Sample-based trajectory planning algorithms are able to approximate a spatio-temporal subspace
of feasible driving policies that can be used to encode the context of a situation.
• However, the interaction with dynamic objects requires an extended planning horizon, which
requires sequential context modeling.
• This work cares the sequential reward prediction over an extended time horizon.
• A neural network architecture that uses a policy attention mechanism to generate a low-
dimensional context vector by concentrating on trajectories with a human-like driving style.
• Besides, a temporal attention mechanism to identify context switches and allow for stable
adaptation of rewards.
52. Planning on the fast lane: Learning to interact using
attention mechanisms in path integral inverse RL
Illustration of planner for automated driving, which samples
policies for our deep inverse RL approach. The z-axis corresponds
to the velocity, whereas the ground plane depicts spatial feature
maps such as distances from the lane centers. A subset of policies
is visualized, where the green triangle shows the optimal policy
and the blue triangles high-light the highest policy attention. The
color gradient corresponds to the policy value. Blue policies have
a high attention activation. The cylindric objects represent a stop
barrier.
53. Planning on the fast lane: Learning to interact using
attention mechanisms in path integral inverse RL
Neural network architectures for situation-dependent reward prediction. Policy temporal attention architecture
consisting of policy attention and temporal attention mechanism. Inputs are a set of planning cycles each having a
set of policies. Policy encoder generates a latent representation of individual policies. Policy attention mechanism
produces a low-dimensional context vector, which is forwarded to the temporal attention network (TAN). Policy
temporal attention mechanism predicts a mixture reward function given a history of context vectors.
54. Planning on the fast lane: Learning to interact using
attention mechanisms in path integral inverse RL
Overview of average test performance based on expected value difference (EVD), expected distance
(ED), and optimal policy distance (OPD). Tests are conducted on a test dataset, recorded by an expert-
tuned planning algorithm.
55. Vehicle Trajectory Prediction by Transfer
Learning of Semi-Supervised Models
• This work shows that semi-supervised models for vehicle trajectory
prediction significantly improve performance over supervised models on
state-of-the-art real-world benchmarks.
• Moving from supervised to semi-supervised models allows scaling-up by
using unlabeled data, increasing the number of images in pre-training from
Millions to a Billion.
• It performs ablation studies comparing transfer learning of semi-supervised
and supervised models while keeping all other factors equal.
• Within semi-supervised models it compares contrastive learning with
teacher-student methods as well as networks predicting a small number of
trajectories with networks predicting probabilities over a large trajectory
set.
56. Vehicle Trajectory Prediction by Transfer
Learning of Semi-Supervised Models
An example of input and output representations for
mid-level (top) and low-level representations
(bottom). In the top row, the mid-level input
representation is an annotated map of the scene
(top left), with boxes representing agent positions
and colors representing semantic categories. The
output (top right) is a probability distribution over a
set of candidate trajectories. In the bottom row, a
low-level representation uses an image from the
vehicle’s front-facing camera as input (bottom left),
and predicts the future steering wheel angle (bottom
right) and speed of the vehicle.
57. Vehicle Trajectory Prediction by Transfer
Learning of Semi-Supervised Models
• Mid-level representation: an annotated map image to represent the driving
environment. This includes annotations for drivable areas, crosswalks and
walkways using color coding to represent semantic categories. All scenes are
oriented such that the agent under consideration is centered and directed
towards the top of the image. The positions of all agents in the scene are drawn
onto the image, using faded bounding boxes to represent past positions in a
historical window. By encoding all this information into a single map, a large
amount of information is condensed into a single image.
• Low-level representation: use front-facing camera images from the Drive360
dataset as a low-level representation of a driving environment. In addition to the
image, it includes a vector of semantic map data, which includes datapoints such
as the distance to the nearest intersection, the speed limit, and the approximate
road curvature.
58. Vehicle Trajectory Prediction by Transfer
Learning of Semi-Supervised Models
low-level representations
mid-level representations
59. Vehicle Trajectory Prediction by Transfer
Learning of Semi-Supervised Models
Comparison of semi-supervised models used in experiments. the labeled
dataset in all the models consists of 1.2m images. since SimCLR is trained
on augmentations, there is no measure of unlabeled data set size.
60. Vehicle Trajectory Prediction by Transfer
Learning of Semi-Supervised Models
results of CoverNet and MTP on the NuScenes dataset, comparing different semi-supervised and
supervised models to encode the annotated map. For each semi-supervised model, a direct
comparison to a supervised model with the same architecture. semi-supervised models
significantly outperform their supervised counterparts on most metrics.