Biology for Computer Engineers Course Handout.pptx
Driving behaviors for adas and autonomous driving xiv
1. Driving Behaviors for ADAS
and Autonomous Driving XIV
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California
2. Outline
• TNT: Target-driveN Trajectory Prediction (CVPR’20)
• Driving Through Ghosts: Behavioral Cloning with False Positives (8’29)
• LiRaNet: E2E Trajectory Prediction using Spatio-Temporal Radar Fusion (10.15)
• SimAug: Learning Robust Representations from Simulation for Trajectory
Prediction (ECCV’20)
• Learning Lane Graph Representations for Motion Forecasting (ECCV’20)
• Implicit Latent Variable Model for Scene-Consistent Motion Forecasting (ECCV’20)
• Perceive, Predict, and Plan: Safe Motion Planning Through Interpretable Semantic
Representations (ECCV’20)
3. TNT: Target-driveN Trajectory Prediction
• This key insight is that for prediction within a moderate time horizon, the
future modes can be effectively captured by a set of target states.
• This leads to target-driven trajectory prediction (TNT) framework.
• TNT has three stages which are trained end-to-end.
• It first predicts an agent’s potential target states T steps into the future, by encoding
its interactions with the environment and the other agents.
• TNT then generates trajectory state sequences conditioned on targets.
• A final stage estimates trajectory likelihoods and a final compact set of trajectory
predictions is selected.
• This is in contrast to previous work which models agent intents as latent
variables, and relies on test-time sampling to generate diverse trajectories.
• Benchmark TNT on trajectory prediction of vehicles and pedestrians,
outperform state-of-the-art on Argoverse Forecasting, INTERACTION,
Stanford Drone and an in-house Pedestrian-at-Intersection dataset.
4. TNT: Target-driveN Trajectory Prediction
Illustration of the TNT framework when applied to the vehicle future trajectory prediction task. TNT
consists of three stages: (a) target prediction which proposes a set of plausible targets (stars)
among all candidates (diamonds). (b) target-conditioned motion estimation which estimates a
trajectory (distribution) towards each selected target, (c) scoring and selection which ranks
trajectory hypotheses and selects a final set of trajectory predictions with likelihood scores.
5. TNT: Target-driveN Trajectory Prediction
TNT model overview. Scene context is first encoded as the model’s inputs. Then follows the core
three stages of TNT: (a) target prediction which proposes an initial set of M targets; (b) target-
conditioned motion estimation which estimates a trajectory for each target; (c) scoring and selection
which ranks trajectory hypotheses and outputs a final set of K predicted trajectories.
6. TNT: Target-driveN Trajectory Prediction
TNT supports flexible choices of targets. Vehicle target candidate points
are sampled from the lane centerlines. Pedestrian target candidate
points are sampled from a virtual grid centered on the pedestrian.
8. Driving Through Ghosts: Behavioral Cloning with
False Positives
• In the context of behavioral cloning, perceptual errors at training time can
lead to learning difficulties or wrong policies, as expert demonstrations
might be inconsistent with the perceived world state.
• In this work, propose a behavioral cloning approach that can safely
leverage imperfect perception without being conservative.
• The core is a representation of perceptual uncertainty for learning to plan.
• It propose a new probabilistic birds-eye-view semantic grid to encode the
noisy output of object perception systems.
• Then leverage expert demonstrations to learn an imitative driving policy
using this probabilistic representation.
• Using the CARLA simulator, it can safely overcome critical false positives
that would otherwise lead to catastrophic failures or conservative behavior.
9. Driving Through Ghosts: Behavioral Cloning with
False Positives
It is a probabilistic birds-eye-view semantic representation, Soft BEV, for imitation learning
under perceptual uncertainty. It enables learning safer policies that can ignore false positives.
10. Driving Through Ghosts: Behavioral Cloning with
False Positives
• The observations o in a birds-eye-view grid, i.e., an NxMxD-dim. tensor where each dimension k
represents a category of estimated state (e.g., an object or feature type) together with the
respective estimated confidences.
• Each slice is a matrix of NxM, where each element corresponds to the presence of an estimated
object or feature of type k at that location, weighted by its estimated confidence.
• The resulting input representation is referred as the Soft BEV.
• It models a driving agent via a deep convolutional policy network with input of Soft BEV.
• The CNN outputs way-points along the future trajectory, used by a PID controller to compute the
control signals for the steering and throttle of the vehicle.
• It consists of a ResNet-18 base network acting as an encoder, followed by three deconvolutional
layers which also have as an input the current speed signal.
• For each of the potential high-level commands (“go left”, “go right”, “go straight”, “follow the
road”), the network predicts multiple output heat-maps which are then converted into way-points
by spatial soft-argmax layers.
• Based on the high-level command, the respective head is used to predict the way-points.
11. Driving Through Ghosts: Behavioral Cloning with
False Positives
Experimental setup: The CARLA simulator provides ground truth features. Perception
noise is applied to the dynamic features, which are then fused into an uncertainty-
scaled birds-eye view representation, the Soft BEV. Together with high-level
commands and speed information it is fed to a CNN that predicts way-points.
13. LiRaNet: E2E Trajectory Prediction using Spatio-
Temporal Radar Fusion
• LiRaNet, a end-to-end trajectory prediction method which utilizes radar sensor
information along with widely used lidar and high definition (HD) maps.
• Automotive radar provides rich, complementary information, allowing for longer
range vehicle detection as well as instantaneous radial velocity measurements.
• However, there are factors that make the fusion of lidar and radar information
challenging, such as the relatively low angular resolution of radar measurements,
their sparsity and the lack of exact time synchronization with lidar.
• To overcome these challenges, propose an efficient spatio-temporal radar feature
extraction scheme which achieves state-of-the-art performance on multiple
large-scale datasets.
14. LiRaNet: E2E Trajectory Prediction using Spatio-
Temporal Radar Fusion
An example scene from X17k in bird’s eye view where lidar points (light blue) and radar point velocities (orange)
are visualized with labels (white) for current, past and future frames. Vehicle A is a turning bus that has multiple
radar points across frames. By effectively combining them over space and time a full 2D velocity and turning
rate can be recovered. Vehicle B shows the high positional noise that inherently comes with radar. Vehicle C
shows a case with sparse lidar points where implicitly associating them across time can be challenging.
However, radar points present around C can add context for the model to detect and predict the trajectory.
15. LiRaNet: E2E Trajectory Prediction using Spatio-
Temporal Radar Fusion
LiRaNet overview: The radar feature extraction network (A) extracts spatio-temporal features from raw radar points in
2 steps: (1) for each frame we create a graph between the BEV grid cells and radar points to learn spatial features of
each cell using a non-rigid convolution, (2) these spatial features are further fused temporally by stacking across
channel dimension and using an MLP to get a radar feature volume. This feature volume is then fused with feature
volumes from other sensors and fed to a joint perception-prediction network (B) which produces detections and their
future trajectories. An example prediction for a scene from X17k can be seen in (C).
16. LiRaNet: E2E Trajectory Prediction using Spatio-
Temporal Radar Fusion
• Input domain consists of radar points and output domain consists of BEV cells.
• For each cell j, calculate the features hm
j for the sweep m as
• where Am
j is the set of associated radar points, xm
i is the 2D coordinates of the associated
radar point, xm
j is the 2D coordinate of the BEV cell’s center, ⊕ denotes the
concatenation operation, fm
i is the feature vector for the radar point and gm() is an multi-
layer perceptron (MLP) with learnable weights shared across all the cells.
• Calculate Am
j, using nearest neighbor algorithm with a distance threshold.
• By using a threshold larger than the size of a cell, this method compensates for positional
errors in radar.
• For each cell j, calculate the final spatio-temporal feature vector (hj ) by concatenating
the per sweep features hj,m and using an MLP to combine them.
19. SimAug: Learning Robust Representations from
Simulation for Trajectory Prediction
• This paper studies the problem of predicting future trajectories of people in
unseen cameras of novel scenarios and views.
• It approaches this problem through the real-data-free setting in which the model
is trained only on 3D simulation data and applied out-of-the-box to a wide variety
of real cameras.
• It proposes to learn robust representation through augmenting the simulation
training data such that the representation can better generalize to unseen real-
world test data.
• The key idea is to mix the feature of the hardest camera view with the adversarial
feature of the original view.
• It is referred as SimAug, achieving results on three real-world benchmarks using
zero real training data, and state-of-the-art performance in the Stanford Drone
and the VIRAT/ActEV dataset when using in-domain training data.
• Code and models are released at https://next.cs.cmu.edu/simaug.
20. SimAug: Learning Robust Representations from
Simulation for Trajectory Prediction
SimAug that is trained on simulation and tested on real unseen videos. Each
training trajectory is represented by multi-view segmentation features extracted
from the simulator. SimAug mixes the feature of the hardest camera view with
the adversarial feature of the original view.
21. SimAug: Learning Robust Representations from
Simulation for Trajectory Prediction
• Each time given a camera view, use it as an anchor to search for the “hardest" view that is most
inconsistent with what the model has learned.
• It uses the classification loss as the criteria and compute:
• For the original view, generate an adversarial trajectory by the targeted-FGSM attack
• The attack tries to make the model predict the future locations in the selected “hardest" camera
view rather than the original view.
• In essence, the resulting adversarial feature is “warped" to the “hardest” camera view by a small
perturbation.
• By defending against such adversarial trajectory, the model learns representations that are robust
against variances in camera views.
22. SimAug: Learning Robust Representations from
Simulation for Trajectory Prediction
• It mix up the trajectory locations of the selected view and the adversarial
trajectory locations by a convex combination function over their features and
one-hot location labels.
• where [yh+1, …, yT ] = Lh+1:T are the ground-truth locations of the original view.
• The one-hot (·) function projects the location in xy coordinates into an one-hot
embedding over the predefined grid used in our backbone trajectory prediction
model.
23. SimAug: Learning Robust Representations from
Simulation for Trajectory Prediction
• Backbone network is the Multiverse model (CVPR’20).
• The training algorithm’s pseudo-code is:
26. Learning Lane Graph Representations for Motion
Forecasting
• A motion forecasting model that exploits a novel structured map representation
as well as actor-map interactions.
• Instead of encoding vectorized maps as raster images, construct a lane graph
from raw map data to explicitly preserve the map structure.
• To capture the complex topology and long range dependencies of the lane graph,
propose LaneGCN which extends graph convolutions with multiple adjacency
matrices and along-lane dilation.
• To capture the complex interactions between actors and maps, exploit a fusion
network consisting of four types of interactions, actor-to-lane, lane-to-lane, lane-
to-actor and actor-to-actor.
• Powered by LaneGCN and actor-map interactions, the model is able to predict
accurate and realistic multi-modal trajectories.
• This approach significantly outperforms the state-of-the-art on the large scale
Argoverse motion forecasting benchmark.
27. Learning Lane Graph Representations for Motion
Forecasting
It constructs a lane graph from raw map data and use LaneGCN to extract map features. In parallel,
ActorNet extracts actor features from observed past trajectories. Then it uses FusionNet to model the
Interactions between actors themselves and the map, and predict the future trajectories.
28. Learning Lane Graph Representations for Motion
Forecasting
The model is composed of four modules. (1) ActorNet receives the past actor trajectories as input, and uses 1D convolution to
extract actor node features. (2) MapNet constructs a lane graph from HD maps, and uses a LaneGCN to exact lane node
features. (3) FusionNet is a stack of 4 interaction blocks. The actor to lane block fuses real-time traffic information from actor
nodes to lane nodes. The lane to lane block propagates information over the lane graph and updates lane features. The lane to
actor block fuses updated map information from lane nodes to actor nodes. The actor to actor block performs interactions among
actors. It uses another LaneGCN for the lane to lane block, and spatial attention layers for the other blocks. (4) The prediction
header uses after-fusion actor features to produce multi-modal trajectories.
29. Learning Lane Graph Representations for Motion
Forecasting
LaneGCN is a stack of 4 multi-scale LaneConv residual
blocks, each of which consists of a LaneConv (1, 2, 4, 8,
16, 32) and a linear layer with a residual connection. All
layers have 128 feature channels.
Left: The lane centerline of interest, its predecessor,
successor, left and right neighbor are denoted with red,
orange, blue, purple, and green lines, respectively.
Each centerline is given as a sequence of BEV points
(hollow circles). Right: Derived lane graph with an
example lane node. The lane node of interest, its
predecessor, successor, left and right neighbor are
denoted with red, orange, blue, purple and green
circles respectively.
30. Learning Lane Graph Representations for Motion
Forecasting
LaneConv Operator:
parameterize the node feature as follows,
LaneConv operator as
Ai and Wi are the adjacency and the weight matrices
corresponding to the i-th connection type respectively.
MLP indicates a multi-layer perceptron and the
two subscripts refer to shape and location,
respectively.
k-dilation LaneConv operator as
Ak
pre is the k-th matrix power of Apre.
In regular grid graphs, a dilated convolution operator can effectively
capture the long range dependency by enlarging the receptive field.
31. Learning Lane Graph Representations for Motion
Forecasting
LaneGCN: LaneConv(k1, .., kC) this multi- scale layer.
• In the model, use spatial attention and LaneGCN to capture a complete set of actor-map interactions.
FusionNet:
• Build a stack of four fusion modules to capture all information between actors and lane nodes, i.e.,
actors to lanes (A2L), lanes to lanes (L2L), lanes to actors (L2A) and actors to actors (A2A).
• L2L module by LaneGCN, but other three modules by:
Prediction Header: The header has two branches, a regression branch to predict the trajectory of each
mode and a classification branch to predict the confidence score of each mode.
• regression branch
• classification branch
33. Implicit Latent Variable Model for Scene-
Consistent Motion Forecasting
• In this paper, aim to learn scene-consistent motion forecasts of complex
urban trac directly from sensor data.
• In particular, propose to characterize the joint distribution over future
trajectories via an implicit latent variable model.
• It models the scene as an interaction graph and employ powerful graph
neural networks to learn a distributed latent representation of the scene.
• Coupled with a deterministic decoder, obtain trajectory samples that are
consistent across trac participants, achieving state-of-the-art results in
motion forecasting and interaction understanding.
• This motion forecasts result in safer and more comfortable motion
planning.
34. Implicit Latent Variable Model for Scene-
Consistent Motion Forecasting
Graphical models of trajectory distribution. Dashed arrows/circles denote that only some approaches within
the group use those components. Double circle in (c) denotes that it is a deterministic mapping of its inputs.
Actor Feature Extraction. Given LiDAR and maps, the backbone CNN detects the actors in the scene, and
individual feature vectors per actor are extracted via RRoI Align, followed by a CNN with spatial pooling.
35. Implicit Latent Variable Model for Scene-
Consistent Motion Forecasting
Implicit Latent Variable Model encodes the scene into a latent space, from which it can efficiently sample
multiple future realizations in parallel, each with socially consistent trajectories.
38. Perceive, Predict, and Plan: Safe Motion Planning
Through Interpretable Semantic Representations
• In this paper propose an end-to-end learnable network that performs joint
perception, prediction and motion planning for self-driving vehicles and
produces interpretable intermediate representations.
• Unlike existing neural motion planners, this motion planning costs are
consistent with perception and prediction estimates.
• This is achieved by a novel differentiable semantic occupancy representation
that is explicitly used as cost by the motion planning process.
• This network is learned end-to-end from human demonstrations.
• The experiments in a large-scale manual-driving dataset and closed-loop
simulation show that the proposed model significantly outperforms state-of-
the-art planners in imitating the human behaviors while producing much
safer trajectories.
39. Perceive, Predict, and Plan: Safe Motion Planning
Through Interpretable Semantic Representations
The overview of end-to-end learnable autonomy system that takes raw sensor data, an
HD map and a high level route as input and produces safe maneuvers for the self-
driving vehicle via our novel semantic interpretable intermediate representations.
40. Perceive, Predict, and Plan: Safe Motion Planning
Through Interpretable Semantic Representations
Semantic classes in occupancy forecasting. Colors match between drawing and hierarchy. Shadowed
area corresponds to the SDV route. Black vehicle, pedestrian and bike icons represent the agents'
true current location.
41. Perceive, Predict, and Plan: Safe Motion Planning
Through Interpretable Semantic Representations
Inference diagram of the perception and recurrent occupancy forecasting
model. || symbolizes concatenation along the feature dimension, ⊕ element-
wise sum and ∆ bilinear interpolation used to downscale the occupancy.
42. Perceive, Predict, and Plan: Safe Motion Planning
Through Interpretable Semantic Representations
Examples of the motion planner cost functions: (a) collision, (b) driving-path,
(c) lane boundary, (d) traffic light, (e) comfort, (f) route, (g) progress.
Cost related comfort, traffic rules and progress in the route:
Safe cost:
43. Perceive, Predict, and Plan: Safe Motion Planning
Through Interpretable Semantic Representations
Learn the model parameters by exploiting these two loss functions:
Semantic Occupancy Loss:
Planning Loss:
44. Perceive, Predict, and Plan: Safe Motion Planning
Through Interpretable Semantic Representations