TRiPODの紹介 discusses a new model called TRiPOD for human pose dynamics and trajectory forecasting in real-world scenes. TRiPOD uses graph attention networks to model human-human and human-object interactions, and can predict future poses and trajectories while also estimating joint visibility, including cases where humans leave the scene or become occluded. It was presented in an arXiv paper in 2021 and uses the Social Motion Forecasting benchmark dataset. The model addresses challenges like different interaction levels and joint occlusions through techniques like message passing between graphs and curriculum learning.
2. 2
本日の紹介論文
2021/4/8にarXivで発表された論文
著者は有名な人が多い
ソースコードは公開されていない
データセットは公開されている
• Social Motion Forecasting (SoMoF) Benchmark
(http://somof.stanford.edu/)
• このデータセットを使ってICCV2021でワークショップが開催されるらしい
(1st Workshop, Benchmark and Challenge on Human Trajectory and Pose Dynamics Forecasting in the Wild)
3. 3
Forecasting human movements (pose dynamics and trajectory)
Real-world applications
including robotics
healthcare
detection of perilous behavioral patterns
Extremely challenging in real-world scenes due to the different factors
1. Interactions between people in the scene
2. Objects involved in the scene can provide informative clues
3. Different levels of interactions
• (Movements of all the persons in the scene are not always highly correlated with each other nor
the humans to objects)
• These different levels of interactions can change over time
4. A person might move outside the sensor field-of-view or be a partially/fully occluded by an
object
1. Introduction
5. 5
Existing methods
Pose dynamics forecasting methods [14, 40, 41, 58]
Trajectory forecasting [22, 27]
Problem
Neglect some of these challenging factors
• Do not effectively model all the informative environmental and social interactions in the scene
• Assume that all tracks and/or body joints are always observable in the past and future
Proposed method
Human pose dynamics and trajectory forecasting one step forward toward more practical
scenarios in-the-wild by considering all these factors together
1. Introduction
6. 6
Model
Model the input skeleton body joints, the social human-human and human-object interactions
with different attention graphs
These two types of information are different by nature
-> Applying an iterative message passing
Humans may retain their influences on each other consistently in future
-> Preserve their spatio-temporal attentional relationships by modeling them also in future
prediction phase
Address the concept of joint invisibility or body disappearance
Accumulative error in sequential models for long-term sequences
-> Take a curriculum learning approach to train our model
Dataset
No proper benchmark dataset for such real-world problem
-> Introduce a new benchmark by repurposing existing datasets and introducing relevant
evaluation metrics
1. Introduction
7. 7
Pose dynamics forecasting
Human trajectory predictions
Pose dynamics and trajectory forecasting
2. Related Work
8. 8
Our goal: To model the complex human-human and human-object interactions in a way
that can also predict all the joint visibility
Problem Definition
3. Trajectory and Pose Dynamics Forecasting
A binary value being 0
if the joint is invisible
10. 10
A. Attentional Human Pose History
Leverage natural
connectivities
Influence of joints on each
other is not uniform
Encode the past skeleton
history for each person
11. 11
B. Object and Global Scene Features
• visual feature
• geometrical information
• class label
Object detector
Spatio-temporal model to
represent the sequence
12. 12
C. Human to Object Attention Module
Graph attention
-> Learn different levels of interactions
15. 15
F. Future Social Interactions
Message
Passing result
𝑓𝑁
𝑝
Dynamically reconsider social
interactions in the future
16. 16
Problem in the training phase
The model cannot recover from its accumulating errors at each time step
-> Feeding this error as the input to the next step propagates it throughout the network
Solution
1. Make the final prediction to consider both the input and output of the RNN decoder at each
time step
2. Employ the concept of curriculum
• Starting with easier sub-tasks and gradually increasing the difficulty level of the tasks
• Divide our future pose prediction problem for 𝜏𝑓 frames into
𝜏𝑓
𝜔
frames
G. Training Strategies
17. 17
5. Experiments ABC: Interpreting the interactions
between humans and objects
D: Being aware to
estimate occlusion
EF: Handling agent leaving the scene
18. 18
Feature work
Incorporating 3D information (when camera parameters are available)
Considering multi-modal future predictions
6. Conclusion