TRiPOD: Modeling Human-Human and Human-Object Interactions for Pose Forecasting

TRiPODの紹介
2021年5月21日
西村仁志

2
本日の紹介論文
 2021/4/8にarXivで発表された論文
 著者は有名な人が多い
 ソースコードは公開されていない
 データセットは公開されている
• Social Motion Forecasting (SoMoF) Benchmark
（http://somof.stanford.edu/）
• このデータセットを使ってICCV2021でワークショップが開催されるらしい
（1st Workshop, Benchmark and Challenge on Human Trajectory and Pose Dynamics Forecasting in the Wild）

3
 Forecasting human movements (pose dynamics and trajectory)
 Real-world applications
 including robotics
 healthcare
 detection of perilous behavioral patterns
 Extremely challenging in real-world scenes due to the different factors
1. Interactions between people in the scene
2. Objects involved in the scene can provide informative clues
3. Different levels of interactions
• (Movements of all the persons in the scene are not always highly correlated with each other nor
the humans to objects)
• These different levels of interactions can change over time
4. A person might move outside the sensor field-of-view or be a partially/fully occluded by an
object
1. Introduction

5
 Existing methods
 Pose dynamics forecasting methods [14, 40, 41, 58]
 Trajectory forecasting [22, 27]
 Problem
 Neglect some of these challenging factors
• Do not effectively model all the informative environmental and social interactions in the scene
• Assume that all tracks and/or body joints are always observable in the past and future
 Proposed method
 Human pose dynamics and trajectory forecasting one step forward toward more practical
scenarios in-the-wild by considering all these factors together
1. Introduction

6
 Model
 Model the input skeleton body joints, the social human-human and human-object interactions
with different attention graphs
 These two types of information are different by nature
-> Applying an iterative message passing
 Humans may retain their influences on each other consistently in future
-> Preserve their spatio-temporal attentional relationships by modeling them also in future
prediction phase
 Address the concept of joint invisibility or body disappearance
 Accumulative error in sequential models for long-term sequences
-> Take a curriculum learning approach to train our model
 Dataset
 No proper benchmark dataset for such real-world problem
-> Introduce a new benchmark by repurposing existing datasets and introducing relevant
evaluation metrics
1. Introduction

7
 Pose dynamics forecasting
 Human trajectory predictions
 Pose dynamics and trajectory forecasting
2. Related Work

8
Our goal: To model the complex human-human and human-object interactions in a way
that can also predict all the joint visibility
 Problem Definition
3. Trajectory and Pose Dynamics Forecasting
A binary value being 0
if the joint is invisible

10
A. Attentional Human Pose History
Leverage natural
connectivities
Influence of joints on each
other is not uniform
Encode the past skeleton
history for each person

11
B. Object and Global Scene Features
• visual feature
• geometrical information
• class label
Object detector
Spatio-temporal model to
represent the sequence

12
C. Human to Object Attention Module
Graph attention
-> Learn different levels of interactions

13
D. Social Attention Module
Graph attention
-> Learn different levels of interactions

14
E. Message Passing
𝑓𝑛
2
𝑚𝑛+1
𝑝
𝑓𝑛
1 𝑓𝑛
𝑝
Message to person 𝑝
at step n+1
𝑒𝑛
2
𝑒𝑛
1
Node feature
of person 𝑝
Same as above
Average

15
F. Future Social Interactions
Message
Passing result
𝑓𝑁
𝑝
Dynamically reconsider social
interactions in the future

16
 Problem in the training phase
 The model cannot recover from its accumulating errors at each time step
-> Feeding this error as the input to the next step propagates it throughout the network
 Solution
1. Make the final prediction to consider both the input and output of the RNN decoder at each
time step
2. Employ the concept of curriculum
• Starting with easier sub-tasks and gradually increasing the difficulty level of the tasks
• Divide our future pose prediction problem for 𝜏𝑓 frames into
𝜏𝑓
𝜔
frames
G. Training Strategies

17
5. Experiments ABC: Interpreting the interactions
between humans and objects
D: Being aware to
estimate occlusion
EF: Handling agent leaving the scene

18
 Feature work
 Incorporating 3D information (when camera parameters are available)
 Considering multi-modal future predictions
6. Conclusion

TRiPOD: Modeling Human-Human and Human-Object Interactions for Pose Forecasting

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to TRiPOD: Modeling Human-Human and Human-Object Interactions for Pose Forecasting

Similar to TRiPOD: Modeling Human-Human and Human-Object Interactions for Pose Forecasting (20)

More from Hitoshi Nishimura

More from Hitoshi Nishimura (9)

Recently uploaded

Recently uploaded (20)

TRiPOD: Modeling Human-Human and Human-Object Interactions for Pose Forecasting