https://mcv-m6-video.github.io/deepvideo-2020/
Self-supervised techniques define surrogate tasks to train machine learning algorithms without the need of human generated labels. This lecture reviews the state of the art in the field of computer vision, including the baseline techniques based on visual feature learning from ImageNet data.
Customer Service Analytics - Make Sense of All Your Data.pptx
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
1. @DocXavi
Module 6 - Day 10 - Lecture 1
Self-supervised
Visual Learning
26th March 2020
Xavier Giró-i-Nieto
Deep Learning for Video lectures:
[https://mcv-m6-video.github.io/deepvideo-2020/]
7. 7
A broader picture of types of learning...
Slide inspired by Alex Graves (Deepmind) at
“Unsupervised Learning Tutorial” @ NeurIPS 2018.
...with a teacher ...without a teacher
Active agent... Reinforcement learning
(with extrinsic reward)
Intrinsic motivation /
Exploration.
Passive agent... Supervised learning Unsupervised learning
9. 9
A broader picture of types of learning...
Slide inspired by Alex Graves (Deepmind) at
“Unsupervised Learning Tutorial” @ NeurIPS 2018.
...with a teacher ...without a teacher
Active agent... Reinforcement learning
(with extrinsic reward)
Intrinsic motivation /
Exploration.
Passive agent... Supervised learning Unsupervised learning
11. Unsupervised learning
11
● It is the nature of how intelligent beings percept
the world.
● It can save us tons of efforts to build an AI agent
compared to a totally supervised fashion.
● Vast amounts of unlabelled data.
12. 12
Video lectures on Unsupervised Learning
Kevin McGuinness, UPC DLCV 2016 Xavier Giró, UPC DLAI 2017
16. 16
Self-supervised learning
Reference: Andrew Zisserman (PAISS 2018)
Self-supervised learning is a form of unsupervised learning where the data
provides the supervision.
● A pretext (or surrogate) task must be invented by withholding a part of the
unlabeled data and training the NN to predict it.
Unlabeled data
(X)
17. 17
Self-supervised learning
Reference: Andrew Zisserman (PAISS 2018)
Self-supervised learning is a form of unsupervised learning where the data
provides the supervision.
● By defining a proxy loss, the NN learns representations, which should be
valuable for the actual downstream task.
^
y
loss
Representations learned without labels
18. 18
Transfer Learning
Target data
E.g. PASCAL
Target model
Target labels
Small
amount of
data/labels
Slide credit: Kevin McGuinness
Instead of training a deep network from scratch for your task...
19. 19
Transfer Learning
Source data
E.g. ImageNet
Source model
Source labels
Target data
E.g. PASCAL
Target model
Target labels
Transfer Learned
Knowledge
Large
amount of
data/labels
Small
amount of
data/labels
Slide credit: Kevin McGuinness
Instead of training a deep network from scratch for your task:
● Take a network trained on a different source domain and/or task
● Adapt it for your target domain and/or task
20. 20
Transfer Learning
Slide credit: Kevin McGuinness
labels
Train deep net on “nearby” task using standard
backprop
● Supervised learning (e.g. ImageNet).
● Self-supervised learning (e.g. Language
Model)
Cut off top layer(s) of network and replace with
supervised objective for target domain
Fine-tune network using backprop with labels for
target domain until validation loss starts to
increase
conv2
conv3
fc1
conv1
surrogate loss
surrogate data
fc2 + softmax
real labelsreal data
real loss
my_fc2 + softmax
21. 21
Self-supervised + Transfer Learning
#Shuffle&Learn Misra, Ishan, C. Lawrence Zitnick, and Martial Hebert. "Shuffle and learn: unsupervised learning using
temporal order verification." ECCV 2016. [code] (Slides by Xunyu Lin):
26. 26
Autoencoder (AE)
Fig: “Deep Learning Tutorial” Stanford
Autoencoders:
● Predict at the output the
same input data.
● Do not need labels.
27. 27
Autoencoder (AE)
Slide: Kevin McGuinness (DLCV UPC 2017)
Encoder
W1
Decoder
W2
hdata reconstruction
Loss
(reconstruction error)
Latent variables
(representation/features)
1. Initialize a NN by solving an autoencoding
problem.
28. 28
Autoencoder (AE)
Slide: Kevin McGuinness (DLCV UPC 2017)
Latent variables
(representation/features)
Encoder
W1
hdata Classifier
WC
prediction
y Loss
(cross entropy)
1. Initialize a NN solving an autoencoding
problem.
2. Train for final task with “few” labels.
30. 30
Autoencoder (AE)
Fig: “Deep Learning Tutorial” Stanford
Dimensionality reduction:
Use the hidden layer as a
feature extractor of any
desired size.
31. 31
Denoising Autoencoder (DAE)
Vincent, Pascal, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. "Extracting and composing robust features
with denoising autoencoders." ICML 2008.
1. Corrupt the input signal, but predict the clean version at its output.
2. Train for final task with “few” labels.
Encoder
W1
Decoder
W2
hdata reconstruction
Loss
(reconstruction error)
corrupted
data
34. 34
Eigen, David, Dilip Krishnan, and Rob Fergus. "Restoring an image taken through a window covered with dirt or rain." ICCV
2013.
35. 35
#BERT Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-training of deep bidirectional
transformers for language understanding." NAACL 2019 (best paper award).
Take a sequence of words on input,
mask out 15% of the words, and
ask the system to predict the
missing words (or a distribution of
words)
Masked Autoencoder (MAE)
37. 37
Spatial verification
Predict the relative position between two image patches.
#RelPos Doersch, Carl, Abhinav Gupta, and Alexei A. Efros. "Unsupervised visual representation learning by context
prediction." ICCV 2015.
38. 38
Spatial verification
#Jigsaw Noroozi, M., & Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. ECCV 2016.
Train a neural network to solve a jigsaw puzzle.
39. 39
Spatial verification
#Jigsaw Noroozi, M., & Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. ECCV 2016.
Train a neural network to solve a jigsaw puzzle.
41. 41#RelPos Doersch, Carl, Abhinav Gupta, and Alexei A. Efros. "Unsupervised visual representation learning by context
prediction." ICCV 2015.
What temporal-specific surrogate tasks could you think about ?
42. 42
Temporal coherence
Take temporal order as the supervisory signals for learning
Shuffled
sequences
Binary classification
In order
Not in order
#Shuffle&Learn Misra, Ishan, C. Lawrence Zitnick, and Martial Hebert. "Shuffle and learn: unsupervised learning using
temporal order verification." ECCV 2016. [code] (Slides by Xunyu Lin):
43. 43
Temporal coherence
#Shuffle&Learn Misra, Ishan, C. Lawrence Zitnick, and Martial Hebert. "Shuffle and learn: unsupervised learning using
temporal order verification." ECCV 2016. [code] (Slides by Xunyu Lin):
Temporal order of frames is
exploited as the supervisory
signal for learning.
44. 44
Temporal verification
#Odd-one-out Fernando, Basura, Hakan Bilen, Efstratios Gavves, and Stephen Gould. "Self-supervised video
representation learning with odd-one-out networks." ICCV 2017
Train a network to detect which of the video sequences contains frames in the wrong order.
45. 45
Temporal sorting
Lee, Hsin-Ying, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. "Unsupervised representation learning by sorting
sequences." ICCV 2017.
Sort the sequence of frames.
46. 46#T-CAM Wei, Donglai, Joseph J. Lim, Andrew Zisserman, and William T. Freeman. "Learning and using the arrow of time."
CVPR 2018.
Predict whether the video moves forward or backward.
Arrow of time
48. 48#TimeCycle Xiaolong Wang, Allan Jabri, Alexei A. Efros, “Learning Correspondence from the Cycle-Consistency of
Time”. CVPR 2019. [code] [Slides by Carles Ventura]
Cycle-consistency
Training: Tracking backward and then forward with pixel embeddings, to later train a
NN to match the pairs of associated pixel embeddings.
49. 49#TimeCycle Xiaolong Wang, Allan Jabri, Alexei A. Efros, “Learning Correspondence from the Cycle-Consistency of
Time”. CVPR 2019. [code] [Slides by Carles Ventura]
Cycle-consistency
Inference (video object segmentation): Track pixels across video sequences.
50. 50#TCC Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman, “Temporal
Cycle-Consistency Learning”. CVPR 2019
Training: Learn video frame embeddings that must satisfy a temporal alignment
between videos depicting the same activity.
Temporal Cycle-Consistency
51. 51#TCC Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman, “Temporal
Cycle-Consistency Learning”. CVPR 2019
Temporal Cycle-Consistency
52. 52#TCC Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman, “Temporal
Cycle-Consistency Learning”. CVPR 2019
Application: Anomaly detection.
Temporal Cycle-Consistency
53. 53#TCC Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman, “Temporal
Cycle-Consistency Learning”. CVPR 2019
Application: Fine-grained retrieval
Temporal Cycle-Consistency
56. 56
Next word prediction (language model)
Figure:
TensorFlow tutorial
Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. "A neural probabilistic language model." Journal
of machine learning research 3, no. Feb (2003): 1137-1155.
Self-supervised
learning
57. 57
#word2vec #continuousbow Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed
representations of words and phrases and their compositionality." NIPS 2013
the cat climbed a tree
Given context:
a, cat, the, tree
Estimate prob. of
climbed
Self-supervised
learning
Missing word prediction (language model)
58. 58
#word2vec #skipgram Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed
representations of words and phrases and their compositionality." NIPS 2013
Self-supervised
learning
the cat climbed a tree
Given word:
climbed
Estimate prob. of context words:
a, cat, the, tree
Context Words Prediction (language model)
59. 59
Next sentence prediction (TRUE/FALSE)
#BERT Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-training of deep bidirectional
transformers for language understanding." NAACL 2019 (best paper award).
BERT is also pre-trained for a
binarized next sentence
prediction task:
Select some sentences for A and
B, where 50% of the data B is the
next sentence of A, and the
remaining 50% of the data B are
randomly selected in the corpus,
and learn the correlation.
60. 60
Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using
LSTMs." In ICML 2015. [Github]
Learning video representations (features) by...
Video Frame Prediction
61. 61
(1) frame reconstruction (AE):
Learning video representations (features) by...
Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using
LSTMs." ICML 2015. [Github]
Video Frame Prediction
62. 62
Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using
LSTMs." ICML 2015. [Github]
Learning video representations (features) by...
(2) frame prediction
Video Frame Prediction
63. 63
Unsupervised learned features (lots of data) are
fine-tuned for activity recognition (small data).
Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using
LSTMs." ICML 2015. [Github]
Video Frame Prediction
64. 64
Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error."
ICLR 2016 [project] [code]
Video frame prediction with a ConvNet.
Video Frame Prediction
65. 65
Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error."
ICLR 2016 [project] [code]
The blurry predictions from MSE (L1 loss) are improved with multi-scale
architecture, adversarial training and an image gradient difference loss (GDL)
function.
Video Frame Prediction
66. 66
Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error."
ICLR 2016 [project] [code]
Video Frame Prediction
67. 67#DrNet Denton, Emily L. "Unsupervised learning of disentangled representations from video." NIPS 2017.
The model learns to disentangle (“separate”) the visual features that correspond
to the:
Object Pose
(wrt the camera)
Object Content
(class)
Frame Prediction + Disentangled features
68. 68
#DrNet Denton, Emily L. "Unsupervised learning of disentangled representations from video." NIPS 2017.
#MCNet R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video sequence
prediction. In ICLR, 2017
100 step video generation on KTH
dataset where:
● green frames indicate
reconstructions
● red frames indicate frame
predictions.
Frame Prediction + Disentangled features
69. 69#DDPAE Hsieh, Jun-Ting, Bingbin Liu, De-An Huang, Li F. Fei-Fei, and Juan Carlos Niebles. "Learning to
decompose and disentangle representations for video prediction." NeurIPS 2018.
Decompositional Disentangled Predictive Auto-Encoder (DDPAE):
Frame Prediction + Disentangled features
71. 71
Zhang, R., Isola, P., & Efros, A. A. Colorful image colorization. ECCV 2016.
Colorization (still image)
Surrogate task: Colorize a gray scale image.
72. 72Vondrick, Carl, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. "Tracking emerges by
colorizing videos." ECCV 2018. [blog]
Pixel tracking
Pre-training: A NN is trained to point the location in a reference frame with the most
similar color...
73. 73Vondrick, Carl, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. "Tracking emerges by
colorizing videos." ECCV 2018. [blog]
Pre-training: A NN is trained to point the location in a reference frame with the most
similar color… computing a loss with respect to the actual color.
Pixel tracking
74. 74Vondrick, Carl, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. "Tracking emerges by
colorizing videos." ECCV 2018. [blog]
For each grayscale pixel to colorize, find the color of most similar pixel embedding
(representation) from the first frame.
Pixel tracking
75. 75Vondrick, Carl, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. "Tracking emerges by
colorizing videos." ECCV 2018. [blog]
Pixel tracking
Application: colorization from a reference frame.
Reference frame Grayscale video Colorized video
76. 76Vondrick, Carl, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. "Tracking emerges by
colorizing videos." ECCV 2018. [blog]
Application: video object segmentation from an object mask
Pixel tracking
79. Hadsell, Raia, Sumit Chopra, and Yann LeCun. "Dimensionality reduction by learning an invariant mapping." CVPR 2006.
Learn Gw
(X) with pairs (X1
,X2
) which are similar (Y=0) or dissimilar (Y=1).
Contrastive Loss
similar pair (X1
,X2
) dissimilar pair (X1
,X2
)
80. 80
Predictive Learning
#CPC Oord, Aaron van den, Yazhe Li, and Oriol Vinyals. "Representation learning with contrastive predictive coding." arXiv
preprint arXiv:1807.03748 (2018).
81. 81
Predictive Learning
#CPCv2 Hénaff, Olivier J., Ali Razavi, Carl Doersch, S. M. Eslami, and Aaron van den Oord. "Data-efficient image recognition
with contrastive predictive coding." arXiv preprint arXiv:1905.09272 (2019).
Patch features (in blue) are
aggregated in the context
network (in red) to predict
the features from the
features extracted from
other parts of the image.
83. The Gelato Bet
"If, by the first day of autumn (Sept 23) of 2015, a
method will exist that can match or beat the
performance of R-CNN on Pascal VOC detection,
without the use of any extra, human annotations
(e.g. ImageNet) as pre-training, Mr. Malik promises to
buy Mr. Efros one (1) gelato (2 scoops: one chocolate,
one vanilla)."
Source: Ankesh Anand, “Contrastive Self-Supervised Learning” (2020)”
84. Contrastive loss between:
● Encoded features of a query
image.
● A running queue of encoded
keys.
Defining losses over
representations instead of pixels
facilitates abstraction.
#MoCo He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. . Momentum Contrast for Unsupervised Visual Representation Learning.
CVPR 2020. [code]
Momentum Contrast (MoCo)
85. Wu, Z., Xiong, Y., Yu, S. X., & Lin, D. (2018). Unsupervised feature learning via non-parametric instance discrimination. CVPR
2018.
Pretext task: Treat each image instance as a distinct class of its own and train a
classifier to distinguish between individual instance classes (images).
Momentum Contrast (MoCo)
86. 86
Transfer Learning (super. IN-1M)
Source data
E.g. ImageNet
Source model
Source labels
Target data
E.g. PASCAL
Target model
Target labels
Transfer Learned
Knowledge
Large
amount of
data/labels
Small
amount of
data/labels
Slide credit: Kevin McGuinness
87. 87
Transfer Learning (MoCo IN-1M)
Source data
E.g. ImageNet
Source model
Source labels
Target data
E.g. PASCAL
Target model
Target labels
Transfer Learned
Knowledge
Large
amount of
data
Small
amount of
data/labels
Slide concept: Kevin McGuinness
88. 88
Transfer Learning (super. IN-1M vs MoCo IN-1M)
Transferring to VOC object detection: a Faster R-CNN detector (C4-backbone)
is fine-tuned end-to-end.
90. 90
Correlated views by data augmentation
#SimCLR Chen, Ting, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. "A simple framework for contrastive
learning of visual representations." arXiv preprint arXiv:2002.05709 (2020). [tweet]
92. 92
Correlated views by data augmentation
Original image cc-by: Von.grzanka
#SimCLR Chen, Ting, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. "A simple framework for contrastive
learning of visual representations." arXiv preprint arXiv:2002.05709 (2020). [tweet]
93. 93
#SimCLR Chen, Ting, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. "A simple framework for contrastive
learning of visual representations." arXiv preprint arXiv:2002.05709 (2020). [tweet]
Correlated views by data augmentation
ImageNet Linear Classification: unsupervised learned features are frozen and a
supervised linear classifier is trained on top.
94. 94
#MoCo2 Chen, Xinlei, Haoqi Fan, Ross Girshick, and Kaiming He. "Improved Baselines with Momentum Contrastive
Learning." arXiv preprint arXiv:2003.04297 (2020).
MoCo + data augmentation + others
ImageNet Linear Classification: unsupervised learned features are
frozen and a supervised linear classifier is trained on top.
95. 95
#VINCE Gordon, Daniel, Kiana Ehsani, Dieter Fox, and Ali Farhadi. "Watching the World Go By: Representation Learning
from Unlabeled Videos." arXiv preprint arXiv:2003.07990 (2020).
Video Noise Contrastive estimation
Videos provide data augmentation for free.
96. 96
#VINCE Gordon, Daniel, Kiana Ehsani, Dieter Fox, and Ali Farhadi. "Watching the World Go By: Representation Learning
from Unlabeled Videos." arXiv preprint arXiv:2003.07990 (2020).
Video Noise Contrastive estimation
98. Object candidates + Time
Pirk, Sören, Mohi Khansari, Yunfei Bai, Corey Lynch, and Pierre Sermanet. "Online Object Representations with Contrastive
Learning." Learning from Unlabeled Videos (LUV) Worksop, CVPR 2019.
99. Multiview + Time
#TCN Sermanet, Pierre, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google
Brain. "Time-contrastive networks: Self-supervised learning from video." ICRA 2018.
Pre-Training: Learn features with a
triplet loss by:
● Positive: Same time-stamp
from different cameras for
view invariance.
● Negative: Other frames from
the same camera to be
sensitive to temporal ordering
100. Inference: Imitation learning for a
robotic arm.
Multiview + Time
#TCN Sermanet, Pierre, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google
Brain. "Time-contrastive networks: Self-supervised learning from video." ICRA 2018.
101. #TCN Sermanet, Pierre, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google
Brain. "Time-contrastive networks: Self-supervised learning from video." ICRA 2018.
106. 106
Further readings
● Jeremy Howard, “Self-supervised learning and computer vision” (fast.ai, 2020)
● Zhang, L., Qi, G. J., Wang, L., & Luo, J. (2019). Aet vs. aed: Unsupervised representation learning
by auto-encoding transformations rather than data. CVPR 2019.
● Tian, Yonglong, Dilip Krishnan, and Phillip Isola. "Contrastive Multiview Coding." arXiv preprint
arXiv:1906.05849 (2019).
● Wu, Z., Xiong, Y., Yu, S. X., & Lin, D. (2018). Unsupervised feature learning via non-parametric
instance discrimination. CVPR 2018.
○ Each image is its on class.
○ Memory bank.
○ Triplet loss + Softmax
109. 109
Temporal regularization: ISA
Le, Quoc V., Will Y. Zou, Serena Y. Yeung, and Andrew Y. Ng. "Learning hierarchical invariant spatio-temporal features for
action recognition with independent subspace analysis." CVPR 2011
Video features are learned with convolutions and poolings combined with a
Independent Subspace Analysis (ISA).
110. 110Le, Quoc V., Will Y. Zou, Serena Y. Yeung, and Andrew Y. Ng. "Learning hierarchical invariant spatio-temporal features for
action recognition with independent subspace analysis." CVPR 2011
Feature visualizations.
Temporal regularization: ISA
111. 111
Assumption: adjacent video frames contain semantically similar information.
Autoencoder trained with regularizations by slowliness and sparisty.
Goroshin, Ross, Joan Bruna, Jonathan Tompson, David Eigen, and Yann LeCun. "Unsupervised learning of spatiotemporally
coherent metrics." ICCV 2015.
Temporal regularization: Slowliness
112. 112Jayaraman, Dinesh, and Kristen Grauman. "Slow and steady feature analysis: higher order temporal coherence in video."
CVPR 2016. [video]
Slow feature analysis
● Temporal coherence assumption: features
should change slowly over time in video
Steady feature analysis
● Second order changes also small: changes in
the past should resemble changes in the
future
Train on triplets of frames from video
Loss encourages nearby frames to have slow
and steady features, and far frames to have
different features
Temporal regularization: Slowliness