Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona

@DocXavi
Module 6 - Day 10 - Lecture 1
Self-supervised
Visual Learning
26th March 2020
Xavier Giró-i-Nieto
Deep Learning for Video lectures:
[https://mcv-m6-video.github.io/deepvideo-2020/]

2
Acknowledgements
Víctor
Campos
Junting
Pan
Xunyu
Lin
Sebastian
Palacio
Carlos
Arenas
Oscar
Mañas

Video lecture
3
Lecture from in Master in Computer Vision Barcelona 2019

5
Outline
1. Unsupervised Learning
2. Self-supervised Learning
3. Predictive methods
4. Contrastive methods

Types of machine learning
Yann Lecun’s Black Forest cake
6

7
A broader picture of types of learning...
Slide inspired by Alex Graves (Deepmind) at
“Unsupervised Learning Tutorial” @ NeurIPS 2018.
...with a teacher ...without a teacher
Active agent... Reinforcement learning
(with extrinsic reward)
Intrinsic motivation /
Exploration.
Passive agent... Supervised learning Unsupervised learning

Supervised learning
8
Slide credit: Kevin McGuinness
y^

9
A broader picture of types of learning...
Slide inspired by Alex Graves (Deepmind) at
“Unsupervised Learning Tutorial” @ NeurIPS 2018.
...with a teacher ...without a teacher
Active agent... Reinforcement learning
(with extrinsic reward)
Intrinsic motivation /
Exploration.
Passive agent... Supervised learning Unsupervised learning

Unsupervised learning
10

Unsupervised learning
11
● It is the nature of how intelligent beings percept
the world.
● It can save us tons of eﬀorts to build an AI agent
compared to a totally supervised fashion.
● Vast amounts of unlabelled data.

12
Video lectures on Unsupervised Learning
Kevin McGuinness, UPC DLCV 2016 Xavier Giró, UPC DLAI 2017

13
Outline
3. Predictive methods
4. Contrastive methods

14
Self-supervised learning
(...)
“Doing this properly and reliably is the greatest
challenge in ML and AI of the next few years in my
opinion.”

16
Reference: Andrew Zisserman (PAISS 2018)
Self-supervised learning is a form of unsupervised learning where the data
provides the supervision.
● A pretext (or surrogate) task must be invented by withholding a part of the
unlabeled data and training the NN to predict it.
Unlabeled data
(X)

17
Reference: Andrew Zisserman (PAISS 2018)
Self-supervised learning is a form of unsupervised learning where the data
provides the supervision.
● By deﬁning a proxy loss, the NN learns representations, which should be
valuable for the actual downstream task.
^
y
loss
Representations learned without labels

18
Transfer Learning
Target data
E.g. PASCAL
Target model
Target labels
Small
amount of
data/labels
Instead of training a deep network from scratch for your task...

19
Transfer Learning
Source data
E.g. ImageNet
Source model
Source labels
Target data
E.g. PASCAL
Target model
Target labels
Transfer Learned
Knowledge
Large
amount of
data/labels
Small
amount of
data/labels
Instead of training a deep network from scratch for your task:
● Take a network trained on a diﬀerent source domain and/or task
● Adapt it for your target domain and/or task

20
Transfer Learning
labels
Train deep net on “nearby” task using standard
backprop
● Supervised learning (e.g. ImageNet).
● Self-supervised learning (e.g. Language
Model)
Cut oﬀ top layer(s) of network and replace with
supervised objective for target domain
Fine-tune network using backprop with labels for
target domain until validation loss starts to
increase
conv2
conv3
fc1
conv1
surrogate loss
surrogate data
fc2 + softmax
real labelsreal data
real loss
my_fc2 + softmax

21
Self-supervised + Transfer Learning
#Shuffle&Learn Misra, Ishan, C. Lawrence Zitnick, and Martial Hebert. "Shuffle and learn: unsupervised learning using
temporal order verification." ECCV 2016. [code] (Slides by Xunyu Lin):

22
Transfer Learning: Video-lectures
Kevin McGuinness (UPC DLCV 2016) Ramon Morros (UPC DLAI 2018)

23
Predictive vs Contrastive Methods
Source: Ankesh Anand, “Contrastive Self-Supervised Learning” (2020)”

24

25
Outline
3. Predictive Methods
○ Autoencoder

26
Autoencoder (AE)
Fig: “Deep Learning Tutorial” Stanford
Autoencoders:
● Predict at the output the
same input data.
● Do not need labels.

27
Autoencoder (AE)
Slide: Kevin McGuinness (DLCV UPC 2017)
Encoder
W1
Decoder
W2
hdata reconstruction
Loss
(reconstruction error)
Latent variables
(representation/features)
1. Initialize a NN by solving an autoencoding
problem.

28
Autoencoder (AE)
Slide: Kevin McGuinness (DLCV UPC 2017)
Latent variables
Encoder
W1
hdata Classifier
WC
prediction
y Loss
(cross entropy)
1. Initialize a NN solving an autoencoding
problem.
2. Train for ﬁnal task with “few” labels.

29
Autoencoder (AE)
What other application does the AE target address ?

30
Autoencoder (AE)
Dimensionality reduction:
Use the hidden layer as a
feature extractor of any
desired size.

31
Denoising Autoencoder (DAE)
Vincent, Pascal, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. "Extracting and composing robust features
with denoising autoencoders." ICML 2008.
1. Corrupt the input signal, but predict the clean version at its output.
Encoder
W1
Decoder
W2
hdata reconstruction
Loss
(reconstruction error)
corrupted
data

32
Latent variables
Encoder
W1
hdata Classifier
WC
prediction
y Loss
(cross entropy)
1. Corrupt the input signal, but predict the clean version at its output.

33
Source: “Tutorial: Your First Model (DAE)” by Opendeep.org

34
Eigen, David, Dilip Krishnan, and Rob Fergus. "Restoring an image taken through a window covered with dirt or rain." ICCV
2013.

35
#BERT Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-training of deep bidirectional
transformers for language understanding." NAACL 2019 (best paper award).
Take a sequence of words on input,
mask out 15% of the words, and
ask the system to predict the
missing words (or a distribution of
words)
Masked Autoencoder (MAE)

36
Outline
○ Spatial relations

37
Spatial veriﬁcation
Predict the relative position between two image patches.
#RelPos Doersch, Carl, Abhinav Gupta, and Alexei A. Efros. "Unsupervised visual representation learning by context
prediction." ICCV 2015.

38
#Jigsaw Noroozi, M., & Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. ECCV 2016.
Train a neural network to solve a jigsaw puzzle.

39
#Jigsaw Noroozi, M., & Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. ECCV 2016.
Train a neural network to solve a jigsaw puzzle.

40
Outline
○ Temporal relations

41#RelPos Doersch, Carl, Abhinav Gupta, and Alexei A. Efros. "Unsupervised visual representation learning by context
prediction." ICCV 2015.
What temporal-speciﬁc surrogate tasks could you think about ?

42
Temporal coherence
Take temporal order as the supervisory signals for learning
Shuﬄed
sequences
Binary classiﬁcation
In order
Not in order

43
Temporal coherence
Temporal order of frames is
exploited as the supervisory
signal for learning.

44
Temporal veriﬁcation
#Odd-one-out Fernando, Basura, Hakan Bilen, Efstratios Gavves, and Stephen Gould. "Self-supervised video
representation learning with odd-one-out networks." ICCV 2017
Train a network to detect which of the video sequences contains frames in the wrong order.

45
Temporal sorting
Lee, Hsin-Ying, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. "Unsupervised representation learning by sorting
sequences." ICCV 2017.
Sort the sequence of frames.

46#T-CAM Wei, Donglai, Joseph J. Lim, Andrew Zisserman, and William T. Freeman. "Learning and using the arrow of time."
CVPR 2018.
Predict whether the video moves forward or backward.
Arrow of time

47
Outline
○ Cycle Consistency

48#TimeCycle Xiaolong Wang, Allan Jabri, Alexei A. Efros, “Learning Correspondence from the Cycle-Consistency of
Time”. CVPR 2019. [code] [Slides by Carles Ventura]
Cycle-consistency
Training: Tracking backward and then forward with pixel embeddings, to later train a
NN to match the pairs of associated pixel embeddings.

49#TimeCycle Xiaolong Wang, Allan Jabri, Alexei A. Efros, “Learning Correspondence from the Cycle-Consistency of
Time”. CVPR 2019. [code] [Slides by Carles Ventura]
Cycle-consistency
Inference (video object segmentation): Track pixels across video sequences.

50#TCC Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman, “Temporal
Cycle-Consistency Learning”. CVPR 2019
Training: Learn video frame embeddings that must satisfy a temporal alignment
between videos depicting the same activity.
Temporal Cycle-Consistency

Application: Anomaly detection.

Application: Fine-grained retrieval

54
Outline
○ Video Frame Prediction

55
Video Frame Prediction
Slide credit:
Yann LeCun

56
Next word prediction (language model)
Figure:
TensorFlow tutorial
Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. "A neural probabilistic language model." Journal
of machine learning research 3, no. Feb (2003): 1137-1155.
Self-supervised
learning

57
#word2vec #continuousbow Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeﬀ Dean. "Distributed
representations of words and phrases and their compositionality." NIPS 2013
the cat climbed a tree
Given context:
a, cat, the, tree
Estimate prob. of
climbed
Self-supervised
learning
Missing word prediction (language model)

58
#word2vec #skipgram Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeﬀ Dean. "Distributed
representations of words and phrases and their compositionality." NIPS 2013
Self-supervised
learning
the cat climbed a tree
Given word:
climbed
Estimate prob. of context words:
a, cat, the, tree
Context Words Prediction (language model)

59
Next sentence prediction (TRUE/FALSE)
#BERT Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-training of deep bidirectional
transformers for language understanding." NAACL 2019 (best paper award).
BERT is also pre-trained for a
binarized next sentence
prediction task:
Select some sentences for A and
B, where 50% of the data B is the
next sentence of A, and the
remaining 50% of the data B are
randomly selected in the corpus,
and learn the correlation.

60
Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using
LSTMs." In ICML 2015. [Github]
Learning video representations (features) by...

61
(1) frame reconstruction (AE):
LSTMs." ICML 2015. [Github]

62
(2) frame prediction

63
Unsupervised learned features (lots of data) are
ﬁne-tuned for activity recognition (small data).

64
Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error."
ICLR 2016 [project] [code]
Video frame prediction with a ConvNet.

65
The blurry predictions from MSE (L1 loss) are improved with multi-scale
architecture, adversarial training and an image gradient diﬀerence loss (GDL)
function.

66

67#DrNet Denton, Emily L. "Unsupervised learning of disentangled representations from video." NIPS 2017.
The model learns to disentangle (“separate”) the visual features that correspond
to the:
Object Pose
(wrt the camera)
Object Content
(class)
Frame Prediction + Disentangled features

68
#DrNet Denton, Emily L. "Unsupervised learning of disentangled representations from video." NIPS 2017.
#MCNet R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video sequence
prediction. In ICLR, 2017
100 step video generation on KTH
dataset where:
● green frames indicate
reconstructions
● red frames indicate frame
predictions.

69#DDPAE Hsieh, Jun-Ting, Bingbin Liu, De-An Huang, Li F. Fei-Fei, and Juan Carlos Niebles. "Learning to
decompose and disentangle representations for video prediction." NeurIPS 2018.
Decompositional Disentangled Predictive Auto-Encoder (DDPAE):

70
Outline
○ Pixel tracking

71
Zhang, R., Isola, P., & Efros, A. A. Colorful image colorization. ECCV 2016.
Colorization (still image)
Surrogate task: Colorize a gray scale image.

72Vondrick, Carl, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. "Tracking emerges by
colorizing videos." ECCV 2018. [blog]
Pixel tracking
Pre-training: A NN is trained to point the location in a reference frame with the most
similar color...

Pre-training: A NN is trained to point the location in a reference frame with the most
similar color… computing a loss with respect to the actual color.
Pixel tracking

For each grayscale pixel to colorize, ﬁnd the color of most similar pixel embedding
(representation) from the ﬁrst frame.
Pixel tracking

Pixel tracking
Application: colorization from a reference frame.
Reference frame Grayscale video Colorized video

Application: video object segmentation from an object mask
Pixel tracking

77
Outline
4. Contrastive Methods

78

Hadsell, Raia, Sumit Chopra, and Yann LeCun. "Dimensionality reduction by learning an invariant mapping." CVPR 2006.
Learn Gw
(X) with pairs (X1
,X2
) which are similar (Y=0) or dissimilar (Y=1).
Contrastive Loss
similar pair (X1
,X2
) dissimilar pair (X1
,X2
)

80
Predictive Learning
#CPC Oord, Aaron van den, Yazhe Li, and Oriol Vinyals. "Representation learning with contrastive predictive coding." arXiv
preprint arXiv:1807.03748 (2018).

81
Predictive Learning
#CPCv2 Hénaﬀ, Olivier J., Ali Razavi, Carl Doersch, S. M. Eslami, and Aaron van den Oord. "Data-eﬃcient image recognition
with contrastive predictive coding." arXiv preprint arXiv:1905.09272 (2019).
Patch features (in blue) are
aggregated in the context
network (in red) to predict
the features from the
features extracted from
other parts of the image.

The Gelato Bet
"If, by the ﬁrst day of autumn (Sept 23) of 2015, a
method will exist that can match or beat the
performance of R-CNN on Pascal VOC detection,
without the use of any extra, human annotations
(e.g. ImageNet) as pre-training, Mr. Malik promises to
buy Mr. Efros one (1) gelato (2 scoops: one chocolate,
one vanilla)."

Contrastive loss between:
● Encoded features of a query
image.
● A running queue of encoded
keys.
Deﬁning losses over
representations instead of pixels
facilitates abstraction.
#MoCo He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. . Momentum Contrast for Unsupervised Visual Representation Learning.
CVPR 2020. [code]
Momentum Contrast (MoCo)

Wu, Z., Xiong, Y., Yu, S. X., & Lin, D. (2018). Unsupervised feature learning via non-parametric instance discrimination. CVPR
2018.
Pretext task: Treat each image instance as a distinct class of its own and train a
classiﬁer to distinguish between individual instance classes (images).
Momentum Contrast (MoCo)

86
Transfer Learning (super. IN-1M)
Source data
E.g. ImageNet
Source model
Source labels
Target data
E.g. PASCAL
Target model
Target labels
Transfer Learned
Knowledge
Large
amount of
data/labels
Small
amount of
data/labels

87
Transfer Learning (MoCo IN-1M)
Source data
E.g. ImageNet
Source model
Source labels
Target data
E.g. PASCAL
Target model
Target labels
Transfer Learned
Knowledge
Large
amount of
data
Small
amount of
data/labels
Slide concept: Kevin McGuinness

88
Transfer Learning (super. IN-1M vs MoCo IN-1M)
Transferring to VOC object detection: a Faster R-CNN detector (C4-backbone)
is ﬁne-tuned end-to-end.

90
Correlated views by data augmentation
#SimCLR Chen, Ting, Simon Kornblith, Mohammad Norouzi, and Geoﬀrey Hinton. "A simple framework for contrastive
learning of visual representations." arXiv preprint arXiv:2002.05709 (2020). [tweet]

92
Original image cc-by: Von.grzanka

93
ImageNet Linear Classiﬁcation: unsupervised learned features are frozen and a
supervised linear classiﬁer is trained on top.

94
#MoCo2 Chen, Xinlei, Haoqi Fan, Ross Girshick, and Kaiming He. "Improved Baselines with Momentum Contrastive
Learning." arXiv preprint arXiv:2003.04297 (2020).
MoCo + data augmentation + others
ImageNet Linear Classiﬁcation: unsupervised learned features are
frozen and a supervised linear classiﬁer is trained on top.

95
#VINCE Gordon, Daniel, Kiana Ehsani, Dieter Fox, and Ali Farhadi. "Watching the World Go By: Representation Learning
from Unlabeled Videos." arXiv preprint arXiv:2003.07990 (2020).
Video Noise Contrastive estimation
Videos provide data augmentation for free.

96
#VINCE Gordon, Daniel, Kiana Ehsani, Dieter Fox, and Ali Farhadi. "Watching the World Go By: Representation Learning
from Unlabeled Videos." arXiv preprint arXiv:2003.07990 (2020).
Video Noise Contrastive estimation

97
https://paperswithcode.com/sota/self-supervised-image-classiﬁcation-on
Coming next...

Object candidates + Time
Pirk, Sören, Mohi Khansari, Yunfei Bai, Corey Lynch, and Pierre Sermanet. "Online Object Representations with Contrastive
Learning." Learning from Unlabeled Videos (LUV) Worksop, CVPR 2019.

Multiview + Time
#TCN Sermanet, Pierre, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google
Brain. "Time-contrastive networks: Self-supervised learning from video." ICRA 2018.
Pre-Training: Learn features with a
triplet loss by:
● Positive: Same time-stamp
from diﬀerent cameras for
view invariance.
● Negative: Other frames from
the same camera to be
sensitive to temporal ordering

Inference: Imitation learning for a
robotic arm.
Multiview + Time

102
Outline
4. Contrastive Methods

103
Coming next:
Self-supervised Audiovisual Learning [slides] [video]

104
Suggested reading
Alex Graves, Kelly Clancy, “Unsupervised learning: The curious pupil”. Deepmind blog 2019.

105
Suggested talks
Demis Hassabis (Deepmind) Aloysha Efros (UC Berkeley)
Andrew Zisserman (Oxford/Deepmind) Yann LeCun (Facebook AI)

106
Further readings
● Jeremy Howard, “Self-supervised learning and computer vision” (fast.ai, 2020)
● Zhang, L., Qi, G. J., Wang, L., & Luo, J. (2019). Aet vs. aed: Unsupervised representation learning
by auto-encoding transformations rather than data. CVPR 2019.
● Tian, Yonglong, Dilip Krishnan, and Phillip Isola. "Contrastive Multiview Coding." arXiv preprint
arXiv:1906.05849 (2019).
● Wu, Z., Xiong, Y., Yu, S. X., & Lin, D. (2018). Unsupervised feature learning via non-parametric
instance discrimination. CVPR 2018.
○ Each image is its on class.
○ Memory bank.
○ Triplet loss + Softmax

109
Temporal regularization: ISA
Le, Quoc V., Will Y. Zou, Serena Y. Yeung, and Andrew Y. Ng. "Learning hierarchical invariant spatio-temporal features for
action recognition with independent subspace analysis." CVPR 2011
Video features are learned with convolutions and poolings combined with a
Independent Subspace Analysis (ISA).

110Le, Quoc V., Will Y. Zou, Serena Y. Yeung, and Andrew Y. Ng. "Learning hierarchical invariant spatio-temporal features for
action recognition with independent subspace analysis." CVPR 2011
Feature visualizations.
Temporal regularization: ISA

111
Assumption: adjacent video frames contain semantically similar information.
Autoencoder trained with regularizations by slowliness and sparisty.
Goroshin, Ross, Joan Bruna, Jonathan Tompson, David Eigen, and Yann LeCun. "Unsupervised learning of spatiotemporally
coherent metrics." ICCV 2015.
Temporal regularization: Slowliness

112Jayaraman, Dinesh, and Kristen Grauman. "Slow and steady feature analysis: higher order temporal coherence in video."
CVPR 2016. [video]
Slow feature analysis
● Temporal coherence assumption: features
should change slowly over time in video
Steady feature analysis
● Second order changes also small: changes in
the past should resemble changes in the
future
Train on triplets of frames from video
Loss encourages nearby frames to have slow
and steady features, and far frames to have
diﬀerent features
Temporal regularization: Slowliness

Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona

Similar to Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (17)

Recently uploaded

Recently uploaded (20)

Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona