Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the Wild

Recent Breakthroughs in AI
- Clubhouse Podcasts
YouTube Link: https://www.youtube.com/watch?v=3OxEpGU1unA
Presenter: Sangmin Woo
2021.03.10
Andrej Karpathy Justin Johnson Lex Fridman Richard Socher Russell Kaplan

2 36
 Rise of multimodal learning: CLIP and DALL-E
• CLIP efficiently learns visual concepts from natural language supervision
• DALL-E creates images from text captions for a wide range of concepts expressible in natural language
 ‘Data’ is the KING: Importance of data and datasets
• Academia: Given dataset – build more powerful model vs. Reality (Industry): Given model – collect/generate dataset
• In fact, many innovation comes from data (not model…)
• Data curation & MLOps will become more important
 Will Transformers overtake CNNs? and towards "generalized neural substrates“
• Image = CNN, sequence = RNN → All = Transformer (consolidation of architectures)
• 2020 is the year of Transformer, All we need is Transformer!
 Lifelong learning (need to consider catastrophic forgetting & semantic shift, …)
• Benchmarking is difficult… since the tasks & models will be all different from previous SOTA…
• Then why not fix the model? Model-first benchmark design!
 Taking hard data structures and "softening" to make differentiable
• Transformer is softened version of hash table
• What would be the next generation data structure?
Talk Summary https://www.youtube.com/watch?v=3OxEpGU1unA

Learning Visual-Linguistic
Representation in the Wild
Presenter: Sangmin Woo
2021.03.10
CLIP – OpenAI / DALL-E – OpenAI / ALIGN – Google Research
(+ UniT – Facebook AI Research)

Blog Link : https://openai.com/blog/clip/

5 36
“Scaling a simple pre-training task is sufficient to achieve competitive zero-shot performance on
a great variety of image classification datasets”
 CLIP is trained on 400M (image, text) pairs found across the internet.
 Given an image, CLIP predicts which out of a set of 32,768 randomly sampled text snippets, was
actually paired with it in the dataset.
 CLIP learns to recognize a wide variety of visual concepts in images and associate them with
their names.
 CLIP models can then be applied to nearly arbitrary visual classification tasks.
Summary

6 36
 Current approaches have several major problems:
 datasets are labor intensive and costly to create
 models are good at one task and one task only
 models that perform well on benchmarks have disappointingly poor performance on real-world.
 CLIP (Contrastive Language–Image Pre-training) aims to address these problems:
• It is trained on image & natural language supervision that’s abundantly available on the internet.
• It can be instructed in natural language to perform several classification benchmarks, without directly
optimizing for the benchmark’s performance (similar to the “zero-shot” capabilities of GPT-3).
• It matches the accuracy of the original ResNet-50 on ImageNet zero-shot without using any of the
1.28M training examples.
Introduction

7 36
 Both models show the same accuracy on the
ImageNet test set.
 In non-ImageNet settings, CLIP significantly
outperforms ImageNet model.
 ObjectNet checks a model’s ability to recognize
objects in many different poses and with many
different backgrounds inside homes.
 ImageNet Rendition and ImageNet Sketch check
a model’s ability to recognize more abstract
depictions of objects.
ImageNet ResNet-101 vs. CLIP ViT-L

8 36
 CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training
examples.
 At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of
the target dataset’s classes.
Approach

10 36
 Random, non-cherry picked,
predictions of zero-shot CLIP
classifiers on examples from
various datasets:
Qualitative Examples

11 36
 CLIP is highly efficient
• CLIP learns from unfiltered, highly varied, and highly noisy data, and is intended to be used
in a zero-shot manner.
• CLIP (GPT-2 and 3) can achieve compelling zero shot performance. However, it requires
significant training compute.
• Two algorithmic choices to save compute:
 contrastive objective for connecting text with images.
 Vision Transformer gives 3x gain in compute efficiency over a standard ResNet.
Key takeaways

12 36
 Image-to-caption
Transformer model
struggled at zero-shot
transfer. It only
achieves 16% accuracy
on ImageNet after
training for 400M
images.
 CLIP is much more
efficient and achieves
the same accuracy
roughly 12x faster.
Key takeaways

13 36
 CLIP is flexible and general
• CLIP models are more flexible and general than ImageNet models because they learn a wide range of
visual concepts directly from natural language. They are able to zero-shot perform many different tasks.
• CLIP has validated its zero-shot performance on over 30 different datasets including tasks such as
fine-grained object classification, geo-localization, action recognition in videos, and OCR.
• Learning OCR is an exciting behavior that does not occur in standard ImageNet models.
• The best CLIP model outperforms the best publicly available ImageNet model, the Noisy Student
EfficientNet-L2, on 20 out of 26 different transfer datasets.
Key takeaways

14 36
 Across 27 tasks such as fine-
grained object classification, OCR,
activity recognition in videos, and
geo-localization, CLIP models
learn more widely useful image
representations.
Key takeaways

15 36
 While CLIP usually performs well on recognizing common objects, it struggles on counting the
number of objects in an image and on predicting how close the nearest car is in a photo.
 Zero-shot CLIP also struggles compared to task specific models on very fine-grained
classification, such as telling the difference between car models, variants of aircraft, or flower
species.
 CLIP also still has poor generalization to images not covered in its pre-training dataset. For
instance, although CLIP learns a capable OCR system, when evaluated on MNIST dataset, zero-
shot CLIP only achieves 88% accuracy, well below the 99.75% of humans on the dataset.
Limitations

Blog Link : https://openai.com/blog/dall-e/
YouTube: https://www.youtube.com/watch?v=az-OV47oKvA
(for more detailed and friendly explanation)

17 36
“DALL-E is a 12B parameter AR Transformer trained to generate images from text descriptions in
a zero-shot manner, using 250M text–image pairs collected from the internet”
 DALL-E achieves high quality image generation on MS-COCO dataset zero-shot, without using
any of the training labels.
 preferred over prior work trained on the dataset by human evaluators 90% of the time.
 Image-to-image translation
Summary
+
DALL-E =
Salvador Dalí WALL-E

18 36
 GPT-3: text generation
 Image GPT: image generation
 Jukebox: music generation
 DALL-E extend these findings to show that manipulating visual concepts through language is
now within reach.
 DALL-E can
• create anthropomorphized versions of animals and objects
• combine unrelated concepts in plausible ways
• render text
• apply transformations to existing images
 Qualitative examples: https://openai.com/blog/dall-e/
Introduction

20 36
 The goal is to train a transformer to autoregressively model the text and image tokens as a
single stream of data.
 However, using pixels directly as image tokens would require an inordinate amount of memory
for high-resolution images.
 A discrete variational autoencoder (dVAE) is trained to compress each 256×256 RGB image
into a 32×32 grid of image tokens, each element of which can assume 8192 possible values.
 256 BPE-encoded text tokens are concatenated with the 32×32=1024 image tokens, and train
an autoregressive transformer to model the joint distribution over the text and image tokens.
Approach

21 36
 VQ-VAE (Vector Quantized Variational AutoEncoder) for image compression
Approach
Oord et al., Neural Discrete Representation Learning

22 36
 Gumbel Softmax
Approach
Jang et al., Categorical Reparameterization with Gumbel-Softmax

23 36
 A discrete variational autoencoder (dVAE) is
trained to compress each 256×256 RGB image
into a 32×32 grid of image tokens, each element
of which can assume 8192 possible values.
 The encoder downsamples the spatial resolution by
a factor of 8.
 While details are sometimes lost or distorted, the
main features of the image are still typically
recognizable.
Approach

26 36
“ALIGN (A Large-scale ImaGe and Noisy-text embedding) uses a noisy dataset of over 1B image
alt-text pairs, obtained without expensive filtering or post-processing steps, to learn a simple dual-
encoder architecture (image and text) by aligning visual-language representations using a
contrastive loss”
 While representation learning in NLP has transitioned to training on raw text without human
annotations, visual and vision-language representations still rely heavily on curated training
datasets that are expensive or require expert knowledge. This costly curation process limits
the size of datasets and hence hinders the scaling of trained models.
 The scale of the corpus can make up for its noise and leads to state-of-the-art representations
even with such a simple learning scheme.
Summary

27 36
 Visual and language representations are jointly learned from noisy image alt-text data and can be used for
vision-only or vision-language task transfer.
 Without any fine-tuning, ALIGN powers cross-modal search including image-to-text search, text-to-image
search and even search with joint image+text queries.
Summary

28 36
 The goal is to align the visual-language representations in a shared latent embedding space
using a simple dual-encoder architecture (image: EfficientNet, text: BERT)
 Image and text encoders are learned via a contrastive loss (formulated as normalized softmax)
that pushes the embeddings of matched image-text pair together while pushing those of non-
matched image-text pair apart.
 Considering paired texts as fine-grained labels of images, image-to-text contrastive loss is
analogous to the conventional label-based classification objective; and the key difference is
that the text encoder generates the “label” weights.
Approach

29 36
 The image (EfficientNet) and text encoders (BERT) are optimized via contrastive loss (sum of
two normalized softmax losses) that pushes the embeddings of matched image-text pair
(positive) together while pushing those of non-matched image-text pair (negative) apart.
• Image-to-text classification loss
• Text-to-image classification loss
Approach
𝑥𝑖: image embedding in the 𝑖-th pair
𝑦𝑗: text embedding in the 𝑗-th pair
𝑁: batch size
𝜎: (learnable) temperature

32 36
“Unified Transformer (UniT) is built upon the transformer encoder-decoder architecture and jointly
learns multiple tasks across different modalities (image & text), ranging from object detection to
language understanding and multimodal reasoning”
 UniT model encodes each input modality with an encoder and makes predictions on each task
with a shared decoder over the encoded input representations, followed by task-specific output
heads
 Compared to previous efforts on multi-task learning with transformers, UniT share the same
model parameters to all tasks instead of separately fine-tuning task-specific models and handle
a much higher variety of tasks across different domains.
 UniT learns 7 tasks jointly over 8 datasets, achieving comparable performance to well-
established prior work on each domain under the same supervision with a compact set of
model parameters.
Summary

34 36
 UniT uses an image encoder, a text encoder, and a joint decoder with per-task query embedding
followed by task-specific heads to make the final outputs for each task.
Approach

35 36
 Among the existing architectures, Transformer is the most generic architectures because it has
less inductive bias than others.
 A new formula such as “Large Transformer + Large scale dataset” has begun to emerge
(CLIP:400M, DALL-E:250M, ALIGN:1B).
 All we need is data: the recent BIG studies talk about how they collected/curated data, not
much about models.
 Transformers are replacing CNN-based SOTAs, which were considered de-facto standard in
the image domain, on several benchmarks.
 Also, Transformers are indeed strong at multi-modality.
Wrap up

Thank You
shmwoo9395@{gmail.com, gist.ac.kr}
If you find my presentation interesting and this gives you new inspiration,
please feel free to contact me!

Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the Wild

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the Wild

Similar to Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the Wild (20)

More from Sangmin Woo

More from Sangmin Woo (13)

Recently uploaded

Recently uploaded (20)

Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the Wild