SlideShare a Scribd company logo
1 of 36
Recent Breakthroughs in AI
- Clubhouse Podcasts
YouTube Link: https://www.youtube.com/watch?v=3OxEpGU1unA
Presenter: Sangmin Woo
2021.03.10
Andrej Karpathy Justin Johnson Lex Fridman Richard Socher Russell Kaplan
2 36
 Rise of multimodal learning: CLIP and DALL-E
• CLIP efficiently learns visual concepts from natural language supervision
• DALL-E creates images from text captions for a wide range of concepts expressible in natural language
 ‘Data’ is the KING: Importance of data and datasets
• Academia: Given dataset – build more powerful model vs. Reality (Industry): Given model – collect/generate dataset
• In fact, many innovation comes from data (not model…)
• Data curation & MLOps will become more important
 Will Transformers overtake CNNs? and towards "generalized neural substrates“
• Image = CNN, sequence = RNN → All = Transformer (consolidation of architectures)
• 2020 is the year of Transformer, All we need is Transformer!
 Lifelong learning (need to consider catastrophic forgetting & semantic shift, …)
• Benchmarking is difficult… since the tasks & models will be all different from previous SOTA…
• Then why not fix the model? Model-first benchmark design!
 Taking hard data structures and "softening" to make differentiable
• Transformer is softened version of hash table
• What would be the next generation data structure?
Talk Summary https://www.youtube.com/watch?v=3OxEpGU1unA
Learning Visual-Linguistic
Representation in the Wild
Presenter: Sangmin Woo
2021.03.10
CLIP – OpenAI / DALL-E – OpenAI / ALIGN – Google Research
(+ UniT – Facebook AI Research)
Blog Link : https://openai.com/blog/clip/
5 36
“Scaling a simple pre-training task is sufficient to achieve competitive zero-shot performance on
a great variety of image classification datasets”
 CLIP is trained on 400M (image, text) pairs found across the internet.
 Given an image, CLIP predicts which out of a set of 32,768 randomly sampled text snippets, was
actually paired with it in the dataset.
 CLIP learns to recognize a wide variety of visual concepts in images and associate them with
their names.
 CLIP models can then be applied to nearly arbitrary visual classification tasks.
Summary
6 36
 Current approaches have several major problems:
 datasets are labor intensive and costly to create
 models are good at one task and one task only
 models that perform well on benchmarks have disappointingly poor performance on real-world.
 CLIP (Contrastive Language–Image Pre-training) aims to address these problems:
• It is trained on image & natural language supervision that’s abundantly available on the internet.
• It can be instructed in natural language to perform several classification benchmarks, without directly
optimizing for the benchmark’s performance (similar to the “zero-shot” capabilities of GPT-3).
• It matches the accuracy of the original ResNet-50 on ImageNet zero-shot without using any of the
1.28M training examples.
Introduction
7 36
 Both models show the same accuracy on the
ImageNet test set.
 In non-ImageNet settings, CLIP significantly
outperforms ImageNet model.
 ObjectNet checks a model’s ability to recognize
objects in many different poses and with many
different backgrounds inside homes.
 ImageNet Rendition and ImageNet Sketch check
a model’s ability to recognize more abstract
depictions of objects.
ImageNet ResNet-101 vs. CLIP ViT-L
8 36
 CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training
examples.
 At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of
the target dataset’s classes.
Approach
9 36
Approach
10 36
 Random, non-cherry picked,
predictions of zero-shot CLIP
classifiers on examples from
various datasets:
Qualitative Examples
11 36
 CLIP is highly efficient
• CLIP learns from unfiltered, highly varied, and highly noisy data, and is intended to be used
in a zero-shot manner.
• CLIP (GPT-2 and 3) can achieve compelling zero shot performance. However, it requires
significant training compute.
• Two algorithmic choices to save compute:
 contrastive objective for connecting text with images.
 Vision Transformer gives 3x gain in compute efficiency over a standard ResNet.
Key takeaways
12 36
 Image-to-caption
Transformer model
struggled at zero-shot
transfer. It only
achieves 16% accuracy
on ImageNet after
training for 400M
images.
 CLIP is much more
efficient and achieves
the same accuracy
roughly 12x faster.
Key takeaways
13 36
 CLIP is flexible and general
• CLIP models are more flexible and general than ImageNet models because they learn a wide range of
visual concepts directly from natural language. They are able to zero-shot perform many different tasks.
• CLIP has validated its zero-shot performance on over 30 different datasets including tasks such as
fine-grained object classification, geo-localization, action recognition in videos, and OCR.
• Learning OCR is an exciting behavior that does not occur in standard ImageNet models.
• The best CLIP model outperforms the best publicly available ImageNet model, the Noisy Student
EfficientNet-L2, on 20 out of 26 different transfer datasets.
Key takeaways
14 36
 Across 27 tasks such as fine-
grained object classification, OCR,
activity recognition in videos, and
geo-localization, CLIP models
learn more widely useful image
representations.
Key takeaways
15 36
 While CLIP usually performs well on recognizing common objects, it struggles on counting the
number of objects in an image and on predicting how close the nearest car is in a photo.
 Zero-shot CLIP also struggles compared to task specific models on very fine-grained
classification, such as telling the difference between car models, variants of aircraft, or flower
species.
 CLIP also still has poor generalization to images not covered in its pre-training dataset. For
instance, although CLIP learns a capable OCR system, when evaluated on MNIST dataset, zero-
shot CLIP only achieves 88% accuracy, well below the 99.75% of humans on the dataset.
Limitations
Blog Link : https://openai.com/blog/dall-e/
YouTube: https://www.youtube.com/watch?v=az-OV47oKvA
(for more detailed and friendly explanation)
17 36
“DALL-E is a 12B parameter AR Transformer trained to generate images from text descriptions in
a zero-shot manner, using 250M text–image pairs collected from the internet”
 DALL-E achieves high quality image generation on MS-COCO dataset zero-shot, without using
any of the training labels.
 preferred over prior work trained on the dataset by human evaluators 90% of the time.
 Image-to-image translation
Summary
+
DALL-E =
Salvador Dalí WALL-E
18 36
 GPT-3: text generation
 Image GPT: image generation
 Jukebox: music generation
 DALL-E extend these findings to show that manipulating visual concepts through language is
now within reach.
 DALL-E can
• create anthropomorphized versions of animals and objects
• combine unrelated concepts in plausible ways
• render text
• apply transformations to existing images
 Qualitative examples: https://openai.com/blog/dall-e/
Introduction
19 36
Overview
20 36
 The goal is to train a transformer to autoregressively model the text and image tokens as a
single stream of data.
 However, using pixels directly as image tokens would require an inordinate amount of memory
for high-resolution images.
 A discrete variational autoencoder (dVAE) is trained to compress each 256×256 RGB image
into a 32×32 grid of image tokens, each element of which can assume 8192 possible values.
 256 BPE-encoded text tokens are concatenated with the 32×32=1024 image tokens, and train
an autoregressive transformer to model the joint distribution over the text and image tokens.
Approach
21 36
 VQ-VAE (Vector Quantized Variational AutoEncoder) for image compression
Approach
Oord et al., Neural Discrete Representation Learning
22 36
 Gumbel Softmax
Approach
Jang et al., Categorical Reparameterization with Gumbel-Softmax
23 36
 A discrete variational autoencoder (dVAE) is
trained to compress each 256×256 RGB image
into a 32×32 grid of image tokens, each element
of which can assume 8192 possible values.
 The encoder downsamples the spatial resolution by
a factor of 8.
 While details are sometimes lost or distorted, the
main features of the image are still typically
recognizable.
Approach
24 36
Qualitative Examples
26 36
“ALIGN (A Large-scale ImaGe and Noisy-text embedding) uses a noisy dataset of over 1B image
alt-text pairs, obtained without expensive filtering or post-processing steps, to learn a simple dual-
encoder architecture (image and text) by aligning visual-language representations using a
contrastive loss”
 While representation learning in NLP has transitioned to training on raw text without human
annotations, visual and vision-language representations still rely heavily on curated training
datasets that are expensive or require expert knowledge. This costly curation process limits
the size of datasets and hence hinders the scaling of trained models.
 The scale of the corpus can make up for its noise and leads to state-of-the-art representations
even with such a simple learning scheme.
Summary
27 36
 Visual and language representations are jointly learned from noisy image alt-text data and can be used for
vision-only or vision-language task transfer.
 Without any fine-tuning, ALIGN powers cross-modal search including image-to-text search, text-to-image
search and even search with joint image+text queries.
Summary
28 36
 The goal is to align the visual-language representations in a shared latent embedding space
using a simple dual-encoder architecture (image: EfficientNet, text: BERT)
 Image and text encoders are learned via a contrastive loss (formulated as normalized softmax)
that pushes the embeddings of matched image-text pair together while pushing those of non-
matched image-text pair apart.
 Considering paired texts as fine-grained labels of images, image-to-text contrastive loss is
analogous to the conventional label-based classification objective; and the key difference is
that the text encoder generates the “label” weights.
Approach
29 36
 The image (EfficientNet) and text encoders (BERT) are optimized via contrastive loss (sum of
two normalized softmax losses) that pushes the embeddings of matched image-text pair
(positive) together while pushing those of non-matched image-text pair (negative) apart.
• Image-to-text classification loss
• Text-to-image classification loss
Approach
𝑥𝑖: image embedding in the 𝑖-th pair
𝑦𝑗: text embedding in the 𝑗-th pair
𝑁: batch size
𝜎: (learnable) temperature
30 36
Qualitative Examples
32 36
“Unified Transformer (UniT) is built upon the transformer encoder-decoder architecture and jointly
learns multiple tasks across different modalities (image & text), ranging from object detection to
language understanding and multimodal reasoning”
 UniT model encodes each input modality with an encoder and makes predictions on each task
with a shared decoder over the encoded input representations, followed by task-specific output
heads
 Compared to previous efforts on multi-task learning with transformers, UniT share the same
model parameters to all tasks instead of separately fine-tuning task-specific models and handle
a much higher variety of tasks across different domains.
 UniT learns 7 tasks jointly over 8 datasets, achieving comparable performance to well-
established prior work on each domain under the same supervision with a compact set of
model parameters.
Summary
33 36
Summary
34 36
 UniT uses an image encoder, a text encoder, and a joint decoder with per-task query embedding
followed by task-specific heads to make the final outputs for each task.
Approach
35 36
 Among the existing architectures, Transformer is the most generic architectures because it has
less inductive bias than others.
 A new formula such as “Large Transformer + Large scale dataset” has begun to emerge
(CLIP:400M, DALL-E:250M, ALIGN:1B).
 All we need is data: the recent BIG studies talk about how they collected/curated data, not
much about models.
 Transformers are replacing CNN-based SOTAs, which were considered de-facto standard in
the image domain, on several benchmarks.
 Also, Transformers are indeed strong at multi-modality.
Wrap up
Thank You
shmwoo9395@{gmail.com, gist.ac.kr}
If you find my presentation interesting and this gives you new inspiration,
please feel free to contact me!

More Related Content

What's hot

How much position information do convolutional neural networks encode? review...
How much position information do convolutional neural networks encode? review...How much position information do convolutional neural networks encode? review...
How much position information do convolutional neural networks encode? review...Dongmin Choi
 
Architecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks IArchitecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks IWanjin Yu
 
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani WithanawasamScene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani WithanawasamWithTheBest
 
Devil in the Details: Analysing the Performance of ConvNet Features
Devil in the Details: Analysing the Performance of ConvNet FeaturesDevil in the Details: Analysing the Performance of ConvNet Features
Devil in the Details: Analysing the Performance of ConvNet FeaturesKen Chatfield
 
Visual Saliency Prediction with Deep Learning - Kevin McGuinness - UPC Barcel...
Visual Saliency Prediction with Deep Learning - Kevin McGuinness - UPC Barcel...Visual Saliency Prediction with Deep Learning - Kevin McGuinness - UPC Barcel...
Visual Saliency Prediction with Deep Learning - Kevin McGuinness - UPC Barcel...Universitat Politècnica de Catalunya
 
Review: Incremental Few-shot Instance Segmentation [CDM]
Review: Incremental Few-shot Instance Segmentation [CDM]Review: Incremental Few-shot Instance Segmentation [CDM]
Review: Incremental Few-shot Instance Segmentation [CDM]Dongmin Choi
 
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...paperpublications3
 
convolutional neural network (CNN, or ConvNet)
convolutional neural network (CNN, or ConvNet)convolutional neural network (CNN, or ConvNet)
convolutional neural network (CNN, or ConvNet)RakeshSaran5
 
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
160205 NeuralArt - Understanding Neural Representation
160205 NeuralArt - Understanding Neural Representation160205 NeuralArt - Understanding Neural Representation
160205 NeuralArt - Understanding Neural RepresentationJunho Cho
 
Review: You Only Look One-level Feature
Review: You Only Look One-level FeatureReview: You Only Look One-level Feature
Review: You Only Look One-level FeatureDongmin Choi
 
Object Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning FrameworkObject Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning FrameworkNader Karimi
 
Object Detection Methods using Deep Learning
Object Detection Methods using Deep LearningObject Detection Methods using Deep Learning
Object Detection Methods using Deep LearningSungjoon Choi
 
Semantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesSemantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesFellowship at Vodafone FutureLab
 
CNNs: from the Basics to Recent Advances
CNNs: from the Basics to Recent AdvancesCNNs: from the Basics to Recent Advances
CNNs: from the Basics to Recent AdvancesDmytro Mishkin
 

What's hot (20)

How much position information do convolutional neural networks encode? review...
How much position information do convolutional neural networks encode? review...How much position information do convolutional neural networks encode? review...
How much position information do convolutional neural networks encode? review...
 
Architecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks IArchitecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks I
 
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani WithanawasamScene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
 
Devil in the Details: Analysing the Performance of ConvNet Features
Devil in the Details: Analysing the Performance of ConvNet FeaturesDevil in the Details: Analysing the Performance of ConvNet Features
Devil in the Details: Analysing the Performance of ConvNet Features
 
Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017
Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017
Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017
 
Visual Saliency Prediction with Deep Learning - Kevin McGuinness - UPC Barcel...
Visual Saliency Prediction with Deep Learning - Kevin McGuinness - UPC Barcel...Visual Saliency Prediction with Deep Learning - Kevin McGuinness - UPC Barcel...
Visual Saliency Prediction with Deep Learning - Kevin McGuinness - UPC Barcel...
 
Review: Incremental Few-shot Instance Segmentation [CDM]
Review: Incremental Few-shot Instance Segmentation [CDM]Review: Incremental Few-shot Instance Segmentation [CDM]
Review: Incremental Few-shot Instance Segmentation [CDM]
 
Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018
Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018
Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018
 
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...
 
LeNet to ResNet
LeNet to ResNetLeNet to ResNet
LeNet to ResNet
 
convolutional neural network (CNN, or ConvNet)
convolutional neural network (CNN, or ConvNet)convolutional neural network (CNN, or ConvNet)
convolutional neural network (CNN, or ConvNet)
 
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
 
160205 NeuralArt - Understanding Neural Representation
160205 NeuralArt - Understanding Neural Representation160205 NeuralArt - Understanding Neural Representation
160205 NeuralArt - Understanding Neural Representation
 
Review: You Only Look One-level Feature
Review: You Only Look One-level FeatureReview: You Only Look One-level Feature
Review: You Only Look One-level Feature
 
Object Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning FrameworkObject Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning Framework
 
Deep Learning for Computer Vision: Image Retrieval (UPC 2016)
Deep Learning for Computer Vision: Image Retrieval (UPC 2016)Deep Learning for Computer Vision: Image Retrieval (UPC 2016)
Deep Learning for Computer Vision: Image Retrieval (UPC 2016)
 
Object Detection Methods using Deep Learning
Object Detection Methods using Deep LearningObject Detection Methods using Deep Learning
Object Detection Methods using Deep Learning
 
Semantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesSemantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network Approaches
 
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
 
CNNs: from the Basics to Recent Advances
CNNs: from the Basics to Recent AdvancesCNNs: from the Basics to Recent Advances
CNNs: from the Basics to Recent Advances
 

Similar to Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the Wild

Obscenity Detection in Images
Obscenity Detection in ImagesObscenity Detection in Images
Obscenity Detection in ImagesAnil Kumar Gupta
 
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]Dongmin Choi
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用CHENHuiMei
 
Image Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine LearningImage Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine Learningijtsrd
 
Report face recognition : ArganRecogn
Report face recognition :  ArganRecognReport face recognition :  ArganRecogn
Report face recognition : ArganRecognIlyas CHAOUA
 
Representational Continuity for Unsupervised Continual Learning
Representational Continuity for Unsupervised Continual LearningRepresentational Continuity for Unsupervised Continual Learning
Representational Continuity for Unsupervised Continual LearningMLAI2
 
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakLearn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakPyData
 
Minor Project Report on Denoising Diffusion Probabilistic Model
Minor Project Report on Denoising Diffusion Probabilistic ModelMinor Project Report on Denoising Diffusion Probabilistic Model
Minor Project Report on Denoising Diffusion Probabilistic Modelsoxigoh238
 
Human age and gender Detection
Human age and gender DetectionHuman age and gender Detection
Human age and gender DetectionAbhiAchalla
 
Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Grigory Sapunov
 
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSitakanta Mishra
 
Automated Image Captioning – Model Based on CNN – GRU Architecture
Automated Image Captioning – Model Based on CNN – GRU ArchitectureAutomated Image Captioning – Model Based on CNN – GRU Architecture
Automated Image Captioning – Model Based on CNN – GRU ArchitectureIRJET Journal
 
DSDT meetup July 2021
DSDT meetup July 2021DSDT meetup July 2021
DSDT meetup July 2021DSDT_MTL
 
OReilly AI Transfer Learning
OReilly AI Transfer LearningOReilly AI Transfer Learning
OReilly AI Transfer LearningDanielle Dean
 
ImageNet Classification with Deep Convolutional Neural Networks
ImageNet Classification with Deep Convolutional Neural NetworksImageNet Classification with Deep Convolutional Neural Networks
ImageNet Classification with Deep Convolutional Neural NetworksWilly Marroquin (WillyDevNET)
 
Transfer Learning: Breve introducción a modelos pre-entrenados.
Transfer Learning: Breve introducción a modelos pre-entrenados.Transfer Learning: Breve introducción a modelos pre-entrenados.
Transfer Learning: Breve introducción a modelos pre-entrenados.Fernando Constantino
 
16 OpenCV Functions to Start your Computer Vision journey.docx
16 OpenCV Functions to Start your Computer Vision journey.docx16 OpenCV Functions to Start your Computer Vision journey.docx
16 OpenCV Functions to Start your Computer Vision journey.docxssuser90e017
 
What multimodal foundation models cannot perceive
What multimodal foundation models cannot perceiveWhat multimodal foundation models cannot perceive
What multimodal foundation models cannot perceiveUniversity of Amsterdam
 

Similar to Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the Wild (20)

Obscenity Detection in Images
Obscenity Detection in ImagesObscenity Detection in Images
Obscenity Detection in Images
 
Image captioning
Image captioningImage captioning
Image captioning
 
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
 
Image Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine LearningImage Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine Learning
 
Report face recognition : ArganRecogn
Report face recognition :  ArganRecognReport face recognition :  ArganRecogn
Report face recognition : ArganRecogn
 
Representational Continuity for Unsupervised Continual Learning
Representational Continuity for Unsupervised Continual LearningRepresentational Continuity for Unsupervised Continual Learning
Representational Continuity for Unsupervised Continual Learning
 
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakLearn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
 
Minor Project Report on Denoising Diffusion Probabilistic Model
Minor Project Report on Denoising Diffusion Probabilistic ModelMinor Project Report on Denoising Diffusion Probabilistic Model
Minor Project Report on Denoising Diffusion Probabilistic Model
 
Human age and gender Detection
Human age and gender DetectionHuman age and gender Detection
Human age and gender Detection
 
Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016
 
One shot learning
One shot learningOne shot learning
One shot learning
 
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
 
Automated Image Captioning – Model Based on CNN – GRU Architecture
Automated Image Captioning – Model Based on CNN – GRU ArchitectureAutomated Image Captioning – Model Based on CNN – GRU Architecture
Automated Image Captioning – Model Based on CNN – GRU Architecture
 
DSDT meetup July 2021
DSDT meetup July 2021DSDT meetup July 2021
DSDT meetup July 2021
 
OReilly AI Transfer Learning
OReilly AI Transfer LearningOReilly AI Transfer Learning
OReilly AI Transfer Learning
 
ImageNet Classification with Deep Convolutional Neural Networks
ImageNet Classification with Deep Convolutional Neural NetworksImageNet Classification with Deep Convolutional Neural Networks
ImageNet Classification with Deep Convolutional Neural Networks
 
Transfer Learning: Breve introducción a modelos pre-entrenados.
Transfer Learning: Breve introducción a modelos pre-entrenados.Transfer Learning: Breve introducción a modelos pre-entrenados.
Transfer Learning: Breve introducción a modelos pre-entrenados.
 
16 OpenCV Functions to Start your Computer Vision journey.docx
16 OpenCV Functions to Start your Computer Vision journey.docx16 OpenCV Functions to Start your Computer Vision journey.docx
16 OpenCV Functions to Start your Computer Vision journey.docx
 
What multimodal foundation models cannot perceive
What multimodal foundation models cannot perceiveWhat multimodal foundation models cannot perceive
What multimodal foundation models cannot perceive
 

More from Sangmin Woo

Multimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxMultimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxSangmin Woo
 
Video Transformers.pptx
Video Transformers.pptxVideo Transformers.pptx
Video Transformers.pptxSangmin Woo
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxSangmin Woo
 
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxAn Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxSangmin Woo
 
Visual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptxVisual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptxSangmin Woo
 
Video Grounding.pptx
Video Grounding.pptxVideo Grounding.pptx
Video Grounding.pptxSangmin Woo
 
Action Recognition Datasets.pptx
Action Recognition Datasets.pptxAction Recognition Datasets.pptx
Action Recognition Datasets.pptxSangmin Woo
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningSangmin Woo
 
Towards Efficient Transformers
Towards Efficient TransformersTowards Efficient Transformers
Towards Efficient TransformersSangmin Woo
 
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsAction Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsSangmin Woo
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextSangmin Woo
 
Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsSangmin Woo
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationSangmin Woo
 

More from Sangmin Woo (13)

Multimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxMultimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptx
 
Video Transformers.pptx
Video Transformers.pptxVideo Transformers.pptx
Video Transformers.pptx
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptx
 
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxAn Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
 
Visual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptxVisual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptx
 
Video Grounding.pptx
Video Grounding.pptxVideo Grounding.pptx
Video Grounding.pptx
 
Action Recognition Datasets.pptx
Action Recognition Datasets.pptxAction Recognition Datasets.pptx
Action Recognition Datasets.pptx
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Towards Efficient Transformers
Towards Efficient TransformersTowards Efficient Transformers
Towards Efficient Transformers
 
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsAction Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global context
 
Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene Graphs
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
 

Recently uploaded

BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxnoordubaliya2003
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 

Recently uploaded (20)

BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptx
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 

Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the Wild

  • 1. Recent Breakthroughs in AI - Clubhouse Podcasts YouTube Link: https://www.youtube.com/watch?v=3OxEpGU1unA Presenter: Sangmin Woo 2021.03.10 Andrej Karpathy Justin Johnson Lex Fridman Richard Socher Russell Kaplan
  • 2. 2 36  Rise of multimodal learning: CLIP and DALL-E • CLIP efficiently learns visual concepts from natural language supervision • DALL-E creates images from text captions for a wide range of concepts expressible in natural language  ‘Data’ is the KING: Importance of data and datasets • Academia: Given dataset – build more powerful model vs. Reality (Industry): Given model – collect/generate dataset • In fact, many innovation comes from data (not model…) • Data curation & MLOps will become more important  Will Transformers overtake CNNs? and towards "generalized neural substrates“ • Image = CNN, sequence = RNN → All = Transformer (consolidation of architectures) • 2020 is the year of Transformer, All we need is Transformer!  Lifelong learning (need to consider catastrophic forgetting & semantic shift, …) • Benchmarking is difficult… since the tasks & models will be all different from previous SOTA… • Then why not fix the model? Model-first benchmark design!  Taking hard data structures and "softening" to make differentiable • Transformer is softened version of hash table • What would be the next generation data structure? Talk Summary https://www.youtube.com/watch?v=3OxEpGU1unA
  • 3. Learning Visual-Linguistic Representation in the Wild Presenter: Sangmin Woo 2021.03.10 CLIP – OpenAI / DALL-E – OpenAI / ALIGN – Google Research (+ UniT – Facebook AI Research)
  • 4. Blog Link : https://openai.com/blog/clip/
  • 5. 5 36 “Scaling a simple pre-training task is sufficient to achieve competitive zero-shot performance on a great variety of image classification datasets”  CLIP is trained on 400M (image, text) pairs found across the internet.  Given an image, CLIP predicts which out of a set of 32,768 randomly sampled text snippets, was actually paired with it in the dataset.  CLIP learns to recognize a wide variety of visual concepts in images and associate them with their names.  CLIP models can then be applied to nearly arbitrary visual classification tasks. Summary
  • 6. 6 36  Current approaches have several major problems:  datasets are labor intensive and costly to create  models are good at one task and one task only  models that perform well on benchmarks have disappointingly poor performance on real-world.  CLIP (Contrastive Language–Image Pre-training) aims to address these problems: • It is trained on image & natural language supervision that’s abundantly available on the internet. • It can be instructed in natural language to perform several classification benchmarks, without directly optimizing for the benchmark’s performance (similar to the “zero-shot” capabilities of GPT-3). • It matches the accuracy of the original ResNet-50 on ImageNet zero-shot without using any of the 1.28M training examples. Introduction
  • 7. 7 36  Both models show the same accuracy on the ImageNet test set.  In non-ImageNet settings, CLIP significantly outperforms ImageNet model.  ObjectNet checks a model’s ability to recognize objects in many different poses and with many different backgrounds inside homes.  ImageNet Rendition and ImageNet Sketch check a model’s ability to recognize more abstract depictions of objects. ImageNet ResNet-101 vs. CLIP ViT-L
  • 8. 8 36  CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples.  At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset’s classes. Approach
  • 10. 10 36  Random, non-cherry picked, predictions of zero-shot CLIP classifiers on examples from various datasets: Qualitative Examples
  • 11. 11 36  CLIP is highly efficient • CLIP learns from unfiltered, highly varied, and highly noisy data, and is intended to be used in a zero-shot manner. • CLIP (GPT-2 and 3) can achieve compelling zero shot performance. However, it requires significant training compute. • Two algorithmic choices to save compute:  contrastive objective for connecting text with images.  Vision Transformer gives 3x gain in compute efficiency over a standard ResNet. Key takeaways
  • 12. 12 36  Image-to-caption Transformer model struggled at zero-shot transfer. It only achieves 16% accuracy on ImageNet after training for 400M images.  CLIP is much more efficient and achieves the same accuracy roughly 12x faster. Key takeaways
  • 13. 13 36  CLIP is flexible and general • CLIP models are more flexible and general than ImageNet models because they learn a wide range of visual concepts directly from natural language. They are able to zero-shot perform many different tasks. • CLIP has validated its zero-shot performance on over 30 different datasets including tasks such as fine-grained object classification, geo-localization, action recognition in videos, and OCR. • Learning OCR is an exciting behavior that does not occur in standard ImageNet models. • The best CLIP model outperforms the best publicly available ImageNet model, the Noisy Student EfficientNet-L2, on 20 out of 26 different transfer datasets. Key takeaways
  • 14. 14 36  Across 27 tasks such as fine- grained object classification, OCR, activity recognition in videos, and geo-localization, CLIP models learn more widely useful image representations. Key takeaways
  • 15. 15 36  While CLIP usually performs well on recognizing common objects, it struggles on counting the number of objects in an image and on predicting how close the nearest car is in a photo.  Zero-shot CLIP also struggles compared to task specific models on very fine-grained classification, such as telling the difference between car models, variants of aircraft, or flower species.  CLIP also still has poor generalization to images not covered in its pre-training dataset. For instance, although CLIP learns a capable OCR system, when evaluated on MNIST dataset, zero- shot CLIP only achieves 88% accuracy, well below the 99.75% of humans on the dataset. Limitations
  • 16. Blog Link : https://openai.com/blog/dall-e/ YouTube: https://www.youtube.com/watch?v=az-OV47oKvA (for more detailed and friendly explanation)
  • 17. 17 36 “DALL-E is a 12B parameter AR Transformer trained to generate images from text descriptions in a zero-shot manner, using 250M text–image pairs collected from the internet”  DALL-E achieves high quality image generation on MS-COCO dataset zero-shot, without using any of the training labels.  preferred over prior work trained on the dataset by human evaluators 90% of the time.  Image-to-image translation Summary + DALL-E = Salvador Dalí WALL-E
  • 18. 18 36  GPT-3: text generation  Image GPT: image generation  Jukebox: music generation  DALL-E extend these findings to show that manipulating visual concepts through language is now within reach.  DALL-E can • create anthropomorphized versions of animals and objects • combine unrelated concepts in plausible ways • render text • apply transformations to existing images  Qualitative examples: https://openai.com/blog/dall-e/ Introduction
  • 20. 20 36  The goal is to train a transformer to autoregressively model the text and image tokens as a single stream of data.  However, using pixels directly as image tokens would require an inordinate amount of memory for high-resolution images.  A discrete variational autoencoder (dVAE) is trained to compress each 256×256 RGB image into a 32×32 grid of image tokens, each element of which can assume 8192 possible values.  256 BPE-encoded text tokens are concatenated with the 32×32=1024 image tokens, and train an autoregressive transformer to model the joint distribution over the text and image tokens. Approach
  • 21. 21 36  VQ-VAE (Vector Quantized Variational AutoEncoder) for image compression Approach Oord et al., Neural Discrete Representation Learning
  • 22. 22 36  Gumbel Softmax Approach Jang et al., Categorical Reparameterization with Gumbel-Softmax
  • 23. 23 36  A discrete variational autoencoder (dVAE) is trained to compress each 256×256 RGB image into a 32×32 grid of image tokens, each element of which can assume 8192 possible values.  The encoder downsamples the spatial resolution by a factor of 8.  While details are sometimes lost or distorted, the main features of the image are still typically recognizable. Approach
  • 25.
  • 26. 26 36 “ALIGN (A Large-scale ImaGe and Noisy-text embedding) uses a noisy dataset of over 1B image alt-text pairs, obtained without expensive filtering or post-processing steps, to learn a simple dual- encoder architecture (image and text) by aligning visual-language representations using a contrastive loss”  While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. This costly curation process limits the size of datasets and hence hinders the scaling of trained models.  The scale of the corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Summary
  • 27. 27 36  Visual and language representations are jointly learned from noisy image alt-text data and can be used for vision-only or vision-language task transfer.  Without any fine-tuning, ALIGN powers cross-modal search including image-to-text search, text-to-image search and even search with joint image+text queries. Summary
  • 28. 28 36  The goal is to align the visual-language representations in a shared latent embedding space using a simple dual-encoder architecture (image: EfficientNet, text: BERT)  Image and text encoders are learned via a contrastive loss (formulated as normalized softmax) that pushes the embeddings of matched image-text pair together while pushing those of non- matched image-text pair apart.  Considering paired texts as fine-grained labels of images, image-to-text contrastive loss is analogous to the conventional label-based classification objective; and the key difference is that the text encoder generates the “label” weights. Approach
  • 29. 29 36  The image (EfficientNet) and text encoders (BERT) are optimized via contrastive loss (sum of two normalized softmax losses) that pushes the embeddings of matched image-text pair (positive) together while pushing those of non-matched image-text pair (negative) apart. • Image-to-text classification loss • Text-to-image classification loss Approach 𝑥𝑖: image embedding in the 𝑖-th pair 𝑦𝑗: text embedding in the 𝑗-th pair 𝑁: batch size 𝜎: (learnable) temperature
  • 31.
  • 32. 32 36 “Unified Transformer (UniT) is built upon the transformer encoder-decoder architecture and jointly learns multiple tasks across different modalities (image & text), ranging from object detection to language understanding and multimodal reasoning”  UniT model encodes each input modality with an encoder and makes predictions on each task with a shared decoder over the encoded input representations, followed by task-specific output heads  Compared to previous efforts on multi-task learning with transformers, UniT share the same model parameters to all tasks instead of separately fine-tuning task-specific models and handle a much higher variety of tasks across different domains.  UniT learns 7 tasks jointly over 8 datasets, achieving comparable performance to well- established prior work on each domain under the same supervision with a compact set of model parameters. Summary
  • 34. 34 36  UniT uses an image encoder, a text encoder, and a joint decoder with per-task query embedding followed by task-specific heads to make the final outputs for each task. Approach
  • 35. 35 36  Among the existing architectures, Transformer is the most generic architectures because it has less inductive bias than others.  A new formula such as “Large Transformer + Large scale dataset” has begun to emerge (CLIP:400M, DALL-E:250M, ALIGN:1B).  All we need is data: the recent BIG studies talk about how they collected/curated data, not much about models.  Transformers are replacing CNN-based SOTAs, which were considered de-facto standard in the image domain, on several benchmarks.  Also, Transformers are indeed strong at multi-modality. Wrap up
  • 36. Thank You shmwoo9395@{gmail.com, gist.ac.kr} If you find my presentation interesting and this gives you new inspiration, please feel free to contact me!