SlideShare a Scribd company logo
1 of 57
Transformers
State Of The Art Natural Language Processing
Nilesh Verma
Full Stack Data Scientist
(Amlgo Labs India)
Agenda
• Recent Development on NLP
• Short History of NLP
• Why Transformers
• Transformer & Their Architecture
• Attention Mechanism
• Workings
• Types of Transformers
• Explain BERT
• Popular State of the Art language Model
Who I am
• I am Nilesh Verma
• Full Stack Data Scientist at Amlgo Labs India. Ex- Xceedance, Samsung AI
• Having 2+ Years of Industry Experience.
• AutoWave for Audio Classification, DeepImageSearch, and DeepTextSearch are some of the
interesting python libraries (open source contributions) that I developed and maintain.
• More then 30K-40K+ Downloads.
• Secured 1st rank in The Great Indian Hiring Hackathon (Nov-20) based on Foretelling the Retail
Price Host by MachineHack.
• Recognition of being placed 3rd in AppScript, A 48-Hours Hackathon Conducted by IEEE APSIT on
6-7th Feb 2021.
• Clear NTA-NET,GATE exam in first attempt.
• B.Sc. And M.Sc. Computer Science (Gold Medalist)
• Various state-level news cover for the development of real-time covid-19 detection through CT-
Scan software.
Recent Development on NLP
Short History of NLP
• 1954 - Bag of Words (BoW)
• 1972 - TF-IDF
• 2001 - Neural language models (RNN,B-RNN,LSTM)
• 2008 - Multi-Task learning
• 2013 - Word embeddings (Word2Vec)
• 2013 - Neural networks for NLP
• 2014 - Sequence to sequence models(Encoder-Decoder)
• 2015 - Attention (For images but found useful for Text too)
• 2017 - Transformer
• 2018 - Pretrained language models(BERT,GPT ,T5 etc)
Transformer ??
Why Transformer
• Improve Contextual Understanding
• Parallelization (Faster Processing/Utilization of GPU/TPU Power)
What is Transformer
What is Transformer
The Transformer in NLP is a novel architecture that aims to solve sequence-to-
sequence tasks while handling long-range dependencies with ease. It relies entirely
on self-attention to compute representations of its input and
output WITHOUT using sequence-aligned RNNs or convolution.
Transformer Architecture
Transformer Architecture Breakdown
• we see an encoding component, a decoding component, and
connections between them.
Transformer Architecture Breakdown
• The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s
nothing magical about the number six, one can definitely experiment with other arrangements). The
decoding component is a stack of decoders of the same number.
Transformer Architecture Breakdown
• The encoder’s inputs first flow through a self-attention layer
• The outputs of the self-attention layer are fed to a feed-forward neural network.
• The decoder has both those layers, but between them is an attention layer that helps the decoder focus on
relevant parts of the input sentence
Input Preprocessing
• Each word is embedded into a vector of size 512. We'll represent
those vectors with these simple boxes. The embedding only happens
in the bottom-most encoder.
Input Preprocessing
• To give the model a sense of the order of the words, we add positional encoding
vectors -- the values of which follow a specific pattern.
• Real example of positional encoding with a toy embedding size of 4
Input Preprocessing
Here “pos” refers to the
position of the “word” in
the sequence “d” means
the size of the
word/token embedding.
Finally, “i” refers to each
of the individual
dimensions of the
embedding (i.e. 0, 1,2,3,4)
Encoder
• The word at each position passes through a self-attention process. Then, they
each pass through a feed-forward neural network -- the exact same network with
each vector flowing through it separately.
Self-Attention
• Attention allowed us to focus on parts of our input sequence while we predicted
our output sequence
“Self attention, sometimes called intra-attention is an attention mechanism
relating different positions of a single sequence in order to compute a
representation of the sequence.”
Self-Attention in Detail
• The first step in calculating self-attention is to create a Query vector, a Key vector, and a Value vector. These vectors are
created by multiplying the embedding by three matrices that we trained during the training process.
• Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. This is an
architecture.
Multiplying x1 by
the WQ weight matrix
produces q1, the "query"
vector associated with that
word. We end up creating a
"query", a "key", and a
"value" projection of each
word in the input sentence.
What are the “query”, “key”,
and “value” vectors?
Self-Attention in Detail
• The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first
word in this example, “Thinking”. We need to score each word of the input sentence against this word.
The score is calculated by taking the dot
product of the query vector with the key
vector of the respective word we’re
scoring. So if we’re processing the self-
attention for the word in position #1,
the first score would be the dot product
of q1 and k1. The second score would be
the dot product of q1 and k2.
Self-Attention in Detail
• The third and fourth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the
paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default),
then pass the result through a SoftMax operation. SoftMax normalizes the scores so they’re all positive and add up to 1.
This SoftMax score determines how
much each word will be expressed at
this position. Clearly the word at this
position will have the highest SoftMax
score, but sometimes it’s useful to
attend to another word that is relevant
to the current word.
Self-Attention in Detail
• The fifth step is to multiply each value vector by
the SoftMax score (in preparation to sum them up).
The intuition here is to keep intact the values of the
word(s) we want to focus on, and drown-out
irrelevant words (by multiplying them by tiny
numbers like 0.001, for example).
• The sixth step is to sum up the weighted value
vectors. This produces the output of the self-
attention layer at this position (for the first word).
Matrix Calculation of Self-Attention
The first step is to calculate the Query, Key, and Value
matrices. We do that by packing our embeddings into a
matrix X, and multiplying it by the weight matrices we’ve
trained (WQ, WK, WV).
Finally, since we’re dealing with matrices, we can
condense steps two through six in one formula to
calculate the outputs of the self-attention layer.
The Beast With Many Heads
• The paper further refined the self-attention layer by adding a mechanism called “multi-headed” attention. This improves
the performance of the attention layer in two ways:
• It expands the model’s ability to focus on different positions.
• It gives the attention layer multiple “representation subspaces”. so we end up with eight sets for each
encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input
embeddings (or vectors from lower encoders/decoders) into a different representation subspace.
The Beast With Many Heads
The Beast With Many Heads
The Beast With Many Heads
As we encode the word "it", one attention
head is focusing most on "the animal", while
another is focusing on "tired" -- in a sense, the
model's representation of the word "it" bakes
in some of the representation of both "animal"
and "tired".
The Residuals
• One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-
layer (self-attention, FFNN) in each encoder has a residual connection around it, and is followed by a layer-
normalization step.
Layer Normalization
If we’re to visualize the vectors and the layer-norm
operation associated with self attention, it would look like
this:
Final Linear and SoftMax Layer
1. The decoder stack outputs a vector of floats.
How do we turn that into a word That’s the job
of the final Linear layer which is followed by a
SoftMax Layer.
2. The Linear layer is a simple fully connected
neural network that projects the vector
produced by the stack of decoders, into a
much, much larger vector called a logits vector.
3. Let’s assume that our model knows 10,000
unique English words (our model’s “output
vocabulary”) that it’s learned from its training
dataset. This would make the logits vector
10,000 cells wide – each cell corresponding to
the score of a unique word. That is how we
interpret the output of the model followed by
the Linear layer.
4. The SoftMax layer then turns those scores
into probabilities (all positive, all add up to
1.0). The cell with the highest probability is
chosen, and the word associated with it is
produced as the output for this time step.
Combined All
• This goes for the sub-layers of the decoder as well. If we’re to think of a Transformer of 2 stacked
encoders and decoders, it would look something like this:
Working
Working
Transformers are everywhere!
• Transformer models are used to solve all kinds of NLP tasks.
1. Feature Extraction (Get The Vector Representation Of A Text)
2. Fill-mask (Next Word Predication)
3. NER (Named Entity Recognition)
4. Question-Answering
5. Sentiment-Analysis
6. Summarization
7. Text-Generation
8. Translation
9. Zero-Shot-Classification
• The companies and organizations using Transformer models
A bit of Transformer history
Here are some reference points in the (short) history of Transformer models:
A bit of Transformer history
The Transformer architecture was introduced in June 2017. The focus of the original research was
on translation tasks. This was followed by the introduction of several influential models, including:
• June 2018: GPT, the first pretrained Transformer model, used for fine-tuning on various NLP tasks
and obtained state-of-the-art results
• October 2018: BERT, another large pretrained model, this one designed to produce better
summaries of sentences (more on this in the next chapter!)
• February 2019: GPT-2, an improved (and bigger) version of GPT that was not immediately publicly
released due to ethical concerns
• October 2019: DistilBERT, a distilled version of BERT that is 60% faster, 40% lighter in memory,
and still retains 97% of BERT’s performance
• October 2019: BART and T5, two large pretrained models using the same architecture as the
original Transformer model (the first to do so)
• May 2020, GPT-3, an even bigger version of GPT-2 that is able to perform well on a variety of
tasks without the need for fine-tuning (called zero-shot learning)
Types of Transformers
This list is far from comprehensive, and is just meant to highlight a few
of the different kinds of Transformer models. Broadly, they can be
grouped into three categories:
Model Examples Tasks
Encoder ALBERT, BERT, DistilBERT, ELECTRA,
RoBERTa
Sentence classification, named
entity recognition, extractive
question answering
Decoder CTRL, GPT, GPT-2, Transformer XL Text generation
Encoder-decoder BART, T5, Marian, mBART Summarization, translation,
generative question answering
Transformers are language models
• All the Transformer models mentioned above (GPT, BERT, BART, T5, etc.) have been trained as
language models. This means they have been trained on large amounts of raw text in a self-
supervised fashion. Self-supervised learning is a type of training in which the objective is
automatically computed from the inputs of the model. That means that humans are not needed
to label the data!
• This type of model develops a statistical understanding of the language it has been trained on,
but it’s not very useful for specific practical tasks. Because of this, the general pretrained model
then goes through a process called transfer learning. During this process, the model is fine-tuned
in a supervised way — that is, using human-annotated labels — on a given task.
BERT (Bidirectional Encoder Representations
from Transformers)
• BERT is a Natural Language Processing Model proposed by
researchers at Google Research in 2018.
• Individual NLP tasks have traditionally been solved by individual
models created for each specific task. That is, until— BERT!
• BERT revolutionized the NLP space by solving for 11+ of the most
common NLP tasks (and better than previous models) making it the
jack of all NLP trades.
Fun Fact 😁: You interact with NLP (and likely BERT) almost every single day!
Example of BERT
• BERT helps Google better surface (English) results for nearly all searches since
November of 2020.
• Here’s an example of how BERT helps Google better understand specific searches
like:
BERT’s Architecture
Transformer Layers Hidden Size Attention Heads Parameters Processing Length of Training
BERT-base 12 768 12 110M 4 TPUs 4 days
BERT-large 24 1024 16 340M 16 TPUs 4 days
How does BERT Work?
1. Large amounts of training data:
• A massive dataset of 3.3 Billion words has contributed to BERT’s continued success.
• BERT was specifically trained on Wikipedia (~2.5B words) and Google’s Books-Corpus (~800M words). These
large informational datasets contributed to BERT’s deep knowledge not only of the English language but also
of our world! 🚀
How does BERT Work?
2. Masked Language Model:
• MLM enables/enforces bidirectional learning from text by masking (hiding) a word in a sentence and forcing
BERT to bidirectionally use the words on either side of the covered word to predict the masked word. This
had never been done before!
How does BERT Work?
3. Next Sentence Prediction:
• NSP (Next Sentence Prediction) is used to help BERT learn about relationships between sentences by
predicting if a given sentence follows the previous sentence or not.
Training Inputs
1. We give inputs to BERT using the above structure. The input consists of a pair of sentences, called
sequences, and two special tokens: [CLS] and [SEP].
2. BERT first uses wordpiece tokenization to convert the sequence into tokens and adds the [CLS]
token in the start and the [SEP] token in the beginning and end of the second sentence.
Training Inputs
Token Embeddings: Token embeddings by indexing a Matrix of size 30000x768(H). Here, 30000 is
the Vocab length after wordpiece tokenization. The weights of this matrix would be learned while
training.
Training Inputs
Segment Embeddings: For tasks such as question answering, we should specify which segment this
sentence is from. These are either all 0 vectors of H length if the embedding is from sentence 1, or
a vector of 1’s if the embedding is from sentence 2.
Training Output
we define two vectors S and E (which will be learned
during fine-tuning) both having shapes(1x768). We then
take a dot product of these vectors with the second
sentence’s output vectors from BERT, giving us some
scores. We then apply SoftMax over these scores to get
probabilities. The training objective is the sum of the log-
likelihoods of the correct start and end positions.
BERT Training
Pre-Training
“What is language? What is context?”
Fine-Training
“How to use language for specific task?”
Fine-Training
GLUE Benchmark
• GLUE (General Language Understanding Evaluation) benchmark is a group of resources for
training, measuring, and analyzing language models comparatively to one another. These
resources consist of nine “difficult” tasks designed to test an NLP model’s understanding.
GPT (Generative Pre-trained Transformer)
• OpenAI GPT model was proposed in Improving Language Understanding by Generative Pre-
Training by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It’s a causal
(unidirectional) transformer pre-trained using language modelling on a large corpus will long
range dependencies, the Toronto Book Corpus.
T5(Text-To-Text Transfer Transformer)
• T5, or Text-to-Text Transfer Transformer, is a Transformer based architecture that uses a
text-to-text approach. Every task – including translation, question answering, and
classification – is cast as feeding the model text as input and training it to generate some
target text. T5 uses common crawl web extracted text.
References
• https://jalammar.github.io/illustrated-transformer/
• https://www.analyticsvidhya.com/blog/2019/06/understanding-
transformers-nlp-state-of-the-art-models/
• https://towardsdatascience.com/transformers-89034557de14
• https://www.youtube.com/watch?v=TQQlZhbC5ps&t=60s
• https://arxiv.org/abs/1706.03762

More Related Content

What's hot

NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERTshaurya uppal
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and TransformerArvind Devaraj
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarSenthil Kumar M
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Fwdays
 
Attention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsAttention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsArtifacia
 
Semantic Segmentation AIML Project
Semantic Segmentation AIML ProjectSemantic Segmentation AIML Project
Semantic Segmentation AIML ProjectHitesh
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
Transformers and BERT with SageMaker
Transformers and BERT with SageMakerTransformers and BERT with SageMaker
Transformers and BERT with SageMakerSuman Debnath
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPInsoo Chung
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningBigDataCloud
 

What's hot (20)

NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Word embedding
Word embedding Word embedding
Word embedding
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
Attention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsAttention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its Applications
 
Semantic Segmentation AIML Project
Semantic Segmentation AIML ProjectSemantic Segmentation AIML Project
Semantic Segmentation AIML Project
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
Transformers and BERT with SageMaker
Transformers and BERT with SageMakerTransformers and BERT with SageMaker
Transformers and BERT with SageMaker
 
Bert
BertBert
Bert
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLP
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
 

Similar to Demystifying NLP Transformers: Understanding the Power and Architecture behind Natural Language Processing

Automatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face RecognitionAutomatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face Recognitionvatsal199567
 
Introduction to Transformers
Introduction to TransformersIntroduction to Transformers
Introduction to TransformersSuman Debnath
 
Dataworkz odsc london 2018
Dataworkz odsc london 2018Dataworkz odsc london 2018
Dataworkz odsc london 2018Olaf de Leeuw
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspectiveAnirban Santara
 
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech TalksA Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech TalksAmazon Web Services
 
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech TalksA Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech TalksAmazon Web Services
 
ML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxDebabrataPain1
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsSanghamitra Deb
 
Neural Networks for Machine Learning and Deep Learning
Neural Networks for Machine Learning and Deep LearningNeural Networks for Machine Learning and Deep Learning
Neural Networks for Machine Learning and Deep Learningcomifa7406
 
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...devismileyrockz
 
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 Separating Hype from Reality in Deep Learning with Sameer Farooqui Separating Hype from Reality in Deep Learning with Sameer Farooqui
Separating Hype from Reality in Deep Learning with Sameer FarooquiDatabricks
 
Build a simple image recognition system with tensor flow
Build a simple image recognition system with tensor flowBuild a simple image recognition system with tensor flow
Build a simple image recognition system with tensor flowDebasisMohanty37
 
240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptxthanhdowork
 
05-transformers.pdf
05-transformers.pdf05-transformers.pdf
05-transformers.pdfChaoYang81
 
Applying Deep Learning Machine Translation to Language Services
Applying Deep Learning Machine Translation to Language ServicesApplying Deep Learning Machine Translation to Language Services
Applying Deep Learning Machine Translation to Language ServicesYannis Flet-Berliac
 
Deep Learning Sample Class (Jon Lederman)
Deep Learning Sample Class (Jon Lederman)Deep Learning Sample Class (Jon Lederman)
Deep Learning Sample Class (Jon Lederman)Jon Lederman
 

Similar to Demystifying NLP Transformers: Understanding the Power and Architecture behind Natural Language Processing (20)

Automatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face RecognitionAutomatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face Recognition
 
Introduction to Transformers
Introduction to TransformersIntroduction to Transformers
Introduction to Transformers
 
Som paper1.doc
Som paper1.docSom paper1.doc
Som paper1.doc
 
Dataworkz odsc london 2018
Dataworkz odsc london 2018Dataworkz odsc london 2018
Dataworkz odsc london 2018
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech TalksA Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
 
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech TalksA Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
 
ML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptx
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
 
Neural Networks for Machine Learning and Deep Learning
Neural Networks for Machine Learning and Deep LearningNeural Networks for Machine Learning and Deep Learning
Neural Networks for Machine Learning and Deep Learning
 
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
 
Deep learning
Deep learningDeep learning
Deep learning
 
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 Separating Hype from Reality in Deep Learning with Sameer Farooqui Separating Hype from Reality in Deep Learning with Sameer Farooqui
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 
Build a simple image recognition system with tensor flow
Build a simple image recognition system with tensor flowBuild a simple image recognition system with tensor flow
Build a simple image recognition system with tensor flow
 
240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx
 
05-transformers.pdf
05-transformers.pdf05-transformers.pdf
05-transformers.pdf
 
Applying Deep Learning Machine Translation to Language Services
Applying Deep Learning Machine Translation to Language ServicesApplying Deep Learning Machine Translation to Language Services
Applying Deep Learning Machine Translation to Language Services
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
 
Deep Learning Sample Class (Jon Lederman)
Deep Learning Sample Class (Jon Lederman)Deep Learning Sample Class (Jon Lederman)
Deep Learning Sample Class (Jon Lederman)
 

Recently uploaded

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 

Recently uploaded (20)

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 

Demystifying NLP Transformers: Understanding the Power and Architecture behind Natural Language Processing

  • 1. Transformers State Of The Art Natural Language Processing Nilesh Verma Full Stack Data Scientist (Amlgo Labs India)
  • 2. Agenda • Recent Development on NLP • Short History of NLP • Why Transformers • Transformer & Their Architecture • Attention Mechanism • Workings • Types of Transformers • Explain BERT • Popular State of the Art language Model
  • 3. Who I am • I am Nilesh Verma • Full Stack Data Scientist at Amlgo Labs India. Ex- Xceedance, Samsung AI • Having 2+ Years of Industry Experience. • AutoWave for Audio Classification, DeepImageSearch, and DeepTextSearch are some of the interesting python libraries (open source contributions) that I developed and maintain. • More then 30K-40K+ Downloads. • Secured 1st rank in The Great Indian Hiring Hackathon (Nov-20) based on Foretelling the Retail Price Host by MachineHack. • Recognition of being placed 3rd in AppScript, A 48-Hours Hackathon Conducted by IEEE APSIT on 6-7th Feb 2021. • Clear NTA-NET,GATE exam in first attempt. • B.Sc. And M.Sc. Computer Science (Gold Medalist) • Various state-level news cover for the development of real-time covid-19 detection through CT- Scan software.
  • 5. Short History of NLP • 1954 - Bag of Words (BoW) • 1972 - TF-IDF • 2001 - Neural language models (RNN,B-RNN,LSTM) • 2008 - Multi-Task learning • 2013 - Word embeddings (Word2Vec) • 2013 - Neural networks for NLP • 2014 - Sequence to sequence models(Encoder-Decoder) • 2015 - Attention (For images but found useful for Text too) • 2017 - Transformer • 2018 - Pretrained language models(BERT,GPT ,T5 etc)
  • 7. Why Transformer • Improve Contextual Understanding • Parallelization (Faster Processing/Utilization of GPU/TPU Power)
  • 9. What is Transformer The Transformer in NLP is a novel architecture that aims to solve sequence-to- sequence tasks while handling long-range dependencies with ease. It relies entirely on self-attention to compute representations of its input and output WITHOUT using sequence-aligned RNNs or convolution.
  • 11. Transformer Architecture Breakdown • we see an encoding component, a decoding component, and connections between them.
  • 12. Transformer Architecture Breakdown • The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number.
  • 13. Transformer Architecture Breakdown • The encoder’s inputs first flow through a self-attention layer • The outputs of the self-attention layer are fed to a feed-forward neural network. • The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence
  • 14. Input Preprocessing • Each word is embedded into a vector of size 512. We'll represent those vectors with these simple boxes. The embedding only happens in the bottom-most encoder.
  • 15. Input Preprocessing • To give the model a sense of the order of the words, we add positional encoding vectors -- the values of which follow a specific pattern. • Real example of positional encoding with a toy embedding size of 4
  • 16. Input Preprocessing Here “pos” refers to the position of the “word” in the sequence “d” means the size of the word/token embedding. Finally, “i” refers to each of the individual dimensions of the embedding (i.e. 0, 1,2,3,4)
  • 17. Encoder • The word at each position passes through a self-attention process. Then, they each pass through a feed-forward neural network -- the exact same network with each vector flowing through it separately.
  • 18. Self-Attention • Attention allowed us to focus on parts of our input sequence while we predicted our output sequence “Self attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.”
  • 19. Self-Attention in Detail • The first step in calculating self-attention is to create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process. • Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. This is an architecture. Multiplying x1 by the WQ weight matrix produces q1, the "query" vector associated with that word. We end up creating a "query", a "key", and a "value" projection of each word in the input sentence. What are the “query”, “key”, and “value” vectors?
  • 20. Self-Attention in Detail • The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self- attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.
  • 21. Self-Attention in Detail • The third and fourth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a SoftMax operation. SoftMax normalizes the scores so they’re all positive and add up to 1. This SoftMax score determines how much each word will be expressed at this position. Clearly the word at this position will have the highest SoftMax score, but sometimes it’s useful to attend to another word that is relevant to the current word.
  • 22. Self-Attention in Detail • The fifth step is to multiply each value vector by the SoftMax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example). • The sixth step is to sum up the weighted value vectors. This produces the output of the self- attention layer at this position (for the first word).
  • 23. Matrix Calculation of Self-Attention The first step is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix X, and multiplying it by the weight matrices we’ve trained (WQ, WK, WV). Finally, since we’re dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer.
  • 24. The Beast With Many Heads • The paper further refined the self-attention layer by adding a mechanism called “multi-headed” attention. This improves the performance of the attention layer in two ways: • It expands the model’s ability to focus on different positions. • It gives the attention layer multiple “representation subspaces”. so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.
  • 25. The Beast With Many Heads
  • 26. The Beast With Many Heads
  • 27. The Beast With Many Heads As we encode the word "it", one attention head is focusing most on "the animal", while another is focusing on "tired" -- in a sense, the model's representation of the word "it" bakes in some of the representation of both "animal" and "tired".
  • 28. The Residuals • One detail in the architecture of the encoder that we need to mention before moving on, is that each sub- layer (self-attention, FFNN) in each encoder has a residual connection around it, and is followed by a layer- normalization step.
  • 29. Layer Normalization If we’re to visualize the vectors and the layer-norm operation associated with self attention, it would look like this:
  • 30. Final Linear and SoftMax Layer 1. The decoder stack outputs a vector of floats. How do we turn that into a word That’s the job of the final Linear layer which is followed by a SoftMax Layer. 2. The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector. 3. Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer. 4. The SoftMax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.
  • 31. Combined All • This goes for the sub-layers of the decoder as well. If we’re to think of a Transformer of 2 stacked encoders and decoders, it would look something like this:
  • 34. Transformers are everywhere! • Transformer models are used to solve all kinds of NLP tasks. 1. Feature Extraction (Get The Vector Representation Of A Text) 2. Fill-mask (Next Word Predication) 3. NER (Named Entity Recognition) 4. Question-Answering 5. Sentiment-Analysis 6. Summarization 7. Text-Generation 8. Translation 9. Zero-Shot-Classification • The companies and organizations using Transformer models
  • 35. A bit of Transformer history Here are some reference points in the (short) history of Transformer models:
  • 36. A bit of Transformer history The Transformer architecture was introduced in June 2017. The focus of the original research was on translation tasks. This was followed by the introduction of several influential models, including: • June 2018: GPT, the first pretrained Transformer model, used for fine-tuning on various NLP tasks and obtained state-of-the-art results • October 2018: BERT, another large pretrained model, this one designed to produce better summaries of sentences (more on this in the next chapter!) • February 2019: GPT-2, an improved (and bigger) version of GPT that was not immediately publicly released due to ethical concerns • October 2019: DistilBERT, a distilled version of BERT that is 60% faster, 40% lighter in memory, and still retains 97% of BERT’s performance • October 2019: BART and T5, two large pretrained models using the same architecture as the original Transformer model (the first to do so) • May 2020, GPT-3, an even bigger version of GPT-2 that is able to perform well on a variety of tasks without the need for fine-tuning (called zero-shot learning)
  • 37. Types of Transformers This list is far from comprehensive, and is just meant to highlight a few of the different kinds of Transformer models. Broadly, they can be grouped into three categories: Model Examples Tasks Encoder ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa Sentence classification, named entity recognition, extractive question answering Decoder CTRL, GPT, GPT-2, Transformer XL Text generation Encoder-decoder BART, T5, Marian, mBART Summarization, translation, generative question answering
  • 38. Transformers are language models • All the Transformer models mentioned above (GPT, BERT, BART, T5, etc.) have been trained as language models. This means they have been trained on large amounts of raw text in a self- supervised fashion. Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data! • This type of model develops a statistical understanding of the language it has been trained on, but it’s not very useful for specific practical tasks. Because of this, the general pretrained model then goes through a process called transfer learning. During this process, the model is fine-tuned in a supervised way — that is, using human-annotated labels — on a given task.
  • 39. BERT (Bidirectional Encoder Representations from Transformers) • BERT is a Natural Language Processing Model proposed by researchers at Google Research in 2018. • Individual NLP tasks have traditionally been solved by individual models created for each specific task. That is, until— BERT! • BERT revolutionized the NLP space by solving for 11+ of the most common NLP tasks (and better than previous models) making it the jack of all NLP trades. Fun Fact 😁: You interact with NLP (and likely BERT) almost every single day!
  • 40. Example of BERT • BERT helps Google better surface (English) results for nearly all searches since November of 2020. • Here’s an example of how BERT helps Google better understand specific searches like:
  • 41. BERT’s Architecture Transformer Layers Hidden Size Attention Heads Parameters Processing Length of Training BERT-base 12 768 12 110M 4 TPUs 4 days BERT-large 24 1024 16 340M 16 TPUs 4 days
  • 42. How does BERT Work? 1. Large amounts of training data: • A massive dataset of 3.3 Billion words has contributed to BERT’s continued success. • BERT was specifically trained on Wikipedia (~2.5B words) and Google’s Books-Corpus (~800M words). These large informational datasets contributed to BERT’s deep knowledge not only of the English language but also of our world! 🚀
  • 43. How does BERT Work? 2. Masked Language Model: • MLM enables/enforces bidirectional learning from text by masking (hiding) a word in a sentence and forcing BERT to bidirectionally use the words on either side of the covered word to predict the masked word. This had never been done before!
  • 44. How does BERT Work? 3. Next Sentence Prediction: • NSP (Next Sentence Prediction) is used to help BERT learn about relationships between sentences by predicting if a given sentence follows the previous sentence or not.
  • 45. Training Inputs 1. We give inputs to BERT using the above structure. The input consists of a pair of sentences, called sequences, and two special tokens: [CLS] and [SEP]. 2. BERT first uses wordpiece tokenization to convert the sequence into tokens and adds the [CLS] token in the start and the [SEP] token in the beginning and end of the second sentence.
  • 46. Training Inputs Token Embeddings: Token embeddings by indexing a Matrix of size 30000x768(H). Here, 30000 is the Vocab length after wordpiece tokenization. The weights of this matrix would be learned while training.
  • 47. Training Inputs Segment Embeddings: For tasks such as question answering, we should specify which segment this sentence is from. These are either all 0 vectors of H length if the embedding is from sentence 1, or a vector of 1’s if the embedding is from sentence 2.
  • 48. Training Output we define two vectors S and E (which will be learned during fine-tuning) both having shapes(1x768). We then take a dot product of these vectors with the second sentence’s output vectors from BERT, giving us some scores. We then apply SoftMax over these scores to get probabilities. The training objective is the sum of the log- likelihoods of the correct start and end positions.
  • 50. Pre-Training “What is language? What is context?”
  • 51. Fine-Training “How to use language for specific task?”
  • 53. GLUE Benchmark • GLUE (General Language Understanding Evaluation) benchmark is a group of resources for training, measuring, and analyzing language models comparatively to one another. These resources consist of nine “difficult” tasks designed to test an NLP model’s understanding.
  • 54. GPT (Generative Pre-trained Transformer) • OpenAI GPT model was proposed in Improving Language Understanding by Generative Pre- Training by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It’s a causal (unidirectional) transformer pre-trained using language modelling on a large corpus will long range dependencies, the Toronto Book Corpus.
  • 55. T5(Text-To-Text Transfer Transformer) • T5, or Text-to-Text Transfer Transformer, is a Transformer based architecture that uses a text-to-text approach. Every task – including translation, question answering, and classification – is cast as feeding the model text as input and training it to generate some target text. T5 uses common crawl web extracted text.
  • 56.
  • 57. References • https://jalammar.github.io/illustrated-transformer/ • https://www.analyticsvidhya.com/blog/2019/06/understanding- transformers-nlp-state-of-the-art-models/ • https://towardsdatascience.com/transformers-89034557de14 • https://www.youtube.com/watch?v=TQQlZhbC5ps&t=60s • https://arxiv.org/abs/1706.03762