SlideShare a Scribd company logo
1 of 110
no me lebante ahorita
cuz I felt como si me
kemara por dentro
jit fi la fin du mois de dece-mbre kan
ljaw bared ktir wttalj
Kibrisa geldigim … god
warum? ich mochte
nicht hier
Sous la pluie mais beau tout de
même, chère Ileana!
Buona giornata a te e a tutti!
Coridel Ent merilis
full tracklist untuk
debut mini album
Jessica Jung yg akan
segera rilis bulan Mei
mendatang
Code-mixing or Code-Switching is the mixing of two or more
languages in a conversation or even an utterance.
no me lebante ahorita
cuz I felt como si me
kemara por dentro
jit fi la fin du mois de dece-mbre kan
ljaw bared ktir wttalj
Kibrisa geldigim … god
warum? ich mochte
nicht hier
Sous la pluie mais beau tout de
même, chère Ileana!
Buona giornata a te e a tutti!
Coridel Ent merilis
full tracklist untuk
debut mini album
Jessica Jung yg akan
segera rilis bulan Mei
mendatang
Processing & Understanding
Mixed Language Data
Monojit Choudhury1, Anirudh Srinivasan1, Sandipan Dandapat2, Kalika Bali1*
1MICROSOFT RESEARCH LAB INDIA
2MICROSOFT INDIA DEVELOPMENT CENTER
E M N L P - I J C N L P T u t o r i a l [ T 2 ]  3 r d N o v e m b e r 2 0 1 9  H o n g K o n g
PROLOG
UE
You are in safe hands ;p
Why this tutorial?
Code-mixing is hot right now!
Industry is interested
• 50% queries to Ruuh
(Microsoft chatbot) are
code-mixed
• People are talking to Alexa in
code-mixing
• 2-20% posts on Twitter and
Facebook are code-mixed.
2 1 2 1 1 1
4
1
19
6
32
13
59
18
0
10
20
30
40
50
60
70
Number of papers in ACL
anthology with code-mixing or
related terms in the title or
abstract.
After this tutorial, you will …
• know how languages interact in multilingual societies
• understand why code-mixing is a difficult (and therefore,
interesting) problem
• be able to appreciate the challenges and nuances of code-
mixed dataset creation
• have some idea about the different NLP tasks and research
that has been happening
• be able to make better and more informed decisions about
designing code-mixed NLP systems
ML approaches and techniques for solving code-mixing are identical to
those for monolingual NLP tasks. Differences exist in…
PRIORITIES OF TASKS DATA COLLECTION AND
PREPARATION
STRATEGIES
OPTIMALLY USE OF
EXISTING RESOURCES
USER-CENTRIC DESIGN
OF (CODE-MIXED) NLP
SYSTEMS
Setting mixed expectations …
Text  Speech
Design  Implementation
Deep Linguistics  Deep learning
Map the field  Cover all research
Insights from industry  Building large scale systems
Outline
• Prologue
• Definitions & some
linguistic primer
• Challenges and Solutions
• SOTA in NLP tasks
• Data and Evaluation
• Language Modeling and
Word Embedding
• Pragmatic and Social
Functions
• Epilogue
BREAK
(10:30 – 11:00)
Definitions and some Linguistic primer
Mixing vs. switching
Matrix language defines the grammatical
structure of the sentence/clause
Sub-clausal syntactic units from another
language, called the embedded language,
can be inserted within the matrix structure.
Code Switching: When matrix changes across
sentences/clauses, but no embedding
Code Mixing: When there is an embedded
language
Lawyer:
Minal:
Lawyer:
Minal:
Lawyer:
Minal-ji, aap smile karti rahi?
Extra-friendly thi aap?
[Ms. Minal, were you smiling
and being extra-friendly]
I was normal.
What?
I was normal.
Normal. Khana-pina normal.
Hasna
[food and drink normal,
smiling]
Language Interactions in Multilingual Society
Cognitive Integration
Performance
Integration
Low = distinct
languages
High = same
language
Low = infrequent
interleaving
High = frequent
interleaving
Multilingual
Discourse Loan
words/bor
rowing
Code-
switching
Code-
mixing
Fused lect
Societal
Multilingualism
Source: Wikipedia
Code-mixing
• Happens in all multilingual societies
• Is predominantly a spoken language
phenomenon
• Is generally associated with informal
conversations
• Has well-defined socio-pragmatic functions
Challenges
& Solutions
Monolingual as well as Multilingual NLP systems
break-down in the presence of code-mixing
Cortana, aaj
Hyderabad ka
weather kaisa
hai? Is it raining
ya sunny day
hai?
Adik… sem brape boleh bwak kenderaan? normal
parent question – UiTMLendufornia
Social Media
Analytics
Intersteller es
una amazing
movie!
Hindi-English Code-Switching on Social Media
In public pages from Facebook
(of Indian celebrities, movies and BBC Hindi News)
• ALL sufficiently long threads were multilingual
• 17.2% of the comments/posts have code-mixing
Bali et al. I am borrowing ya mixing: An analysis of English-Hindi Code-
mixing in Facebook. 1st Workshop on Computational Approaches to
Code-switching, EMNLP 2014
Worldwide language distribution of monolingual and code-switched
tweets computed over 50M Tweets (restricted to the 7 languages)
3.5% tweets are
code-switched
Rijhwani et al. ACL 2017
Geographical Distribution of Code-switching on 8M Tweets from 24 cities
We might praise you in English,
but gaali to Hindi me hi denge! (Rudra et al., EMNLP 2016)
Study of 830K Tweets from Hi-En
bilinguals
1. The native language, Hindi, is
strongly preferred (10 times more)
for negativity and swearing
2. English is used far more for positive
sentiment than negative
3. Language change often corresponds
with changing sentiment
Hindi
English
Fraction of tweets with swear words
Inferences drawn from data in a single (usually
the majority) language are likely to be misleading
for multilingual societies.
Why is it Challenging?
Problem of Data
Code-mixing is predominantly
a spoken phenomenon.
So no large text corpora.
Model Explosion
With n languages, there are
O(n2) potential code-mixed
pairs.
Reusing Models
How to exploit the
monolingual models and data
for code-mixing.
How to solve it?
• Combine monolingual models
• Combine monolingual data
• Use synthetic code-mixed data
Computational Models of Code-Switching
• Supervised i.e., from scratch
• Divide & Conquer
• Combining Monolingual Models
• Zero-shot learning
Annotated Code-
mixed Data
Code-
switched
Model
Computational Models of Code-Switching
• Supervised i.e., from scratch
• Divide & Conquer
• Combining Monolingual Models
• Zero-shot learning
Code-switched
Text or speech
LID
L1 fragment L2 fragment
L1
model
L2
model
Vyas et al. 2014. En-Hi POS Tagging
Computational Models of Code-Switching
• Supervised aka from scratch
• Divide & Conquer
• Combining Monolingual Models
• Zero-shot learning
Code-switched
Text or speech
LID
L1
model
L2
model
Combination
Logic or ML
Solorio and Liu (EMNLP 2008): En-Es POS Tagging
Also Multilingual ASRs
Computational Models of Code-Switching
• Supervised aka from scratch
• Divide & Conquer
• Combining Monolingual Models
• Zero-shot learning
Code-
switched
Model
L1 Data L2 Data
Schuster et al. 2016: Zeroshot translation with Google’s
Multilingual Neural Machine translation System
Artexe and Shwenk. 2019: Massively multilingual
sentence embeddings for zeroshot crosslingual transfer
and beyond.
Speech and
NLP Tasks
Code-mixed Speech and NLP tasks
Every Speech and NLP task that takes input
beyond lexical information has a counter code-
mixed task
◦ Sub-sentential , sentence, conversation etc.
◦ There are few tasks which address morpheme level
code switching
Code-mixed
tasks
Speech
ASR
TTS
Text
Word level
Lang. Identification
POS Tagging
NER
Sentence level
Sentiment Analysis
Language Model
Parsing
Applications
Question
Answering
Machine
Translation
Information
retrieval
Areas #papers Shared Tasks
Language Identification 39 CALCS 2014, 2016
Sentiment Analysis 23 Semeval 2019, TRAC 2018, ICON 2017
ASR 24
NER 13 CALCS 2018
POS 14 ICON 2016
TTS 9
Parsing 6
Laanguage modelling 8
Translation 4
QnA 4
Statistics of papers from ACL anthology that mentions Code-mixing, code-
switching, etc., and for speech work also considering Interspeech and
ICASSP
Language Identification
Microsoft ne ek worldwide Hackathon organize kiya
NE Hi Hi En En En Hi
The task is to label each word in a text with a
language from a set L or a named entity
◦ Preprocessing for the downstream NLP tasks
◦ Techniques include
◦ Dictionary look up
◦ Sequence labelling approaches
Wat n awesum movie it wazzzz!
sabko dekhna chahiye
Dilwale vs. Bajirao Mastani: Even
Super-Films Get the Monday
Blues
Use of LID
Code-switched
Text or speech
LID
L1
model
L2
model
Combination
Logic or ML
Code-switched
Text or speech
LID
L1 fragment L2 fragment
L1
model
L2
model
Pairwise Language Labeling: Approach
Technique: Use your favorite Sequence Labeling technique
E.g., HMM, Conditional Random Fields, RNN
Data:
◦ EMNLP 2014 Code-Switching Dataset
◦ FIRE Language Detection Dataset
Finer Models
Semi-supervised Learning with Weak Labeling
(Technique: Hidden Markov Models)
Monolingual
(Labeled)
Tweets
Unlabeled
Tweets
Initial Model
Initial Model from Weakly Labeled Data
En XEn End
Start
Ge XGe
Fr XFr
Updating the probabilities
En XEn
Ge
Fr
En XEn End
0.8
0.15
0.05
End
0.015
0.015
0.79
0.04
0.14
Correctly Labeled:
@crystal_jaimes no me lebante ahorita cuz I felt como si me kemara por dentro
! :o Then I started getting all red , I think im allergic a algo
What was your favourite moment at the concert ? Was war für euch der
schönste Moment ?
Errors:
RT @lolsoufixe : remember when pensavam que a minha cadela aka nina se
chamava Irina
XINGIE , nouvel de disponible dès aujourd'hui release party jeudi aux bains ...
Some examples English Other language X
Our current LID system can handle
25 Languages
Catalan Indonesian
Czech Italian
Danish Latvian
Estonian Malay
Finnish Norwegian
French Polish
Croatian Romanian
Hungarian Slovak
Tagalog Slovene
87
88
89
90
91
92
93
94
95
96
97
HMM (2) HMM (7) HMM (25)
Languages:
Dutch
English
French
German
Portuguese
Spanish
Turkish
Word
Labeling
Accuracy
2 7 25
#Languages 
Machine Translation
4-6%
Tweets are code-mixed
found in Bing translation
Input Translation
ह ाँ | मैं ह यर एजुक
े शन ककय हाँ |
haan . main haayar ejukeshan kiya hoon .
Yes I have higher ejayuukeshan.
मैं अभी तक श दी नहीीं ककय हाँ | मतलब
अनमैरीड हाँ |
main abhee tak shaadee nahin kiya hoon .
matalab anamaireed hoon .
I'm not married yet. I mean
Anamairid.
हम्म! एक्चुअली, किक
े ट में मुझे अच्छ लगत हैं
|
hamm! ekchualee, kriket mein mujhe
achchha lagata hain .
Hmm! Ekachualali Ahha, I feel good
in cricket.
The problem is more intense if the input is
Romanized. Less intense if mixed script is used
Machine Translation for Code-mixed input
Merci beaucoup à
tout le monde
pour les messages.
Grazie ancora
per gli auguri
Thanks much to everyone for messages.
Thanks again for your good wishes.
Direct
Translation
Language
Detection
Fr En MT It En MT
In process of integration with Bing
MT for 7 Languages
(En, De, Es, Pt, Fr, Tr, Du)
Merci beaucoup à tout le monde
pour les messages. Grazie ancora
per gli auguri
Thanks much to everyone for messages.
Grazie ancora per gli wishes
Fr  En
MT for code-switching is hard problem!
“… we can handle input with code-switching … In practice, it is
not too hard to find examples where code-switching in the input
does not result in good outputs; in some cases the model will
simply copy parts of the source sentence instead of translating it.”
Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
November, 2016
Machine Translation
Some insight of code-mixed translation with different script
Data
& Evaluation
/
The Three Fundamental Problems of CM DATA
Where to get the
data form?
1
How to
characterize the
nature of code-
mixing?
2
How to label the
data given?
3
Where to get
the data from?
• BANGOR-MIAMI: En-Es, 54 conversations
• SEAME (63 Hrs), HKUST (5+15 hrs), CECOS (12 Hrs),
CUMIX (17 hrs): En-Mandarin
• MCSM: French-Arabic
• Malay-En, Frisian-Dutch, Hindi-English
Ideally, transcribed conversational speech
• WhatsApp and Facebook conversation
• Extracted Twitter conversations
• Human-bot conversations
• Privacy concern
Next best is Text-based chat logs
Where to get
the data from?
• User generated content on the Web
• Twitter – most researched, but doesn’t
allow distribution of tweet contents
• Facebook – difficult to crawl
• YouTube, Reddit comments
Non-conversational text data
• Movie scripts
• Plays, podcasts, reality shows
Scripted conversations
Guess Why?
POS tagging accuracies reported on the BANGOR-MIAMI (En-Es) corpus are in
high 80s to mid 90s, whereas POS tagging accuracies of the best performing
systems in the ICON 2017 shared task (En-Hi, En-Ta, …) was in mid-70s!
◦ More training data
◦ Inherently difficult language pair
◦ Patterns of code-mixing in the corpora are different
Language Interactions in Multilingual Society
Cognitive Integration
Performance
Integration
Low = distinct
languages
High = same
language
Low = infrequent
interleaving
High = frequent
interleaving
Multilingual
Discourse Loan
words/bor
rowing
Code-
switching
Code-
mixing
Fused lect
The Three Fundamental Problems of CM DATA
Where to get the
data form?
1
How to
characterize the
nature of code-
mixing?
2
How to label the
data given?
3
Comparing the level of code-mixing
fraction of words in matrix language is not
a good estimator
(Gambäck and Das, 2014)
Comparing the level of code-mixing
𝑤𝐿1𝑤𝐿1𝑤𝐿2𝑤𝐿2
vs.
𝑤𝐿1𝑤𝐿2𝑤𝐿1𝑤𝐿2
𝑼𝒔𝒆𝒔 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒄𝒐𝒅𝒆 𝒂𝒍𝒕𝒆𝒓𝒏𝒂𝒕𝒊𝒐𝒏 𝒑𝒐𝒊𝒏𝒕𝒔 𝒑𝒆𝒓 𝒕𝒐𝒌𝒆𝒏
(Gambäck and Das, 2016)
Extended this considering the code-alternation between two utterances
Comparing the level of code-mixing
Ratio-based metrics
M-index (Barnett et al., 2000)– the ratio of
languages in the corpora to measure the
inequality of the distribution
Guzman et al., 2017
◦ Language Entropy #of bits needed to
represent the distribution of languages
◦ I-Index – measure the total probability of
switching in the corpus
Comparing the level of code-mixing
Time-course measure (Guzman et al., 2017)
◦ Measures the temporal distribution of C-S across the corpus
◦ Burstiness – Bursty vs periodic patterns
◦ Information required to describe the distribution of language span
CMI of
some
Corpora
SOURCE:
HTTP://WWW.AMITAVADA
S.COM/CODE-
MIXING.HTML
The Three Fundamental Problems of CM DATA
Where to get the
data form?
1
How to
characterize the
nature of code-
mixing?
2
How to label data?
3
Annotation
Standards
• Sentiment, emotion, hate speech
• Information retrieval
• Machine translation
No special treatment needed for code-mixing
• POS Tagging
• Parsing
Monolingual standards need to be adapted
• Word-level Language Detection
• Discourse functions of code-mixing
• ASR Transcription
New standards need to be created
Are UNIVERSAL Tagsets for POS and Dependency labels
adequate for code-mixed languages?
SOURCE: HTTP://WWW.AMITAVADAS.COM/CODE -MIXING.HTML
Finding
Annotators
IT’S HARD TO FIND MANY BILINGUAL TURKERS
FOR A SPECIFIC LANGUAGE PAIR, AND
IMPOSSIBLE TO FIND EVEN ONE WHO KNOWS ALL
LANGUAGES!
Evaluation of CM systems
EVALUATE AT CODE-MIXING POINTS
Source: Utsab Barman (2019) Automatic Processing of Code-mixed Social Media Content. PhD Thesis. DCU
Evaluation of
CM systems
Evaluate at Evaluate at code-
mixing points
Source: Pratapa et al. ACL 2018
Language Modeling
Perplexity
Solvi
ng Language Models and
Word-Embeddings
What is
Language
Modeling
• Assigning probabilities to sequences of words
• 𝑝 𝑤1𝑤2 … . 𝑤𝑛 = 𝑘=1
𝑛
𝑝 𝑤𝑘 𝑤𝑘𝑤𝑘−1 … 𝑤1)
Why Language
Modeling
• Automatic Speech Recognition (ASR) systems
need an LM
• Downstream tasks like POS tagging, NER need
some of form LM
• The hot NLP topics now - Machine Translation
and Language Generation also need LMs
• And how can we forget phone keyboards?
Why Language
Modeling
• Say we have an LM that can properly code mix
• Model can predict words from both
languages
• Model knows when to pick words from each
language
• Model knows when to code mix
• If so, have we solved the problem of code
mixing itself??
Data ! Data ! Data !
• LMs require large amounts of
UNLABLLED data
• Unlike other NLP systems that can
be trained on less LABELLED data
• Monolingual LMs, trained on
Wikipedia data
Language No. of Wikipedia
Articles
English 5.9 M
German 2.3 M
French 2.1 M
Chinese 1.7 M
Esperanto 270 k
Hindi 133 k
Code-Mixed Corpora No. of
Sentences
Hindi- English (Chandu, et al. (2018)) 59 k
Mandarin – English (SEAME) 56 k
Approaches: Something Simple
• 1 RNN per language
• Take turns outputting tokens
• Which RNN’s turn – determined by a switch
variable
• Switch variable sampled from some distribution
• Garg et al. (2017) Dual Language Models for Code
Switched Speech Recognition.
Approaches:
More Complex
• Handle Data Sparsity
• Generate more code-mixed sentences
• Model the switching constraint
• Make the model learn when to switch
• Share context between both RNNs
One approach to
handing data sparsity
Language modeling for code-mixing: The role of linguistic
theory based synthetic data, Pratapa et al. [2018]
Use Linguistic Theories ??
• These theories exist as early as 1980s
• Assume a syntactic relation between a pair of parallel sentences
• Equivalence of grammar rules
• Word or phrase level alignment
• Propose ways to model CM and generate sentences
• Sentences generated with a linguistic theory backing – Bound to be better than random
mixing
Linguistic
Theories for
CM
• The 3 theories
• Equivalence Constraint (Sankoff and
Poplack [1981]), (Sankoff [1998])
• Functional Head (Belazi et al.
[1994])
• Embedded Matrix
• Li and Fung. 2014. Code switch
language modeling with functional
head constraint used the functional
head theory during the decoding phase
of the LM component of the ASR
Example
• She lives in a white house
• Elle vive en una casa blanca
EM theory
• Replace a subtree in the matrix language with one from the
embedded language
She lives in a white house Elle vive una casa blanca
EM theory
in a white house Elle vive una casa blanca
EM theory
Elle vive in a white house
EC theory
• All leaf nodes swappable
• After all swaps, check for constraints
• Monolingual fragments must appear as in original language
• EC constraint must be obeyed at every switch point
• Each node in the tree is assigned a language id from it’s parent and children
• Parent: Based on ordering of NTs in RHS of rule applied at parent
• Child: Based on language of leaf and propagated upwards
• Both have to match
English Spanish
Code Mixed
Ill-formed Monolingual Fragment Language tag mismatch
Training via a Curriculum
• Use of curriculum improves perplexity of the model
• Baheti, et al. (2017) Curriculum Design for Code-switching:
Experiments with Language Identification and Language
Modeling with Deep Neural Networks – using a curriculum
improves perplexity
• Monolingual -> Code Mixed
• Pratapa, et al. (2018)
• Generated Code Mixed -> Monolingual ->
Real Code Mixed
• Adding Real Code Mixed at the end is
very useful
Results from Pratapa, et al. (2018) on LM Perplexity
Other Work - Handle Data Sparsity
• Models that are trained on data
generated by a SeqGAN
• Garg, et al. (2018) Code-switched
Language Models Using Dual
RNNs and Same-Source
Pretraining
• Chang, et al. (2019) Code-
switching Sentence Generation by
Generative Adversarial Networks
and its Application to Data
Augmentation
Chang, et al. (2019) GAN framework for
generating sentences
Other Work - Handle Data Sparsity
• Samanta, et al. (2019) A Deep Generative Model for Code-Switched
Text present using VAEs with a RNN based encoder and decoder to
generate sentences
Samanta, et al. (2019), A VAE framework for generating sentences
Other Work - Handle Data Sparsity
• Winata, et al. (2019) Code-Switched
Language Models Using Neural Based
Synthetic Data from Parallel Sentences
use pointer generator networks to
generate code mixed sentences
• Lee, et al. (2019) Linguistically
Motivated Parallel Data Augmentation
for Code-switch Language Modeling use
the EM theory at a phrase level to
generate code mixed sentences
Pointer Generator networks used in Winata, et al. (2019)
Other Work - Model the switching constraint
• Garg, et al. (2018) Code-switched
Language Models Using Dual RNNs and
Same-Source Pretraining
• Output of 2 RNNs run through linear layer to
get final output
• Also train on data generated by a SeqGAN
Other Work - Model the switching constraint
• Adel, et al. (2013) Combination of Recurrent Neural Networks and Factored
Language Models for Code-Switching Language Modeling
• Adel, et al. (2015) Syntactic and semantic features for code-switching factored
language models
Other Work - Model the switching constraint
• Winata, et al. (2018) Code-Switching Language Modeling using Syntax-Aware
Multi-Task Learning show that adding a POS tag prediction task to the LM shows
improvements in perplexity
• Soto and Hirschberg (2019) Improving Code-Switched Language Modeling
Performance Using Cognate Features uses features about words with similar
origin in both languages
Other Work – Sharing context between the
RNNs
• Not as simple of a task as it sounds
• No current model capable of this
• On a related note
• Multilingual deep contextual embeddings – models multiple languages at
once
• Can this be made to code mix?
• Artetxe, and Schwenk. (2019) Massively multilingual sentence embeddings for
zero-shot cross-lingual transfer and beyond has an explicit language id as
input in decoder during training
Talking about
embeddings…
Work on Embeddings for code mixing
• Most work - Bilingual Embeddings adapted for some tasks
• What about learning embeddings from CM data?
Bilingual Embeddings
• Summarized in Upadhyay et al., (2016)
• 4 methods
• Bilingual Skip-Gram Model (BiSkip) - Luong et al. (2015)
• Bilingual Compositional Model (BiCVM) - Hermann and Blunsom (2014)
• Bilingual Correlation Based Embeddings(BiCCA) - Faruqui and Dyer(2014)
• Bilingual Vectors from Comparable Data(BiVCD) - Vulíc and Moens(2015)
• Take monolingual embeddings and a corpora aligned at certain level
• Projects those embeddings into a common space using the alignment
Bilingual Embeddings for CM
• Pratapa, et al. (2018) evaluated these for CM POS tagging and
sentiment analysis
• Pre-trained embeddings performed better than no embeddings
• Embeddings learnt on synthetic code mixed data performed better
Other work on Embeddings
• Winata, et al. (2019) Hierarchical Meta-
Embeddings for Code-Switching Named Entity
Recognition show that amalgamating multiple
embeddings(word, subword, and character
level) shows improvements in downstream
tasks
• Lee and Li. (2019) Word and Class Common
Space Embedding for Code-switch Language
Modelling show that when using auxiliary
features as input for an LM, constraining the
embedding space of the words and these
features improves LM perplexity
Winata, et al. (2019)
Takeaways
• So much is possible using linguistic
theories
• Solving LM for CM – solving CM
• Direction of Future Work
• Deep contextual embeddings ?
• Zero shot transfer ?
e
Socio-pragmatic
Functions of
Code-mixing
Language
Preference:
When and
why do
bilinguals
prefer a
certain
language?
Topic change
Puns
Emphasis
Emotion
Reported Speech
But it’s unpredictable!
Linguistic Studies
Until recently, no large-scale, data-driven validation of the hypotheses.
Fishman (1971):
- Use of English for professional settings, Spanish for informal chat
Dewaele (2004, 2010):
- The native language elicits stronger emotion
- Preferred for emotion expression and swearing
Nguyen (2014):
- Code-choice as a social identity marker
A great place to
start your
exploration!
Computational Linguistics, 2016
Initial
quantitative
studies
• Jurgens, Dimitrov, and Ruths (2014) studied
tweets written in one language but containing
hashtags in another language
• Nguyen, Trieschnigg, and Cornips (2015) studied
users in the Netherlands who tweeted in a
minority language (Limburgish or Frisian) as
well as in Dutch. Most tweets were written in
Dutch, but during conversations users often
switched to the minority language (i.e.,
Limburgish or Frisian).
We might praise you in English,
but gaali to Hindi me hi denge! (Rudra et al., EMNLP 2016)
Study of 830K Tweets from Hi-En
bilinguals
1. The native language, Hindi, is
strongly preferred (10 times more)
for negativity and swearing
2. English is used far more for positive
sentiment than negative
3. Language change often corresponds
with changing sentiment
Hindi
English
Fraction of tweets with swear words
Predicting
Naij́a-English
code
switching
Innocent Ndubuisi-Obi, Sayan Ghosh and David
Jurgens (2019) W´etin dey with these comments?
Modeling Sociolinguistic Factors Affecting Code-
switching Behavior in Nigerian Online Discussions.
ACL.
330K articles and accompanying 389K comments
labeled for code switching behavior
Predictive
Factors of
Naij́a usage
Article Topic
Social Setting
number of prior
comments
Depth of thread
Social Status –
Number of followers
Emotion
Tribal affiliation:
Yoruba, Hausa-Falani,
Igbo, etc.
(automatically labeled)
Predictive
Factors of
Naij́a usage
(Findings)
Article Topic
Social Setting
number of prior
comments
Depth of thread
Social Status –
Number of followers
Emotion
Tribal affiliation:
Yoruba, Hausa-Falani,
Igbo, etc.
(automatically labeled)
Comments deeper in a reply
thread are more likely to be
Naij́a
Those made in the evening likely
to be conversational with a
particular person.
High status  more English, but
potential confounds
Strong sentiment  more Naij́a
Worldwide language distribution of monolingual and code-switched
tweets computed over 50M Tweets (restricted to the 7 languages)
3.5% tweets are
code-switched
Rijhwani et al. ACL 2017
Geographical Distribution of Code-switching on 8M Tweets from 24 cities
Fraction of monolingual English
tweets is strongly negatively
correlated (-0.85) with the
fraction of code-switched tweets
This is surprising … especially for
extremely multilingual US cities
(e.g., Houston)
(?) ACCULTURIZATION takes place
much faster in the US
UE If bots could
code-mix…
Code-choice as a
Style dimension
VIJAY: ek minute ke liye thoda practical socho.
VIJAY: Main tumharey angle se hi soch raha hoon...
Tum hi uncomfortable feel karogi...
bahut time ho gaya hai... bahut fark aa gaya hai
RANI: Kismein? Mujhmein koyi change nahin hai
VIJAY: Vohi to baat hai... mujhmein hai... Meri duniya ...
bilkul alag hai... ab... you’ll not fit in
RANI: Matlab? ek dum se main tumharey jitni fancy nahin hoon...
Code-choice accommodation in human-
human conversations
• Base Rate of a style: How frequently a style (code) is used by a user
• Style Accommodation: How frequently a style (code) is used by a user
when the preceding utterance contain that style (code)
Bawa et al. Workshop on Computational Approaches to Code-Switching, EMNLP 2018
Nudge, don’t push or assume...
• Human-human
conversations
show “positive
accommodation”
for the choice of
the marked code
• In a wizard-
mediated-bot
experiment,
most users show
a very strong
preference for
bot that can
code-mix
• A small fraction
of users have
negative opinion
towards a code-
mixing bot, so it
is important to
nudge before
mixing.
Understanding code-
mixing is not a luxury
but a necessity for
chatbots for
multilingual societies
Resources
• Sitaram et al. (2019) A Survey of Code-switched
Speech and Language Processing. Arxiv.
https://arxiv.org/abs/1904.00784
• https://github.com/gentaiscool/code-switching-papers
• Project Melange: https://www.microsoft.com/en-
us/research/project/melange
• Please get in touch with us for a comprehensive
list of datasets and resources covered in this
tutorial.
Tutorial me ane ke lie thank you!
https://www.microsoft.com/en-us/research/project/melange/

More Related Content

Similar to MixedLanguageProcessingTutorialEMNLP2019.pptx

NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA DATASCIENCE
 
Deep network notes.pdf
Deep network notes.pdfDeep network notes.pdf
Deep network notes.pdfRamya Nellutla
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH WarNik Chow
 
Deep learning for natural language embeddings
Deep learning for natural language embeddingsDeep learning for natural language embeddings
Deep learning for natural language embeddingsRoelof Pieters
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
NATURAL LANGUAGE PROCESSING.pptx
NATURAL LANGUAGE PROCESSING.pptxNATURAL LANGUAGE PROCESSING.pptx
NATURAL LANGUAGE PROCESSING.pptxsaivinay93
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
 
NeurIPS_2018_ConvAI2_ParticipantSlides.pptx
NeurIPS_2018_ConvAI2_ParticipantSlides.pptxNeurIPS_2018_ConvAI2_ParticipantSlides.pptx
NeurIPS_2018_ConvAI2_ParticipantSlides.pptxKaiduTester
 
Machine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to HindiMachine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to HindiPadma Metta
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLPSatyam Saxena
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLPAnuj Gupta
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignAnubhav Jain
 
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Zachary S. Brown
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPAnuj Gupta
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingParrotAI
 
ARTIFICIAL INTELLEGENCE AND MACHINE LEARNING.pptx
ARTIFICIAL INTELLEGENCE AND MACHINE LEARNING.pptxARTIFICIAL INTELLEGENCE AND MACHINE LEARNING.pptx
ARTIFICIAL INTELLEGENCE AND MACHINE LEARNING.pptxShivaprasad787526
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchNatasha Latysheva
 

Similar to MixedLanguageProcessingTutorialEMNLP2019.pptx (20)

NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2
 
Deep network notes.pdf
Deep network notes.pdfDeep network notes.pdf
Deep network notes.pdf
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
 
Deep learning for natural language embeddings
Deep learning for natural language embeddingsDeep learning for natural language embeddings
Deep learning for natural language embeddings
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
NATURAL LANGUAGE PROCESSING.pptx
NATURAL LANGUAGE PROCESSING.pptxNATURAL LANGUAGE PROCESSING.pptx
NATURAL LANGUAGE PROCESSING.pptx
 
NLP.pptx
NLP.pptxNLP.pptx
NLP.pptx
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
NeurIPS_2018_ConvAI2_ParticipantSlides.pptx
NeurIPS_2018_ConvAI2_ParticipantSlides.pptxNeurIPS_2018_ConvAI2_ParticipantSlides.pptx
NeurIPS_2018_ConvAI2_ParticipantSlides.pptx
 
Machine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to HindiMachine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to Hindi
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
 
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
 
Bert algorithm 2
Bert algorithm  2Bert algorithm  2
Bert algorithm 2
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
1 Introduction.ppt
1 Introduction.ppt1 Introduction.ppt
1 Introduction.ppt
 
ARTIFICIAL INTELLEGENCE AND MACHINE LEARNING.pptx
ARTIFICIAL INTELLEGENCE AND MACHINE LEARNING.pptxARTIFICIAL INTELLEGENCE AND MACHINE LEARNING.pptx
ARTIFICIAL INTELLEGENCE AND MACHINE LEARNING.pptx
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From Scratch
 

More from MariYam371004

SOCIAL_IDENTITY_pptx.pptx
SOCIAL_IDENTITY_pptx.pptxSOCIAL_IDENTITY_pptx.pptx
SOCIAL_IDENTITY_pptx.pptxMariYam371004
 
srategies-for-teaching-language-skills.ppt
srategies-for-teaching-language-skills.pptsrategies-for-teaching-language-skills.ppt
srategies-for-teaching-language-skills.pptMariYam371004
 
EMLex-Diachronic-lexicography-and-lexicology-1.ppt
EMLex-Diachronic-lexicography-and-lexicology-1.pptEMLex-Diachronic-lexicography-and-lexicology-1.ppt
EMLex-Diachronic-lexicography-and-lexicology-1.pptMariYam371004
 
dokumen.tips_what-is-apa-apa-format-is-a-standard-set-of-conventionsrules-for...
dokumen.tips_what-is-apa-apa-format-is-a-standard-set-of-conventionsrules-for...dokumen.tips_what-is-apa-apa-format-is-a-standard-set-of-conventionsrules-for...
dokumen.tips_what-is-apa-apa-format-is-a-standard-set-of-conventionsrules-for...MariYam371004
 
Sociolinguistics 2.pptx
Sociolinguistics 2.pptxSociolinguistics 2.pptx
Sociolinguistics 2.pptxMariYam371004
 
Word-Formation-Presentation.pptx
Word-Formation-Presentation.pptxWord-Formation-Presentation.pptx
Word-Formation-Presentation.pptxMariYam371004
 
lingua-franca-151019083140-lva1-app6892.pdf
lingua-franca-151019083140-lva1-app6892.pdflingua-franca-151019083140-lva1-app6892.pdf
lingua-franca-151019083140-lva1-app6892.pdfMariYam371004
 
Phonetic-Chart-BE.pdf
Phonetic-Chart-BE.pdfPhonetic-Chart-BE.pdf
Phonetic-Chart-BE.pdfMariYam371004
 
lecturer-website-result-final.pdf
lecturer-website-result-final.pdflecturer-website-result-final.pdf
lecturer-website-result-final.pdfMariYam371004
 
5-Listening-Skills-2.ppt
5-Listening-Skills-2.ppt5-Listening-Skills-2.ppt
5-Listening-Skills-2.pptMariYam371004
 
The Conditionals.pdf
The Conditionals.pdfThe Conditionals.pdf
The Conditionals.pdfMariYam371004
 

More from MariYam371004 (19)

SOCIAL_IDENTITY_pptx.pptx
SOCIAL_IDENTITY_pptx.pptxSOCIAL_IDENTITY_pptx.pptx
SOCIAL_IDENTITY_pptx.pptx
 
srategies-for-teaching-language-skills.ppt
srategies-for-teaching-language-skills.pptsrategies-for-teaching-language-skills.ppt
srategies-for-teaching-language-skills.ppt
 
EMLex-Diachronic-lexicography-and-lexicology-1.ppt
EMLex-Diachronic-lexicography-and-lexicology-1.pptEMLex-Diachronic-lexicography-and-lexicology-1.ppt
EMLex-Diachronic-lexicography-and-lexicology-1.ppt
 
dokumen.tips_what-is-apa-apa-format-is-a-standard-set-of-conventionsrules-for...
dokumen.tips_what-is-apa-apa-format-is-a-standard-set-of-conventionsrules-for...dokumen.tips_what-is-apa-apa-format-is-a-standard-set-of-conventionsrules-for...
dokumen.tips_what-is-apa-apa-format-is-a-standard-set-of-conventionsrules-for...
 
DESCRIPTIVE.pptx
DESCRIPTIVE.pptxDESCRIPTIVE.pptx
DESCRIPTIVE.pptx
 
Sociolinguistics 2.pptx
Sociolinguistics 2.pptxSociolinguistics 2.pptx
Sociolinguistics 2.pptx
 
Morphology.2.ppt
Morphology.2.pptMorphology.2.ppt
Morphology.2.ppt
 
chapter_5_3.pptx
chapter_5_3.pptxchapter_5_3.pptx
chapter_5_3.pptx
 
Word-Formation-Presentation.pptx
Word-Formation-Presentation.pptxWord-Formation-Presentation.pptx
Word-Formation-Presentation.pptx
 
lingua-franca-151019083140-lva1-app6892.pdf
lingua-franca-151019083140-lva1-app6892.pdflingua-franca-151019083140-lva1-app6892.pdf
lingua-franca-151019083140-lva1-app6892.pdf
 
Phonetic-Chart-BE.pdf
Phonetic-Chart-BE.pdfPhonetic-Chart-BE.pdf
Phonetic-Chart-BE.pdf
 
verbrevf3.ppt
verbrevf3.pptverbrevf3.ppt
verbrevf3.ppt
 
9Identity.ppt
9Identity.ppt9Identity.ppt
9Identity.ppt
 
Assimilation.pdf
Assimilation.pdfAssimilation.pdf
Assimilation.pdf
 
scl.ppt
scl.pptscl.ppt
scl.ppt
 
meaning.ppt.pdf
meaning.ppt.pdfmeaning.ppt.pdf
meaning.ppt.pdf
 
lecturer-website-result-final.pdf
lecturer-website-result-final.pdflecturer-website-result-final.pdf
lecturer-website-result-final.pdf
 
5-Listening-Skills-2.ppt
5-Listening-Skills-2.ppt5-Listening-Skills-2.ppt
5-Listening-Skills-2.ppt
 
The Conditionals.pdf
The Conditionals.pdfThe Conditionals.pdf
The Conditionals.pdf
 

Recently uploaded

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersChitralekhaTherkar
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 

Recently uploaded (20)

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of Powders
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 

MixedLanguageProcessingTutorialEMNLP2019.pptx

  • 1. no me lebante ahorita cuz I felt como si me kemara por dentro jit fi la fin du mois de dece-mbre kan ljaw bared ktir wttalj Kibrisa geldigim … god warum? ich mochte nicht hier Sous la pluie mais beau tout de même, chère Ileana! Buona giornata a te e a tutti! Coridel Ent merilis full tracklist untuk debut mini album Jessica Jung yg akan segera rilis bulan Mei mendatang
  • 2. Code-mixing or Code-Switching is the mixing of two or more languages in a conversation or even an utterance. no me lebante ahorita cuz I felt como si me kemara por dentro jit fi la fin du mois de dece-mbre kan ljaw bared ktir wttalj Kibrisa geldigim … god warum? ich mochte nicht hier Sous la pluie mais beau tout de même, chère Ileana! Buona giornata a te e a tutti! Coridel Ent merilis full tracklist untuk debut mini album Jessica Jung yg akan segera rilis bulan Mei mendatang
  • 3. Processing & Understanding Mixed Language Data Monojit Choudhury1, Anirudh Srinivasan1, Sandipan Dandapat2, Kalika Bali1* 1MICROSOFT RESEARCH LAB INDIA 2MICROSOFT INDIA DEVELOPMENT CENTER E M N L P - I J C N L P T u t o r i a l [ T 2 ]  3 r d N o v e m b e r 2 0 1 9  H o n g K o n g
  • 5. You are in safe hands ;p
  • 6. Why this tutorial? Code-mixing is hot right now! Industry is interested • 50% queries to Ruuh (Microsoft chatbot) are code-mixed • People are talking to Alexa in code-mixing • 2-20% posts on Twitter and Facebook are code-mixed. 2 1 2 1 1 1 4 1 19 6 32 13 59 18 0 10 20 30 40 50 60 70 Number of papers in ACL anthology with code-mixing or related terms in the title or abstract.
  • 7. After this tutorial, you will … • know how languages interact in multilingual societies • understand why code-mixing is a difficult (and therefore, interesting) problem • be able to appreciate the challenges and nuances of code- mixed dataset creation • have some idea about the different NLP tasks and research that has been happening • be able to make better and more informed decisions about designing code-mixed NLP systems
  • 8. ML approaches and techniques for solving code-mixing are identical to those for monolingual NLP tasks. Differences exist in… PRIORITIES OF TASKS DATA COLLECTION AND PREPARATION STRATEGIES OPTIMALLY USE OF EXISTING RESOURCES USER-CENTRIC DESIGN OF (CODE-MIXED) NLP SYSTEMS
  • 9. Setting mixed expectations … Text  Speech Design  Implementation Deep Linguistics  Deep learning Map the field  Cover all research Insights from industry  Building large scale systems
  • 10. Outline • Prologue • Definitions & some linguistic primer • Challenges and Solutions • SOTA in NLP tasks • Data and Evaluation • Language Modeling and Word Embedding • Pragmatic and Social Functions • Epilogue BREAK (10:30 – 11:00)
  • 11. Definitions and some Linguistic primer
  • 12. Mixing vs. switching Matrix language defines the grammatical structure of the sentence/clause Sub-clausal syntactic units from another language, called the embedded language, can be inserted within the matrix structure. Code Switching: When matrix changes across sentences/clauses, but no embedding Code Mixing: When there is an embedded language Lawyer: Minal: Lawyer: Minal: Lawyer: Minal-ji, aap smile karti rahi? Extra-friendly thi aap? [Ms. Minal, were you smiling and being extra-friendly] I was normal. What? I was normal. Normal. Khana-pina normal. Hasna [food and drink normal, smiling]
  • 13. Language Interactions in Multilingual Society Cognitive Integration Performance Integration Low = distinct languages High = same language Low = infrequent interleaving High = frequent interleaving Multilingual Discourse Loan words/bor rowing Code- switching Code- mixing Fused lect
  • 15. Code-mixing • Happens in all multilingual societies • Is predominantly a spoken language phenomenon • Is generally associated with informal conversations • Has well-defined socio-pragmatic functions
  • 17. Monolingual as well as Multilingual NLP systems break-down in the presence of code-mixing Cortana, aaj Hyderabad ka weather kaisa hai? Is it raining ya sunny day hai? Adik… sem brape boleh bwak kenderaan? normal parent question – UiTMLendufornia Social Media Analytics Intersteller es una amazing movie!
  • 18. Hindi-English Code-Switching on Social Media In public pages from Facebook (of Indian celebrities, movies and BBC Hindi News) • ALL sufficiently long threads were multilingual • 17.2% of the comments/posts have code-mixing Bali et al. I am borrowing ya mixing: An analysis of English-Hindi Code- mixing in Facebook. 1st Workshop on Computational Approaches to Code-switching, EMNLP 2014
  • 19. Worldwide language distribution of monolingual and code-switched tweets computed over 50M Tweets (restricted to the 7 languages) 3.5% tweets are code-switched Rijhwani et al. ACL 2017
  • 20. Geographical Distribution of Code-switching on 8M Tweets from 24 cities
  • 21. We might praise you in English, but gaali to Hindi me hi denge! (Rudra et al., EMNLP 2016) Study of 830K Tweets from Hi-En bilinguals 1. The native language, Hindi, is strongly preferred (10 times more) for negativity and swearing 2. English is used far more for positive sentiment than negative 3. Language change often corresponds with changing sentiment Hindi English Fraction of tweets with swear words
  • 22. Inferences drawn from data in a single (usually the majority) language are likely to be misleading for multilingual societies.
  • 23. Why is it Challenging? Problem of Data Code-mixing is predominantly a spoken phenomenon. So no large text corpora. Model Explosion With n languages, there are O(n2) potential code-mixed pairs. Reusing Models How to exploit the monolingual models and data for code-mixing.
  • 24. How to solve it? • Combine monolingual models • Combine monolingual data • Use synthetic code-mixed data
  • 25. Computational Models of Code-Switching • Supervised i.e., from scratch • Divide & Conquer • Combining Monolingual Models • Zero-shot learning Annotated Code- mixed Data Code- switched Model
  • 26. Computational Models of Code-Switching • Supervised i.e., from scratch • Divide & Conquer • Combining Monolingual Models • Zero-shot learning Code-switched Text or speech LID L1 fragment L2 fragment L1 model L2 model Vyas et al. 2014. En-Hi POS Tagging
  • 27. Computational Models of Code-Switching • Supervised aka from scratch • Divide & Conquer • Combining Monolingual Models • Zero-shot learning Code-switched Text or speech LID L1 model L2 model Combination Logic or ML Solorio and Liu (EMNLP 2008): En-Es POS Tagging Also Multilingual ASRs
  • 28. Computational Models of Code-Switching • Supervised aka from scratch • Divide & Conquer • Combining Monolingual Models • Zero-shot learning Code- switched Model L1 Data L2 Data Schuster et al. 2016: Zeroshot translation with Google’s Multilingual Neural Machine translation System Artexe and Shwenk. 2019: Massively multilingual sentence embeddings for zeroshot crosslingual transfer and beyond.
  • 30. Code-mixed Speech and NLP tasks Every Speech and NLP task that takes input beyond lexical information has a counter code- mixed task ◦ Sub-sentential , sentence, conversation etc. ◦ There are few tasks which address morpheme level code switching Code-mixed tasks Speech ASR TTS Text Word level Lang. Identification POS Tagging NER Sentence level Sentiment Analysis Language Model Parsing Applications Question Answering Machine Translation Information retrieval
  • 31. Areas #papers Shared Tasks Language Identification 39 CALCS 2014, 2016 Sentiment Analysis 23 Semeval 2019, TRAC 2018, ICON 2017 ASR 24 NER 13 CALCS 2018 POS 14 ICON 2016 TTS 9 Parsing 6 Laanguage modelling 8 Translation 4 QnA 4 Statistics of papers from ACL anthology that mentions Code-mixing, code- switching, etc., and for speech work also considering Interspeech and ICASSP
  • 32. Language Identification Microsoft ne ek worldwide Hackathon organize kiya NE Hi Hi En En En Hi The task is to label each word in a text with a language from a set L or a named entity ◦ Preprocessing for the downstream NLP tasks ◦ Techniques include ◦ Dictionary look up ◦ Sequence labelling approaches Wat n awesum movie it wazzzz! sabko dekhna chahiye Dilwale vs. Bajirao Mastani: Even Super-Films Get the Monday Blues
  • 33. Use of LID Code-switched Text or speech LID L1 model L2 model Combination Logic or ML Code-switched Text or speech LID L1 fragment L2 fragment L1 model L2 model
  • 34. Pairwise Language Labeling: Approach Technique: Use your favorite Sequence Labeling technique E.g., HMM, Conditional Random Fields, RNN Data: ◦ EMNLP 2014 Code-Switching Dataset ◦ FIRE Language Detection Dataset
  • 35. Finer Models Semi-supervised Learning with Weak Labeling (Technique: Hidden Markov Models) Monolingual (Labeled) Tweets Unlabeled Tweets Initial Model
  • 36. Initial Model from Weakly Labeled Data En XEn End Start Ge XGe Fr XFr
  • 37. Updating the probabilities En XEn Ge Fr En XEn End 0.8 0.15 0.05 End 0.015 0.015 0.79 0.04 0.14
  • 38. Correctly Labeled: @crystal_jaimes no me lebante ahorita cuz I felt como si me kemara por dentro ! :o Then I started getting all red , I think im allergic a algo What was your favourite moment at the concert ? Was war für euch der schönste Moment ? Errors: RT @lolsoufixe : remember when pensavam que a minha cadela aka nina se chamava Irina XINGIE , nouvel de disponible dès aujourd'hui release party jeudi aux bains ... Some examples English Other language X
  • 39. Our current LID system can handle 25 Languages Catalan Indonesian Czech Italian Danish Latvian Estonian Malay Finnish Norwegian French Polish Croatian Romanian Hungarian Slovak Tagalog Slovene 87 88 89 90 91 92 93 94 95 96 97 HMM (2) HMM (7) HMM (25) Languages: Dutch English French German Portuguese Spanish Turkish Word Labeling Accuracy 2 7 25 #Languages 
  • 40. Machine Translation 4-6% Tweets are code-mixed found in Bing translation Input Translation ह ाँ | मैं ह यर एजुक े शन ककय हाँ | haan . main haayar ejukeshan kiya hoon . Yes I have higher ejayuukeshan. मैं अभी तक श दी नहीीं ककय हाँ | मतलब अनमैरीड हाँ | main abhee tak shaadee nahin kiya hoon . matalab anamaireed hoon . I'm not married yet. I mean Anamairid. हम्म! एक्चुअली, किक े ट में मुझे अच्छ लगत हैं | hamm! ekchualee, kriket mein mujhe achchha lagata hain . Hmm! Ekachualali Ahha, I feel good in cricket. The problem is more intense if the input is Romanized. Less intense if mixed script is used
  • 41. Machine Translation for Code-mixed input Merci beaucoup à tout le monde pour les messages. Grazie ancora per gli auguri Thanks much to everyone for messages. Thanks again for your good wishes. Direct Translation Language Detection Fr En MT It En MT In process of integration with Bing MT for 7 Languages (En, De, Es, Pt, Fr, Tr, Du) Merci beaucoup à tout le monde pour les messages. Grazie ancora per gli auguri Thanks much to everyone for messages. Grazie ancora per gli wishes Fr  En
  • 42. MT for code-switching is hard problem! “… we can handle input with code-switching … In practice, it is not too hard to find examples where code-switching in the input does not result in good outputs; in some cases the model will simply copy parts of the source sentence instead of translating it.” Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation November, 2016
  • 43. Machine Translation Some insight of code-mixed translation with different script
  • 45. / The Three Fundamental Problems of CM DATA Where to get the data form? 1 How to characterize the nature of code- mixing? 2 How to label the data given? 3
  • 46. Where to get the data from? • BANGOR-MIAMI: En-Es, 54 conversations • SEAME (63 Hrs), HKUST (5+15 hrs), CECOS (12 Hrs), CUMIX (17 hrs): En-Mandarin • MCSM: French-Arabic • Malay-En, Frisian-Dutch, Hindi-English Ideally, transcribed conversational speech • WhatsApp and Facebook conversation • Extracted Twitter conversations • Human-bot conversations • Privacy concern Next best is Text-based chat logs
  • 47. Where to get the data from? • User generated content on the Web • Twitter – most researched, but doesn’t allow distribution of tweet contents • Facebook – difficult to crawl • YouTube, Reddit comments Non-conversational text data • Movie scripts • Plays, podcasts, reality shows Scripted conversations
  • 48. Guess Why? POS tagging accuracies reported on the BANGOR-MIAMI (En-Es) corpus are in high 80s to mid 90s, whereas POS tagging accuracies of the best performing systems in the ICON 2017 shared task (En-Hi, En-Ta, …) was in mid-70s! ◦ More training data ◦ Inherently difficult language pair ◦ Patterns of code-mixing in the corpora are different
  • 49. Language Interactions in Multilingual Society Cognitive Integration Performance Integration Low = distinct languages High = same language Low = infrequent interleaving High = frequent interleaving Multilingual Discourse Loan words/bor rowing Code- switching Code- mixing Fused lect
  • 50. The Three Fundamental Problems of CM DATA Where to get the data form? 1 How to characterize the nature of code- mixing? 2 How to label the data given? 3
  • 51. Comparing the level of code-mixing fraction of words in matrix language is not a good estimator (Gambäck and Das, 2014)
  • 52. Comparing the level of code-mixing 𝑤𝐿1𝑤𝐿1𝑤𝐿2𝑤𝐿2 vs. 𝑤𝐿1𝑤𝐿2𝑤𝐿1𝑤𝐿2 𝑼𝒔𝒆𝒔 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒄𝒐𝒅𝒆 𝒂𝒍𝒕𝒆𝒓𝒏𝒂𝒕𝒊𝒐𝒏 𝒑𝒐𝒊𝒏𝒕𝒔 𝒑𝒆𝒓 𝒕𝒐𝒌𝒆𝒏 (Gambäck and Das, 2016) Extended this considering the code-alternation between two utterances
  • 53. Comparing the level of code-mixing Ratio-based metrics M-index (Barnett et al., 2000)– the ratio of languages in the corpora to measure the inequality of the distribution Guzman et al., 2017 ◦ Language Entropy #of bits needed to represent the distribution of languages ◦ I-Index – measure the total probability of switching in the corpus
  • 54. Comparing the level of code-mixing Time-course measure (Guzman et al., 2017) ◦ Measures the temporal distribution of C-S across the corpus ◦ Burstiness – Bursty vs periodic patterns ◦ Information required to describe the distribution of language span
  • 56. The Three Fundamental Problems of CM DATA Where to get the data form? 1 How to characterize the nature of code- mixing? 2 How to label data? 3
  • 57. Annotation Standards • Sentiment, emotion, hate speech • Information retrieval • Machine translation No special treatment needed for code-mixing • POS Tagging • Parsing Monolingual standards need to be adapted • Word-level Language Detection • Discourse functions of code-mixing • ASR Transcription New standards need to be created
  • 58. Are UNIVERSAL Tagsets for POS and Dependency labels adequate for code-mixed languages? SOURCE: HTTP://WWW.AMITAVADAS.COM/CODE -MIXING.HTML
  • 59. Finding Annotators IT’S HARD TO FIND MANY BILINGUAL TURKERS FOR A SPECIFIC LANGUAGE PAIR, AND IMPOSSIBLE TO FIND EVEN ONE WHO KNOWS ALL LANGUAGES!
  • 60. Evaluation of CM systems EVALUATE AT CODE-MIXING POINTS Source: Utsab Barman (2019) Automatic Processing of Code-mixed Social Media Content. PhD Thesis. DCU
  • 61. Evaluation of CM systems Evaluate at Evaluate at code- mixing points Source: Pratapa et al. ACL 2018 Language Modeling Perplexity
  • 62. Solvi ng Language Models and Word-Embeddings
  • 63. What is Language Modeling • Assigning probabilities to sequences of words • 𝑝 𝑤1𝑤2 … . 𝑤𝑛 = 𝑘=1 𝑛 𝑝 𝑤𝑘 𝑤𝑘𝑤𝑘−1 … 𝑤1)
  • 64. Why Language Modeling • Automatic Speech Recognition (ASR) systems need an LM • Downstream tasks like POS tagging, NER need some of form LM • The hot NLP topics now - Machine Translation and Language Generation also need LMs • And how can we forget phone keyboards?
  • 65. Why Language Modeling • Say we have an LM that can properly code mix • Model can predict words from both languages • Model knows when to pick words from each language • Model knows when to code mix • If so, have we solved the problem of code mixing itself??
  • 66. Data ! Data ! Data ! • LMs require large amounts of UNLABLLED data • Unlike other NLP systems that can be trained on less LABELLED data • Monolingual LMs, trained on Wikipedia data Language No. of Wikipedia Articles English 5.9 M German 2.3 M French 2.1 M Chinese 1.7 M Esperanto 270 k Hindi 133 k Code-Mixed Corpora No. of Sentences Hindi- English (Chandu, et al. (2018)) 59 k Mandarin – English (SEAME) 56 k
  • 67. Approaches: Something Simple • 1 RNN per language • Take turns outputting tokens • Which RNN’s turn – determined by a switch variable • Switch variable sampled from some distribution • Garg et al. (2017) Dual Language Models for Code Switched Speech Recognition.
  • 68. Approaches: More Complex • Handle Data Sparsity • Generate more code-mixed sentences • Model the switching constraint • Make the model learn when to switch • Share context between both RNNs
  • 69. One approach to handing data sparsity Language modeling for code-mixing: The role of linguistic theory based synthetic data, Pratapa et al. [2018]
  • 70. Use Linguistic Theories ?? • These theories exist as early as 1980s • Assume a syntactic relation between a pair of parallel sentences • Equivalence of grammar rules • Word or phrase level alignment • Propose ways to model CM and generate sentences • Sentences generated with a linguistic theory backing – Bound to be better than random mixing
  • 71. Linguistic Theories for CM • The 3 theories • Equivalence Constraint (Sankoff and Poplack [1981]), (Sankoff [1998]) • Functional Head (Belazi et al. [1994]) • Embedded Matrix • Li and Fung. 2014. Code switch language modeling with functional head constraint used the functional head theory during the decoding phase of the LM component of the ASR
  • 72. Example • She lives in a white house • Elle vive en una casa blanca
  • 73. EM theory • Replace a subtree in the matrix language with one from the embedded language She lives in a white house Elle vive una casa blanca
  • 74. EM theory in a white house Elle vive una casa blanca
  • 75. EM theory Elle vive in a white house
  • 76. EC theory • All leaf nodes swappable • After all swaps, check for constraints • Monolingual fragments must appear as in original language • EC constraint must be obeyed at every switch point • Each node in the tree is assigned a language id from it’s parent and children • Parent: Based on ordering of NTs in RHS of rule applied at parent • Child: Based on language of leaf and propagated upwards • Both have to match
  • 77. English Spanish Code Mixed Ill-formed Monolingual Fragment Language tag mismatch
  • 78. Training via a Curriculum • Use of curriculum improves perplexity of the model • Baheti, et al. (2017) Curriculum Design for Code-switching: Experiments with Language Identification and Language Modeling with Deep Neural Networks – using a curriculum improves perplexity • Monolingual -> Code Mixed • Pratapa, et al. (2018) • Generated Code Mixed -> Monolingual -> Real Code Mixed • Adding Real Code Mixed at the end is very useful Results from Pratapa, et al. (2018) on LM Perplexity
  • 79. Other Work - Handle Data Sparsity • Models that are trained on data generated by a SeqGAN • Garg, et al. (2018) Code-switched Language Models Using Dual RNNs and Same-Source Pretraining • Chang, et al. (2019) Code- switching Sentence Generation by Generative Adversarial Networks and its Application to Data Augmentation Chang, et al. (2019) GAN framework for generating sentences
  • 80. Other Work - Handle Data Sparsity • Samanta, et al. (2019) A Deep Generative Model for Code-Switched Text present using VAEs with a RNN based encoder and decoder to generate sentences Samanta, et al. (2019), A VAE framework for generating sentences
  • 81. Other Work - Handle Data Sparsity • Winata, et al. (2019) Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences use pointer generator networks to generate code mixed sentences • Lee, et al. (2019) Linguistically Motivated Parallel Data Augmentation for Code-switch Language Modeling use the EM theory at a phrase level to generate code mixed sentences Pointer Generator networks used in Winata, et al. (2019)
  • 82. Other Work - Model the switching constraint • Garg, et al. (2018) Code-switched Language Models Using Dual RNNs and Same-Source Pretraining • Output of 2 RNNs run through linear layer to get final output • Also train on data generated by a SeqGAN
  • 83. Other Work - Model the switching constraint • Adel, et al. (2013) Combination of Recurrent Neural Networks and Factored Language Models for Code-Switching Language Modeling • Adel, et al. (2015) Syntactic and semantic features for code-switching factored language models
  • 84. Other Work - Model the switching constraint • Winata, et al. (2018) Code-Switching Language Modeling using Syntax-Aware Multi-Task Learning show that adding a POS tag prediction task to the LM shows improvements in perplexity • Soto and Hirschberg (2019) Improving Code-Switched Language Modeling Performance Using Cognate Features uses features about words with similar origin in both languages
  • 85. Other Work – Sharing context between the RNNs • Not as simple of a task as it sounds • No current model capable of this • On a related note • Multilingual deep contextual embeddings – models multiple languages at once • Can this be made to code mix? • Artetxe, and Schwenk. (2019) Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond has an explicit language id as input in decoder during training
  • 87. Work on Embeddings for code mixing • Most work - Bilingual Embeddings adapted for some tasks • What about learning embeddings from CM data?
  • 88. Bilingual Embeddings • Summarized in Upadhyay et al., (2016) • 4 methods • Bilingual Skip-Gram Model (BiSkip) - Luong et al. (2015) • Bilingual Compositional Model (BiCVM) - Hermann and Blunsom (2014) • Bilingual Correlation Based Embeddings(BiCCA) - Faruqui and Dyer(2014) • Bilingual Vectors from Comparable Data(BiVCD) - Vulíc and Moens(2015) • Take monolingual embeddings and a corpora aligned at certain level • Projects those embeddings into a common space using the alignment
  • 89. Bilingual Embeddings for CM • Pratapa, et al. (2018) evaluated these for CM POS tagging and sentiment analysis • Pre-trained embeddings performed better than no embeddings • Embeddings learnt on synthetic code mixed data performed better
  • 90. Other work on Embeddings • Winata, et al. (2019) Hierarchical Meta- Embeddings for Code-Switching Named Entity Recognition show that amalgamating multiple embeddings(word, subword, and character level) shows improvements in downstream tasks • Lee and Li. (2019) Word and Class Common Space Embedding for Code-switch Language Modelling show that when using auxiliary features as input for an LM, constraining the embedding space of the words and these features improves LM perplexity Winata, et al. (2019)
  • 91. Takeaways • So much is possible using linguistic theories • Solving LM for CM – solving CM • Direction of Future Work • Deep contextual embeddings ? • Zero shot transfer ?
  • 93. Language Preference: When and why do bilinguals prefer a certain language? Topic change Puns Emphasis Emotion Reported Speech But it’s unpredictable!
  • 94. Linguistic Studies Until recently, no large-scale, data-driven validation of the hypotheses. Fishman (1971): - Use of English for professional settings, Spanish for informal chat Dewaele (2004, 2010): - The native language elicits stronger emotion - Preferred for emotion expression and swearing Nguyen (2014): - Code-choice as a social identity marker
  • 95. A great place to start your exploration! Computational Linguistics, 2016
  • 96. Initial quantitative studies • Jurgens, Dimitrov, and Ruths (2014) studied tweets written in one language but containing hashtags in another language • Nguyen, Trieschnigg, and Cornips (2015) studied users in the Netherlands who tweeted in a minority language (Limburgish or Frisian) as well as in Dutch. Most tweets were written in Dutch, but during conversations users often switched to the minority language (i.e., Limburgish or Frisian).
  • 97. We might praise you in English, but gaali to Hindi me hi denge! (Rudra et al., EMNLP 2016) Study of 830K Tweets from Hi-En bilinguals 1. The native language, Hindi, is strongly preferred (10 times more) for negativity and swearing 2. English is used far more for positive sentiment than negative 3. Language change often corresponds with changing sentiment Hindi English Fraction of tweets with swear words
  • 98. Predicting Naij́a-English code switching Innocent Ndubuisi-Obi, Sayan Ghosh and David Jurgens (2019) W´etin dey with these comments? Modeling Sociolinguistic Factors Affecting Code- switching Behavior in Nigerian Online Discussions. ACL. 330K articles and accompanying 389K comments labeled for code switching behavior
  • 99. Predictive Factors of Naij́a usage Article Topic Social Setting number of prior comments Depth of thread Social Status – Number of followers Emotion Tribal affiliation: Yoruba, Hausa-Falani, Igbo, etc. (automatically labeled)
  • 100. Predictive Factors of Naij́a usage (Findings) Article Topic Social Setting number of prior comments Depth of thread Social Status – Number of followers Emotion Tribal affiliation: Yoruba, Hausa-Falani, Igbo, etc. (automatically labeled) Comments deeper in a reply thread are more likely to be Naij́a Those made in the evening likely to be conversational with a particular person. High status  more English, but potential confounds Strong sentiment  more Naij́a
  • 101. Worldwide language distribution of monolingual and code-switched tweets computed over 50M Tweets (restricted to the 7 languages) 3.5% tweets are code-switched Rijhwani et al. ACL 2017
  • 102. Geographical Distribution of Code-switching on 8M Tweets from 24 cities
  • 103. Fraction of monolingual English tweets is strongly negatively correlated (-0.85) with the fraction of code-switched tweets This is surprising … especially for extremely multilingual US cities (e.g., Houston) (?) ACCULTURIZATION takes place much faster in the US
  • 104. UE If bots could code-mix…
  • 105. Code-choice as a Style dimension VIJAY: ek minute ke liye thoda practical socho. VIJAY: Main tumharey angle se hi soch raha hoon... Tum hi uncomfortable feel karogi... bahut time ho gaya hai... bahut fark aa gaya hai RANI: Kismein? Mujhmein koyi change nahin hai VIJAY: Vohi to baat hai... mujhmein hai... Meri duniya ... bilkul alag hai... ab... you’ll not fit in RANI: Matlab? ek dum se main tumharey jitni fancy nahin hoon...
  • 106. Code-choice accommodation in human- human conversations • Base Rate of a style: How frequently a style (code) is used by a user • Style Accommodation: How frequently a style (code) is used by a user when the preceding utterance contain that style (code) Bawa et al. Workshop on Computational Approaches to Code-Switching, EMNLP 2018
  • 107. Nudge, don’t push or assume... • Human-human conversations show “positive accommodation” for the choice of the marked code • In a wizard- mediated-bot experiment, most users show a very strong preference for bot that can code-mix • A small fraction of users have negative opinion towards a code- mixing bot, so it is important to nudge before mixing.
  • 108. Understanding code- mixing is not a luxury but a necessity for chatbots for multilingual societies
  • 109. Resources • Sitaram et al. (2019) A Survey of Code-switched Speech and Language Processing. Arxiv. https://arxiv.org/abs/1904.00784 • https://github.com/gentaiscool/code-switching-papers • Project Melange: https://www.microsoft.com/en- us/research/project/melange • Please get in touch with us for a comprehensive list of datasets and resources covered in this tutorial.
  • 110. Tutorial me ane ke lie thank you! https://www.microsoft.com/en-us/research/project/melange/

Editor's Notes

  1. Data- evaluation: Accuracy at switch point is important (Utsab, Yaov)
  2. Because in multilingual societies if there is a language preference, if we use only one language in text processing for analysis, they are likely wrong. For eg, looking at only english tweets for indian context, see a much more positive picture than it actually is
  3. Data is a problem
  4. Dependency between fragments
  5. But will it work? Human babies do so.
  6. Bilinguals have the choice between speaking in a single language as well as alternating languages in a conversation. when and why do bilinguals prefer a certain language? Several possible reasons have been observed in linguistics, for instance … Identifying language preference is a challenging problem, because, in general, it’s rather unpredictable.
  7. There have been linguistic studies on the subject of language preference since as early as 1971, when Fishman studied Spanish-English bilinguals in Puerto Rico. It was observed that English primarily featured in professional settings, while Spanish dominated informal conversation. A few decades later, Dewaele hypothesized that the native (or the primary) language elicits stronger emotion and is therefore used to express sentiment and for swearing. So, we have these studies that make certain claims about language preference – what’s missing? Unfortunately, …