Emotion Classification In Software Engineering Texts: A Comparative Analysis of Pre-trained Transformers Language Models

Mia Mohammad Imran
Virginia Commonwealth University
Emotion Classification In Software
Engineering Texts: A Comparative Analysis of
Pre-trained Transformers Language Models

What are Software Engineering Texts
● Chats
● PR comments
● Issue comments
● Commit messages
● GitHub discussions
● Stack Overflow
● Mailing list

“Programmers Have Feelings Too!”
Anger 🤬
Appreciation 🙏
@[USER] Thank you, Stephen. I hope in the
future Angular will become even better and
easier to understand. However, first of all, I
am grateful to Angular for making me grow
as a developer.
Soooooooooooo you’re setting Angular on
fire and saying bold sh*t in bold like the
Angular team don’t care about you cause
you found relative pathing has an issue is an
odd area

How can Understanding Emotions Help?

Awareness
Self-reflect and
seek feedback
01
Empathy
Understand and
respect diverse
perspectives
02
Regulation
Manage emotions
to maintain focus
03
Social Skills
Enhance
communication and
teamwork
04
Motivation
Drive innovation
and consistent
contribution
05
Benefits of Emotional Intelligence

Study Design and Goals
● Purpose: To investigate how PTMs perform in Emotion
Classification task in software engineering text
● Establish a Benchmark against state-of-the-art tool
● Identify strengths, limitations, and error patterns of PTMs in
this domain
● Propose techniques to improve classifications

Research Questions
● RQ1: How accurately can PTMs classify emotions compared to
the state-of-the-art model?
● RQ2: Can integrating polarity features during training improve
PTMs' emotion classification ability?

Emotion Models
● Theoretical frameworks to represent emotions
● Shaver’s tree-structured model is most commonly used in
Software Engineering Research
○ 6 primary categories, 25 secondary categories and over 100
tertiary categories

Emotion Models: Shaver’s Taxonomy
● 6 primary categories:
○ Anger 😡
○ Love ❤️
○ Fear 😨
○ Joy 😊
○ Sadness 😥
○ Surprise 😲

Shaver’s Taxonomy: Mapping Example
Excitement
Every time you comment I realize
something new about JS or TS.
This is very exciting. 😊
Feel free to file a bug for that -
that code has a history of
breaking :
Joy
Worry Fear

RQ1: How accurately can PTMs
classify emotions compared to
the state-of-the-art model?

State-of-the Art Models
SEntiMoji [1] Transfer learning Neural Network
[3] Chen et al. “Emoji-powered sentiment and emotion detection from software developers' communication data.” TOSEM, 2021
● Studies show that general purpose tools perform poorly in
SE text
● All tools perform one-vs-all predictions for all 6 basic
emotions (Anger, Love, Fear, Joy, Sadness, and Surprise)

Compared Pre-trained Language Models
● BERT: First major transformer model applied to NLP
● RoBERTa: An optimized version of BERT
● ALBERT: Lighter, faster BERT with shared layers
● DeBERTa: Enhanced BERT with disentangled attention
● CodeBERT: BERT model specialized for code
● GraphCodeBERT: CodeBERT enhanced with graph data

Evaluating the Models
● Goal: Assess effectiveness of PTMs against SotA model
● On 2 datasets
○ Stack Overflow Dataset
○ GitHub Dataset
● 80% train set, 20% test set with stratified sampling
[1] Novielli et al., “A gold standard for emotion annotation in stack overflow.” MSR 2018
[2] Imran et al., “Data augmentation for improving emotion recognition in software engineering communication.” ASE 2022

Compared Metric
● F1-score: Harmonic mean of precision and recall
○ For overall performance: micro-averaged and macro-averaged
F1-score

Results (Average F1-score)
Model Micro Avg. Macro Avg.
SEntiMoji 0.530 0.521
BERT 0.585 0.591
RoBERTa 0.575 0.590
ALBERT 0.538 0.539
DeBERTa 0.610 0.608
CodeBERT 0.545 0.555
GraphCodeBERT 0.549 0.549
SEntiMoji 0.714 0.530
BERT 0.754 0.588
RoBERTa 0.758 0.599
ALBERT 0.747 0.584
DeBERTa 0.756 0.607
CodeBERT 0.728 0.567
GraphCodeBERT 0.722 0.552
GitHub Stack Overflow

Error Analysis
● Error Categorization by Novielli et al. [1]
[1] Novielli, Nicole et al. "A benchmark study on sentiment analysis for software engineering research." 2018 MSR.
General Error
Implicit Sentiment Polarity
Pragmatics
Figurative Language
Politeness
Polar Facts
Subjectivity in Annotation

Error Analysis on GitHub Dataset
General Error: the inability to recognize lexical cues that occur in the text
Nice, this is more slick 👍
Implicit Sentiment Polarity: humans use common knowledge to recognize
emotions that the models miss
Patiently waiting for any updates. […]
Surprisingly! Presence of Emojis
And yes, there should be tests 😱😱😱

RQ2: Can integrating polarity
features during training improve
PTMs' emotion classification
ability?

RQ2 Methodology
● Integrate polarity features through token-level attention
adjustment
● Assign greater significance to tokens linked with polarity words
during fine-tuning

Results (Avg F1-score) - GitHub Dataset
BERT
BERT-Polarity
0.585
0.619 (+5.99%)
0.591
0.621 (+5.04%)
RoBERTa
RoBERTa-Polarity
0.575
0.603 (+4.94%)
0.590
0.606 (+2.75%)
ALBERT
ALBERT-Polarity
0.538
0.580 (+7.86%)
0.539
0.581 (+7.65%)
DeBERTa
DeBERTa-Polarity
0.610
0.620 (+1.75%)
0.608
0.614 (+1.04%)
CodeBERT
CodeBERT-Polarity
0.545
0.595 (+9.16%)
0.555
0.601 (+8.37%)
GraphCodeBERT
GraphCodeBERT-Polarity
0.549
0.563 (+2.52%)
0.549
0.568 (+3.38%)

Results (Avg F1-score) - Stack Overflow Dataset
BERT
BERT-Polarity
0.754
0.762 (+1.0%)
0.588
0.607 (+3.17%)
RoBERTa
RoBERTa-Polarity
0.758
0.767 (+1.20%)
0.599
0.646 (+7.78%)
ALBERT
ALBERT-Polarity
0.747
0.757 (+1.36%)
0.584
0.616 (+10.23%)
DeBERTa
DeBERTa-Polarity
0.756
0.766 (+1.37%)
0.607
0.624 (+2.89%)
CodeBERT
CodeBERT-Polarity
0.728
0.742 (+1.91%)
0.567
0.586 (+3.32%)
GraphCodeBERT
GraphCodeBERT-Polarity
0.722
0.732 (+1.29%)
0.552
0.569 (+3.11%)

Error Analysis on GitHub Dataset
● In RQ1, 67 cases all models made mistakes
○ After Polarity enhancement, 27/67 cases - at least one model made correct
prediction
● Most improved categories:
○ General error (13/29 cases resolved)
○ Implicit polarity (9/18 cases resolved)
○ Politeness (2/3 cases resolved)
● Least improved categories:
○ Pragmatics (6/7 cases remained unresolved)
○ Figurative Language (6/9 remains unresolved)
● Still considerate amount of misclassified utterances have presence of
Emojis

Key Takeaways
● General PTMs excel in emotion classification within SE texts
compared to SE-specific models
● Polarity features enhance performance consistently
○ Challenges persist especially with negative emotions
● No single model excels across all emotions and metrics
● Common error categories are usually context dependant:
implicit polarity, figurative language, pragmatics
● Challenges in handling emojis

Future Directions
● Establish more benchmark datasets
● Investigate hierarchical emotion classification (2 step)
○ Enhance performance by identifying broad emotional valence before
specific categories
● Investigate aspect-based sentiment analysis (ABSA)-enhanced PTMs
● Fusion of text and emoji cues during pre-training/fine-tuning
● Explore generative language models for emotion detection
○ Utilize zero-shot and few-shot learning for data augmentation and
prompting techniques
● Focus on detecting emotions that may harm productivity (e.g.,
Frustration)

Questions/Thoughts/Collaboration Ideas to: Mia Mohammad Imran, imranm3@vcu.edu
Thank You!
Question?

Emotion Classification In Software Engineering Texts: A Comparative Analysis of Pre-trained Transformers Language Models

Recommended

Recommended

More Related Content

Similar to Emotion Classification In Software Engineering Texts: A Comparative Analysis of Pre-trained Transformers Language Models

Similar to Emotion Classification In Software Engineering Texts: A Comparative Analysis of Pre-trained Transformers Language Models (20)

Recently uploaded

Recently uploaded (20)

Emotion Classification In Software Engineering Texts: A Comparative Analysis of Pre-trained Transformers Language Models