Emotion recognition in software engineering texts is critical for understanding developer expressions and improving collaboration. This paper presents a comparative analysis of state-of-the-art Pre-trained Language Models (PTMs) for fine-grained emotion classification on two benchmark datasets from GitHub and Stack Overflow. We evaluate six transformer models - BERT, RoBERTa, ALBERT, DeBERTa, CodeBERT and GraphCodeBERT against the current best-performing tool SEntiMoji. Our analysis reveals consistent improvements ranging from 1.17\% to 16.79\% in terms of macro-averaged and micro-averaged F1 scores, with general domain models outperforming specialized ones. To further enhance PTMs, we incorporate polarity features in attention layer during training, demonstrating additional average gains of 1.0\% to 10.23\% over baseline PTMs approaches. Our work provides strong evidence for the advancements afforded by PTMs in recognizing nuanced emotions like Anger, Love, Fear, Joy, Sadness, and Surprise in software engineering contexts. Through comprehensive benchmarking and error analysis, we also outline scope for improvements to address contextual gaps.
Exploring the Future Potential of AI-Enabled Smartphone Processors
Emotion Classification In Software Engineering Texts: A Comparative Analysis of Pre-trained Transformers Language Models
1. Mia Mohammad Imran
Virginia Commonwealth University
Emotion Classification In Software
Engineering Texts: A Comparative Analysis of
Pre-trained Transformers Language Models
2. What are Software Engineering Texts
● Chats
● PR comments
● Issue comments
● Commit messages
● GitHub discussions
● Stack Overflow
● Mailing list
3. “Programmers Have Feelings Too!”
Anger 🤬
Appreciation 🙏
@[USER] Thank you, Stephen. I hope in the
future Angular will become even better and
easier to understand. However, first of all, I
am grateful to Angular for making me grow
as a developer.
Soooooooooooo you’re setting Angular on
fire and saying bold sh*t in bold like the
Angular team don’t care about you cause
you found relative pathing has an issue is an
odd area
5. Awareness
Self-reflect and
seek feedback
01
Empathy
Understand and
respect diverse
perspectives
02
Regulation
Manage emotions
to maintain focus
03
Social Skills
Enhance
communication and
teamwork
04
Motivation
Drive innovation
and consistent
contribution
05
Benefits of Emotional Intelligence
6. Study Design and Goals
● Purpose: To investigate how PTMs perform in Emotion
Classification task in software engineering text
● Establish a Benchmark against state-of-the-art tool
● Identify strengths, limitations, and error patterns of PTMs in
this domain
● Propose techniques to improve classifications
7. Research Questions
● RQ1: How accurately can PTMs classify emotions compared to
the state-of-the-art model?
● RQ2: Can integrating polarity features during training improve
PTMs' emotion classification ability?
9. Emotion Models
● Theoretical frameworks to represent emotions
● Shaver’s tree-structured model is most commonly used in
Software Engineering Research
○ 6 primary categories, 25 secondary categories and over 100
tertiary categories
10. Emotion Models: Shaver’s Taxonomy
● 6 primary categories:
○ Anger 😡
○ Love ❤️
○ Fear 😨
○ Joy 😊
○ Sadness 😥
○ Surprise 😲
11. Shaver’s Taxonomy: Mapping Example
Excitement
Every time you comment I realize
something new about JS or TS.
This is very exciting. 😊
Feel free to file a bug for that -
that code has a history of
breaking :
Joy
Worry Fear
12. RQ1: How accurately can PTMs
classify emotions compared to
the state-of-the-art model?
13. State-of-the Art Models
SEntiMoji [1] Transfer learning Neural Network
[3] Chen et al. “Emoji-powered sentiment and emotion detection from software developers' communication data.” TOSEM, 2021
● Studies show that general purpose tools perform poorly in
SE text
● All tools perform one-vs-all predictions for all 6 basic
emotions (Anger, Love, Fear, Joy, Sadness, and Surprise)
14. Compared Pre-trained Language Models
● BERT: First major transformer model applied to NLP
● RoBERTa: An optimized version of BERT
● ALBERT: Lighter, faster BERT with shared layers
● DeBERTa: Enhanced BERT with disentangled attention
● CodeBERT: BERT model specialized for code
● GraphCodeBERT: CodeBERT enhanced with graph data
15. Evaluating the Models
● Goal: Assess effectiveness of PTMs against SotA model
● On 2 datasets
○ Stack Overflow Dataset
○ GitHub Dataset
● 80% train set, 20% test set with stratified sampling
[1] Novielli et al., “A gold standard for emotion annotation in stack overflow.” MSR 2018
[2] Imran et al., “Data augmentation for improving emotion recognition in software engineering communication.” ASE 2022
16. Compared Metric
● F1-score: Harmonic mean of precision and recall
○ For overall performance: micro-averaged and macro-averaged
F1-score
17. Results (Average F1-score)
Model Micro Avg. Macro Avg.
SEntiMoji 0.530 0.521
BERT 0.585 0.591
RoBERTa 0.575 0.590
ALBERT 0.538 0.539
DeBERTa 0.610 0.608
CodeBERT 0.545 0.555
GraphCodeBERT 0.549 0.549
Model Micro Avg. Macro Avg.
SEntiMoji 0.714 0.530
BERT 0.754 0.588
RoBERTa 0.758 0.599
ALBERT 0.747 0.584
DeBERTa 0.756 0.607
CodeBERT 0.728 0.567
GraphCodeBERT 0.722 0.552
GitHub Stack Overflow
18. Error Analysis
● Error Categorization by Novielli et al. [1]
[1] Novielli, Nicole et al. "A benchmark study on sentiment analysis for software engineering research." 2018 MSR.
General Error
Implicit Sentiment Polarity
Pragmatics
Figurative Language
Politeness
Polar Facts
Subjectivity in Annotation
19. Error Analysis on GitHub Dataset
General Error: the inability to recognize lexical cues that occur in the text
Nice, this is more slick 👍
Implicit Sentiment Polarity: humans use common knowledge to recognize
emotions that the models miss
Patiently waiting for any updates. […]
Surprisingly! Presence of Emojis
And yes, there should be tests 😱😱😱
20. RQ2: Can integrating polarity
features during training improve
PTMs' emotion classification
ability?
21. RQ2 Methodology
● Integrate polarity features through token-level attention
adjustment
● Assign greater significance to tokens linked with polarity words
during fine-tuning
25. Error Analysis on GitHub Dataset
● In RQ1, 67 cases all models made mistakes
○ After Polarity enhancement, 27/67 cases - at least one model made correct
prediction
● Most improved categories:
○ General error (13/29 cases resolved)
○ Implicit polarity (9/18 cases resolved)
○ Politeness (2/3 cases resolved)
● Least improved categories:
○ Pragmatics (6/7 cases remained unresolved)
○ Figurative Language (6/9 remains unresolved)
● Still considerate amount of misclassified utterances have presence of
Emojis
26. Key Takeaways
● General PTMs excel in emotion classification within SE texts
compared to SE-specific models
● Polarity features enhance performance consistently
○ Challenges persist especially with negative emotions
● No single model excels across all emotions and metrics
● Common error categories are usually context dependant:
implicit polarity, figurative language, pragmatics
● Challenges in handling emojis
27. Future Directions
● Establish more benchmark datasets
● Investigate hierarchical emotion classification (2 step)
○ Enhance performance by identifying broad emotional valence before
specific categories
● Investigate aspect-based sentiment analysis (ABSA)-enhanced PTMs
● Fusion of text and emoji cues during pre-training/fine-tuning
● Explore generative language models for emotion detection
○ Utilize zero-shot and few-shot learning for data augmentation and
prompting techniques
● Focus on detecting emotions that may harm productivity (e.g.,
Frustration)