Uncovering the Causes of Emotions in Software Developer Communication Using Zero-shot LLMs

Uncovering the Causes of Emotions in
Software Developer Communication
Using Zero-shot LLMs
Mia Mohammad Imran, Preetha Chatterjee,
Kostadin Damevski
Drexel University Virginia Commonwealth
University

Understanding Emotion Cause in OSS
Emotion cause involves identifying the text span within an
utterance that triggers a particular emotion
Frustration
"I'm feeling frustrated because the code isn't
compiling no matter what I try."
the code isn't compiling no
matter what I try
Cause
Emotion

Outline
● Emotion Models
● Emotion Classification
● Emotion Cause Extraction
● Case Study

Emotion Models
● Theoretical frameworks to represent emotions
● Shaver’s tree-structured model is most commonly used in
Software Engineering Research
○ 6 primary categories, 25 secondary categories and over 100
tertiary categories
● GoEmotions is a recently developed model by Google for
text

Emotion Models: Shaver’s Taxonomy
● 6 primary categories:
○ Anger 😡
○ Love ❤️
○ Fear 😨
○ Joy 😊
○ Sadness 😥
○ Surprise 😲

Shaver’s Taxonomy Is Not Complete
● “I’m curious about this - can you give more context on
what exactly goes wrong? Perhaps if that causes bugs this
should be prohibited instead?"
○ Expresses Curiosity 🤔
● “And, I am a little confused, if there is not any special
folder, according to the module resolution [URL] How
could file find the correct modules? Did I miss something?”
○ Expresses Confusion 😕

Extended Shaver’s Taxonomy
● Imran et al. [1] proposed an extended Shaver’s Taxonomy
by combining GoEmotions’ categories
● Provides mapping between GoEmotions’ categories and
primary emotions:
○ 👍 Approval to 😊 Joy
○ 👎 Disapproval to 😡 Anger
○ 🤔 Curiosity to 😲 Surprise
○ 🙌 Gratitude to ❤️ Love
[1] Imran et al., “Data augmentation for improving emotion recognition in software engineering communication.” ASE 2022

State-of-the Art Models
ESEM-E [1] SVM Unigram, bigram
EMTk [2] SVM Unigram, bigram, lexicon, polarity, mood
SEntiMoji [3] Transfer learning Neural Network
[1] Murgia et al., “An exploratory qualitative and quantitative analysis of emotions in issue report comments of open source systems.”, ESEM, 2018
[2] Calefato et al., “Emtk-the emotion mining toolkit.” SEmotion, 2019
[3] Chen et al. “Emoji-powered sentiment and emotion detection from software developers' communication data.” TOSEM, 2021
● Studies show that general purpose tools perform poorly in
software engineering text
● All tools perform one-vs-all predictions for all 6 basic
emotions (Anger, Love, Fear, Joy, Sadness, and Surprise)

Compared Fine-tuned LLMs
● BERT: First major transformer model applied
to NLP
● RoBERTa: An optimized version of BERT
when LLMs can be fine-tuned with
task-specific data
Fine-tuned LLMs

Compared Zero-shot LLMs
● ChatGPT (GPT-3.5): Proprietary model by OpenAI
● GPT-4: Updated version of gpt-3.5
● flan-alpaca: open-source
○ variation of Meta’s LLaMA model
○ instruct tuned with Google’s Flan-T5 model
when LLMs can make decisions on
unseen tasks without prior training
Zero-shot reasoning

Evaluating the Models
● Goal: Assess effectiveness of LLMs against SotA model
● Compared against three existing datasets from GitHub[1],
JIRA [2] and Stack Overflow [3]
● 80% train set, 20% test set with stratified sampling
● Metric: F1-score (micro-average F1-score)
[1] Imran et al., “Data augmentation for improving emotion recognition in software engineering communication.” ASE 2022
[2] Murgia et al., “An exploratory qualitative and quantitative analysis of emotions in issue report comments of open source systems.” ESEM, 2018
[3] Calefato et al., “Emtk-the emotion mining toolkit.” SEmotion, 2019

Prompt Design for Zero-shot LLM Reasoners
You are a [GitHub/Stack Overflow/JIRA] user. You are reading
comments from [GitHub/Stack Overflow/JIRA]. Your task is to
detect whether there is one of the following emotions aroused in
you while reading the utterance.
Emotions List: Anger, Fear, Love, Joy, Sadness, Surprise.
Utterance: <insert utterance>.
If there is no emotion in the text, write Neutral. Otherwise write
exactly one word, the exact emotion from the emotions list.

Results (Micro-average F1-score)
Model GitHub Stack Overflow JIRA
ESEM-E 0.440 0.674 0.744
EMTk 0.434 0.651 0.734
SEntiMoji 0.529 0.721 0.793
BERT 0.588 0.716 0.817
RoBERTa 0.592 0.735 0.818
ChatGPT 0.234 0.339 0.276
GPT-4 0.424 0.293 0.432
flan-alpaca 0.355 0.444 0.256
Fine-tuned BERT and
RoBERTa other
models
Zero-shot LLMs
performs badly!
We do error analysis
to understand why

Error Analysis
● Misclassifying one emotion as other, i.e., Love as Joy
● Predicting Neutral
"My concern is that more new atributes may appear [...]
it may break their behavior."
● Hallucinations: Generated responses that were outside of
what asked
Apology: "Doh. Sorry for wasting your time."

Zero-shot LLMs: Granular Level
Prompting
● From the GitHub dataset, sampled 400 utterances from training
set and perform prompting
● Designed prompts based on various emotion taxonomies:
○ Using Basic and Secondary emotions (total 36 emotions)
○ Secondary layer only (total 25 emotions)
○ Using all layers of emotions (total 141 emotions)
○ Using GoEmotions taxonomy (total 27 emotions)
Output Emotions are mapped to basic emotions
GoEmotions taxonomy performed best in F1-
score

How the Zero-shot LLMs Perform Now
● Output on GitHub Dataset
● Open-source flan-alpaca achieved best zero-shot
performance, outperformed GPT-4!
Model Anger Love Fear Joy Sadness Surprise Micro avg.
BERT 0.506 0.712 0.536 0.579 0.636 0.594 0.588
RoBERTa 0.525 0.683 0.492 0.500 0.613 0.673 0.592
ChatGPT 0.337 0.490 0.182 0.458 0.412 0.511 0.423
flan-alpaca 0.447 0.543 0.140 0.446 0.451 0.740 0.507
GPT-4 0.437 0.698 0.0 0.446 0.487 0.517 0.481
SotA Model GitHub
ESEM-E 0.440
EMTk 0.434
SEntiMoji 0.529

Zero-shot LLMs for Emotion-Cause
Extraction

Emotion Cause Extraction
● Emotion cause extraction involves extracting the text span
within an utterance that triggers a particular emotion
Frustration
"I'm feeling frustrated because the code isn't
compiling no matter what I try."
the code isn't compiling no
matter what I try
Cause
Emotion

Emotion Cause Extraction - Challenges
Annotation
● Requires understanding nuances in textual communication
● Causes can be implicit
● There can be multiple causes
Automatically cause extraction
● Requires large amounts of training data which we lack

Zero-shot LLMs for Cause Extraction
● Requires no training to extract causes
● Prompt design is critical
● Use same three models:
○ ChatGPT
○ GPT-4
○ flan-alpaca

Emotion Cause Extraction: Prompt
You are a GitHub user. You are reading utterances from
GitHub issues and pull requests. Your task is to extract the
span that is causing the emotion <insert emotion> in the
following GitHub utterance: <insert utterance>.
Write the cause of the span within a double quote.

Experiment Setup: Annotation
● Manually annotated 450 utterances
○ 75 utterances for each of 6 basic emotions
● Instructions:
○ Extract cause span to associated emotion
○ Allow multiple causes

Experiment Setup: Metric
● We use BLEU score as a metric
○ Compares machine-generated text to human references
○ Measures precision of n-gram overlap
● BLEU-2 (bigram) suitable for comparing short texts
● Interpretation:
○ 0.5 - Good fluency and correctness
○ 0.3-0.5 - Comprehensible
○ < 0.3 - Disfluent or incorrect

Results
● GPT-4 outperform in each cases
● BLEU-2 score for GPT-4 and flan-alpaca > 0.5 - which
indicates they perform reasonably well in correctness
Model BLEU-1 BLEU-2 BLEU-3 BLEU-4
ChatGPT 0.522 0.489 0.467 0.450
GPT-4 0.637 0.598 0.571 0.554
flan-alpaca 0.571 0.543 0.525 0.508

Error Analysis
● 41 cases where all three model BLEU-2 score < 0.3
● Two categories of error:
○ Incorrect emotion detection
○ Identifying wrong cause span
“Oh right 🙃! This started as a Mac issue, I
forgot to add the rest.”
Annotation: Neglect (2nd level of Sadness)
GPT-4 Detected emotion: Amusement
GPT-4 Detected cause span: Oh right 🙃
“[USER] yep, it is bug, we will fix it, so we
have it in ‘experiments‘ :+1:”
Annotation: Agreement (2nd level of Joy)
GPT-4 Detected emotion: Agreement
GPT-4 Detected cause span: we will fix it
Incorrect emotion
Wrong cause span

A Case Study on Emotion Cause
● Frustration on Tensorflow Repository using flan-alpaca
● Collected all comments made by developers 1 year period
● Extracted causes when the emotion is Frustration
● Resulted a total of 1275 comments
● Applied DBSCAN clustering on causes
Methodology

Causes of Frustration
● TensorFlow Version and Dependency Issues
● Pull Request Delays and Merge Conflicts
● Failing Tests
● Too Fine-Grained Commits
● CI Flakiness
● CUDA/CuDNN Compatibility Issues

Summary of Contributions
● Utilization of Zero-shot LLMs: Employed zero-shot models like GPT-3.5,
GPT-4 and flan-alpaca for detecting emotions and their causes in SE
● Annotated Data: 450 GitHub utterances with Emotion and Causes
● Resource Sharing: Publicly released source code, annotation
guidelines, and dataset
● Novel Research: Among the first to explore Emotion Causes in SE
● Open-source Case Study: Demonstrated practical benefits of emotion
cause extraction through a case study on a major open-source project
Questions/Thoughts/Collaboration Ideas to:
Mia Mohammad Imran, imranm3@vcu.edu

Uncovering the Causes of Emotions in Software Developer Communication Using Zero-shot LLMs

Recommended

Recommended

More Related Content

Similar to Uncovering the Causes of Emotions in Software Developer Communication Using Zero-shot LLMs

Similar to Uncovering the Causes of Emotions in Software Developer Communication Using Zero-shot LLMs (20)

Recently uploaded

Recently uploaded (20)

Uncovering the Causes of Emotions in Software Developer Communication Using Zero-shot LLMs