Understanding and identifying the causes behind developers' emotions (e.g., Frustration caused by 'delays in merging pull requests') can be crucial towards finding solutions to problems and fostering collaboration in open-source communities. Effectively identifying such information in the high volume of communications across the different project channels, such as chats, emails, and issue comments, requires automated recognition of emotions and their causes. To enable this automation, large-scale software engineering-specific datasets that can be used to train accurate machine learning models are required. However, such datasets are expensive to create with the variety and informal nature of software projects' communication channels.
In this paper, we explore zero-shot LLMs that are pre-trained on massive datasets but without being fine-tuned specifically for the task of detecting emotion causes in software engineering: ChatGPT, GPT-4, and flan-alpaca. Our evaluation indicates that these recently available models can identify emotion categories when given detailed emotions, although they perform worse than the top-rated models. For emotion cause identification, our results indicate that zero-shot LLMs are effective at recognizing the correct emotion cause with a BLEU-2 score of 0.598. To highlight the potential use of these techniques, we conduct a case study of the causes of Frustration in the last year of development of a popular open-source project, revealing several interesting insights.
08448380779 Call Girls In Civil Lines Women Seeking Men
Uncovering the Causes of Emotions in Software Developer Communication Using Zero-shot LLMs
1. Uncovering the Causes of Emotions in
Software Developer Communication
Using Zero-shot LLMs
Mia Mohammad Imran, Preetha Chatterjee,
Kostadin Damevski
Drexel University Virginia Commonwealth
University
2. Understanding Emotion Cause in OSS
Emotion cause involves identifying the text span within an
utterance that triggers a particular emotion
Frustration
"I'm feeling frustrated because the code isn't
compiling no matter what I try."
the code isn't compiling no
matter what I try
Cause
Emotion
5. Emotion Models
● Theoretical frameworks to represent emotions
● Shaver’s tree-structured model is most commonly used in
Software Engineering Research
○ 6 primary categories, 25 secondary categories and over 100
tertiary categories
● GoEmotions is a recently developed model by Google for
text
6. Emotion Models: Shaver’s Taxonomy
● 6 primary categories:
○ Anger 😡
○ Love ❤️
○ Fear 😨
○ Joy 😊
○ Sadness 😥
○ Surprise 😲
7. Shaver’s Taxonomy Is Not Complete
● “I’m curious about this - can you give more context on
what exactly goes wrong? Perhaps if that causes bugs this
should be prohibited instead?"
○ Expresses Curiosity 🤔
● “And, I am a little confused, if there is not any special
folder, according to the module resolution [URL] How
could file find the correct modules? Did I miss something?”
○ Expresses Confusion 😕
8. Extended Shaver’s Taxonomy
● Imran et al. [1] proposed an extended Shaver’s Taxonomy
by combining GoEmotions’ categories
● Provides mapping between GoEmotions’ categories and
primary emotions:
○ 👍 Approval to 😊 Joy
○ 👎 Disapproval to 😡 Anger
○ 🤔 Curiosity to 😲 Surprise
○ 🙌 Gratitude to ❤️ Love
[1] Imran et al., “Data augmentation for improving emotion recognition in software engineering communication.” ASE 2022
10. State-of-the Art Models
ESEM-E [1] SVM Unigram, bigram
EMTk [2] SVM Unigram, bigram, lexicon, polarity, mood
SEntiMoji [3] Transfer learning Neural Network
[1] Murgia et al., “An exploratory qualitative and quantitative analysis of emotions in issue report comments of open source systems.”, ESEM, 2018
[2] Calefato et al., “Emtk-the emotion mining toolkit.” SEmotion, 2019
[3] Chen et al. “Emoji-powered sentiment and emotion detection from software developers' communication data.” TOSEM, 2021
● Studies show that general purpose tools perform poorly in
software engineering text
● All tools perform one-vs-all predictions for all 6 basic
emotions (Anger, Love, Fear, Joy, Sadness, and Surprise)
11. Compared Fine-tuned LLMs
● BERT: First major transformer model applied
to NLP
● RoBERTa: An optimized version of BERT
when LLMs can be fine-tuned with
task-specific data
Fine-tuned LLMs
12. Compared Zero-shot LLMs
● ChatGPT (GPT-3.5): Proprietary model by OpenAI
● GPT-4: Updated version of gpt-3.5
● flan-alpaca: open-source
○ variation of Meta’s LLaMA model
○ instruct tuned with Google’s Flan-T5 model
when LLMs can make decisions on
unseen tasks without prior training
Zero-shot reasoning
13. Evaluating the Models
● Goal: Assess effectiveness of LLMs against SotA model
● Compared against three existing datasets from GitHub[1],
JIRA [2] and Stack Overflow [3]
● 80% train set, 20% test set with stratified sampling
● Metric: F1-score (micro-average F1-score)
[1] Imran et al., “Data augmentation for improving emotion recognition in software engineering communication.” ASE 2022
[2] Murgia et al., “An exploratory qualitative and quantitative analysis of emotions in issue report comments of open source systems.” ESEM, 2018
[3] Calefato et al., “Emtk-the emotion mining toolkit.” SEmotion, 2019
14. Prompt Design for Zero-shot LLM Reasoners
You are a [GitHub/Stack Overflow/JIRA] user. You are reading
comments from [GitHub/Stack Overflow/JIRA]. Your task is to
detect whether there is one of the following emotions aroused in
you while reading the utterance.
Emotions List: Anger, Fear, Love, Joy, Sadness, Surprise.
Utterance: <insert utterance>.
If there is no emotion in the text, write Neutral. Otherwise write
exactly one word, the exact emotion from the emotions list.
15. Results (Micro-average F1-score)
Model GitHub Stack Overflow JIRA
ESEM-E 0.440 0.674 0.744
EMTk 0.434 0.651 0.734
SEntiMoji 0.529 0.721 0.793
BERT 0.588 0.716 0.817
RoBERTa 0.592 0.735 0.818
ChatGPT 0.234 0.339 0.276
GPT-4 0.424 0.293 0.432
flan-alpaca 0.355 0.444 0.256
Fine-tuned BERT and
RoBERTa other
models
Zero-shot LLMs
performs badly!
We do error analysis
to understand why
16. Error Analysis
● Misclassifying one emotion as other, i.e., Love as Joy
● Predicting Neutral
"My concern is that more new atributes may appear [...]
it may break their behavior."
● Hallucinations: Generated responses that were outside of
what asked
Apology: "Doh. Sorry for wasting your time."
17. Zero-shot LLMs: Granular Level
Prompting
● From the GitHub dataset, sampled 400 utterances from training
set and perform prompting
● Designed prompts based on various emotion taxonomies:
○ Using Basic and Secondary emotions (total 36 emotions)
○ Secondary layer only (total 25 emotions)
○ Using all layers of emotions (total 141 emotions)
○ Using GoEmotions taxonomy (total 27 emotions)
Output Emotions are mapped to basic emotions
GoEmotions taxonomy performed best in F1-
score
18. How the Zero-shot LLMs Perform Now
● Output on GitHub Dataset
● Open-source flan-alpaca achieved best zero-shot
performance, outperformed GPT-4!
Model Anger Love Fear Joy Sadness Surprise Micro avg.
BERT 0.506 0.712 0.536 0.579 0.636 0.594 0.588
RoBERTa 0.525 0.683 0.492 0.500 0.613 0.673 0.592
ChatGPT 0.337 0.490 0.182 0.458 0.412 0.511 0.423
flan-alpaca 0.447 0.543 0.140 0.446 0.451 0.740 0.507
GPT-4 0.437 0.698 0.0 0.446 0.487 0.517 0.481
SotA Model GitHub
ESEM-E 0.440
EMTk 0.434
SEntiMoji 0.529
20. Emotion Cause Extraction
● Emotion cause extraction involves extracting the text span
within an utterance that triggers a particular emotion
Frustration
"I'm feeling frustrated because the code isn't
compiling no matter what I try."
the code isn't compiling no
matter what I try
Cause
Emotion
21. Emotion Cause Extraction - Challenges
Annotation
● Requires understanding nuances in textual communication
● Causes can be implicit
● There can be multiple causes
Automatically cause extraction
● Requires large amounts of training data which we lack
22. Zero-shot LLMs for Cause Extraction
● Requires no training to extract causes
● Prompt design is critical
● Use same three models:
○ ChatGPT
○ GPT-4
○ flan-alpaca
23. Emotion Cause Extraction: Prompt
You are a GitHub user. You are reading utterances from
GitHub issues and pull requests. Your task is to extract the
span that is causing the emotion <insert emotion> in the
following GitHub utterance: <insert utterance>.
Write the cause of the span within a double quote.
24. Experiment Setup: Annotation
● Manually annotated 450 utterances
○ 75 utterances for each of 6 basic emotions
● Instructions:
○ Extract cause span to associated emotion
○ Allow multiple causes
25. Experiment Setup: Metric
● We use BLEU score as a metric
○ Compares machine-generated text to human references
○ Measures precision of n-gram overlap
● BLEU-2 (bigram) suitable for comparing short texts
● Interpretation:
○ 0.5 - Good fluency and correctness
○ 0.3-0.5 - Comprehensible
○ < 0.3 - Disfluent or incorrect
26. Results
● GPT-4 outperform in each cases
● BLEU-2 score for GPT-4 and flan-alpaca > 0.5 - which
indicates they perform reasonably well in correctness
Model BLEU-1 BLEU-2 BLEU-3 BLEU-4
ChatGPT 0.522 0.489 0.467 0.450
GPT-4 0.637 0.598 0.571 0.554
flan-alpaca 0.571 0.543 0.525 0.508
27. Error Analysis
● 41 cases where all three model BLEU-2 score < 0.3
● Two categories of error:
○ Incorrect emotion detection
○ Identifying wrong cause span
“Oh right 🙃! This started as a Mac issue, I
forgot to add the rest.”
Annotation: Neglect (2nd level of Sadness)
GPT-4 Detected emotion: Amusement
GPT-4 Detected cause span: Oh right 🙃
“[USER] yep, it is bug, we will fix it, so we
have it in ‘experiments‘ :+1:”
Annotation: Agreement (2nd level of Joy)
GPT-4 Detected emotion: Agreement
GPT-4 Detected cause span: we will fix it
Incorrect emotion
Wrong cause span
29. A Case Study on Emotion Cause
● Frustration on Tensorflow Repository using flan-alpaca
● Collected all comments made by developers 1 year period
● Extracted causes when the emotion is Frustration
● Resulted a total of 1275 comments
● Applied DBSCAN clustering on causes
Methodology
30. Causes of Frustration
● TensorFlow Version and Dependency Issues
● Pull Request Delays and Merge Conflicts
● Failing Tests
● Too Fine-Grained Commits
● CI Flakiness
● CUDA/CuDNN Compatibility Issues
31. Summary of Contributions
● Utilization of Zero-shot LLMs: Employed zero-shot models like GPT-3.5,
GPT-4 and flan-alpaca for detecting emotions and their causes in SE
● Annotated Data: 450 GitHub utterances with Emotion and Causes
● Resource Sharing: Publicly released source code, annotation
guidelines, and dataset
● Novel Research: Among the first to explore Emotion Causes in SE
● Open-source Case Study: Demonstrated practical benefits of emotion
cause extraction through a case study on a major open-source project
Questions/Thoughts/Collaboration Ideas to:
Mia Mohammad Imran, imranm3@vcu.edu