SlideShare a Scribd company logo
1 of 19
Beyond Accuracy: Evaluating the Reasoning Behavior of
Large Language Models - A Survey
Short Story Review
By: Dhruval Shah
Introduction
Exploring beyond task accuracy, this survey investigates if Large Language Models
(LLMs) can reason like humans, highlighting the gap in our understanding of LLMs'
reasoning processes and the over reliance on shallow patterns over deeper insights
Understanding Reasoning
Reasoning, both in humans and AI, is the process of deriving conclusions from premises
given. While humans use implicit reasoning in everyday decisions, LLMs' reasoning is
analyzed based on their response to specific tasks they were already trained on, focusing on
their actions and the mechanisms they employ.
Categorizing Reasoning Tasks
This survey categorizes reasoning tasks into two types: Core and Integrated. Core
Reasoning Tasks test LLMs on fundamental reasoning skills—logical, mathematical,
and causal. Integrated tasks combine multiple core skills, but the focus here in this
survey paper is solely on core tasks
Logical Reasoning
LLMs show varied success in logical reasoning, capable of engaging in deductive, inductive,
and abductive reasoning. However, they often rely on identifying patterns, leading to
challenges when faced with novel or complex scenarios outside their training.
Mathematical Reasoning
In mathematical reasoning, LLMs excel in familiar problems but struggle when presented
with slightly altered problem statements. This indicates a reliance on memorization over
genuine understanding, highlighting a key limitation in their reasoning capabilities.
Causal Reasoning
Causal reasoning in LLMs is tested through their ability to understand cause-and-effect
relationships. LLMs can mimic this reasoning for scenarios they've been trained on but
struggle with novel tasks, showing limitations in their ability to truly 'understand' causality.
Evaluation Methodologies Overview
The survey introduces four methodologies for evaluating LLMs' reasoning: Conclusion-
based, Rationale-based, Interactive, and Mechanistic evaluations, each offering unique
insights into the reasoning behavior of LLMs.
Conclusion-Based Evaluation
This methodology assesses LLMs based on the accuracy and relevance of their
conclusions, using error analysis to identify patterns in mistakes, thus testing the
models' adaptability and generalization capabilities.
Rationale-Based Evaluation
Rationale-based evaluation delves into the 'thought process' of LLMs, analyzing the logical
steps they take to reach conclusions. It assesses the coherence, validity, and consistency
of their rationales, aiming to understand if LLMs mimic human thought processes.
Interactive Evaluation
Interactive evaluation involves engaging LLMs in dynamic dialogues, mimicking human
learning environments. It tests the models' ability to adapt their reasoning in light of new
information, assessing their conversational reasoning skills
Mechanistic Evaluation
This method aims to understand the internal workings of LLMs during reasoning, analyzing
models' attention patterns, layer functions, and activation flows to uncover the
computational bases of their reasoning behavior.
Key Insights
LLMs demonstrate an impressive yet limited ability to mimic human reasoning, heavily
relying on pattern recognition. The survey highlights the need for more nuanced evaluation
methods to fully capture the complexity of LLM reasoning processes.
Future Research Directions
The survey calls for research focused on enhancing LLMs' generalization capabilities
and developing sophisticated evaluation methodologies that go beyond traditional task
performance metrics, aiming to bridge the gap between human and AI reasoning.
The Journey Ahead
Understanding LLMs' reasoning abilities is an ongoing journey, with the ultimate goal of
developing models that not only compute but truly comprehend and reason, mirroring the
depth of human thought
Closing Thoughts
This survey underscores the importance of advancing our understanding and evaluation
of LLMs' reasoning abilities, paving the way for future AI systems that are more aligned
with human-like reasoning.
THANK YOU

More Related Content

Similar to Dhruval_Shah_CMPE258_ShortStory_PPT.pptx

Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Cognitive Approaches to Second Language Acquisition
Cognitive Approaches to Second Language AcquisitionCognitive Approaches to Second Language Acquisition
Cognitive Approaches to Second Language AcquisitionOla Sayed Ahmed
 
psychometrics chapter one.pptx
psychometrics chapter one.pptxpsychometrics chapter one.pptx
psychometrics chapter one.pptxDejeneDay
 
Silico-centric Theory of Mind arXiv:2403.09289v1 [cs.AI] 14 Mar 2024
Silico-centric Theory of Mind arXiv:2403.09289v1 [cs.AI] 14 Mar 2024Silico-centric Theory of Mind arXiv:2403.09289v1 [cs.AI] 14 Mar 2024
Silico-centric Theory of Mind arXiv:2403.09289v1 [cs.AI] 14 Mar 2024vs5qkn48td
 
Competency
CompetencyCompetency
CompetencySindhu .
 
Interpretable Machine Learning_ Techniques for Model Explainability.
Interpretable Machine Learning_ Techniques for Model Explainability.Interpretable Machine Learning_ Techniques for Model Explainability.
Interpretable Machine Learning_ Techniques for Model Explainability.Tyrion Lannister
 
old.ipto-2004.ppt
old.ipto-2004.pptold.ipto-2004.ppt
old.ipto-2004.pptbutest
 
Individual behavior-in-organization
Individual behavior-in-organizationIndividual behavior-in-organization
Individual behavior-in-organizationKunal Sheth
 
Critical thinking,coaching and performance management
Critical thinking,coaching and performance managementCritical thinking,coaching and performance management
Critical thinking,coaching and performance managementDr. Renalde Huysamen
 
Alva labs ab 2 - Logical ability
Alva labs ab 2 - Logical abilityAlva labs ab 2 - Logical ability
Alva labs ab 2 - Logical abilityKhanhLinhNguyen23
 
Competency mapping (2)
Competency mapping (2)Competency mapping (2)
Competency mapping (2)Bhumika Garg
 
Motivation, affective aspects, and knowledge maturing
Motivation, affective aspects, and knowledge maturingMotivation, affective aspects, and knowledge maturing
Motivation, affective aspects, and knowledge maturingAndreas Schmidt
 
Rsqrd AI: Recent Advances in Explainable Machine Learning Research
Rsqrd AI: Recent Advances in Explainable Machine Learning ResearchRsqrd AI: Recent Advances in Explainable Machine Learning Research
Rsqrd AI: Recent Advances in Explainable Machine Learning ResearchSanjana Chowdhury
 
types of analyses of research
types of analyses of researchtypes of analyses of research
types of analyses of researchssuser1ee781
 

Similar to Dhruval_Shah_CMPE258_ShortStory_PPT.pptx (20)

Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Cognitive Approaches to Second Language Acquisition
Cognitive Approaches to Second Language AcquisitionCognitive Approaches to Second Language Acquisition
Cognitive Approaches to Second Language Acquisition
 
Basics of Soft Computing
Basics of Soft  Computing Basics of Soft  Computing
Basics of Soft Computing
 
NLP Ecosystem
NLP EcosystemNLP Ecosystem
NLP Ecosystem
 
psychometrics chapter one.pptx
psychometrics chapter one.pptxpsychometrics chapter one.pptx
psychometrics chapter one.pptx
 
Silico-centric Theory of Mind arXiv:2403.09289v1 [cs.AI] 14 Mar 2024
Silico-centric Theory of Mind arXiv:2403.09289v1 [cs.AI] 14 Mar 2024Silico-centric Theory of Mind arXiv:2403.09289v1 [cs.AI] 14 Mar 2024
Silico-centric Theory of Mind arXiv:2403.09289v1 [cs.AI] 14 Mar 2024
 
Competency
CompetencyCompetency
Competency
 
Logical Thinking
Logical ThinkingLogical Thinking
Logical Thinking
 
Interpretable Machine Learning_ Techniques for Model Explainability.
Interpretable Machine Learning_ Techniques for Model Explainability.Interpretable Machine Learning_ Techniques for Model Explainability.
Interpretable Machine Learning_ Techniques for Model Explainability.
 
Presentation Brain Perform
Presentation Brain PerformPresentation Brain Perform
Presentation Brain Perform
 
Intelligence
IntelligenceIntelligence
Intelligence
 
Ability testing 5
Ability testing 5Ability testing 5
Ability testing 5
 
old.ipto-2004.ppt
old.ipto-2004.pptold.ipto-2004.ppt
old.ipto-2004.ppt
 
Individual behavior-in-organization
Individual behavior-in-organizationIndividual behavior-in-organization
Individual behavior-in-organization
 
Critical thinking,coaching and performance management
Critical thinking,coaching and performance managementCritical thinking,coaching and performance management
Critical thinking,coaching and performance management
 
Alva labs ab 2 - Logical ability
Alva labs ab 2 - Logical abilityAlva labs ab 2 - Logical ability
Alva labs ab 2 - Logical ability
 
Competency mapping (2)
Competency mapping (2)Competency mapping (2)
Competency mapping (2)
 
Motivation, affective aspects, and knowledge maturing
Motivation, affective aspects, and knowledge maturingMotivation, affective aspects, and knowledge maturing
Motivation, affective aspects, and knowledge maturing
 
Rsqrd AI: Recent Advances in Explainable Machine Learning Research
Rsqrd AI: Recent Advances in Explainable Machine Learning ResearchRsqrd AI: Recent Advances in Explainable Machine Learning Research
Rsqrd AI: Recent Advances in Explainable Machine Learning Research
 
types of analyses of research
types of analyses of researchtypes of analyses of research
types of analyses of research
 

Recently uploaded

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 

Recently uploaded (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Dhruval_Shah_CMPE258_ShortStory_PPT.pptx

  • 1. Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models - A Survey Short Story Review By: Dhruval Shah
  • 2. Introduction Exploring beyond task accuracy, this survey investigates if Large Language Models (LLMs) can reason like humans, highlighting the gap in our understanding of LLMs' reasoning processes and the over reliance on shallow patterns over deeper insights
  • 3. Understanding Reasoning Reasoning, both in humans and AI, is the process of deriving conclusions from premises given. While humans use implicit reasoning in everyday decisions, LLMs' reasoning is analyzed based on their response to specific tasks they were already trained on, focusing on their actions and the mechanisms they employ.
  • 4. Categorizing Reasoning Tasks This survey categorizes reasoning tasks into two types: Core and Integrated. Core Reasoning Tasks test LLMs on fundamental reasoning skills—logical, mathematical, and causal. Integrated tasks combine multiple core skills, but the focus here in this survey paper is solely on core tasks
  • 5.
  • 6. Logical Reasoning LLMs show varied success in logical reasoning, capable of engaging in deductive, inductive, and abductive reasoning. However, they often rely on identifying patterns, leading to challenges when faced with novel or complex scenarios outside their training.
  • 7. Mathematical Reasoning In mathematical reasoning, LLMs excel in familiar problems but struggle when presented with slightly altered problem statements. This indicates a reliance on memorization over genuine understanding, highlighting a key limitation in their reasoning capabilities.
  • 8. Causal Reasoning Causal reasoning in LLMs is tested through their ability to understand cause-and-effect relationships. LLMs can mimic this reasoning for scenarios they've been trained on but struggle with novel tasks, showing limitations in their ability to truly 'understand' causality.
  • 9. Evaluation Methodologies Overview The survey introduces four methodologies for evaluating LLMs' reasoning: Conclusion- based, Rationale-based, Interactive, and Mechanistic evaluations, each offering unique insights into the reasoning behavior of LLMs.
  • 10.
  • 11. Conclusion-Based Evaluation This methodology assesses LLMs based on the accuracy and relevance of their conclusions, using error analysis to identify patterns in mistakes, thus testing the models' adaptability and generalization capabilities.
  • 12. Rationale-Based Evaluation Rationale-based evaluation delves into the 'thought process' of LLMs, analyzing the logical steps they take to reach conclusions. It assesses the coherence, validity, and consistency of their rationales, aiming to understand if LLMs mimic human thought processes.
  • 13. Interactive Evaluation Interactive evaluation involves engaging LLMs in dynamic dialogues, mimicking human learning environments. It tests the models' ability to adapt their reasoning in light of new information, assessing their conversational reasoning skills
  • 14. Mechanistic Evaluation This method aims to understand the internal workings of LLMs during reasoning, analyzing models' attention patterns, layer functions, and activation flows to uncover the computational bases of their reasoning behavior.
  • 15. Key Insights LLMs demonstrate an impressive yet limited ability to mimic human reasoning, heavily relying on pattern recognition. The survey highlights the need for more nuanced evaluation methods to fully capture the complexity of LLM reasoning processes.
  • 16. Future Research Directions The survey calls for research focused on enhancing LLMs' generalization capabilities and developing sophisticated evaluation methodologies that go beyond traditional task performance metrics, aiming to bridge the gap between human and AI reasoning.
  • 17. The Journey Ahead Understanding LLMs' reasoning abilities is an ongoing journey, with the ultimate goal of developing models that not only compute but truly comprehend and reason, mirroring the depth of human thought
  • 18. Closing Thoughts This survey underscores the importance of advancing our understanding and evaluation of LLMs' reasoning abilities, paving the way for future AI systems that are more aligned with human-like reasoning.