SlideShare a Scribd company logo
1 of 24
Debugging Intermittent Issues
Lloyd Moore, President
Lloyd@CyberData-Robotics.com
www.CyberData-Robotics.com
Northwest C++ Users Group, January 2017
Agenda
 The problem with intermittent issues
 Getting a baseline
 Basic statistics
 Making a change
 Is it fixed?
 Dealing with multi-causal issues
Problem with Intermittent Issues
 Fundamentally you do not know what is causing
the problem, therefore you cannot control for it
 You never know from one run to the next if the
issue will show up
 From a single run you cannot say anything, so
you will need multiple runs
 Once you have multiple runs you need statistics
to identify and describe any change in behavior
Getting a Baseline
 A baseline is a reference point or configuration
where you control any factor that might be
related to the issue
 Since you really don't know what is causing the
issue – you likely won't actually control for it, but
at least you minimize the number of variables
affecting future measurements
 Ideally the goal is to simply control as much of
the system as possible
Key Point
 Intermittent failures are NOT really
intermittent!!!
 They are caused by some varying condition that
you are not measuring, observing, and
controlling for
 If the varying condition happens in a particular
way – you get a failure – every time
 The debugging problem is to find the condition
that is varying, quantify how varies then control
it
Getting a Baseline
 What you need to control will vary based on
what you are debugging
 “Pure software” based issues will often be
easier to control than “network connected”
issues or “hardware related” issues
 Note that “pure software” issues are also far
less likely to be intermittent as there are far
fewer sources of randomness
 Most notable pure software randomness is system
loading leading to race conditions
Items to look at controlling
 Version of the software you are running
 Physical location of the target system
 Other processes running on the target system
 Network traffic to the target system – ideally
connect the system to a wired network
connection on an isolated network
 Other processes, and response time, of
remotely connected systems that the target
system depends on
More items to control
 Data being manipulated by the system
 State of the target system – does each run of
the system begin in the same state, including
existence and size of debugging files
 Depending on what is going on ANYTHING
could affect the issue: lighting, time of day,
people in the room, temperature, be creative
 May not get everything on the first pass – for
really hard problems establishing a baseline
becomes iterative
Also Control Yourself
 As you work the problem and repeatedly go
through the test case you will unconsciously
change your behavior
 Humans naturally learn to avoid failure
conditions, and this includes behavior patterns
that show up bugs
 Result can be VERY subtle, even something
like changing your keystroke rate slightly, and
this can affect your testing
Can't Control It – Measure it
 Many times you will not be able to control a
factor that may be important – time of day for
example
 If you cannot control a factor make every
attempt to measure and record the value during
your test runs
 Often times after hours of testing and looking at
something else, a pattern will emerge in the
collected data that strongly hints at the real
issue
Baseline Failure Rate
 Part of the baseline configuration is the rate at
which you experience the failure condition
 Once you have the test environment as stable
and quantified as possible you need to run the
system multiple times and quantify how often
you see the failure
 Three basic outcomes here:
 The failure no longer occurs
 The failure now happens all the time
 The failure happens X of Y trials
Baseline Failure Rate
 If you no longer have an intermittent failure (no
failure or constant failure) something you
changed in setting up the baseline is likely
related to the issue – this is good, you can now
systemically vary the baseline setup to debug
 Most likely you will still get some failures, for the
next set of slides we will assume that you have
run the system 10 times and saw 3 failures.....
Basic Statistics
 3 failures in 10 runs is: 3/10 = 0.3 = 30% failure
rate
 Success rate is: 1.0 – 0.3 = 0.7 = 70%
 Note that this number is an APPROXIMATION –
you can really never know the actual rates
 What is the chance of testing 10 times and not
seeing a single failure:
 Chance of success on first try * chance of success
on second try * chance ….
 0.7 * 0.7.... = 0.7^10 = 0.028 = 2.8%
How many times to test?
 So if we CAN run the test 10 times and not see
a failure how many times do we really need to
test?
 Yes there is a formula to estimate this – have
never actually used this in practice as several of
the variables involved are pretty hard to
estimate – things like degrees of freedom
 Simply use a rule of thumb here – want to get
the chance of not seeing a failure when there
really is one below either 5% or 1%
How many times to test?
 If I want to get below 1% will 15 trials be
enough?
 0.70^15 = 0.0047 = 0.47% - Yep, and this approach
is good enough for day to day work
 Those that are more math inclined could also solve
the equation: 0.01 = 0.70^X , then of course make
sure to take the next full integer! (Hint: Need logb ar
= r logb a)
OK now what?
 We have a controlled test set up
 We have an approximation of the failure rate
 We may also have some clues on what may be
causing the problem
 Next step is to attempt a change and observe
the failure rate
Is it fixed?
 I made a change and now I don't see the failure
condition any longer – I'm done right?
 Hum – NO!
 From the previous slide:
 What is the chance of testing 10 times and not
seeing a single failure:
 Chance of success on first try * chance of success
on second try * chance ….
 0.7 * 0.7.... = 0.7^10 = 0.028 = 2.8%
Is it fixed?
 One issue with problems that display
intermittently is that you can never really know
for sure if they are fixed
 The best you can do is estimate the probability
that they are fixed and decide if that is good
enough for your application
 You can always run more trials to increase the
certainty that you really fixed the problem
Is it fixed?
 Assume our original 30% failure rate, 70%
success rate:
Trials: Formula: Chance of not
seeing failure:
10 0.7^10 2.82%
15 0.7^15 0.474%
20 0.7^20 0.079%
25 0.7^25 0.013%
Multi-causal Issues
 I made a change and my failure rate decreased,
but not to zero, now what?
 Very good indication that you have either
affected the original problem or have actually
solved one problem but another remains
 The key to sorting this out is to look for details
on the actual failure
 Often times multiple issues have a similar
looking failure, but are really two different things
Multi-causal Issues
 Attempt to get more information on the exact
nature of the remaining failure, things to look at:
 Stack traces
 Timing values, how long into the run?
 Variation in results – cluster plots helpful here
 More than one trigger for the failure
 Many times you will stumble upon this
information while debugging the original issue
 Very common for intermittent issues to be multi-
causal as are ignored when first seen
Multi-causal Issues
 If you suspect you have more than one issue
setup a separate debugging environment for
each suspected issue
 You may also be able to make a list of the
separate failure cases and failure rates,
decomposing the original numbers
 Helpful to remember the original failure rate is
simply the sum of the individual failure rates
 Work the issue with the highest failure rate first
 Keep working down the list until all are solved
Summary
 Intermittent issues are not really intermittent
 Need to track down what unknown variable is
changing and handle that variation
 A baseline system configuration with failure
rates is key to telling if a change occured
 Just because you don't see the failure any more
is NOT a guarantee that it is fixed
 Multi-causal issues can be separated and
worked as individual issues once you have
details on the failure
Questions?

More Related Content

Similar to Debugging Intermittent Issues - A How To

Things Could Get Worse: Ideas About Regression Testing
Things Could Get Worse: Ideas About Regression TestingThings Could Get Worse: Ideas About Regression Testing
Things Could Get Worse: Ideas About Regression TestingTechWell
 
Keys to Better Problem Solving
Keys to Better Problem SolvingKeys to Better Problem Solving
Keys to Better Problem SolvingMike Wicker
 
7 Cases Where You Can't Afford to Skip Analytics Testing
7 Cases Where You Can't Afford to Skip Analytics Testing7 Cases Where You Can't Afford to Skip Analytics Testing
7 Cases Where You Can't Afford to Skip Analytics TestingObservePoint
 
The limits of unit testing by Craig Stuntz
The limits of unit testing by Craig StuntzThe limits of unit testing by Craig Stuntz
The limits of unit testing by Craig StuntzQA or the Highway
 
The Limits of Unit Testing by Craig Stuntz
The Limits of Unit Testing by Craig StuntzThe Limits of Unit Testing by Craig Stuntz
The Limits of Unit Testing by Craig StuntzQA or the Highway
 
The Anatomy of Problem Solving
The Anatomy of Problem SolvingThe Anatomy of Problem Solving
The Anatomy of Problem SolvingDamian T. Gordon
 
Blackboxtesting 02 An Example Test Series
Blackboxtesting 02 An Example Test SeriesBlackboxtesting 02 An Example Test Series
Blackboxtesting 02 An Example Test Seriesnazeer pasha
 
Software Testing 101
Software Testing 101Software Testing 101
Software Testing 101QA Hannah
 
Fundamentals of testing
Fundamentals of testingFundamentals of testing
Fundamentals of testingTaufik hidayat
 
Operating Excellence is built on Corrective & Preventive Actions
Operating Excellence is built on Corrective & Preventive ActionsOperating Excellence is built on Corrective & Preventive Actions
Operating Excellence is built on Corrective & Preventive ActionsAtanu Dhar
 
Defining Test Competence
Defining Test CompetenceDefining Test Competence
Defining Test CompetenceJohan Hoberg
 
1 testing fundamentals
1 testing fundamentals1 testing fundamentals
1 testing fundamentalsAsmaa Matar
 
Root Cause Analysis Guide Book.pdf
Root Cause Analysis Guide Book.pdfRoot Cause Analysis Guide Book.pdf
Root Cause Analysis Guide Book.pdfRohitLakhotia12
 
The Secret to Unblocking Your Business Bottlenecks
The Secret to Unblocking Your Business BottlenecksThe Secret to Unblocking Your Business Bottlenecks
The Secret to Unblocking Your Business BottlenecksKashish Trivedi
 
Testing techniques
Testing techniquesTesting techniques
Testing techniquescnpltesters
 

Similar to Debugging Intermittent Issues - A How To (20)

Things Could Get Worse: Ideas About Regression Testing
Things Could Get Worse: Ideas About Regression TestingThings Could Get Worse: Ideas About Regression Testing
Things Could Get Worse: Ideas About Regression Testing
 
Keys to Better Problem Solving
Keys to Better Problem SolvingKeys to Better Problem Solving
Keys to Better Problem Solving
 
7 Cases Where You Can't Afford to Skip Analytics Testing
7 Cases Where You Can't Afford to Skip Analytics Testing7 Cases Where You Can't Afford to Skip Analytics Testing
7 Cases Where You Can't Afford to Skip Analytics Testing
 
The limits of unit testing by Craig Stuntz
The limits of unit testing by Craig StuntzThe limits of unit testing by Craig Stuntz
The limits of unit testing by Craig Stuntz
 
The Limits of Unit Testing by Craig Stuntz
The Limits of Unit Testing by Craig StuntzThe Limits of Unit Testing by Craig Stuntz
The Limits of Unit Testing by Craig Stuntz
 
The Anatomy of Problem Solving
The Anatomy of Problem SolvingThe Anatomy of Problem Solving
The Anatomy of Problem Solving
 
Blackboxtesting 02 An Example Test Series
Blackboxtesting 02 An Example Test SeriesBlackboxtesting 02 An Example Test Series
Blackboxtesting 02 An Example Test Series
 
Root Cause Analysis
Root Cause AnalysisRoot Cause Analysis
Root Cause Analysis
 
Fishbone analysis
Fishbone analysisFishbone analysis
Fishbone analysis
 
Software Testing 101
Software Testing 101Software Testing 101
Software Testing 101
 
Bab 1
Bab 1Bab 1
Bab 1
 
Fundamentals of testing
Fundamentals of testingFundamentals of testing
Fundamentals of testing
 
Operating Excellence is built on Corrective & Preventive Actions
Operating Excellence is built on Corrective & Preventive ActionsOperating Excellence is built on Corrective & Preventive Actions
Operating Excellence is built on Corrective & Preventive Actions
 
Defining Test Competence
Defining Test CompetenceDefining Test Competence
Defining Test Competence
 
#8 Root Cause Analysis
#8 Root Cause Analysis#8 Root Cause Analysis
#8 Root Cause Analysis
 
1 testing fundamentals
1 testing fundamentals1 testing fundamentals
1 testing fundamentals
 
Root Cause Analysis Guide Book.pdf
Root Cause Analysis Guide Book.pdfRoot Cause Analysis Guide Book.pdf
Root Cause Analysis Guide Book.pdf
 
The Secret to Unblocking Your Business Bottlenecks
The Secret to Unblocking Your Business BottlenecksThe Secret to Unblocking Your Business Bottlenecks
The Secret to Unblocking Your Business Bottlenecks
 
Testing techniques
Testing techniquesTesting techniques
Testing techniques
 
Fundamentals of testing
Fundamentals of testingFundamentals of testing
Fundamentals of testing
 

More from LloydMoore

Less Magical Numbers - A coding standard proposal
Less Magical Numbers - A coding standard proposalLess Magical Numbers - A coding standard proposal
Less Magical Numbers - A coding standard proposalLloydMoore
 
Successful Software Projects - What you need to consider
Successful Software Projects - What you need to considerSuccessful Software Projects - What you need to consider
Successful Software Projects - What you need to considerLloydMoore
 
A Slice Of Rust - A quick look at the Rust programming language
A Slice Of Rust - A quick look at the Rust programming languageA Slice Of Rust - A quick look at the Rust programming language
A Slice Of Rust - A quick look at the Rust programming languageLloydMoore
 
What Have We Lost - A look at some historical techniques
What Have We Lost - A look at some historical techniquesWhat Have We Lost - A look at some historical techniques
What Have We Lost - A look at some historical techniquesLloydMoore
 
Raspberry pi robotics
Raspberry pi roboticsRaspberry pi robotics
Raspberry pi roboticsLloydMoore
 
High Reliabilty Systems
High Reliabilty SystemsHigh Reliabilty Systems
High Reliabilty SystemsLloydMoore
 
Real Time Debugging - What to do when a breakpoint just won't do
Real Time Debugging - What to do when a breakpoint just won't doReal Time Debugging - What to do when a breakpoint just won't do
Real Time Debugging - What to do when a breakpoint just won't doLloydMoore
 
Using PSoC Creator
Using PSoC CreatorUsing PSoC Creator
Using PSoC CreatorLloydMoore
 
Using the Cypress PSoC Processor
Using the Cypress PSoC ProcessorUsing the Cypress PSoC Processor
Using the Cypress PSoC ProcessorLloydMoore
 
C for Microcontrollers
C for MicrocontrollersC for Microcontrollers
C for MicrocontrollersLloydMoore
 
Starting Raspberry Pi
Starting Raspberry PiStarting Raspberry Pi
Starting Raspberry PiLloydMoore
 

More from LloydMoore (12)

Less Magical Numbers - A coding standard proposal
Less Magical Numbers - A coding standard proposalLess Magical Numbers - A coding standard proposal
Less Magical Numbers - A coding standard proposal
 
Successful Software Projects - What you need to consider
Successful Software Projects - What you need to considerSuccessful Software Projects - What you need to consider
Successful Software Projects - What you need to consider
 
A Slice Of Rust - A quick look at the Rust programming language
A Slice Of Rust - A quick look at the Rust programming languageA Slice Of Rust - A quick look at the Rust programming language
A Slice Of Rust - A quick look at the Rust programming language
 
What Have We Lost - A look at some historical techniques
What Have We Lost - A look at some historical techniquesWhat Have We Lost - A look at some historical techniques
What Have We Lost - A look at some historical techniques
 
Raspberry pi robotics
Raspberry pi roboticsRaspberry pi robotics
Raspberry pi robotics
 
High Reliabilty Systems
High Reliabilty SystemsHigh Reliabilty Systems
High Reliabilty Systems
 
Real Time Debugging - What to do when a breakpoint just won't do
Real Time Debugging - What to do when a breakpoint just won't doReal Time Debugging - What to do when a breakpoint just won't do
Real Time Debugging - What to do when a breakpoint just won't do
 
PSoC USB HID
PSoC USB HIDPSoC USB HID
PSoC USB HID
 
Using PSoC Creator
Using PSoC CreatorUsing PSoC Creator
Using PSoC Creator
 
Using the Cypress PSoC Processor
Using the Cypress PSoC ProcessorUsing the Cypress PSoC Processor
Using the Cypress PSoC Processor
 
C for Microcontrollers
C for MicrocontrollersC for Microcontrollers
C for Microcontrollers
 
Starting Raspberry Pi
Starting Raspberry PiStarting Raspberry Pi
Starting Raspberry Pi
 

Recently uploaded

Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
EduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIEduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIkoyaldeepu123
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
DATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage exampleDATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage examplePragyanshuParadkar1
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 

Recently uploaded (20)

Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
EduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIEduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AI
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
DATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage exampleDATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage example
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 

Debugging Intermittent Issues - A How To

  • 1. Debugging Intermittent Issues Lloyd Moore, President Lloyd@CyberData-Robotics.com www.CyberData-Robotics.com Northwest C++ Users Group, January 2017
  • 2. Agenda  The problem with intermittent issues  Getting a baseline  Basic statistics  Making a change  Is it fixed?  Dealing with multi-causal issues
  • 3. Problem with Intermittent Issues  Fundamentally you do not know what is causing the problem, therefore you cannot control for it  You never know from one run to the next if the issue will show up  From a single run you cannot say anything, so you will need multiple runs  Once you have multiple runs you need statistics to identify and describe any change in behavior
  • 4. Getting a Baseline  A baseline is a reference point or configuration where you control any factor that might be related to the issue  Since you really don't know what is causing the issue – you likely won't actually control for it, but at least you minimize the number of variables affecting future measurements  Ideally the goal is to simply control as much of the system as possible
  • 5. Key Point  Intermittent failures are NOT really intermittent!!!  They are caused by some varying condition that you are not measuring, observing, and controlling for  If the varying condition happens in a particular way – you get a failure – every time  The debugging problem is to find the condition that is varying, quantify how varies then control it
  • 6. Getting a Baseline  What you need to control will vary based on what you are debugging  “Pure software” based issues will often be easier to control than “network connected” issues or “hardware related” issues  Note that “pure software” issues are also far less likely to be intermittent as there are far fewer sources of randomness  Most notable pure software randomness is system loading leading to race conditions
  • 7. Items to look at controlling  Version of the software you are running  Physical location of the target system  Other processes running on the target system  Network traffic to the target system – ideally connect the system to a wired network connection on an isolated network  Other processes, and response time, of remotely connected systems that the target system depends on
  • 8. More items to control  Data being manipulated by the system  State of the target system – does each run of the system begin in the same state, including existence and size of debugging files  Depending on what is going on ANYTHING could affect the issue: lighting, time of day, people in the room, temperature, be creative  May not get everything on the first pass – for really hard problems establishing a baseline becomes iterative
  • 9. Also Control Yourself  As you work the problem and repeatedly go through the test case you will unconsciously change your behavior  Humans naturally learn to avoid failure conditions, and this includes behavior patterns that show up bugs  Result can be VERY subtle, even something like changing your keystroke rate slightly, and this can affect your testing
  • 10. Can't Control It – Measure it  Many times you will not be able to control a factor that may be important – time of day for example  If you cannot control a factor make every attempt to measure and record the value during your test runs  Often times after hours of testing and looking at something else, a pattern will emerge in the collected data that strongly hints at the real issue
  • 11. Baseline Failure Rate  Part of the baseline configuration is the rate at which you experience the failure condition  Once you have the test environment as stable and quantified as possible you need to run the system multiple times and quantify how often you see the failure  Three basic outcomes here:  The failure no longer occurs  The failure now happens all the time  The failure happens X of Y trials
  • 12. Baseline Failure Rate  If you no longer have an intermittent failure (no failure or constant failure) something you changed in setting up the baseline is likely related to the issue – this is good, you can now systemically vary the baseline setup to debug  Most likely you will still get some failures, for the next set of slides we will assume that you have run the system 10 times and saw 3 failures.....
  • 13. Basic Statistics  3 failures in 10 runs is: 3/10 = 0.3 = 30% failure rate  Success rate is: 1.0 – 0.3 = 0.7 = 70%  Note that this number is an APPROXIMATION – you can really never know the actual rates  What is the chance of testing 10 times and not seeing a single failure:  Chance of success on first try * chance of success on second try * chance ….  0.7 * 0.7.... = 0.7^10 = 0.028 = 2.8%
  • 14. How many times to test?  So if we CAN run the test 10 times and not see a failure how many times do we really need to test?  Yes there is a formula to estimate this – have never actually used this in practice as several of the variables involved are pretty hard to estimate – things like degrees of freedom  Simply use a rule of thumb here – want to get the chance of not seeing a failure when there really is one below either 5% or 1%
  • 15. How many times to test?  If I want to get below 1% will 15 trials be enough?  0.70^15 = 0.0047 = 0.47% - Yep, and this approach is good enough for day to day work  Those that are more math inclined could also solve the equation: 0.01 = 0.70^X , then of course make sure to take the next full integer! (Hint: Need logb ar = r logb a)
  • 16. OK now what?  We have a controlled test set up  We have an approximation of the failure rate  We may also have some clues on what may be causing the problem  Next step is to attempt a change and observe the failure rate
  • 17. Is it fixed?  I made a change and now I don't see the failure condition any longer – I'm done right?  Hum – NO!  From the previous slide:  What is the chance of testing 10 times and not seeing a single failure:  Chance of success on first try * chance of success on second try * chance ….  0.7 * 0.7.... = 0.7^10 = 0.028 = 2.8%
  • 18. Is it fixed?  One issue with problems that display intermittently is that you can never really know for sure if they are fixed  The best you can do is estimate the probability that they are fixed and decide if that is good enough for your application  You can always run more trials to increase the certainty that you really fixed the problem
  • 19. Is it fixed?  Assume our original 30% failure rate, 70% success rate: Trials: Formula: Chance of not seeing failure: 10 0.7^10 2.82% 15 0.7^15 0.474% 20 0.7^20 0.079% 25 0.7^25 0.013%
  • 20. Multi-causal Issues  I made a change and my failure rate decreased, but not to zero, now what?  Very good indication that you have either affected the original problem or have actually solved one problem but another remains  The key to sorting this out is to look for details on the actual failure  Often times multiple issues have a similar looking failure, but are really two different things
  • 21. Multi-causal Issues  Attempt to get more information on the exact nature of the remaining failure, things to look at:  Stack traces  Timing values, how long into the run?  Variation in results – cluster plots helpful here  More than one trigger for the failure  Many times you will stumble upon this information while debugging the original issue  Very common for intermittent issues to be multi- causal as are ignored when first seen
  • 22. Multi-causal Issues  If you suspect you have more than one issue setup a separate debugging environment for each suspected issue  You may also be able to make a list of the separate failure cases and failure rates, decomposing the original numbers  Helpful to remember the original failure rate is simply the sum of the individual failure rates  Work the issue with the highest failure rate first  Keep working down the list until all are solved
  • 23. Summary  Intermittent issues are not really intermittent  Need to track down what unknown variable is changing and handle that variation  A baseline system configuration with failure rates is key to telling if a change occured  Just because you don't see the failure any more is NOT a guarantee that it is fixed  Multi-causal issues can be separated and worked as individual issues once you have details on the failure