Debugging intermittent issues present a unique set of challenges and requirements to the developer. Of particular interest is question “did this fix actually solve the problem, or am I just getting lucky today?” This talk will show you how to determine, with some level of confidence, if the issue is really solved or just not showing up. In addition issues which only show up rarely have a habit of accumulating until some issue seems to be happening all the time. Techniques for untangling the interaction of multiple issues will be discussed.
young call girls in Green Park🔝 9953056974 🔝 escort Service
Debugging Intermittent Issues - A How To
1. Debugging Intermittent Issues
Lloyd Moore, President
Lloyd@CyberData-Robotics.com
www.CyberData-Robotics.com
Northwest C++ Users Group, January 2017
2. Agenda
The problem with intermittent issues
Getting a baseline
Basic statistics
Making a change
Is it fixed?
Dealing with multi-causal issues
3. Problem with Intermittent Issues
Fundamentally you do not know what is causing
the problem, therefore you cannot control for it
You never know from one run to the next if the
issue will show up
From a single run you cannot say anything, so
you will need multiple runs
Once you have multiple runs you need statistics
to identify and describe any change in behavior
4. Getting a Baseline
A baseline is a reference point or configuration
where you control any factor that might be
related to the issue
Since you really don't know what is causing the
issue – you likely won't actually control for it, but
at least you minimize the number of variables
affecting future measurements
Ideally the goal is to simply control as much of
the system as possible
5. Key Point
Intermittent failures are NOT really
intermittent!!!
They are caused by some varying condition that
you are not measuring, observing, and
controlling for
If the varying condition happens in a particular
way – you get a failure – every time
The debugging problem is to find the condition
that is varying, quantify how varies then control
it
6. Getting a Baseline
What you need to control will vary based on
what you are debugging
“Pure software” based issues will often be
easier to control than “network connected”
issues or “hardware related” issues
Note that “pure software” issues are also far
less likely to be intermittent as there are far
fewer sources of randomness
Most notable pure software randomness is system
loading leading to race conditions
7. Items to look at controlling
Version of the software you are running
Physical location of the target system
Other processes running on the target system
Network traffic to the target system – ideally
connect the system to a wired network
connection on an isolated network
Other processes, and response time, of
remotely connected systems that the target
system depends on
8. More items to control
Data being manipulated by the system
State of the target system – does each run of
the system begin in the same state, including
existence and size of debugging files
Depending on what is going on ANYTHING
could affect the issue: lighting, time of day,
people in the room, temperature, be creative
May not get everything on the first pass – for
really hard problems establishing a baseline
becomes iterative
9. Also Control Yourself
As you work the problem and repeatedly go
through the test case you will unconsciously
change your behavior
Humans naturally learn to avoid failure
conditions, and this includes behavior patterns
that show up bugs
Result can be VERY subtle, even something
like changing your keystroke rate slightly, and
this can affect your testing
10. Can't Control It – Measure it
Many times you will not be able to control a
factor that may be important – time of day for
example
If you cannot control a factor make every
attempt to measure and record the value during
your test runs
Often times after hours of testing and looking at
something else, a pattern will emerge in the
collected data that strongly hints at the real
issue
11. Baseline Failure Rate
Part of the baseline configuration is the rate at
which you experience the failure condition
Once you have the test environment as stable
and quantified as possible you need to run the
system multiple times and quantify how often
you see the failure
Three basic outcomes here:
The failure no longer occurs
The failure now happens all the time
The failure happens X of Y trials
12. Baseline Failure Rate
If you no longer have an intermittent failure (no
failure or constant failure) something you
changed in setting up the baseline is likely
related to the issue – this is good, you can now
systemically vary the baseline setup to debug
Most likely you will still get some failures, for the
next set of slides we will assume that you have
run the system 10 times and saw 3 failures.....
13. Basic Statistics
3 failures in 10 runs is: 3/10 = 0.3 = 30% failure
rate
Success rate is: 1.0 – 0.3 = 0.7 = 70%
Note that this number is an APPROXIMATION –
you can really never know the actual rates
What is the chance of testing 10 times and not
seeing a single failure:
Chance of success on first try * chance of success
on second try * chance ….
0.7 * 0.7.... = 0.7^10 = 0.028 = 2.8%
14. How many times to test?
So if we CAN run the test 10 times and not see
a failure how many times do we really need to
test?
Yes there is a formula to estimate this – have
never actually used this in practice as several of
the variables involved are pretty hard to
estimate – things like degrees of freedom
Simply use a rule of thumb here – want to get
the chance of not seeing a failure when there
really is one below either 5% or 1%
15. How many times to test?
If I want to get below 1% will 15 trials be
enough?
0.70^15 = 0.0047 = 0.47% - Yep, and this approach
is good enough for day to day work
Those that are more math inclined could also solve
the equation: 0.01 = 0.70^X , then of course make
sure to take the next full integer! (Hint: Need logb ar
= r logb a)
16. OK now what?
We have a controlled test set up
We have an approximation of the failure rate
We may also have some clues on what may be
causing the problem
Next step is to attempt a change and observe
the failure rate
17. Is it fixed?
I made a change and now I don't see the failure
condition any longer – I'm done right?
Hum – NO!
From the previous slide:
What is the chance of testing 10 times and not
seeing a single failure:
Chance of success on first try * chance of success
on second try * chance ….
0.7 * 0.7.... = 0.7^10 = 0.028 = 2.8%
18. Is it fixed?
One issue with problems that display
intermittently is that you can never really know
for sure if they are fixed
The best you can do is estimate the probability
that they are fixed and decide if that is good
enough for your application
You can always run more trials to increase the
certainty that you really fixed the problem
19. Is it fixed?
Assume our original 30% failure rate, 70%
success rate:
Trials: Formula: Chance of not
seeing failure:
10 0.7^10 2.82%
15 0.7^15 0.474%
20 0.7^20 0.079%
25 0.7^25 0.013%
20. Multi-causal Issues
I made a change and my failure rate decreased,
but not to zero, now what?
Very good indication that you have either
affected the original problem or have actually
solved one problem but another remains
The key to sorting this out is to look for details
on the actual failure
Often times multiple issues have a similar
looking failure, but are really two different things
21. Multi-causal Issues
Attempt to get more information on the exact
nature of the remaining failure, things to look at:
Stack traces
Timing values, how long into the run?
Variation in results – cluster plots helpful here
More than one trigger for the failure
Many times you will stumble upon this
information while debugging the original issue
Very common for intermittent issues to be multi-
causal as are ignored when first seen
22. Multi-causal Issues
If you suspect you have more than one issue
setup a separate debugging environment for
each suspected issue
You may also be able to make a list of the
separate failure cases and failure rates,
decomposing the original numbers
Helpful to remember the original failure rate is
simply the sum of the individual failure rates
Work the issue with the highest failure rate first
Keep working down the list until all are solved
23. Summary
Intermittent issues are not really intermittent
Need to track down what unknown variable is
changing and handle that variation
A baseline system configuration with failure
rates is key to telling if a change occured
Just because you don't see the failure any more
is NOT a guarantee that it is fixed
Multi-causal issues can be separated and
worked as individual issues once you have
details on the failure