How to Investigate Intermittent Problems

The ability and the confidence to investigate an intermittent bug is one of the things that marks an excellent tester. The most engaging stories about testing I have heard have been stories about hunting a “white whale” sort of problem in an ocean of complexity. Recently, a thread on the SHAPE forum made me realized that I had not yet written about this fascinating aspect of software testing.

Unlike a mysterious non-intermittent bug, an intermittent bug is more of a testing problem than a development problem. A lot of programmers will not want to chase that white whale, when there’s other fishing to do.

Intermittent behavior itself is no big deal. It could be said that digital computing is all about the control of intermittent behavior. So, what are we really talking about?

We are not concerned about intermittence that is both desirable and non-mysterious, even if it isn’t exactly predictable. Think of a coin toss at the start of a football game, or a slot machine that comes up all 7’s once in a long while. We are not even concerned about mysterious intermittent behavior if we believe it can’t possibly cause a problem. For the things I test, I don’t care much about transient magnetic fields or minor random power spikes, even though they are happening all the time.

Many intermittent problems have not yet been observed at all, perhaps because they haven’t manifested, yet, or perhaps because they have manifested and not yet been noticed. The only thing we can do about that is to get the best test coverage we can and keep at it. No algorithm can exist for automatically detecting or preventing all intermittent problems.

So, what we typically call an intermittent problem is: a mysterious and undesirable behavior of a system, observed at least once, that we cannot yet manifest on demand.

Our challenge is to transform the intermittent bug into a regular bug by resolving the mystery surrounding it. After that it’s the programmer’s headache.

Some Principles of Intermittent Problems:

  • Be comforted: the cause is probably not evil spirits.
  • If it happened once, it will probably happen again.
  • If a bug goes away without being fixed, it probably didn’t go away for good.
  • Be wary of any fix made to an intermittent bug. By definition, a fixed bug and an unfixed intermittent bug are indistinguishable over some period of time and/or input space.
  • Any software state that takes a long time to occur, under normal circumstances, can also be reached instantly, by unforeseen circumstances.
  • Complex and baffling behavior often has a simple underlying cause.
  • Complex and baffling behavior sometimes has a complex set of causes.
  • Intermittent problems often teach you something profound about your product.
  • It’s easy to fall in love with a theory of a problem that is sensible, clever, wise, and just happens to be wrong.
  • The key to your mystery might be resting in someone else’s common knowledge.
  • An intermittent problem in the lab might be easily reproducible in the field.
  • The Pentium Principle of 1994: an intermittent technical problem may pose a *sustained and expensive* public relations problem.
  • The problem may be intermittent, but the risk of that problem is ever present.
  • The more testability is designed into a product, the easier it is to investigate and solve intermittent problems.
  • When you have eliminated the impossible, whatever remains, however improbable, could have done a lot of damage by then! So, don’t wait until you’ve fully researched an intermittent problem before you report it.
  • If you ever get in trouble an intermittent problem that you could not lock down before release, you will fare a lot better if you made a faithful, thoughtful, vigorous effort to find and fix it. The journey can be the reward, you might say.

Some General Suggestions for Investigating Intermittent Problems:

  • Recheck your most basic assumptions: are you using the computer you think you are using? are you testing what you think you are testing? are you observing what you think you are observing?
  • Eyewitness reports leave out a lot of potentially vital information. So listen, but DO NOT BECOME ATTACHED to the claims people make.
  • Invite more observers and minds into the investigation.
  • Create incentives for people to report intermittent problems.
  • If someone tells you what the problem can’t possibly be, consider putting extra attention into those possibilities.
  • Check tech support websites for each third party component you use. Maybe the problem is listed.
  • Seek tools that could help you observe and control the system.
  • Improve communication among observers (especially with observers who are users in the field).
  • Establish a central clearinghouse for mystery bugs, so that patterns among them might be easier to spot.
  • Look through the bug list for any other bug that seems like the intermittent problem.
  • Make more precise observations (consider using measuring instruments).
  • Improve testability: Add more logging and scriptable interfaces.
  • Control inputs more precisely (including sequences, timing, types, sizes, sources, iterations, combinations).
  • Control state more precisely (find ways to return to known states).
  • Systematically cover the input and state spaces.
  • Save all log files. Someday you’ll want to compare patterns in old logs to patterns in new ones.
  • If the problem happens more often in some situations than in others, consider doing a statistical analysis of the variance between input patterns in those situations.
  • Consider controlling things that you think probably don’t matter.
  • Simplify. Try changing only one variable at a time; try subdividing the system. (helps you understand and isolate problem when it occurs)
  • Complexify. Try changing more variables at once; let the state get “dirty”. (helps you make a lottery-type problem happen)
  • Inject randomness into states and inputs (possibly by loosening controls) in order to reach states that may not fit your typical usage profile.
  • Create background stress (high loads; large data).
  • Set a trap for the problem, so that the next time it happens, you’ll learn much more about it.
  • Consider reviewing the code.
  • Look for interference among components created by different organizations.
  • Celebrate and preserve stories about intermittent problems and how they were resolved.
  • Systematically consider the conceivable causes of the problem (see below).
  • Beware of burning huge time on a small problem. Keep asking, is this problem worth it?
  • When all else fails, let the problem sit a while, do something else, and see if it spontaneously recurs.

Considering the Causes of Intermittent Problems

When investigating an intermittent problem, it maybe worth considering the kinds of things that cause such problems. The list of guideword heuristics below may help you systematically do that analysis. There is some redundancy among the items in the list, because causes can be viewed from different perspectives.

Possibility 1: The system is NOT behaving differently. The apparent intermittence is an artifact of the observation.

  • Bad observation: The observer may have made a poor observation. (e.g. “Innattentional Blindness” is a phenomena whereby an observer whose mind is occupied may not see things that are in plain view. When presented with the scene a second time, the observer may see new things in the scene and assume that they weren’t there, before. Also, certain optical illusions cause apparently intermittent behavior in an unchanging scene. See “the scintillating grid”)
  • Irrelevant observation: The observer may be looking at differences that don’t matter. The things that matter may not be intermittent. This can happen when an observation is too precise for its purpose.
  • Bad memory: The observer may have mis-remembered the observation, or records of the observation could have been corrupted. (There’s a lot to observe when we observe! Our mind immediately compact the data and relate it to other data. Important data may be edited out. Besides, a lot of system development and testing involve highly repetitive observations, and we sometimes get them mixed up.)
  • Misattribution: The observer may have mis-attributed the observation. (”Microsoft Word crashed” might mean that *Windows* crashed for a reason that had nothing whatsoever to do with Word. Word didn’t “do” anything. This is a phenomenon also known as “false correlation” and often occurs in the mind of an observer when one event follows hard on the heels of another event, making one appear to be caused by the other. False correlation is also chiefly responsible for many instances whereby an intermittent problem is mistakenly construed to be a non-intermittent problem with a very complex and unlikely set of causes)
  • Misrepresentation: The observer may have misrepresented the observation. (There are various reasons for this. An innocent reason is that the observer is so confident in an inference that they have the honest impression that they did observe it and report it as such. I once asked my son if his malfunctioning Playstation was plugged in. “Yes!” he said impatiently. After some more troubleshooting, I had just concluded that the power supply was shot when I looked down and saw that it was obviously not plugged in.)
  • Unreliable oracle: The observer may be applying an intermittent standard for what constitutes a “problem.” (We may get the impression that a problem is intermittent only because some people, some of the time, don’t consider the behavior to be a problem, even if the behavior is itself predictable. Different observers may have different tolerances and sensitivities; and the same observer may vary in that way from one hour to the next.)
  • Unreliable communication: Communication with the observer may be inconsistent. (We may get the impression that a problem is intermittent simply because reports about it don’t consistently reach us, even if the problem is itself quite predictable. “I guess people aren’t seeing the problem anymore” may simply mean that people no longer bother to complain.)

Possibility 2: The system behaved differently because it was a different system.

  • Deus ex machina: A developer may have changed it on purpose, and then changed it back. (This can occur easily when multiple developers or teams are simultaneously building or servicing different parts of an operational server platform without coordinating with each other. Another possibility, of course, is that the system has been modified by a malicious hacker.)
  • Accidental change: A developer may be making accidental changes. (The changes may have unanticipated side effects, leading to the intermittent behavior. Also, a developer may be unwittingly changing a live server instead of a sandbox system.)
  • Platform change: A platform component may have been swapped or reconfigured. (An administrator or user may have changed, intentionally or not, a component on which the product depends. Common sources of these problems include Windows automatic updates, memory and disk space reconfigurations.)
  • Flakey hardware: A physical component may have transiently malfunctioned. (Transient malfunctions may be due factors such as inherent natural variation, magnetic fields, excessive heat or cold, battery low conditions, poor maintenance, or physical shock.)
  • Trespassing system: A foreign system may be intruding. (For instance, in web testing, I might get occasionally incorrect results due to a proxy server somewhere at my ISP that provides a cached version of pages when it shouldn’t. Other examples are background virus scans, automatic system updates, other programs, or other instances of the same program.)
  • Executable corruption: The object code may have become corrupted. (One of the worst bugs I ever created in my own code (in terms of how hard it was to find) involved machine code in a video game that occasionally wrote data over a completely unrelated part of the same program. Because of the nature of that data, the system didn’t crash, but rather the newly corrupted function passed control to the function that immediately followed it in memory. Took me days (and a chip emulator) to figure it out.)
  • Split personality: The “system” may actually be several different systems that perform as one. (For instance, I may get inconsistent results from Google depending on which Google server I happen to get; or I might not realize that different machines in the test lab have different versions of some key component; or I might mistype a URL and accidentally test on the wrong server some of the time.)
  • Human element: There may be a human in the system, making part of it run, and that human is behaving inconsistently.

Possibility 3: The system behaved differently because it was in a different state.

  • Frozen conditional: A decision that is supposed to be based on the status of a condition may have stopped checking that condition. (It could be stuck in an “always yes” or “always no” state.)
  • Improper initialization: One or more variables may not have been initialized. (The starting state of a computation would therefore depend on the state of some previous computation of the same or other function.)
  • Resource denial: A critical file, stream, or other variable may not be available to the system. (This could happen either because the object does not exist, has become corrupted, or is locked by another process.)
  • Progressive data corruption: A bad state may have slowly evolved from a good state by small errors propagating over time. (Examples include timing loops that are slightly off, or rounding errors in complicated or reflexive calculations.)
  • Progressive destabilization: There may be a classic multi-stage failure. (The first part of the bug creates an unstable state– such as a wild pointer– when a certain event occurs, but without any visible or obvious failure. The second part precipitates a visible failure at a later time based on the unstable state in combination with some other condition that occurs down the line. The lag time between the destabilizing event and the precipitating event makes it difficult to associate the two events to the same bug.)
  • Overflow: Some container may have filled to beyond its capacity, triggering a failure or an exception handler. (In an era of large memories and mass storage, overflow testing is often shortchanged. Even if the condition is properly handled, the process of handling it may interact with other functions of the system to cause an emergent intermittent problem.)
  • Occasional functions: Some functions of a system may be invoked so infrequently that we forget about them. (These include exception handlers, internal garbage collection functions, auto-save, and periodic maintenance functions. These functions, when invoked, may interact in unexpected ways with other functions or conditions of the system. Be especially wary of silent and automatic functions.)
  • Different mode or option setting: The system can be run in a variety of modes and the user may have set a different mode. (The new mode may not be obviously different from the old one.)

Possibility 4: The system behaved differently because it was given different input.

  • Accidental input: User may have provided input or changed the input in a way that shouldn’t have mattered, yet did. (This might also be called the Clever Hans syndrome, after the mysteriously repeatable ability of Clever Hans, the horse, to perform math problems. It was eventually discovered by Oskar Pfungst that the horse was responding to subtle physical cues that its owner was unintentionally conveying. In the computing world, I once experienced an intermittent problem due to sunlight coming through my office window and hitting an optical sensor in my mouse. The weather conditions outside shouldn’t have constituted different input, but they did. Another more common example is different behavior that may occur when using the keyboard instead of mouse to enter commands. The accidental input might be invisible unless you use special tools or recorders. For instance, two identical texts, one saved in RTF format from Microsoft Word and one saved in RTF format from Wordpad, will be very similar on the disk but not exactly identical.)
  • Secret boundaries and conditions: The software may behave differently in some parts of the input space than it does in others. (There maybe hidden boundaries, or regions of failure, that aren’t documented or anticipated in your mental model of the product. I once tested a search routine that invoked different logic when the total returned hits were =1000 and = 50,000. Only by accident did I discover these undocumented boundaries.)
  • Different profile: Some users may have different profiles of use than other users. (Different biases in input will lead to different experiences of output. Users with certain backgrounds, such as programmers, may be systematically more or less likely to experience, or notice, certain behaviors.)
  • Ghost input: Some other machine-based source than the user may have provided different input. (Such input is often invisible to the user. This includes variations due to different files, different signals from peripherals, or different data coming over the network.)
  • Deus Ex Machina: A third party may be interacting with the product at the same time as the user. (This maybe a fellow tester, friendly user, or a malicious hacker.)
  • Compromised input: Input may have been corrupted or intercepted on its way into the system. (Especially a concern in client-server systems.)
  • Time as input: Intermittence over time may be due to time itself. (Time is the one thing that constantly changes, no matter whatever else you control. Whenever time and date, or time and date intervals, are used as input, bugs in that functionality may appear at some times but not others.)
  • Timing lottery: Variations in input that normally don’t matter may matter at certain times or at certain loads. (The Mars Rover suffered from a problem like this involving a three microsecond window of vulnerability when a write operation could write to a protected part of memory.)
  • Combination lottery: Variations in input that normally don’t matter may matter when combined in a certain way.

Possibility 5: The other possibilities are magnified because your mental model of the system and what influences it is incorrect or incomplete in some important way.

  • You may not be aware of each variable that influences the system.
  • You may not be aware of sources of distortion in your observations.
  • You may not be aware of available tools that might help you understand or observe the system.
  • You may not be aware of the all the boundaries of the system and all the characteristics of those boundaries.
  • The system may not actually have a function that you think it has; or maybe it has extra functions.
  • A complex algorithm may behave in a surprising way, intermittently, that is entirely correct (e.g. mathematical chaos can look like random behavior).

Leave a Reply