Minimizing Unreproducible Bugs
Monday, February 03, 2014
by Anthony Vallone
Unreproducible bugs are the bane of my existence. Far too often, I find a bug, report it, and hear back that it’s not a bug because it can’t be reproduced. Of course, the bug is still there, waiting to prey on its next victim. These types of bugs can be very expensive due to increased investigation time and overall lifetime. They can also have a damaging effect on product perception when users reporting these bugs are effectively ignored. We should be doing more to prevent them. In this article, I’ll go over some obvious, and maybe not so obvious, development/testing guidelines that can reduce the likelihood of these bugs from occurring.
Avoid and test for race conditions, deadlocks, timing issues, memory corruption, uninitialized memory access, memory leaks, and resource issues
I am lumping together many bug types in this section, but they are all related somewhat by how we test for them and how disproportionately hard they are to reproduce and debug. The root cause and effect can be separated by milliseconds or hours, and stack traces might be nonexistent or misleading. A system may fail in strange ways when exposed to unusual traffic spikes or insufficient resources. Race conditions and deadlocks may only be discovered during unique traffic patterns or resource configurations. Timing issues may only be noticed when many components are integrated and their performance parameters and failure/retry/timeout delays create a chaotic system. Memory corruption or uninitialized memory access may go unnoticed for a large percentage of calls but become fatal for rare states. Memory leaks may be negligible unless the system is exposed to load for an extended period of time.
Guidelines for development:
- Simplify your synchronization logic. If it’s too hard to understand, it will be difficult to reproduce and debug complex concurrency problems.
- Always obtain locks in the same order. This is a tried-and-true guideline to avoid deadlocks, but I still see code that breaks it periodically. Define an order for obtaining multiple locks and never change that order.
- Don’t optimize by creating many fine-grained locks, unless you have verified that they are needed. Extra locks increase concurrency complexity.
- Avoid shared memory, unless you truly need it. Shared memory access is very easy to get wrong, and the bugs may be quite difficult to reproduce.
Guidelines for testing:
- Stress test your system regularly. You don't want to be surprised by unexpected failures when your system is under heavy load.
- Test timeouts. Create tests that mock/fake dependencies to test timeout code. If your timeout code does something bad, it may cause a bug that only occurs under certain system conditions.
- Test with debug and optimized builds. You may find that a well behaved debug build works fine, but the system fails in strange ways once optimized.
- Test under constrained resources. Try reducing the number of data centers, machines, processes, threads, available disk space, or available memory. Also try simulating reduced network bandwidth.
- Test for longevity. Some bugs require a long period of time to reveal themselves. For example, persistent data may become corrupt over time.
- Use dynamic analysis tools like memory debuggers, ASan, TSan, and MSan regularly. They can help identify many categories of unreproducible memory/threading issues.
Enforce preconditions
I’ve seen many well-meaning functions with a high tolerance for bad input. For example, consider this function:
void ScheduleEvent(int timeDurationMilliseconds) { if (timeDurationMilliseconds <= 0) { timeDurationMilliseconds = 1; } ... }
This function is trying to help the calling code by adjusting the input to an acceptable value, but it may be doing damage by masking a bug. The calling code may be experiencing any number of problems described in this article, and passing garbage to this function will always work fine. The more functions that are written with this level of tolerance, the harder it is to trace back to the root cause, and the more likely it becomes that the end user will see garbage. Enforcing preconditions, for instance by using asserts, may actually cause a higher number of failures for new systems, but as systems mature, and many minor/major problems are identified early on, these checks can help improve long-term reliability.
Guidelines for development:
- Enforce preconditions in your functions unless you have a good reason not to.
Use defensive programming
Defensive programming is another tried-and-true technique that is great at minimizing unreproducible bugs. If your code calls a dependency to do something, and that dependency quietly fails or returns garbage, how does your code handle it? You could test for situations like this via mocking or faking, but it’s even better to have your production code do sanity checking on its dependencies. For example:
double GetMonthlyLoanPayment() { double rate = GetTodaysInterestRateFromExternalSystem(); if (rate < 0.001 || rate > 0.5) { throw BadInterestRate(rate); } ... }
Guidelines for development:
- When possible, use defensive programming to verify the work of your dependencies with known risks of failure like user-provided data, I/O operations, and RPC calls.
Guidelines for testing:
- Use fuzz testing to test your systems hardiness when enduring bad data.
Don’t hide all errors from the user
There has been a trend in recent years toward hiding failures from users at all costs. In many cases, it makes perfect sense, but in some, we have gone overboard. Code that is very quiet and permissive during minor failures will allow an uninformed user to continue working in a failed state. The software may ultimately reach a fatal tipping point, and all the error conditions that led to failure have been ignored. If the user doesn’t know about the prior errors, they will not be able to report them, and you may not be able to reproduce them.
Guidelines for development:
- Only hide errors from the user when you are certain that there is no impact to system state or the user.
- Any error with impact to the user should be reported to the user with instructions for how to proceed. The information shown to the user, combined with data available to an engineer, should be enough to determine what went wrong.
Test error handling
The most common sections of code to remain untested is error handling code. Don’t skip test coverage here. Bad error handling code can cause unreproducible bugs and create great risk if it does not handle fatal errors well.
Guidelines for testing:
- Always test your error handling code. This is usually best accomplished by mocking or faking the component triggering the error.
- It’s also a good practice to examine your log quality for all types of error handling.
Check for duplicate keys
If unique identifiers or data access keys are generated using random data or are not guaranteed to be globally unique, duplicate keys may cause data corruption or concurrency issues. Key duplication bugs are very difficult to reproduce.
Guidelines for development:
- Try to guarantee uniqueness of all keys.
- When not possible to guarantee unique keys, check if the recently generated key is already in use before using it.
- Watch out for potential race conditions here and avoid them with synchronization.
Test for concurrent data access
Some bugs only reveal themselves when multiple clients are reading/writing the same data. Your stress tests might be covering cases like these, but if they are not, you should have special tests for concurrent data access. Case like these are often unreproducible. For example, a user may have two instances of your app running against the same account, and they may not realize this when reporting a bug.
Guidelines for testing:
- Always test for concurrent data access if it’s a feature of the system. Actually, even if it’s not a feature, verify that the system rejects it. Testing concurrency can be challenging. An approach that usually works for me is to create many worker threads that simultaneously attempt access and a master thread that monitors and verifies that some number of attempts were indeed concurrent, blocked or allowed as expected, and all were successful. Programmatic post-analysis of all attempts and changing system state may also be necessary to ensure that the system behaved well.
Steer clear of undefined behavior and non-deterministic access to data
Some APIs and basic operations have warnings about undefined behavior when in certain states or provided with certain input. Similarly, some data structures do not guarantee an iteration order (example: Java’s Set). Code that ignores these warnings may work fine most of the time but fail in unusual ways that are hard to reproduce.
Guidelines for development:
- Understand when the APIs and operations you use might have undefined behavior and prevent those conditions.
- Do not depend on data structure iteration order unless it is guaranteed. It is a common mistake to depend on the ordering of sets or associative arrays.
Log the details for errors or test failures
Issues described in this article can be easier to reproduce and debug when the logs contain enough detail to understand the conditions that led to an error.
Guidelines for development:
- Follow good logging practices, especially in your error handling code.
- If logs are stored on a user’s machine, create an easy way for them to provide you the logs.
Guidelines for testing:
- Save your test logs for potential analysis later.
Anything to add?
Have I missed any important guidelines for minimizing these bugs? What is your favorite hard-to-reproduce bug that you discovered and resolved?
Simply stated and enlightening. I am certainly going to incorporate these as soon much as possible. Thanks for sharing your thoughts.
ReplyDeleteI wrote code that enabled me to read SQL 2000 deadlocks from the log by converting it into text that showed the tables and indexes involved and the process that was running. Certainly was a bug we couldn't reproduce but with the log we were able to figure out the odd locking order that was causing it with a read and an update.
ReplyDeleteThe update locked the table page and then tried to lock the index. The read had tried to use the index to read the table. The update couldn't lock the index, the read couldn't look at the data page.
Altered the read to retrieve just the clustered PK field values. (A covering index read) Then read the table using the PK values so it is delayed by the update, not deadlocked.
PS Your sign-in procedure exposed a bug. I wrote a comment, did not have a Google account, so signed up for it. Google sign-up procedure was successful, but then blew up when I tried to return to submit my comment. Had to rewrite the comment after going back in web history to your article.
ReplyDeleteSo, did I just find an unreproducible bug in your blog?
Hi Ken, Thanks for letting us know. Actually, this might belong in the reproducible category :) I reported this bug to the blogger team.
DeleteMost "unreproducible bugs" are caused by using the execrable "C" language, with its inbuilt design flaws.
ReplyDeleteAnyone programming in "C" (for a safety critical application) should be indicted for crimes against humanity.
The total cost to the world for avoidable bugs directly as a result of employing "C" code or libraries possibly exceeds several billions of dollars.
C++ is a willing accessory to these crimes against humanity.
Good one Ken :D
ReplyDeleteWell written Anthony - thought provoking... I have been in testing for more than 6 years now and it happens all the time when a reported bug doesn't reproduce especially when a developer updates you about it...
ReplyDeleteThis is a great list of guidelines to follow... thanks for sharing...
Regards,
Aman
Great article, thanks for sharing it!
ReplyDeleteInteresting. We recently conducted an empirical study to characterize non-reproducible bugs. If you're interested, the paper is available here: http://salt.ece.ubc.ca/publications/docs/mona-msr14.pdf
ReplyDeleteI'm not going to agree with Kingsford Gray on the C language bias standpoint at all. That's not the topic. But it is the paradigm! Some parts of a testers job do involve understanding the specific vulnerabilities the software under test has, based not only on historical knowledge of the kind of bugs your dev team inject, but also on the kind of bugs the software production process itself injects. As well as the compiler-toolchain and finally the language chosen to implement the software. So back to taking out locks and synchronization, when it comes to test-code I'm going to disagree with the author Mr Vallone.
ReplyDeleteTest code and instrumentation code introduces synchronisation windows in the software being tested a lot of the time. Normally the "window" is quite wide like when turning on logging all threads end up having to access one log file. The tester who knows the exact affect this has on the system as well as about all the other synchronization points in the software will be able to exploit them to find timing bugs.
And on this point I am your keen follower Anthony, being able to test timeouts properly as if they are simply another axis to your test driven approaches or simply another table-driven test test-case can uncover lots of bugs the devs might take years to find. Great guide, packed with nuggets :)