Saturday, March 13, 2010

Memory Corruption Makes Me Sad

Nothing makes programmers cower in fear more than memory corruption. These bugs are almost always A) fatal and B) ridiculously difficult to track down and C) hard to reproduce consistently. The combination of these three things can make you start thinking of your memory in surprisingly literal ways:

(and literal in more ways than one, since your memory is "trashed" hahahaha, oh that was bad...)


This particular blog posting is about my own struggle with a bug that I knew about for no less than four months, and for the last two weeks I worked on it 24/7. I fixed it, but it was very difficult to isolate.

Memory corruption happens when "the contents of a memory location are unintentionally modified due to programming errors. When the corrupted memory contents are used later in the computer program, it leads either to program crash or to strange and bizarre program behavior." What makes memory corruption particularly insidious is your program doesn't crash or behave strangely at the time of modification--it happens later, when something attempts to use the memory that was sullied.

Causes of memory corruption include use of uninitialized memory, use of unowned memory, buffer overflows, faulty heap memory management and multithreading problems. In all cases, the defining characteristic is where the program crashes isn't necessarily related to what went wrong.

Here's the best strategy I've found for dealing with memory corruption:
  1. If you're really, really lucky, you might be able to catch the problem with something like DevPartner or Valgrind, but typically these only show the problem as it's blowing up. In the case of my bug, DevPartner ended up being beneficial only because it made my error more repeatable. Also, have a look at Application Verifier.
  2. Get a static code analysis tool. If you're using VS.NET 2008, try using their Code Analysis tool. It is surprisingly effective at locating buffer overruns and errant pointer access. These tools may or may not help you, but they certainly can't hurt. If you are lucky, this may be all you need to find your problem. If it doesn't work, then you get to go on to the brute force methods (lucky you!)
  3. Go through all classes and make sure every member variable is initialized correctly. Value types should be set to sensible values, pointer types should be initialized to NULL.
  4. Search for all new/delete/malloc/free statements.  Ensure that all pointer values that are allocated begin their life as NULL. Ensure that the memory being released is immediately NULL'd. Ensure you have no new/free or malloc/delete mismatches. Ensure you do not free/delete memory twice. Ensure that any memory allocated by a third party library is also destroyed by that library (i.e. do not call "CrazyLibraryMemAlloc" and use free/delete to clean up unless you are positive that is the correct thing to do). Make sure your destructors and cleanup methods release all memory and NULL everything. Make sure you're using delete[] if the type was allocated with new []. In essence, everything should begin its life as NULL and end its life as NULL. This is probably the single best thing you can do to isolate memory tom-foolery.
  5. Review every memset, memcpy and mem-whatever in the program (ditto for Win32 variants like CopyMemory). If you are using raw buffer pointers (e.g. void*, int*, etc.), consider wrapping them in something like QByteArray. Review any and all string handling code (in particular, strcpy and the likes). If you have any raw pointer string types, consider replacing them with a decent string class like CString or Qt's QString.
  6. Are you using threads? Does the crash happen in a shared object? If so, this strongly implies your locking strategy (or lack thereof--even if you have locks, be absolutely certain they are working as you expect).
  7. Determine if you are experiencing corruption in the same location, or if it's more random. If it's random corruption, then it is more likely to be a buffer overflow. If it's localized corruption (i.e. let's say the crash always happens in a shared queue, or in the same place in code), then it is more likely that code touching the shared item is invalid. If it crashes in the exact same place always, then you are in luck--you should be able to watch that location in a debugger and break on any read/write. Whether or not you have crashes in the same place is a huge, huge clue about your problem. Track this information religiously.
  8. One method for determining if you have local/random corruption is to declare "no man's land" buffers directly above and below the item being corrupted. Like, nice, big 10k buffers that are initialized to "0xDEADBEEFDEADBEEFDEADBEEF..." When your program crashes, inspect those buffers. If those buffers contain invalid data, then it is not localized corruption. If they aren't corrupted, but the data structure they wrap is, then that strongly implies something that touches the sandwiched object is where the problem may lie.
  9. It is not likely to be the C-runtime, third party code, obscure linking issues, etc. Think about it: you're not the only person using these libraries and tools. They are generally more thoroughly vetted because of Linus' law. Is it possible? Yeah, sure, anything's possible. But is it likely? Not really.
  10. Unless evidence strongly implies otherwise, assume the issue is in your code. This is good, because it means it's something you can potentially fix. Otherwise, you may have to start a support case with whomever owns the code. If it's an open source project, you might get a quick response (or possibly no response at all...). If it's somebody like MSFT, it is going to take weeks at a minimum. Only as a last resort should you assume it's somewhere else, and be certain you have Real Information™ to backup your theory.
It may take a couple of people a day or five to go through the program and make all these changes depending on how big the application is, but it's generally the only real way to isolate problems. And it also gets you in the habit of being religiously fanatical about default values, pointer checking and correct deletion of objects, which is good.

For me, the issue ended up being extremely subtle (it evaded two separate code reviews by my peers), and after finding it, painfully obvious and somewhat embarrassing. I was having localized corruption around a shared queue that two threads accessed. The culprit was invalid locking code I'd written. Once the queue became corrupted, it would fail somewhere in the bowels of whatever queue object I'd been using (I tried CAtlList, QList, etc, but it didn't matter because none of them are thread-safe).

Which brings me back to item #10 in the above list: it's always your fault. It was my fault. It can be very tempting to assume otherwise, but generally I don't find this to be the case. So keep an open mind, think analytically, write down what you know and what you don't know, and you'll be done with the bug sooner than you know it!

No comments: