A debugging war story

The bug has been around for most of the last year.  It’s intermittent and a pain to replicate.  We hadn’t heard anything about it from our customers, but our test system threw the errors up when we performed our nightly runs.  But the failures were all slightly different.  And different tests were failing each time.

All we knew were we were blue-screening, and that the problem was something to do with memory corruption.

The first thing to do was look at the crash dumps.  And the good news was that our drivers weren’t on the stack.  The bad news was that more and more testing showed the problem only occurred when our drivers were loaded.  You see, the problem was clearly a memory corruption – but the crash wasn’t happening at the time memory was being corrupted – no, the crash was happening when the memory got freed. You couldn’t tell who corrupted the memory, just that it got corrupted.

We tried driver verifier.  As soon as we put verifier to work on the driver we suspected, all the problems went away.

We did find that close to where the corruptions occured there was often a memory pool allocated with the tag ‘HAL’ – what was interesting about this pool, which looked like being some sort of mapping between addresses and page frame numbers, was that it seemed to have one entry too many – it had overflowed the space allocated for it.  The good news is it wasn’t one of our pools.  The bad news – I was beginning to suspect it was something like a double free of memory which caused this situation to arise.

Because we thought our driver might be causing this, we added all the instrumentation in the world to its memory allocations and frees.  But this didn’t show anything up.  The driver seemed to be working perfectly.

We were close to giving up.

One of our test engineers went through the test logs, and came up with a set of situations most likely to cause the problem.  With a bit of effort he made a reproduction of the bug that could happen in about an hour – much better than the six hour repro we had earlier.  One of the things he found was that the issue mainly happened on Windows 2008 Sp2 32 bit.

We then wen’t through ruling out any number of potential hypotheses.  Everythign from ‘Was a DVD in the drive at the time?’ to ‘Does it only happen at on machines with 2 CPUs’.  Once we had ruled out the impossible, whatever reamined, however unlikely was sure to be the cause.

Unfortunately, we ended up with the same suspicious driver.  And the same lack of clues.

Not knowing where else to look, I tried reproducing the reproduction on a checked build of 2008 sp2.  I didn’t hold out much hope.  We frequently use checked builds in developing our code, and this issue looked like being timing specific – the checked build was going to play havoc with the timing.

I installed the drivers, rebooted, and:

Assertion failed: RaidIsRegionInitialized

OK.  great.  What now?  Google was our friend.  Well, almost.  We found two results.  One was an MSDN page which didn’t mention anything about this.  The other wasn’t clear but had a few lines of hope

You may need to call GetUncachedExtension, even if you’re not going to use it. IIRC, on Win7 Storport would allocate the DMA adapter object during the GetUncachedExtension context. Your adapter likely doesn’t have any DMA restrictions, so Storport probably doesn’t really need the DMA adapter object, which is why everything works without the call.”

http://lists.openfabrics.org/pipermail/nvmewin/2012-March/000075.html

And, as it turned out, we did.  We did need to call GetUncachedExtension, even though there was no reason for us to do so.

One line fixed our storport driver, removed the bugs, fixed everything.

A year of irritating, intermittent, bluescreens gone.  And a good reason to help us understand what, roughly, was happening:  Microsoft Windows was freeing memory which we had never asked it to allocate.  More or less a double free.

Its astounding how often my job ultimately comes down to being a Google Monkey.  But there was a lot of work to lead us to Google.  And some bad luck too – we used checked builds a lot, but – it turns out – not the 2k8 checked build, which was the one that had the assertion.  We only used that this time because 2k8 was part of the repro we found.

But figuring this out is something that our team (and it was absolutely a team effort) can be proud of.

Today is a good day to code.

You must be logged in to post a comment.

© Ben.Cha.lmers.co.uk
CyberChimps