且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

逆Heisenbug - 单元测试仅在附加调试器时失败

更新时间:2022-10-15 16:59:41

我已经分离了这个问题的原因 - 参见这个问题



当在调试器下运行我的测试工具时,调试环境消耗的内存意味着同一对象的后续分配/释放总是分配在不同部分的内存中。这意味着当我的测试工具试图访问一个悬挂指针,它崩溃了测试(技术上这是未定义的行为,但这是测试代码,它似乎做我需要做的)。



当从命令行运行我的测试工具时,同一对象的后续分配/释放总是重用相同内存块。这种无约束的行为意味着,当我在我的测试用例中访问实际上一个悬空指针时,发生了悬挂指针仍然指向一个有效对象。这就是为什么我没有看到崩溃。


I recently fixed a defect in our product, the symptom of which was an access violation caused by accessing a dangling pointer.

For good practice I added a unit test to ensure that the bug doesn't come back. When writing a unit test I will always back out my defect fix and ensure the unit test fails, otherwise I know it isn't doing its job properly.

After backing out the defect fix, I discovered that my unit test still passes (not good). When I attached a debugger to the unit test to see why it passes, the test failed (i.e. an exception was thrown) and I could break and observe that the call stack matched the one in the original defect which I fixed.

I didn't modify the "Break on exception" settings in Visual Studio 2005, and this is indeed a critical Win32 exception which causes the test harness to terminate (i.e. there is no graceful exception handler).

The text of the exception is:

Unhandled exception at 0x0040fc59 in _testcase.exe: 0xC0000005:
Access violation reading location 0xcdcdcdcd.

Note: The location isn't always 0xcdcdcdcd (allocated but unwritten Win32 heap memory). Sometimes it is 0x00000000, and sometimes it is another address.

This seems like the inverse of a traditional Heisenbug, where a problem goes away when observing it via a debugger. In my case, observing it via the debugger makes the problem appear!

My initial thought was that this was a race condition exposed by the timing differences in the debugger. However, when I added tracing to the code and ran it separately from the debugger, the data that I am printing out indicates to me that the application should be aborting in a similar manner to when running under the debugger. But it is not!

Any suggestions as to what could be causing this?


Update: I am narrowing in on the cause of this problem. See this question for more details. Will update this question with the answer if I find it.

I have isolated the cause of this problem - see this question for details.

When running my test harness under the debugger, the memory consumed by the debugging environment meant that subsequent allocations/deallocations of the same object were always allocated in different parts of memory. This meant that when my test harness tried to access a dangling pointer, it crashed the test (technically this is undefined behaviour but this is test code and it seems to do what I need it to do).

When running my test harness from the command line, subsequent allocations/deallocations of the same object always re-used the same block of memory. This coincedental behaviour meant that when I accessed what was in actuality a dangling pointer in my test case, it happened that the dangling pointer still pointed to a valid object. That's why I didn't see a crash.