调试内存损坏:到底谁将“ 2”写入我的堆栈?

嗨,我叫Tautvydas,我是Windows团队的Unity软件开发人员。 我想分享一个调试难以捉摸的内存损坏错误的故事。 (Hi, my name is Tautvydas and I’m a software developer at Unity working in the Windows team. I’d like to share a story of debugging an elusive memory corruption bug.)

Several weeks ago we received a bug report from a customer that said their game was crashing when using IL2CPP scripting backend. QA verified the bug and assigned it to me for fixing. The project was quite big (although far from the largest ones); it took 40 minutes to build on my machine. The instructions on the bug report said: “Play the game for 5-10 minutes until it crashes”. Sure enough, after following instructions, I observed a crash. I fired up WinDbg ready to nail it down. Unfortunately, the stack trace was bogus:

几周前,我们收到了一个客户的错误报告,该错误报告说他们的游戏在使用IL2CPP脚本后端时崩溃。 质量检查人员验证了该错误,并将其分配给我进行修复。 这个项目很大(尽管与最大的项目相去甚远)。 在我的机器上构建花了40分钟。 错误报告中的说明说:“玩游戏5-10分钟,直到崩溃为止”。 果然,按照说明进行操作后,我观察到了崩溃。 我启动了WinDbg,准备将其固定下来。 不幸的是,堆栈跟踪是虚假的:

0:049> k
# Child-SP RetAddr Call Site
00 00000022 e25feb10 0000000000000010 0x00007ffa 00000102

0:050> u 0x00007ffa

00000102 L10
00007ffa 00000102 ?? ???
^ Memory access error in 'u 0x00007ffa
00000102 l10′
0:049> k
#Child-SP RetAddr呼叫站点
00 00000022 e25feb10 00000000 00000010 0x00007ffa 00000102

0:050> u 0x00007ffa

00000102 L10
00007ffa 00000102 ?? ???
^ Memory access error in 'u 0x00007ffa
00000102 ?? ???
^ Memory access error in 'u 0x00007ffa
00000102 ?? ???
^ Memory access error in 'u 0x00007ffa
00000102 l10'中的 00000102 ?? ???
^ Memory access error in 'u 0x00007ffa

Clearly, it tried executing an invalid memory address. Although the stacktrace had been corrupted, I was hoping that only a part of the whole stack got corrupted and that I should be able to reconstruct it if I look at memory contents past the stack pointer register. Surely enough, that gave me an idea where to look next:

显然,它尝试执行无效的内存地址。 尽管堆栈跟踪已损坏,但我希望整个堆栈中只有一部分损坏,并且如果我查看堆栈指针寄存器后面的内存内容,我应该能够重建它。 果然,这使我有了下一步的思路:


……………
………………
00000022
00000022 e25febd8 00007ffab1fdc65c ucrtbased!heap_alloc_dbg+0x1c [d:\th\minkernel\crts\ucrt\src\appcrt\heap\debug_heap.cpp @ 447] e25febd8 00007ffa b1fdc65c ucrtbased!heap_alloc_dbg + 0x1c [d:\ th \ minkernel \ crts \ ucrt \ src \ appcrt \ heap \ debug_heap.cpp @ 447]
00000022
00000022 e25febe0 0000000000000004 e25febe0 00000000
00000022
00000022 e25febe8 0000002200000001 e25febe8 00000022 00000001
00000022
00000022 e25febf0 0000002200000000 e25febf0 00000022 00000000
00000022
00000022 e25febf8 0000000000000000 e25febf8 00000000 00000000
00000022
00000022 e25fec00 00000022e25fec30 e25fec00 00000022 e25fec30
00000022
00000022 e25fec08 00007ffa99b3d3ab UnityPlayer!std::_Vector_alloc<std::_Vec_base_types<il2cpp::os::PollRequest,std::allocator<il2cpp::os::PollRequest> > >::_Get_data+0x2b [ c:\program files (x86)\microsoft visual studio 14.0\vc\include\vector @ 642] e25fec08 00007ffa 99b3d3ab UnityPlayer!std :: _ Vector_alloc <std :: _ Vec_base_types <il2cpp :: os :: PollRequest,std :: allocator <il2cpp :: os :: PollRequest>>> :: _ Get_data + 0x2b [c:\ program文件(x86)\ Microsoft Visual Studio 14.0 \ vc \ include \ vector @ 642]
00000022
00000022 e25fec10 00000022e25ff458 e25fec10 00000022 e25ff458
00000022
00000022 e25fec18 cccccccccccccccc e25fec18 cccccccc cccccccc
00000022
00000022 e25fec20 cccccccccccccccc e25fec20 cccccccc cccccccc
00000022
00000022 e25fec28 00007ffab1fdf54c ucrtbased!_calloc_dbg+0x6c [d:\th\minkernel\crts\ucrt\src\appcrt\heap\debug_heap.cpp @ 511] e25fec28 00007ffa b1fdf54c ucrtbased!_calloc_dbg + 0x6c [d:\ th \ minkernel \ crts \ ucrt \ src \ appcrt \ heap \ debug_heap.cpp @ 511]
00000022
00000022 e25fec30 0000000000000010 e25fec30 00000000 00000010
00000022
00000022 e25fec38 00007ffa00000001 e25fec38 00007ffa 00000001
……………
………………
00000022
00000022 e25fec58 0000000000000010 e25fec58 00000000 00000010
00000022
00000022 e25fec60 00000022e25feca0 e25fec60 00000022 e25feca0
00000022
00000022 e25fec68 00007ffab1fdb69e ucrtbased!calloc+0x2e [d:\th\minkernel\crts\ucrt\src\appcrt\heap\calloc.cpp @ 25] e25fec68 00007ffa b1fdb69e ucrtbased!calloc + 0x2e [d:\ th \ minkernel \ crts \ ucrt \ src \ appcrt \ heap \ calloc.cpp @ 25]
00000022
00000022 e25fec70 0000000000000001 e25fec70 00000000 00000001
00000022
00000022 e25fec78 0000000000000010 e25fec78 00000000 00000010
00000022
00000022 e25fec80 cccccccc00000001 e25fec80 cccccccc 00000001
00000022
00000022 e25fec88 0000000000000000 e25fec88 00000000 00000000
00000022
00000022 e25fec90 0000002200000000 e25fec90 00000022 00000000
00000022
00000022 e25fec98 cccccccccccccccc e25fec98 cccccccc cccccccc
00000022
00000022 e25feca0 00000022e25ff3f0 e25feca0 00000022 e25ff3f0
00000022
00000022 e25feca8 00007ffa99b3b646 UnityPlayer!il2cpp::os::SocketImpl::Poll+0x66 [ c:\users\tautvydas\builds\bin2\il2cppoutputproject\il2cpp\libil2cpp\os\win32\socketimpl.cpp @ 1429] e25feca8 00007ffa 99b3b646 UnityPlayer!il2cpp :: os :: SocketImpl :: Poll + 0x66 [c:\ users \ tautvydas \ builds \ bin2 \ il2cppoutputproject \ il2cpp \ libil2cpp \ os \ win32 \ socketimpl.cpp @ 1429]
00000022
00000022 e25fecb0 0000000000000001 e25fecb0 00000000 00000001
00000022
00000022 e25fecb8 0000000000000010 e25fecb8 00000000 00000010
……………
………………
00000022
00000022 e25ff3f0 00000022e25ff420 e25ff3f0 00000022 e25ff420
00000022
00000022 e25ff3f8 00007ffa99c1caf4 UnityPlayer!il2cpp::os::Socket::Poll+0x44 [ c:\users\tautvydas\builds\bin2\il2cppoutputproject\il2cpp\libil2cpp\os\socket.cpp @ 324] e25ff3f8 00007ffa 99c1caf4 UnityPlayer!il2cpp :: os :: Socket :: Poll + 0x44 [c:\ users \ tautvydas \ builds \ bin2 \ il2cppoutputproject \ il2cpp \ libil2cpp \ os \ socket.cpp @ 324]
00000022
00000022 e25ff400 00000022e25ff458 e25ff400 00000022 e25ff458
00000022
00000022 e25ff408 ccccccccffffffff e25ff408 cccccccc ffffffff
00000022
00000022 e25ff410 00000022e25ff5b4 e25ff410 00000022 e25ff5b4
00000022
00000022 e25ff418 00000022e25ff594 e25ff418 00000022 e25ff594
00000022
00000022 e25ff420 00000022e25ff7e0 e25ff420 00000022 e25ff7e0
00000022
00000022 e25ff428 00007ffa99b585f8 UnityPlayer!il2cpp::vm::SocketPollingThread::RunLoop+0x268 [ c:\users\tautvydas\builds\bin2\il2cppoutputproject\il2cpp\libil2cpp\vm\threadpool.cpp @ 452] e25ff428 00007ffa 99b585f8 UnityPlayer!il2cpp :: vm :: SocketPollingThread :: RunLoop + 0x268 [c:\ users \ tautvydas \ builds \ bin2 \ il2cppoutputproject \ il2cpp \ libil2cpp \ vm \ threadpool.cpp @ 452]
00000022
00000022 e25ff430 00000022e25ff458 e25ff430 00000022 e25ff458
00000022
00000022 e25ff438 00000000ffffffff e25ff438 00000000 ffffffff
……………
………………
00000022
00000022 e25ff7d8 00000022e25ff6b8 e25ff7d8 00000022 e25ff6b8
00000022
00000022 e25ff7e0 00000022e25ff870 e25ff7e0 00000022 e25ff870
00000022
00000022 e25ff7e8 00007ffa99b58d2c UnityPlayer!il2cpp::vm::SocketPollingThreadEntryPoint+0xec [ c:\users\tautvydas\builds\bin2\il2cppoutputproject\il2cpp\libil2cpp\vm\threadpool.cpp @ 524] e25ff7e8 00007ffa 99b58d2c UnityPlayer!il2cpp :: vm :: SocketPollingThreadEntryPoint + 0xec [c:\ users \ tautvydas \ builds \ bin2 \ il2cppoutputproject \ il2cpp \ libil2cpp \ vm \ threadpool.cpp @ 524]
00000022
00000022 e25ff7f0 00007ffa9da83610 UnityPlayer!il2cpp::vm::g_SocketPollingThread e25ff7f0 00007ffa 9da83610 UnityPlayer!il2cpp :: vm :: g_SocketPollingThread
00000022
00000022 e25ff7f8 00007ffa99b57700 UnityPlayer!il2cpp::vm::FreeThreadHandle [ c:\users\tautvydas\builds\bin2\il2cppoutputproject\il2cpp\libil2cpp\vm\threadpool.cpp @ 488] e25ff7f8 00007ffa 99b57700 UnityPlayer!il2cpp :: vm :: FreeThreadHandle [c:\ users \ tautvydas \ builds \ bin2 \ il2cppoutputproject \ il2cpp \ libil2cpp \ vm \ threadpool.cpp @ 488]
00000022
00000022 e25ff800 000000000000106c e25ff800 00000000 0000106c
00000022
00000022 e25ff808 cccccccccccccccc e25ff808 cccccccc cccccccc
00000022
00000022 e25ff810 00007ffa9da83610 UnityPlayer!il2cpp::vm::g_SocketPollingThread e25ff810 00007ffa 9da83610 UnityPlayer!il2cpp :: vm :: g_SocketPollingThread
00000022
00000022 e25ff818 000001c41705f5c0 e25ff818 000001c4 1705f5c0
00000022
00000022 e25ff820 cccccccc0000106c e25ff820 cccccccc 0000106c
……………
………………
00000022
00000022 e25ff860 00005eaae9a6af86 e25ff860 00005eaa e9a6af86
00000022
00000022 e25ff868 cccccccccccccccc e25ff868 cccccccc cccccccc
00000022
00000022 e25ff870 00000022e25ff8d0 e25ff870 00000022 e25ff8d0
00000022
00000022 e25ff878 00007ffa99c63b52 UnityPlayer!il2cpp::os::Thread::RunWrapper+0xd2 [ c:\users\tautvydas\builds\bin2\il2cppoutputproject\il2cpp\libil2cpp\os\thread.cpp @ 106] e25ff878 00007ffa 99c63b52 UnityPlayer!il2cpp :: os :: Thread :: RunWrapper + 0xd2 [c:\ users \ tautvydas \ builds \ bin2 \ il2cppoutputproject \ il2cpp \ libil2cpp \ os \ thread.cpp @ 106]
00000022
00000022 e25ff880 00007ffa9da83610 UnityPlayer!il2cpp::vm::g_SocketPollingThread e25ff880 00007ffa 9da83610 UnityPlayer!il2cpp :: vm :: g_SocketPollingThread
00000022
00000022 e25ff888 0000000000000018 e25ff888 00000000 00000018
00000022
00000022 e25ff890 cccccccccccccccc e25ff890 cccccccc cccccccc
……………
………………
00000022
00000022 e25ff8a8 000001c415508c90 e25ff8a8 000001c4 15508c90
00000022
00000022 e25ff8b0 cccccccc00000002 e25ff8b0 cccccccc 00000002
00000022
00000022 e25ff8b8 00007ffa99b58c40 UnityPlayer!il2cpp::vm::SocketPollingThreadEntryPoint [ c:\users\tautvydas\builds\bin2\il2cppoutputproject\il2cpp\libil2cpp\vm\threadpool.cpp @ 494] e25ff8b8 00007ffa 99b58c40 UnityPlayer!il2cpp :: vm :: SocketPollingThreadEntryPoint [c:\ users \ tautvydas \ builds \ bin2 \ il2cppoutputproject \ il2cpp \ libil2cpp \ vm \ threadpool.cpp @ 494]
00000022
00000022 e25ff8c0 00007ffa9da83610 UnityPlayer!il2cpp::vm::g_SocketPollingThread e25ff8c0 00007ffa 9da83610 UnityPlayer!il2cpp :: vm :: g_SocketPollingThread
00000022
00000022 e25ff8c8 000001c4155a5890 e25ff8c8 000001c4 155a5890
00000022
00000022 e25ff8d0 00000022e25ff920 e25ff8d0 00000022 e25ff920
00000022
00000022 e25ff8d8 00007ffa99c19a14 UnityPlayer!il2cpp::os::ThreadStartWrapper+0x54 [ c:\users\tautvydas\builds\bin2\il2cppoutputproject\il2cpp\libil2cpp\os\win32\threadimpl.cpp @ 31] e25ff8d8 00007ffa 99c19a14 UnityPlayer!il2cpp :: os :: ThreadStartWrapper + 0x54 [c:\ users \ tautvydas \ builds \ bin2 \ il2cppoutputproject \ il2cpp \ libil2cpp \ os \ win32 \ threadimpl.cpp @ 31]
00000022
00000022 e25ff8e0 000001c4155a5890 e25ff8e0 000001c4 155a5890
……………
………………
00000022
00000022 e25ff900 cccccccccccccccc e25ff900 cccccccc cccccccc
00000022
00000022 e25ff908 00007ffa99c63a80 UnityPlayer!il2cpp::os::Thread::RunWrapper [ c:\users\tautvydas\builds\bin2\il2cppoutputproject\il2cpp\libil2cpp\os\thread.cpp @ 80] e25ff908 00007ffa 99c63a80 UnityPlayer!il2cpp :: os :: Thread :: RunWrapper [c:\ users \ tautvydas \ builds \ bin2 \ il2cppoutputproject \ il2cpp \ libil2cpp \ os \ thread.cpp @ 80]
00000022
00000022 e25ff910 000001c4155a5890 e25ff910 000001c4 155a5890
……………
………………
00000022
00000022 e25ff940 000001c41e0801b0 e25ff940 000001c4 1e0801b0
00000022
00000022 e25ff948 00007ffae6858102 KERNEL32!BaseThreadInitThunk+0x22 e25ff948 00007ffa e6858102 KERNEL32!BaseThreadInitThunk + 0x22
00000022
00000022 e25ff950 000001c41e0801b0 e25ff950 000001c4 1e0801b0
00000022
00000022 e25ff958 0000000000000000 e25ff958 00000000 00000000
00000022
00000022 e25ff960 0000000000000000 e25ff960 00000000 00000000
00000022
00000022 e25ff968 0000000000000000 e25ff968 00000000 00000000
00000022
00000022 e25ff970 00007ffa99c199c0 UnityPlayer!il2cpp::os::ThreadStartWrapper [ c:\users\tautvydas\builds\bin2\il2cppoutputproject\il2cpp\libil2cpp\os\win32\threadimpl.cpp @ 26] e25ff970 00007ffa 99c199c0 UnityPlayer!il2cpp :: os :: ThreadStartWrapper [c:\ users \ tautvydas \ builds \ bin2 \ il2cppoutputproject \ il2cpp \ libil2cpp \ os \ win32 \ threadimpl.cpp @ 26]
00000022
00000022 e25ff978 00007ffae926c5b4 ntdll!RtlUserThreadStart+0x34 e25ff978 00007ffa e926c5b4 ntdll!RtlUserThreadStart + 0x34
00000022
00000022 e25ff980 00007ffae68580e0 KERNEL32!BaseThreadInitThunk e25ff980 00007ffa e68580e0 KERNEL32!BaseThreadInitThunk

Here’s a rough reconstructed stacktrace:

这是一个粗略的重构堆栈跟踪:

e25febd8 00007ffab1fdc65c ucrtbased!heap_alloc_dbg+0x1c […\appcrt\heap\debug_heap.cpp @ 447] e25febd8 00007ffa b1fdc65c ucrtbased!heap_alloc_dbg + 0x1c […\ appcrt \ heap \ debug_heap.cpp @ 447]
00000022
00000022 e25fec28 00007ffab1fdf54c ucrtbased!_calloc_dbg+0x6c […\appcrt\heap\debug_heap.cpp @ 511] e25fec28 00007ffa b1fdf54c ucrtbased!_calloc_dbg + 0x6c […\ appcrt \ heap \ debug_heap.cpp @ 511]
00000022
00000022 e25fec68 00007ffab1fdb69e ucrtbased!calloc+0x2e […\appcrt\heap\calloc.cpp @ 25] e25fec68 00007ffa b1fdb69e ucrtbased!calloc + 0x2e […\ appcrt \ heap \ calloc.cpp @ 25]
00000022
00000022 e25feca8 00007ffa99b3b646 UnityPlayer!il2cpp::os::SocketImpl::Poll+0x66 […\libil2cpp\os\win32\socketimpl.cpp @ 1429] e25feca8 00007ffa 99b3b646 UnityPlayer!il2cpp :: os :: SocketImpl :: Poll + 0x66 […\ libil2cpp \ os \ win32 \ socketimpl.cpp @ 1429]
00000022
00000022 e25ff3f8 00007ffa99c1caf4 UnityPlayer!il2cpp::os::Socket::Poll+0x44 […\libil2cpp\os\socket.cpp @ 324] e25ff3f8 00007ffa 99c1caf4 UnityPlayer!il2cpp :: os :: Socket :: Poll + 0x44 […\ libil2cpp \ os \ socket.cpp @ 324]
00000022
00000022 e25ff428 00007ffa99b585f8 UnityPlayer!il2cpp::vm::SocketPollingThread::RunLoop+0x268 […\libil2cpp\vm\threadpool.cpp @ 452] e25ff428 00007ffa 99b585f8 UnityPlayer!il2cpp :: vm :: SocketPollingThread :: RunLoop + 0x268 […\ libil2cpp \ vm \ threadpool.cpp @ 452]
00000022
00000022 e25ff7e8 00007ffa99b58d2c UnityPlayer!il2cpp::vm::SocketPollingThreadEntryPoint+0xec […\libil2cpp\vm\threadpool.cpp @ 524] e25ff7e8 00007ffa 99b58d2c UnityPlayer!il2cpp :: vm :: SocketPollingThreadEntryPoint + 0xec […\ libil2cpp \ vm \ threadpool.cpp @ 524]
00000022
00000022 e25ff878 00007ffa99c63b52 UnityPlayer!il2cpp::os::Thread::RunWrapper+0xd2 […\libil2cpp\os\thread.cpp @ 106] e25ff878 00007ffa 99c63b52 UnityPlayer!il2cpp :: os :: Thread :: RunWrapper + 0xd2 […\ libil2cpp \ os \ thread.cpp @ 106]
00000022
00000022 e25ff8d8 00007ffa99c19a14 UnityPlayer!il2cpp::os::ThreadStartWrapper+0x54 […\libil2cpp\os\win32\threadimpl.cpp @ 31] e25ff8d8 00007ffa 99c19a14 UnityPlayer!il2cpp :: os :: ThreadStartWrapper + 0x54 […\ libil2cpp \ os \ win32 \ threadimpl.cpp @ 31]
00000022
00000022 e25ff948 00007ffae6858102 KERNEL32!BaseThreadInitThunk+0x22 e25ff948 00007ffa e6858102 KERNEL32!BaseThreadInitThunk + 0x22
00000022
00000022 e25ff978 00007ffae926c5b4 ntdll!RtlUserThreadStart+0x34 e25ff978 00007ffa e926c5b4 ntdll!RtlUserThreadStart + 0x34

Alright, so now I knew which thread was crashing: it was the IL2CPP runtime socket polling thread. Its responsibility is tell other threads when their sockets are ready to send or receive data. It goes like this: there’s a FIFO queue that socket poll requests get put in by other threads, the socket polling thread then dequeues these requests one by one, calls select() function and when select() returns a result, it queues a callback that was in the original request to the thread pool.

好了,现在我知道哪个线程崩溃了:它是IL2CPP运行时套接字轮询线程。 它的职责是告诉其他线程何时套接字准备发送或接收数据。 它是这样的:有一个FIFO队列,套接字轮询请求被其他线程放入,套接字轮询线程然后逐个使这些请求出队,调用select()函数,当select()返回结果时,它将回调队列那是对线程池的原始请求。

So somebody is corrupting the stack badly. In order to narrow the search, I decided to put “stack sentinels” on most stack frames in that thread. Here’s how my stack sentinel was defined:

因此,有人严重破坏了堆栈。 为了缩小搜索范围,我决定在该线程的大多数堆栈帧中放置“堆栈标记”。 这是我的堆栈哨兵的定义方式:

Stack sentinel

When it’s constructed, it would fill the buffer with “0xDD”. When it’s destructed, it would check if those values did not change. This worked incredibly well: the game was no longer crashing! It was asserting instead:

构造后,它将用“ 0xDD”填充缓冲区。 销毁后,它将检查这些值是否未更改。 效果非常好:游戏不再崩溃! 它断言:

Somebody wrote 2

Somebody had been touching my sentinel’s privates – and it definitely wasn’t a friend. I ran this a couple more times, and the result was the same: every time a value of “2” was written to the buffer first. Looking at the memory view, I noticed that what I saw was familiar:

有人一直在碰我的哨兵的私人人物,而那绝对不是朋友。 我又运行了几次,结果是一样的:每次将值“ 2”首先写入缓冲区时。 在查看内存视图时,我注意到我所看到的很熟悉:

Memory view

These are the exact same values that we’ve seen in the very first corrupted stack trace. I realized that whatever caused the crash earlier was also responsible for corrupting the stack sentinel. At first, I thought that this was some kind of a buffer overflow, and somebody was writing outside of their local variable bounds. So I started placing these stack sentinels much more aggressively: before almost every function call that the thread made. However, the corruptions seemed to happen at random times, and I wasn’t able to find what was causing them using this method.

这些值与我们在第一个损坏的堆栈跟踪中看到的值完全相同。 我意识到,早些时候导致崩溃的原因也是导致堆栈哨兵损坏的原因。 起初,我以为这是某种缓冲区溢出,有人在其局部变量范围之外写。 因此,我开始更加积极地放置这些堆栈标记:在线程进行的几乎每个函数调用之前。 但是,损坏似乎是随机发生的,并且我无法使用此方法找到导致它们的原因。

I knew that memory was always getting corrupted while one of my sentinels is in scope. I somehow needed to catch the thing that corrupts it red handed. I figured to make the stack sentinel memory read only for the duration of the stack sentinel life: I would call VirtualProtect() in the constructor to mark pages read only, and call it again in the destructor to make them writable:

我知道当我的一个哨兵进入视野时,记忆总是会被破坏。 我不知何故需要抓住破坏它的东西。 我想使堆栈哨兵内存在堆栈哨兵寿命期间保持只读:我将在构造函数中调用VirtualProtect()以将页面标记为只读,然后在析构函数中再次对其进行调用以使其可写:

Protected sentinel

To my surprise, it was still being corrupted! And the message in the debug log was:

令我惊讶的是,它仍然在被破坏! 调试日志中的消息是:


CrashingGame.exe has triggered a breakpoint.
CrashingGame.exe触发了一个断点。

This was a red flag to me. Somebody had been corrupting memory either while the memory was read only, or just before I set it to read only. Since I got no access violations, I assumed that it was the latter so I changed the code to check whether memory contents changed right after setting my magic values:

这对我来说是一个危险信号。 在内存为只读时或在我将其设置为只读之前有人破坏了内存。 由于没有访问冲突,因此我以为是后者,因此我更改了代码以检查设置魔术值后内存内容是否发生了更改:

Checking right after setting

My theory checked out:

我的理论证明了:


CrashingGame.exe has triggered a breakpoint.
CrashingGame.exe触发了一个断点。

At this point I was thinking: “Well, it must be another thread corrupting my stack. It MUST be. Right? RIGHT?”. The only way I knew how to proceed in investigating this was to use data (memory) breakpoints to catch the offender. Unfortunately, on x86 you can watch only four memory locations at a time, that means I can monitor 32 bytes at most, while the area that had been getting corrupted was 16 KB. I somehow needed to figure out where to set the breakpoints. I started observing corruption patterns. At first, it seemed that they are random, but that was merely an illusion due to the nature of ASLR: every time I restarted the game, it would place the stack at random memory address, so the place of corruption naturally differed. However, when I realized this, I stopped restarting the game every time memory became corrupted and just continued execution. This led me to discover that the corrupted memory address was always constant for a given debugging session. In other words, once it had been corrupted once, it would always get corrupted at the exact same memory address as long as I don’t terminate the program:

此时,我在想: “好吧,它一定是另一个破坏我堆栈的线程。 肯定是。 对? 对?” 。 我知道如何进行调查的唯一方法是使用数据(内存)断点来捕获违规者。 不幸的是,在x86上,您一次只能看到四个内存位置,这意味着我最多只能监视32个字节,而已损坏的区域为16 KB。 我不知何故需要弄清楚在哪里设置断点。 我开始观察腐败模式。 最初,它们似乎是随机的,但是由于ASLR的性质,这只是一种错觉:每次我重新启动游戏时,它将堆栈放置在随机的内存地址,因此损坏的位置自然会有所不同。 但是,当我意识到这一点时,每当内存损坏时我就停止重新启动游戏,而只是继续执行。 这使我发现,对于给定的调试会话,损坏的内存地址始终是恒定的。 换句话说,一旦损坏一次,只要我不终止程序,它总是会在完全相同的内存地址被损坏:


CrashingGame.exe has triggered a breakpoint.
CrashingGame.exe触发了一个断点。
Memory was corrupted at 0x90445febd8.
内存在0x90445febd8处损坏。
CrashingGame.exe has triggered a breakpoint.
CrashingGame.exe触发了一个断点。

I set a data breakpoint on that memory address and watched as it kept breaking whenever I set it to a magic value of 0xDD. I figured, this was going to take a while, but Visual Studio actually allows me to set a condition on that breakpoint: to only break if the value of that memory address is 2:

我在该内存地址上设置了一个数据断点,并且当我将其设置为0xDD的魔术值时,它一直处于断点状态。 我想这需要一段时间,但是Visual Studio实际上允许我在该断点上设置一个条件:仅在该内存地址的值为2时才中断:

Conditional data breakpoint

A minute later, this breakpoint finally hit. I arrived at this point in time after 3 days into debugging this thing. This was going to be my triumph. “I finally pinned you down!”, I proclaimed. Or so I so optimistically thought:

一分钟后,这个断点终于出现了。 经过三天的调试,我到达了这个时间点。 这将是我的胜利。 “我终于把你钉住了!” ,我宣布。 还是我如此乐观地认为:

Corrupted at assignment

I watched at the debugger with disbelief as my mind got filled with more questions than answers: “What? How is this even possible? Am I going crazy?”. I decided to look at the disassembly:

当我头脑中充满疑问而不是答案时,我难以置信地看着调试器: “什么? 这怎么可能? 我要疯了吗?” 。 我决定看一下反汇编:

Corrupted at assignment disassembly

Sure enough, it was modifying that memory location. But it was writing 0xDD to it, not 0x02! After looking at the memory window, the whole region was already corrupted:

果然,它正在修改该内存位置。 但是它正在向其中写入0xDD,而不是0x02! 查看内存窗口后,整个区域已经损坏:

rax memory

As I was ready to bang my head against the wall, I called my coworker and asked him to look whether I had missed something obvious. We reviewed the debugging code together and we couldn’t find anything that could even remotely cause such weirdness. I then took a step back and tried imagining what could possibly be causing the debugger to break thinking that code set the value to “2”. I came up with the following hypothetical chain of events:

当我准备把头撞在墙上时,我打电话给我的同事,让他看看我是否错过了明显的事情。 我们一起检查了调试代码,我们找不到任何可以远程导致这种怪异的东西。 然后,我退后一步,尝试想象一下可能导致调试器中断的原因,以为代码将值设置为“ 2”。 我提出了以下假设的事件链:

1. mov byte ptr [rax], 0DDh modifies the memory location, CPU breaks execution to let debugger inspect the program state 2. Memory gets corrupted by something 3. Debugger inspects the memory address, finds “2” inside and thinks that’s what changed.

1. mov byte ptr [rax],0DDh修改内存位置,CPU中断执行以让调试器检查程序状态2.内存被某些东西破坏了3.调试器检查了内存地址,在里面找到了“ 2”,并认为那是改变了。

So… what can change memory contents while the program is frozen by a debugger? As far as I know, that’s possible in 2 scenarios: it’s either another process doing it, or it’s the OS kernel. To investigate either of these, a conventional debugger will not work. Enter kernel debugging land.

那么……当程序被调试器冻结时,什么可以改变存储器的内容呢? 据我所知,这在两种情况下都是可能的:这是另一个过程在做,或者是OS内核。 为了研究这两种情况,常规调试器将无法工作。 输入内核调试域。

Surprisingly, setting up kernel debugging is extremely easy on Windows. You’ll need 2 machines: the one debugger will run on, and the one you’ll debug. Open up elevated command prompt on the machine which you’re going to be debugging, and type this:

令人惊讶的是,在Windows上设置内核调试非常容易。 您将需要两台机器:一台调试器将在其上运行,另一台将调试。 在要调试的计算机上打开提升的命令提示符,然后键入以下命令:

Enable kernel debugger

Host IP is the IP address of the machine that has the debugger running. It will use the specified port for the debugger connection. It can be anywhere between 49152 and 65535. After hitting enter on the second command, it will tell you a secret key (truncated in the picture) which acts as a password when you connect the debugger. After completing these steps, reboot.

主机IP是运行调试器的计算机的IP地址。 它将使用指定的端口进行调试器连接。 它可以在49152到65535之间的任何位置。在第二个命令上按Enter键后,它将告诉您一个秘密密钥(在图中被截断),该密钥在连接调试器时用作密码。 完成这些步骤后,重新启动。

On the other computer, open up WinDbg, click on File -> Kernel Debug and enter port and key.

在另一台计算机上,打开WinDbg,单击文件->内核调试,然后输入端口和密钥。

Attaching kernel debugger

If everything goes well, you’ll be able to break execution by pressing Debug -> Break. If that works, the “debugee” computer will freeze. Enter “g” to continue execution.

如果一切顺利,则可以通过按Debug-> Break来中断执行。 如果可行,“ debugee”计算机将冻结。 输入“ g”继续执行。

I started up the game and waited for it to break once so I could find out the address at which memory gets corrupted:

我启动了游戏,等待它崩溃一次,以便找出内存损坏的地址:


CrashingGame.exe has triggered a breakpoint.
CrashingGame.exe触发了一个断点。

Alright, now that I knew the address where to set a data breakpoint, I had to configure my kernel debugger to actually set it:

好了,既然我知道在哪里设置数据断点的地址,我就必须配置内核调试器来实际设置它:


PROCESS ffffe00167228080
过程ffffe00167228080
SessionId: 1 Cid: 26b8 Peb: 49cceca000 ParentCid: 03d8
SessionId:1 Cid:26b8 Peb:49cceca000 ParentCid:03d8
DirBase: 1ae5e3000 ObjectTable: ffffc00186220d80 HandleCount:
DirBase:1ae5e3000对象表:ffffc00186220d80 HandleCount:
Image: CrashingGame.exe
图片:CrashingGame.exe
kd> .process /i ffffe00167228080
kd> .process / i ffffe00167228080
You need to continue execution (press ‘g’
您需要继续执行(按“ g” ) for the context
to be switched. When the debugger breaks in again, you will be in
the new process context.
kd> g
Break instruction exception – code 80000003 (first chance)
nt!DbgBreakPointWithStatus:
fffff801 7534beb0 cc int 3
kd> .process
Implicit process is now ffffe001
66e9e080
kd> .reload /f
kd> ba w 1 0x00000049D05FEDD8 “.if (@@c++(*(char*)0x00000049D05FEDD8 == 2)) { k } .else { gc }” )的上下文
被切换。 当调试器再次中断时,您将进入
新的流程上下文。
kd>克
中断指令异常–代码80000003(第一次机会)
nt!DbgBreakPointWithStatus:
fffff801 7534beb0 cc int 3
kd> .process
Implicit process is now ffffe001
7534beb0 cc int 3
kd> .process
Implicit process is now ffffe001
7534beb0 cc int 3
kd> .process
Implicit process is now ffffe001
66e9e080
kd> .reload / f
kd> ba w 1 0x00000049D05FEDD8“ .if(@@ c ++(*(char *)0x00000049D05FEDD8 == 2)){k} .else {gc}”

After some time, the breakpoint actually hit…

一段时间后,断点实际上到达了……


00 ffffd000
00 ffffd000 23c1e980 fffff8017527dc64 nt!IopCompleteRequest+0xef 23c1e980 fffff801 7527dc64 nt!IopCompleteRequest + 0xef
01 ffffd000
01 ffffd000 23c1ea70 fffff80175349953 nt!KiDeliverApc+0x134 23c1ea70 fffff801 75349953 nt!KiDeliverApc + 0x134
02 ffffd000
02 ffffd000 23c1eb00 00007ffd7e08b4bd nt!KiApcInterrupt+0xc3 23c1eb00 00007ffd 7e08b4bd nt!KiApcInterrupt + 0xc3
03 00000049
03 00000049 d05fad50 cccccccccccccccc UnityPlayer!StackSentinel::StackSentinel+0x4d […\libil2cpp\utils\memory.cpp @ 21] d05fad50 cccccccc cccccccc UnityPlayer!StackSentinel :: StackSentinel + 0x4d […\ libil2cpp \ utils \ memory.cpp @ 21]

Alright, so what’s going on here?! The sentinel is happily setting its magic values, then there’s a hardware interrupt, which then calls some completion routine, and that writes “2” into my stack. Wow. Okay, for some reason Windows kernel is corrupting my memory. But why?

好吧,那么这是怎么回事? 哨兵愉快地设置其魔术值,然后是硬件中断,然后调用一些完成例程,并将“ 2”写入我的堆栈。 哇。 好的,由于某种原因, Windows内核破坏了我的内存 。 但是为什么呢?

At first, I thought that this has to be us calling some Windows API and passing it invalid arguments. So I went through all the socket polling thread code again, and found that the only system call that we’ve been calling there was the select() function. I went to MSDN, and spent an hour rereading the docs on select() and rechecking whether we were doing everything correctly. As far as I could tell, there wasn’t really much you could do wrong with it, and there definitely wasn’t a place in docs where it said “if you pass it this parameter, we’ll write 2 into your stack”. It seemed like we were doing everything right.

起初,我认为这必须是我们调用一些Windows API并将无效参数传递给它。 因此,我再次浏览了所有套接字轮询线程代码,发现我们一直在调用的唯一系统调用是select()函数。 我去了MSDN,花了一个小时重新阅读了select()上的文档,并重新检查了我们是否做得正确。 据我所知,您实际上并没有做错什么,而且在文档中绝对没有一个地方说“如果您传递此参数,我们会将2写入堆栈” 。 看来我们做对了所有事情。

After running out of things to try, I decided to step into the select() function with a debugger, step through its disassembly and figure out how it works. It took me a few hours, but I managed to do it. It seems that the select() function is a wrapper for the WSPSelect(), which roughly looks like this:

在尝试了所有东西之后,我决定使用调试器进入select()函数,逐步进行反汇编并弄清楚其工作原理。 我花了几个小时,但我设法做到了。 似乎select()函数是WSPSelect()的包装,其大致如下所示:


/* setting up some state
/ *设置一些状态


*/
* /
IO_STATUS_BLOCK statusBlock;
IO_STATUS_BLOCK status块;
auto result = NtDeviceIoControlFile(networkDeviceHandle, completionEvent, nullptr, nullptr, &statusBlock, 0x12024,
自动结果= NtDeviceIoControlFile(networkDeviceHandle,completionEvent,nullptr,nullptr,&statusBlock,0x12024,
buffer, bufferLength, buffer, bufferLength);
buffer,bufferLength,buffer,bufferLength);
if (result == STATUS_PENDING)
如果(结果== STATUS_PENDING)
WaitForSingleObjectEx(completionEvent, INFINITE, TRUE);
WaitForSingleObjectEx(completionEvent,INFINITE,TRUE);
/* convert result and return it
/ *转换结果并返回


*/
* /

The important part here is the call to NtDeviceIoControlFile(), the fact that it passes its local variable statusBlock as an out parameter, and finally the fact that it waits for the event to be signalled using an alertable wait. So far so good: it calls a kernel function, which returns STATUS_PENDING if it cannot complete the request immediately. In that case, WSPSelect() waits until the event is set. Once NtDeviceIoControlFile() is done, it writes the result to statusBlock variable and then sets the event. The wait completes and then WSPSelect() returns.

这里重要的部分是对NtDeviceIoControlFile()的调用,它传递其本地变量statusBlock作为out参数的事实,最后是它使用可警告的等待信号通知事件的事实。 到目前为止,效果很好:它调用了内核函数,如果无法立即完成请求,则该函数将返回STATUS_PENDING。 在这种情况下,WSPSelect()等待直到事件被设置。 完成NtDeviceIoControlFile()之后,它将结果写入statusBlock变量,然后设置事件。 等待完成,然后WSPSelect()返回。

IO_STATUS_BLOCK struct looks like this:

IO_STATUS_BLOCK结构看起来像这样:


{
{
union
联盟
{
{
NTSTATUS Status;
NTSTATUS状态;
PVOID Pointer;
PVOID指针;
};
};
ULONG_PTR Information;
ULONG_PTR信息;
} IO_STATUS_BLOCK, *PIO_STATUS_BLOCK;
} IO_STATUS_BLOCK,* PIO_STATUS_BLOCK;

On 64-bit, that struct is 16 bytes long. It caught my attention that this struct seems to match my memory corruption pattern: the first 4 bytes get corrupted (NTSTATUS is 4 bytes long), then 4 bytes get skipped (padding/space for PVOID) and finally 8 more get corrupted. If that was indeed what was being written to my memory, then the first four bytes would contain the result status. The first 4 corruption bytes were always 0x00000102. And that happens to be the error code for… STATUS_TIMEOUT! That would be a sound theory, if only WSPSelect() didn’t wait for NtDeviceIOControlFile() to complete. But it did.

在64位上,该结构的长度为16个字节。 引起我注意的是,此结构似乎与我的内存损坏模式匹配:前4个字节被损坏(NTSTATUS为4个字节长),然后跳过了4个字节(PVOID的填充/空格),最后又有8个字节被损坏。 如果确实是要写入我的内存中,那么前四个字节将包含结果状态。 前四个损坏字节始终为0x00000102。 恰好是... STATUS_TIMEOUT的错误代码! 如果仅WSPSelect()不等待NtDeviceIOControlFile()完成,那将是一个合理的理论。 但是确实如此。

After figuring out how the select() function worked, I decided to look at the big picture on how socket polling thread worked. And then it hit me like a ton of bricks.

在弄清楚select()函数如何工作之后,我决定看一下套接字轮询线程是如何工作的。 然后它像一堆砖头一样打击我。

When another thread pushes a socket to be processed by the socket polling thread, the socket polling thread calls select() on that function. Since select() is a blocking call, when another socket is pushed to the socket polling thread queue it has to somehow interrupt select() so the new socket gets processed ASAP. How does one interrupt select() function? Apparently, we used QueueUserAPC() to execute asynchronous procedure while select() was blocked… and threw an exception out of it! That unrolled the stack, had us execute some more code, and then at some point in the future the kernel would complete the work and write the result to statusBlock local variable (which no longer existed at that point in time). If it happened to hit a return address on the stack, we’d crash.

当另一个线程将套接字推入要由套接字轮询线程处理的套接字时,套接字轮询线程将在该函数上调用select()。 由于select()是阻塞调用,因此当另一个套接字被推入套接字轮询线程队列时,它必须以某种方式中断select(),以便新套接字尽快得到处理。 一个中断select()如何起作用? 显然,当select()被阻止时,我们使用QueueUserAPC()执行异步过程……并抛出异常! 这样就展开了堆栈,让我们执行了更多代码,然后在将来的某个时候内核将完成工作并将结果写入statusBlock局部变量(该变量在该时间点不再存在)。 如果碰巧碰到了堆栈上的返回地址,我们将崩溃。

The fix was pretty straightforward: instead of using QueueUserAPC(), we now create a loopback socket to which we send a byte any time we need to interrupt select(). This path has been used on POSIX platforms for quite a while, and is now used on Windows too. The fix for this bug shipped in Unity 5.3.4p1.

修复非常简单:我们现在创建了一个回送套接字,而不需要使用QueueUserAPC(),只要需要中断select(),我们就可以向该套接字发送一个字节。 该路径已经在POSIX平台上使用了一段时间,现在也已在Windows上使用。 Unity 5.3.4p1中提供了此错误的修复程序。

This is one of those bugs that keep you up at night. It took me 5 days to solve, and it’s probably one of the hardest bugs I ever had to look into and fix. Lesson learnt, folks: do not throw exceptions out of asynchronous procedures if you’re inside a system call!

这是使您彻夜难眠的错误之一。 我花了5天的时间才能解决,这可能是我曾经研究和修复的最困难的错误之一。 经验教训,伙计们: 如果您在系统调用中,请不要在异步过程中抛出异常!

翻译自: https://blogs.unity3d.com/2016/04/25/debugging-memory-corruption-who-the-hell-writes-2-into-my-stack-2/