I am currently working on an emulation environment similar to Qiling. Unlike Qiling, it emulates the entire user-space, not just the target application.

As Qiling reimplements all APIs (kernel32, vcruntime, …) outside the emulator, it gains a lot of speed (e.g. by not needing to run all the ntdll code during startup), while sacrificing stability (reimplementing all APIs can be error prone) and introducing a whole lot of work.

My emulator draws the line on syscall level. So instead of reimplementing all APIs, it loads all Windows DLLs and simply provides syscall implementations outside the emulator. This might be slower, but drastically reduces the amount of work. By using C++ instead of Python, I hope I can make up for the speed loss (I will do some performance measurements soon to see whether that is really the case :D).

However, apart from syscalls, one thing I have to deal with is exceptions.

When e.g. accessing bad memory or executing an invalid instruction, applications get notified and have a chance to handle these errors. The kernel forwards these exceptions to the application by invoking ntdll’s KiUserExceptionDispatcher.

That’s pretty much all I knew before this journey.

The setup

To be able to test if exceptions work, I created a sample application to run in my emulator:

void test1()
{
    __try
    {
        *(int*)1 = 1;
    }
    __except (EXCEPTION_EXECUTE_HANDLER)
    {
        puts("Executing handler");
    }
}

It accesses invalid memory (address 0x1), handles the exception through SEH and prints Executing handler.

That means, the goal is at the end: I want my emulator to print Executing handler.

A dive into KiUserExceptionDispatcher

To implement exception support, I first had a look at the signature of KiUserExceptionDispatcher in IDA:

It takes a pointer to an EXCEPTION_RECORD, a structure that holds information about the exception, and a pointer to a CONTEXT, which stores the CPU context at the time of the exception.

So I implemented just that:

Upon handling an exception in my emulator, I reserved some big space on the stack, created an EXCEPTION_RECORD, stored the CPU state inside a CONTEXT and passed both pointers in rcx and rdx to KiUserExceptionDispatcher.

This did not work. The application dies.

Looking at the broader picture, it gets clear why:

RtlDispatchException also takes pointers to an EXCEPTION_RECORD and a CONTEXT as arguments.
However, it is obvious that KiUserExceptionDispatcher does not. The signature in IDA, generated by lumina, is wrong.

As one can see, the first argument to RtlDispatchException, the EXCEPTION_RECORD, is essentially rsp + 4F0h and the second argument, the CONTEXT, is just rsp.

That means that arguments to KiUserExceptionDispatcher are passed by creating a specific stack layout, not through registers.

So I went back, adjusted my code and built that stack layout: The CONTEXT, with a size of 0x4F0 (*) first, followed by the EXCEPTION_RECORD with a size of 0x98.

The size of the CONTEXT is actually 0x4D0. It is followed by a CONTEXT_EX structure, with size 0x18 and some alignment, 0x8, which makes up the total of 0x4F0, but to keep it simple, I’ll just refer to everything as CONTEXT.

With the new layout at hand, I tried again … without luck. The application still dies without printing anything.

Diving deeper…

My expectation of what is supposed to happen was the following:

The internal exception handling mechanism, within RtlDispatchException, is supposed to call all exception handlers and perform the unwinding using the provided CONTEXT, starting from the point where the exception happened, until it reaches a point where the exception is handled, namely the handler that prints Executing handler.

Apparently, that does not seem to be the case.

I needed to dive deeper, so I started debugging the internals, within RtlDispatchException. Deep inside the exception handling infrastructure, I found an interesting point:

A call to RtlLookupFunctionEntry within RtlUnwindEx:

RtlLookupFunctionEntry returns a pointer to a function entry for a given function specified as first argument: rcx points to a location somewhere within the function.

The resulting function entry is a pointer to a RUNTIME_FUNCTION object. It essentially holds information on how to unwind the stack frame of that function and more.

My guess was the function entries being looked up here are those that also need unwinding. So I started tracking which functions there were:

  • RtlUnwindEx+A0A
  • vcruntime140___C_specific_handler+122
  • RtlpExecuteHandlerForException+F
  • RtlDispatchException+2C8
  • KiUserExceptionDispatcher+2E
  • ConsoleApplication6.exe:00007FF7F7F61074

The last lookup, ConsoleApplication6.exe:00007FF7F7F61074 is the location within my sample application that triggers the exception. However, to my surprise, it seems that all active stack frames are being unwound, including those created after KiUserExceptionDispatcher.

This means my stack layout, when invoking KiUserExceptionDispatcher, needs to be correct.

Analyzing UNWIND_CODEs

To figure out what was wrong, I decided to look at the RUNTIME_FUNCTION entry for KiUserExceptionDispatcher:

It is a structure containing 3 elements:

  • an RVA to the start of the corresponding function
  • an RVA to the end of the function
  • an RVA to the data relevant for unwinding

Looking at the data relevant for unwinding for KiUserExceptionDispatcher, we can see this:

It contains an UNWIND_INFO_HDR and a list of UNWIND_CODE entries. The UNWIND_CODEs are instructions for the unwinder. They essentially describe how to ‘undo’ all operations - related to the stack - the corresponding function performed. Luckily, this is very well documented by microsoft.

Again, the codes describe how to undo operations. That means following these in an inverse order should describe how to construct the stack.

The last 2 codes are especially important: UWOP_ALLOC_LARGE and UWOP_PUSH_MACHFRAME, as they describe how the stack was setup.

First, let’s have a look at UWOP_ALLOC_LARGE. The documentation says:

UWOP_ALLOC_LARGE (1) 2 or 3 nodes

Allocate a large-sized area on the stack. There are two forms. If the operation info equals 0, then the size of the allocation divided by 8 is recorded in the next slot, allowing an allocation up to 512K - 8. If the operation info equals 1, then the unscaled size of the allocation is recorded in the next two slots in little-endian format, allowing allocations up to 4GB - 8.

That means, if operation info (the first value of the UNWIND_CODE tuple) is 0, then the next two bytes denote the allocated size devided by 8. As we can see, in our case, it’s 0:

So, the value of the following 2 bytes (0xB2) multiplied by 8, is the stack allocation the function performs. Which gives us: 0x590.

If we combine the size of CONTEXT (0x4F0) with the size of our EXCEPTION_RECORD (0x98) it equals 0x588. If we slap on some alignment, we’re at 0x590. That means this unwind code describes the allocation of the two structures.

No let’s look at UWOP_PUSH_MACHFRAME:

UWOP_PUSH_MACHFRAME (10) 1 node

Push a machine frame. This unwind code is used to record the effect of a hardware interrupt or exception. There are two forms. If the operation info equals 0, one of these frames has been pushed on the stack:

Location Value
RSP+32 SS
RSP+24 Old RSP
RSP+16 EFLAGS
RSP+8 CS
RSP RIP

[…]

The simulated UWOP_PUSH_MACHFRAME operation decrements RSP by 40 (op info equals 0) […]

There are again two modes, but as operation info is zero again, the second mode is irrelevant:

That means, right after the two structures on the stack, there needs to be a machine frame with 0x40 bytes.

So I went ahead, constructed the machine frame and filled it with the corresponding values from the CONTEXT object. And who would have guessed: The sample runs and the exception handler is called:

As it turns out, only the Old RSP value seems to be relevant, at least in my case, the rest can be garbage.

A few final words

I have never really dealt with internals of exception handling on Windows before. The only thing I ever did was to debug some FrameHandler4 peculiarities, but other than that, this was more or less completely new to me. However, it was a really refreshing journey. Most work on the emulator consists of implementing one syscall after another, so this was quite a fun thing to investigate.

As it turns out, the entire stack layout has already been documented by mrexodia in his dumpulator project. A more thorough google search might therefore have saved me an hour or two, but that would have been half the fun :D