2024-09-18 08:00:00
This post is part of a series on retrowin32.
The "Windows emulator" half of retrowin32 is an implementation of the Windows API, which means it provides implementations of functions exposed by Windows. I've recently changed how calls from emulated code into these native implementations works and this post gives background on how and why.
Consider a function found in kernel32.dll
as taken from MSDN (with some
arguments omitted for brevity):
BOOL WriteFile(
[in] HANDLE hFile,
[in] LPCVOID lpBuffer,
[in] DWORD nNumberOfBytesToWrite,
);
In retrowin32 we write a Rust implementation of it like this:
#[win32_derive::dllexport]
pub fn WriteFile(
machine: &mut Machine,
hFile: HFILE,
lpBuffer: &[u8],
) -> bool {
...
}
Suppose we have some 32-bit x86 executable that thinks it's calling the former. retrowin32's machinery connects it through to the latter — even though the implementation might be running on 64-bit x86, a different CPU architecture, or even the web platform.
Before we get into the lower-level x86 details, you might first notice that
those function prototypes don't even take the same arguments —
nNumberOfBytesToWrite
disappeared.
On the caller's side, a function call like WriteFile(hFile, "hello", 5);
creates a stack that looks like:
return address ← stack pointer points here
hFile
pointer to "hello" buffer
5
The dllexport
annotation in the Rust code above causes a code generator to
generate an interop function that knows how to read various Rust types from the
stack. For example for WriteFile
the code it generates looks something like
the following, which maps the pointer+length pair into a Rust slice:
mod impls {
pub fn WriteFile(machine: &mut Machine, stack_pointer: u32) -> u32 {
let hFile = machine.mem.read(stack_pointer + 4); // hFile
let lpBuffer = machine.mem.read(stack_pointer + 8); // pointer to "hello"
let len = machine.mem.read(stack_pointer + 12); // 5
let slice = machine.mem.slice(lpBuffer, len);
super::WriteFile(machine, hFile, slice) // the human-authored function
}
}
There's a bit more detail around how the return value makes it into the eax
register and how the stack gets cleaned up, but that is most of it. With these
in place, the remaining question is how to intercept when the executable is
making a DLL call and to pass it through to this implementation.
Emulation aside, when an executable calls an internal function it knows where that function is in memory, so it can generate a call instruction targeting the direct address. When an executable wants to call a function in a DLL, it doesn't know where in memory the DLL is, so how does it know where to call?
In Windows this is handled by the "import address table" (IAT), which is an array of pointers at a known location within the exectable. Each call site generates code that calls by indirecting through the IAT: instead of calling WriteFile at specific address, it calls the address found in the WriteFile entry in the IAT. At load time, each needed DLL address is written into the IAT.
(In this picture, XXXX is a constant address pointing at the IAT, while CCCC is the address written into the IAT at load time that points to the implementation of WriteFile.)
Advanced readers might ask: given that we're writing address into the IAT at load time, why is there this indirection instead of just writing the correct addresses in to the call sites? The answer is there's a complex balance between making things fast in the common case as well as in sharing memory. Windows even has an optimization that attempts to make the (mutated after loading, so typically causing a copy on write) IATs amenable to being shared across process instances.
In retrowin32's case we just needed some easy way to identify these calls. So retrowin32 poked easy to spot magic numbers into the IAT.
The emulator's code then looks like:
const builtins = [..., WriteFile, ...]
fn emulate() {
if instruction_pointer looks like 0xFIAT_xxxx {
builtins[xxx].call()
} else {
...emulate code found at instruction_pointer...
}
}
This worked fine for a long time! But it had a few important problems that became apparent over time.
LoadLibrary()
call gives you a pointer to the real base address
where a DLL was loaded. An executable I was trying to get to run was calling
LoadLibrary()
and was attempting to traverse the various DLL file headers
found at that pointer. Because retrowin32 didn't actually load real
libraries, these didn't exist.msvcrt.dll
exports plain variables like ints.WriteFile()
.The shared root cause of some of the above issues is that executables just expect real DLLs to be found in memory. (Pretty much every problem bottoms out as Hyrum's Law.) I could make retrowin32 attempt to construct all the appropriate layout at runtime, but ultimately we can match DLL semantics best by creating real DLLs for system libraries and loading them like any other DLL.
However, all these DLLs need to do is hand control off to handlers within retrowin32. For this I changed retrowin32's code generator to generate trivial snippets of x86 assembly which compile into real win32 DLLs as part of the retrowin32 build. (My various detours through cross-compiling finally paying off!) retrowin32 needs to be able to load DLLs anyway to load DLLs found out in the wild, and by using a real DLL toolchain to construct these I guarantee the system DLLs have all of the expected layout.
I can then bundle the literal bytes of the DLLs into the retrowin32 executable such that they can be loaded when needed without needing to shuffle files around; they are a few kilobytes each.
(Coincidentally, the DREAMM emulator, a much more competent Windows emulator, recently made a similar change for pretty similar reasons, which makes me more confident this is the right track.)
Each stub function needs to trigger the retrowin32 code that handles builtin
functions.
On x86-64 that code is especially tricky.
To make this swappable, these stub functions all call out to known shared
function, yet another symbol that retrowin32 can swap out for the different use
cases. They can find a reference to that symbol using exactly the DLL import
machinery again with a magic ambient retrowin32.dll
.
Under emulation retrowin32_syscall
can point at some code that invokes the x86
sysenter
instruction, which is arbitrary but easy for an emulator to hook. On
native x86-64 we can instead implement it with the stack-switching far-calling
goop.
Finally, in this picture, how does retrowin32_syscall
know which builtin was
invoked? At the time the sysenter
is hit, the return address is at the top of
the stack, so with some bookkeeping we can derive that it was Writefile
from
that address.
COM libraries like DirectDraw expose vtables to executables: when you call e.g.
DirectDrawCreate
you get back a pointer to (a pointer to) a table of functions
within DirectDraw. For the executable to be able to read that pointer, it has to
exist within the emulated memory, which means previously I had some macro soup
that knew to allocate a bit of emulated memory to shove some function pointers
into it when the DLLs were used.
After this change, I can just expose the vtables as static data from the DLLs, like this:
.globl _IDirectDrawClipper
_IDirectDrawClipper:
.long _IDirectDrawClipper_QueryInterface
.long _IDirectDrawClipper_AddRef
.long _IDirectDrawClipper_Release
.long _IDirectDrawClipper_GetClipList
.long _IDirectDrawClipper_GetHWnd
...
To generate these stub DLLs, I currently generate assembly and run it through
the clang-cl
toolchain, which seamlessly cross-compiles C/asm to win32. It
feels especially handy to be able to personally bug one of the authors of it for
my tech support, thanks again Nico!
It feels a bit disappointing to need to pull in a whole separate compiler
toolchain for this, especially given that Rust contains the same LLVM bits. I
spent a disproportionate amount of time tinkering with alternatives, like
perhaps using one of the other assemblers coupled with the lld-link
embedded
in Rust, or even using Rust itself, but I ran into
a variety of small problems with that.
Probably something to revisit later.
Any time I tinker in this area (and also looking at Raymond Chen's post, also linked above) I always find myself thinking about how much dang machinery there is to make dynamic linking work and to try to fix up the performance of it.
It makes me think of a mini-rant I heard Rob Pike give about dynamic linking (it was clearly a canned speech for him, surely a pet peeve) and how much Go just attempts to not do it. I know the static/dynamic linking thing is a tired argument, I understand the two sides of it, and I understand that engineering is about choosing context specific tradeoffs, but much like with minimal version selection it is funny how clear it is to me that dynamic linking is just not the technology to choose unless forced. Introspecting, I guess this just reveals something to me about my own engineering values.
The above omitted a lot of details, sorry pedants! As penance here is one small detail I ratholed on that might be interesting for you; non-pedants, feel free to skip this section.
In the stub DLLs I first generated code like:
WriteFile:
call _retrowin32_syscall
ret 12
To make it all link, at link time I provided a retrowin32.lib
file that lets
the linker know that _retrowin32_syscall
comes from retrowin32.dll
.
I expected that to generate a call via the IAT, but the code it generated went through one additional indirection. That is, the resulting machine code in the DLL looked like:
WriteFile:
call _retrowin32_syscall ; e8 rel32
...
_retrowin32_syscall:
jmp [IAT entry for retrowin32_syscall] ; ff25 addr
Why did it do this? At least in
this blog's explanation, when
compiling the initial assembly into an object file, the assembler doesn't know
whether retrowin32_syscall
is a local symbol that will be resolved at link
time or if it's a symbol coming from a DLL. So it assumes the former and
generates a rel32
call to a local function.
Then, at link time, the linker sees that it actually needs this call to resolve
to an IAT reference, and so it generates this extra bit of code containing the
jmp
(which ghidra labels a "thunk").
To eliminate this indirection I instead generate assembly that looks like:
WriteFile:
call [__imp__retrowin32_syscall]
where that __imp_
-prefixed symbol is somehow understood to refer directly to
the IAT entry. At the level of C code this is related to the dllimport
attribute, but I am not sure whether that attribute exists at the assembly
level. Peeking at the LLVM source I see it actually has some special-casing
around symbols beginning with the literal string "__imp_"
, yikes.
2024-09-12 08:00:00
This post is part of a series on retrowin32.
It's been just about two years since I left my job. The Saturday after my last day I opened up my laptop and, without before even considering this idea up to that point, spontaneously decided to try writing a Windows emulator to run an old program I loved.
I didn't know much of anything about how emulators work. I could read the occasional x86 assembly when debugging things but I had never written much of it. I worked with programming Windows a bit a few times over my professional career but I am far from an expert. But overall, how hard could it be, right?
The truth is that there is kind of a lot of detail to all of it. But also, detail ultimately just means it is a slog. x86 has a scrillion opcodes to implement, win32 has scrillion APIs, but the path from zero to a scrillion starts with a step like any other.
I picked up this project with the explicit intent to just follow my interest wherever it led. I was inspired in part by this thread: "Good things happen when I try hard to chase my sense of excitement, ignoring impulses to produce legible outcomes." I think that observation about legibility really reached me. I went through a period in the past where I found I was only reading books that I felt like I ought to be reading and had ultimately been killing my enjoyment of reading, and I was trying to recover that feeling about programming.
I haven't been employed since, in part due to having a hard time as a parent. This is a larger topic for another post, but my son has been a lot; there was even a period where his preschool was having me come in for two hours a day to help manage him. I remain mystified how parents manage careers and children simultaneously, even with easy children. I also try to rationalize that, to the extent I can afford it, if there was ever a time to be there for him it is now.
Does it really take two years to make a not particularly complete emulator? Looking over my commit logs it seems like I only even made commits for about 85% of the weeks I've been working on it, and many other weeks there's only been a few random touches. I have had plenty of other distractions, including video games (I beat Factorio's Space Exploration with some friends, which either means nothing to you or which will impress you a lot), other projects on the side (like my build system experiment, which has since been forked by Android!), vacations, and so on.
Two years in, one thing I have been reflecting on is where I am going with this project. At some point I ought to call it complete, or abandon it. As long as it remains interesting to tinker with I intend to tinker, but I also am well aware of the feeling an unending project that starts to feel like a burden. I am not there yet but I am watching for the signs.
When I first started I imagined the project would be somehow tying my WebAssembly experience into x86 emulation. It turns out that for various reasons JITting into Wasm isn't the most important piece (though the cheerpx people do it) and I never got there. Similarly, my memory of the demoscene — the above program that spawned the whole project is from that world — is from the 90s and I had kind of imagined there would be a rich collection of interesting 2d Windows programs from there. This turned out to not be true. I discovered there are a lot of 2d DOS demos but once Windows came around the demoscene started adopting 3d APIs relatively quickly, and I am not particularly interested in 3d graphics, in part because the look of 90s 3d demos hasn't aged well to my eyes. In other words, the demoscene I had thought I started this project for may have existed only in my imagination.
But at least for now I can say I accomplished something — I got at least the first scene of that original demo working! (Looking now I haven't updated the website to show this; maybe something to do after this post.) The very last bug fix was that I had flipped some math backwards in my MMX implementation. Also, had I known I would need to implement some of MMX, would I have even started this project? Not even sure. I have seen it observed that sometimes not knowing how hard something will be is an important help to actually just starting to try.
The other surprising accomplishment is that after nearly two years of working on it solo (though to properly credit, I've had some really great advice from Dean and Nico!) recently I have had a few contributors show up. In particular somehow the group behind this project found me because they want to run old Windows compilers on Linux cloud machines, but because of the particular thing they do they are also comfortable with disassembly. I have sometimes thought about this: what are the chances of someone having both the low-level skill set needed to usefully contribute, and also the need to emulate an old Windows program? This is to me one of the best things about the internet, where even if such a person is a one in a billion chance, we have a few billion people around on here. And this includes someone else who for whatever reason got Win95's solitaire.exe to run.
2024-07-01 08:00:00
The ICFP programming contests are a weekend-long programming contest that is affiliated with a functional programming conference. (It's functionally-oriented, but I recall in past years a team from Google dominated using C++, which I think is a good reminder that tools matter less than talent...) In 2006 I participated, really loved it, and then forgot about it.
Last Friday afternoon Rado invited me to try this year. We were relatively casual about it — I had dinner reservations, we both had kid stuff — but together we ranked 45th place and had a great time. (I saw one person reflecting, "this was the best one since 2006", so I think I got very lucky in picking up this one!) Here is a brief review of the contest biased towards the parts I worked on with some highlights.
The center of the contest was the "ICFP language", a tiny functional language (lambda calculus plus a bit) that serialized in an ASCII encoding. An example from that page: ICFP
B$ B$ L# L$ v# B. SB%,,/ S}Q/2,$_ IK
represents the abstract program
((\v2 -> \v3 -> v2) ("Hello" . " World!")) 42
which evaluates to the string
Hello World
Contestants submitted programs in IFCP that evaluate to a string, e.g.
solve problem1 1234
and the server responded with another program that
evaluated to a response string.
I hacked up a pretty straightforward evaluator/encoder on Friday afternoon while Rado was at work. From the 2006 contest I recalled that performance mattered so I wrote it in Rust, and in practice it was a couple hours and we didn't need to touch it much over the rest of the weekend. (Rado had never written Rust before but via Copilot managed to add some of the opcodes, more on this below!) For server communication we just used shell scripts driving curl which was also nice because it made it obvious to archive request/response text.
Once we were able to evaluate the first strings we unlocked the real puzzles. Beyond the hello-world evaluator checks there were four classes of problems that were the bulk of the contest. (In the below I have links to the problem descriptions which were posted after the end of the contest; during the contest we were unlocking these as we went and each was a surprise.)
In the first,
"lambdaman",
the puzzles were simple mazes and the task was to submit strings that were a
sequence of up/down/left/right moves that visited every open cell. Rado wrote
the necessary graph code (TypeScript) to solve these; I never looked at it. For
these puzzles your score was the total size of the ICFP program you submitted
(e.g. a string like the B$ B$ ...
string in the example code at the top), not
the string it evaluated to; for example, the lambaman6 solution was just to move
right 199 times which is the string "R" repeated 199 times, but a program that
loops to generate that is is smaller than sending the actual string. But even
looping in this mini functional language required paying the byte cost for
the y combinator; other
solutions were smaller. I had a fun time building out my support for writing and
measuring programs in the language.
I also built a string compressor, which took our solution strings, generated a recursive compression dictionary, then generated programs that were able to unpack them. This was super neat in that I was generating a custom program per target and the objective was minimizing the total size of the program plus its data, which meant instead of shipping an actual decompression dictionary I generated unpacking logic directly.
// Code to generate decompression dictionary.
// - Var(1) is labmda of the recursive self call;
// - Var(3) is the first letter of the input string;
// - Var(4) is the remainder of the input string.
// Base case: concat(letter, decompress(rest of compressed data)).
let mut decoder = concat(Var(3), apply(Var(1), Var(4)));
for (c, exp) in dictionary {
// If letter is in compression dictionary,
decoder = if_(
bin(Eq, Var(3), Str(c)),
// decompress(expansion prepended to compressed data)
apply(Var(1), concat(Str(exp), Var(4))),
// Else chain to other if expressions and base case
decoder,
);
}
In the second, "spaceship", the puzzles were traces of a spaceship's position and the task was to submit thrust changes to match the trace. Again, Rado solved this puzzle with some code that I didn't look at. It peaked as kind of a neat traveling-salesman-esque problem where you controlled the derivative of the position rather than the position itself and I wish I had time to look into it more. I thought to shovel his answers through the above compressor but for this one I realized too late that the program size didn't matter, bummer.
The third was "3d" and it was pretty amazing. The tasks were to write programs in a two dimensional programming language where data flowed around a grid. The language was very weak so even implementing simple functions like "absolute value" was tricky, but it also had a "time travel" operator which let you transport data back to a previous grid state which could be used for looping. For this task I built a web UI that let us interactively write programs. (Demo link of Rado's solution for isPrime, put in a value in the A box to see it go; solution is decent but not contest-winning!)
This probably wasn't absolutely necessary — I saw others wrote solutions just directly in a text editor — but I love making these simple web UIs and like to think my ability to crank these out is one of my relative strengths as a programmer.
For this task the score was based on the dimensions of your solution (grid width x grid height x time used) and so it became a 2d code golf competition, and I like to think my editor let us iterate quickly on new ideas. I had an awesome time thinking through and experimenting with how to reduce our solutions. The hardest problems in this set were quite complex and I thought about building a custom assembler to let us generate these grids. It sounds like some other contestants did this, but also it appears people were solving even the hardest problems by hand. For golf reasons those likely won so we're probably lucky I didn't try.
The final problem set was named "efficiency" and the task was simply to evaluate progressively more complex programs. My trivial evaluator crashed even on relatively simple inputs though so we (which really means "just Rado") instead dumped the programs as s-expressions and looked at the code directly. This turned out to be exactly the right approach, as the most challenging problems here were designed to be impossible to evaluate; afterwards the organizers admitted the intent was for us to read the code.
Rado used a SAT solver to solve one. I myself have never used a SAT solver and it's a pretty impressive tool to be able to pull out of your programmer bag when necessary. Another highlight here was application of AI: given our fake-lisp s-expressions, it turns out Rado could hand them to Claude and ask it to translate them to scheme, and Claude was even clever enough to be able to say unprompted "by the way, this program is a sudoku solver"!
Coincidentally I recently tried out GitHub's Copilot for the first time and I have been completely blown away by it. I had arrogantly assumed that the kinds of code I was writing would be "too weird" for it, but it has no problems with following along and making useful and correct suggestions as I have been writing e.g. Rust procedural macros that are code generating x86 assembly. For the contest, Copilot was similarly incredible, even providing useful suggestions when writing the above snippet that is generating ASTs in an invented language!?
It is interesting to me to consider the fate of programming competitions in our new AI world. For the spaceship problems we realized late that you could take the input traces of coordinates and hand them to Claude with the instruction to plot the spaceship's trajectory. This kind of exploratory data analysis is useful professionally and in decades past I have intentionally cultivated my ability to wield R for it, but AI systems make it so much easier to explore new data that I'm still trying to internalize using them in the way I approach problems.
The (only two) ICFP contests I've participated in are fascinating in that you aren't sure up front where the difficulty of the problem will lie. For the 2006 one my team lost time trying to initially build infrastructure to debug the programs in the minilanguage only to discover that for that problem set (at least in my recollection) the minilanguage only mattered as a way of transporting tasks and once you could evaluate you didn't need any of it. For this year's problem set in retrospect it again is a balance between diving into solving exactly what's in front of you vs trying to feel out what will be eventually necessary.
Relative to my 2006 experience we did a better job this time of trying to make forward progress quickly, but I think we again got ahead of ourselves in the solution complexity. For example, for the spaceship problems after visualizing the data in Claude we would have realized that some of the initial problems were trivial to solve without doing any sort of graph trick. But then also similarly by looking at the data from the advanced problems we could've better predicted what sorts of graph tricks we'd need. From that perspective, looking back, one thing I would do differently is invest a bit more up front time peering at the shape of the input data, probably using an AI to assist.
In all, I had a really great time this year working on this. I am very grateful to the organizers for setting it up, to my family for covering most of the child care, and for Rado for pulling me along!