2025-05-25 08:00:00
This post is part of a series on retrowin32.
The Rust compiler compiles code in parallel. But the unit of caching is the crate — a concept larger than a module, which corresponds maybe to a library in the C world or package in the JS world. A typical program is a single crate. This means every time you run the compiler, it compiles all the code from scratch. To improve build performance, you can split a program into multiple crates, under the hope that with each compile you can reuse the crates you didn't modify.
retrowin32 was already arranged as a few crates along some obvious boundaries. The x86 emulator, the win32 implementation, the native and web targets were each separate. But the win32 implementation and the underlying system were necessarily pretty tangled, because (among other things) x86 code calls win32 functions which might need to call back into x86 code.
This meant any change to the win32 implementation recompiled a significant
quantity of code. This post is about how I managed to split things up further,
with one crate per Windows library. retrowin32 now has crates like
builtin-gdi32
and builtin-ddraw
that implement those pieces of Windows, and
they can now compile and cache in parallel (mostly).
Going in, there was a god object Machine
that held both the CPU emulator (e.g.
the state of the registers) as well as the rest of the system (e.g. memory and
kernel state). When the Machine
emulated its way to a win32 function call (as
described in
the syscalls post), it passed
itself to the target, which would allow it to poke at system state and
potentially call back into further emulation.
For example, the Windows CreateWindow
API creates a window and as part of that
process it synchronously "sends" the
WM_CREATE
message, which concretely means within CreateWindow
we invoke the window
procedure and hand control back to the emulated code.
You cannot have cycles between crates, so this cycle meant that we must put
Machine
and all the win32 implementation in one single crate. The fix, like
with most computer science problems, is adding a layer of abstraction.
A new shared crate defines a System
trait, which is the interface expressing
"things from the underlying system that a win32 function implementation might
need to call". This is then passed to win32 APIs and implemented by Machine
,
allowing us to compile win32 functions as separate crates, each depending only
on the definition of System
.
One interesting consequence of this layout is that the win32 implementation no
longer directly depends on any emulator at all, as long as the System
interface exposes some way to invoke user code. You could hypothetically imagine
a retrowin32 that runs on native 32-bit x86, or alternatively one that lets you
port a Windows program that you have source for to a non-x86 platform like
winelib.
I mentioned above that Machine
also holds system state. For example, gdi32
implements the drawing API, which provides functions that vend handles to device
contexts. The new gdi32
library enabled by the System
interface can declare
what state it needs, but we must store that state somewhere.
Further, there are interdependencies between these various Windows libraries.
user32
, which handles windows and messaging, needs to use code from gdi32
to
implement drawing upon windows. But the winmm
crate, which implements audio,
is independent from those.
One obvious way — the way I imagine it might work in real Windows — is for this state to be held in per-library static globals. I came up with a different solution that is a little strange so I thought I would write it down and see if any reader has a name for it or a better way.
To restate the problem, there's a core Machine
type that depends on all the
libraries and which holds all the program state. But we want to be able to build
each library independently, possibly with interdependencies between them,
without them holding a dependency on Machine
itself.
The answer is for the ouroborus-breaking System
trait to expose a
dynamically-typed "get my state by its type" function:
fn state(&self, id: &std::any::TypeId) -> &dyn std::any::Any;
Each library, e.g. gdi32
, can register its state (a gdi32::State
, perhaps)
and fetch it when needed from the system. This way a library like user32
can
call gdi32
and both of them can access their own internal state off of the
shared state object.
It's maybe just a static with extra steps. I'm not sure yet if I like it.
Most of the win32 API is now in separate crates. (The remaining piece is
kernel32
, which is the lowest-level piece and will need some more work to pull
apart.)
Here's a waterfall of the part of the build that involves these separate crates:
Per xkcd this probably won't save me time overall, but at least I don't have to wait as long when I'm repeatedly cycling.
At the bottom you see the final win32
crate that ties everything together.
This one is still too slow (possibly due to kernel32), but it's better than
before!
2025-04-26 08:00:00
This post is part of a series on retrowin32.
Demoscene programs sometimes are packed with an executable compressor like UPX. An EXE packer ingests an EXE and emits a smaller one that does the same thing. This is useful for meeting demo competition file size limits. (Here's a few 4kb demos from last week. 4kb, including the music! Just the text of this blog post is nearly twice that!)
At a high level, a packed executable is a tiny program that, when run, uncompresses the real program into memory and then runs it. From the perspective of an emulator, packed executables are the same as regular executables; they are all just programs that eventually make some system calls.
But from the perspective of debugging what has gone wrong when running a packed executable, all of the interesting executable state — including even which system functions it calls — is only something that shows up after the executable has unpacked itself. For this reason it is useful to be able to unpack a packed executable, reconstructing the original program, which you can then feed into a program analyzer like Ghidra. (Tools like UPX even provide unpacking themselves, much like a zip utility also provides unzip.)
But what if your program isn't packed with UPX? Or (what I have encountered more) what if it's packed with an old enough version of UPX that current UPX will no longer unpack it?
An emulator can treat the unpacking logic as a black box and just execute it. If you pause the program right at the point after that, you could grab the in-memory unpacked state and dump it back into a new exe. This exe then looks like an ordinary program, ready for analysis.
I recently implemented just this idea in retrowin32 and successfully dumped at least one packed exe. This post talks about how it works.
The first step is figuring out the execution point to dump from. Ideally you'd grab the memory state after the executable has unpacked but before any of its main code runs, but where exactly is that?
A packed exe looks something like:
entry_point:
; ...some loop to unpack things here
jmp some_address ; jmp to the unpacked code
If you load a packed exe into a tool like Ghidra, the jmp looks like it's jumping to a garbage address, because the code on the other side of that jump only exists after the unpacking logic runs. I think you might be able to automatically spot it because of that.
For now, since I'm deep into analyzing these executables in the first place,
suppose I just manually identify the some_address
address that we want to
break on.
The next problem is that you can't just set an ordinary breakpoint at that
point. A software breakpoint works by
patching over the target with an int3
instruction,
but if we set one of those at startup it gets overwritten by the unpacking loop.
So instead, I can set a breakpoint at the last instruction of the unpacking code
(the jmp
in the above example) and then single step once (to follow the
jmp
).
This also will work with another type of packer I've seen, which generates code like:
entry_point:
; ...some loop to unpack things here,
; including:
push some_address
; ...more asm
ret ; jmps to some_address
It's easier to set a breakpoint on the ret
than it is to find which address it
happens to jump to.
From that state I can then create a .exe
file by dumping the current memory
state, with the exe's entry_point
properly set, and everything works... except
for one important piece.
To minimize the file size, a packed executable only declares dependencies on a minimal set of system functions. The unpacking process decompresses a list of all the real dependencies of the underlying executable, and it dynamically loads those dependencies so that they can be called once the underlying executable starts.
For the purposes of offline static analysis, we need these dynamic dependencies resolved. We want to load our executable into Ghidra and have it understand which system functions it calls.
To understand how to manage this you need to understand a little bit about how reasonable programs resolve imports. I wrote earlier about this, including some pictures. I'll resummarize the broad description that's relevant here.
A regular program that calls system functions like GetStartupInfoA()
will call
via the Import Address Table ("IAT"). In assembly this looks like:
call [imp_GetStartupInfoA] ; where imp_GetStartupInfoA is a fixed address
That is, each call site says "call the address found at this fixed address, a point within the IAT". The executable loader reads a list of named imports (called the Import Directory Table ("IDT")) and fills in the IAT with the resolved addresses.
A packed executable's IDT is nearly empty because all the interesting work happens at unpack time. But the underlying executable that the packer started with had its own IAT; the packer fills it in as part of its unpacking. To reverse this, in our unpacked executable we need to reconstruct our own IDT that points at the IAT such that Ghidra believes it.
Where is that IAT? We don't know. From the emulator's perspective, the unpacking code does a bunch of grinding, eventually with a series of calls like:
hmodule = LoadLibrary("ddraw.dll")
func = GetProcAddress(hmodule, "DirectDrawCreate")
... stash func in the IAT of the in-memory unpacked executable
Unfortunately it's not obvious how to follow that "stash" operation. (In writing
this post, I had the idea that maybe we could observe all memory writes?) But we
can observe these calls to GetProcAddress
and record which addresses we vended
out, and then at dumping time we can scan the program's memory to find where
those values ended up. (This is similar to how
Scylla, an exe dumping tool, attempts to
solve this problem. But Scylla doesn't get to cheat by using an emulator.)
We expect each value to show in memory exactly once, in the unpacked
executable's IAT. (What if an value happens to be found in memory more than once
due to accidentally colliding with some unrelated sequence of bytes? This hasn't
come up yet but one idea I had is I could emulate the unpack sequence twice with
different memory layouts such that the GetProcAddress
calls return different
values, and then compare the two for overlap.)
Once we have the IAT address, we can construct an IDT that tells the loader which functions correspond to which addresses, and stuff that new IDT into our executable as additional data. With a few minor adjustments to the exe headers I now have an exe unpacker that llvm-objdump can understand, and which Ghidra renders complete with system function calls like:
00428c37 6a 05 PUSH 0x5
00428c39 50 PUSH EAX
00428c3a ff 15 b0 CALL dword ptr [->USER32.DLL::ShowWindow]
20 43 00
00428c40 ff 76 10 PUSH dword ptr [ESI + 0x10]
00428c43 ff 15 ac CALL dword ptr [->USER32.DLL::UpdateWindow]
20 43 00
2025-03-11 08:00:00
Rust makes a distinction between values and references that you generally learn to work with while using the language. This week I learned an interesting new corner around how that distinction applies to trait objects, despite using the language for quite a long time. Maybe it will surprise you too!
When you have some x: usize
you can say x
is the usize
itself, while some
y: &usize = &x
is a reference to that value. y
's concrete value is a
pointer.
Similarly, the type [u8; 40]
means 40 bytes itself. If you put it in a struct
or pass it to a function, you're moving around 40 bytes.
Finally, the type [u8]
means a series of bytes of (compile-time) unknown size.
You don't interact with these often because you can't usually put these in a
struct or pass them to a function because of their unknown size. Instead you
typically work with references to these as &[u8]
, which concretely are a
pointer and a length.
(cheats.rs has nice pictures of this.)
But you still do sometimes see [u8]
as a type without a reference, in types
like Box<[u8]>
. And further, you can wrap a dynamically sized type in a struct
as the last member, making that struct dynamically sized as well:
struct Test<X: ?Sized> {
inner: X,
}
// the type Test<[u8]> is now also dynamically sized
The type File
represents an open file. Concretely, it's a struct with some
private fields.
The trait Read
represents things that can be read from, with a .read()
method. The trait implies a trait object type dyn Read
, which is the type of
things that implement the trait. File
implements Read
, so I will use File
and Read
for the following examples.
Concretely, the layout of a trait object dyn Read
is the same as the
underlying value it's wrapping, e.g. a File
(spec ref).
(This is possibly only useful to know as compiler trivia; even cheats.rs doesn't
document this!) Much like [u8]
, because their concrete size is not known at
compile time, you don't typically interact with these directly but instead via
references.
The type &dyn Read
is a reference to a trait object. Concretely it's a pointer
to the object and a pointer to a static vtable of the methods for that type that
implement the trait.
(More pictures from cheats.rs.) Also like
[u8]
, you might more often use Box<dyn Read>
, which holds ownership over the
underlying Read
-able type.
(It was useful for my understanding to contrast these with C++ objects and vtables. In C++, the vtable pointer is always embedded in the struct itself. In Rust, the struct never contains a vtable pointer. Instead the reference to the trait object is two pointers, to the value and the vtable.)
Though it's relatively rare in Rust, there are a few places where one type will
silently convert to another. One you may have used without thinking is using a
&mut T
in a place that needs a &T
.
When using trait objects there is another coercion:
let f: File = ...;
let r: &dyn Read = &f;
Here, &f
is &File
, but the compiler converts it to &dyn Read
.
Did you know that there is another trait-related coercion involving generics? Consider:
let f: File = ...;
let b: BufReader<File> = BufReader::new(f);
let r: &BufReader<dyn Read> = &b; // !!! legal
Here, the File
inside the BufReader<>
was able to coerce to a trait object.
Concretely, r
here is like to a reference to a trait object, in that it is a
pair of a pointer to the BufReader
along with a pointer to a Read
vtable.
(Poking at the compiler, it uses the same Read
vtable as you would get from a
plain File
.)
The
underlying spec reference: [coerce.unsized.composite]
for why this is allowed is pretty involved! But at a hand-wavy level it's
allowed because BufReader<dyn Read>
is a dynamically sized type where the last
field is the place where the dyn Read
is used. Note that, for example, you
cannot have a similar coercion to two traits &SomeType<dyn A, dyn B>
because
the reference can only carry a single vtable.
(How is this useful? It came up in some code I was reviewing from someone new to Rust; I'm not sure.)
(Bonus question: The above doesn't compile if you substitute Box
or Rc
for
BufReader
. Why not? Something about the CoercedUnsized impls on those? I don't
know the answer.)
This does compile though:
let f: File = ...;
let b: Box<File> = Box::new(f);
let r: Box<dyn Read> = b; // consumes b
This is due to some magic traits implemented by Box
(and also Rc
, etc.).
I believe this behavior is the ultimate reason for these corners of support in
the compiiler. You want to sometimes be able to implicitly convert between a
struct and a trait object, and you also sometimes want to be able to wrap things
(in e.g. Box
or Rc
) of unknown size, and then you want those two features to
combine.
PS: Playground link, if you'd like to poke at it yourself.