2025-03-06 14:42:25
Small experiments in the use of libliftoff to try out the modern Linux graphics stack drove home quite how slow DRM “dumb buffers” can be, but also that it’s reading that’s slow, not writing.
Reading from a “dumb buffer” on my AMD GPU is orders of magnitude slower than reading from RAM. It can take seconds to read out a full 4k frame. It’s roughly a thousand times slower than reading RAM.12
Writing, by contrast, is quick.
While it is folklore that “dumb buffers are slow”, I found it challenging to find any
authoritative source on the matter. However, I did find something. In
/usr/include/drm/drm.h
, we see the following comment, which sort of hints at the wider
situation:
/**
* DRM_CAP_DUMB_PREFER_SHADOW
*
* If set to 1, the driver prefers userspace to render to a shadow buffer
* instead of directly rendering to a dumb buffer. For best speed, userspace
* should do streaming ordered memory copies into the dumb buffer and never
* read from it.
*
* Note that this preference only applies to dumb buffers, it's irrelevant for
* other types of buffers.
*/
#define DRM_CAP_DUMB_PREFER_SHADOW 0x4
Indeed, “for best speed […] never read from it.”
Update: Subsequent experimentation using gbm
to allocate buffer objects shows that it
doesn’t help if you need to read or write pixel data to them (as opposed to, presumably, using
the GPU to render into them). Setting the GBM_BO_USE_WRITE
flag when allocating a buffer
object, to allow subsequent writing of pixel data, causes the dri
backend of
gbm
to simply allocate a “dumb buffer”!
Quick-and-dirty C experimentation shows speeds of ~2ms to read a full 3840×2160×32bit frame out of normal RAM. That’s about 16GB/s. Eyeballing the slow “dumb buffer” read times suggests then perhaps about 16MB/s for that! ↩
As a corollary to this realisation, I learned that attempting to use surfaces backed solely by “dumb buffers” to do fallback software composition is a losing proposition. Hence the whole idea of “shadow” buffers, presumably! ↩
2024-10-03 02:12:02
“Erlang supports change of code in a running system.”
However, the details are a bit fiddly. Here’s a cheat-sheet I used recently for a simple TCP service written using Erlang.
My program was a single module, running outside of any OTP application
context. The
instructions here need minor emendation to either explicitly list modules to purge and reload
or to discover all modules within a single application
; see the places in server-reload
below mentioning the atom my_server
.
I did not use the -on_load()
directive, because I wanted to be able to use multiple nodes
rather than controlling reloads from a single node’s shell repl, and I couldn’t figure out how
to make the two play nicely together.
I exported a code_change/0
from my module, to be called after loading a new version of the
module into a node. It sends a message code_change
to each “global” actor in my program (in
this case, there was only one).
-export([code_change/0]).
code_change() ->
io:format("+ code_change~n"),
%% name registered previously with `global:register_name/2`:
global:send(name_of_my_global_actor, code_change),
ok.
That actor distributes the notification on to any inferior actors it is managing, and then does an “MFA” self-call to upgrade its own codebase.
index(Connected) ->
receive
code_change ->
[P ! code_change || {_Peer, P} <- Connected],
?MODULE:index(Connected);
...
end.
Similarly, all other notified actors perform “MFA” self-calls.
connection(Sock, Username, IndexPid) ->
receive
code_change ->
?MODULE:connection(Sock, Username, IndexPid);
...
end.
Actors need to take care to manage upgrades of their state at the same time as they do the “MFA” self-calls.
I wanted it to be run by daemontools, so created the
following shell script called run
, which daemontools will pick up to start a service:
#!/bin/sh
set -e
erlc -o ebin my_server.erl
exec erl \
-noshell \
-pa ebin \
-sname mainnode \
-setcookie f98b3a1e-80ec-11ef-b752-0b638e4de31c \
-s my_server
Pick a fresh random cookie for the -setcookie
argument. I used uuid(1)
.
Then, I created this script, server-reload
:
#!/bin/sh
set -e
erlc -o ebin my_server.erl
exec erl \
-noshell \
-pa ebin \
-setcookie f98b3a1e-80ec-11ef-b752-0b638e4de31c \
-sname undefined \
-eval "
ServerNode = mainnode@$(hostname -s),
io:format(\"ServerNode: ~p~n\", [ServerNode]),
true = net_kernel:connect_node(ServerNode),
spawn(ServerNode, fun () ->
code:purge(my_server),
code:load_file(my_server),
ok = my_server:code_change()
end),
init:stop()"
Running server-reload
causes the source code to be compiled and hot-loaded into the running
server.
Then, I used a git post-receive
hook to automatically recompile and reload the code on push to live:
#!/bin/sh
set -e
unset GIT_DIR
cd $HOME/location-of-checkout-of-server-repository
git pull --ff-only
./server-reload
That’s all. The end result worked well: I used it to run a hotfix to my TCP service with many tens of live, active connections, and not one of them noticed a thing.
2024-07-22 19:21:17
Back in June, I made a quick-and-dirty attempt to get the big-bang model of functional UI running in Processing 4.
Unfortunately Processing uses a dialect of Java predating introduction of Java Records (JEP395), so I, er, creatively broke out m4 as a preprocessor.
The resulting macros turn this:
_record(Rect extends Pict, {{float x, float y, float w, float h}}, {{
public void render() {
rectMode(CORNER);
rect(this.x, this.y, this.w, this.h);
}
}});
into this:
class Rect extends Pict {
public final float x;
public final float y;
public final float w;
public final float h;
public Rect(float x, float y, float w, float h) {
this.x = x;
this.y = y;
this.w = w;
this.h = h;
}
public void render() {
rectMode(CORNER);
rect(this.x, this.y, this.w, this.h);
}
};
Not yet properly factored out into a utility library or anything, just pasted straight at the top of the file. Shield your eyes!
/* -*- mode: java; c-basic-offset: 2 -*- */
changecom(`//')dnl
changequote(`{{',`}}')dnl
dnl);
define({{_record}}, {{class $1 {_record_fields($2,)
public _record_classname($1)($2) {_record_inits($2,)
}
$3dnl;
}{{}}}})dnl;
define({{_record_fields}}, {{ifelse({{$#}}, {{1}},, {{
public final $1;$0(shift($@))}})}})dnl;
define({{_record_inits}}, {{ifelse({{$#}}, {{1}},, {{
this._record_fieldname({{$1}}) = _record_fieldname({{$1}});$0(shift($@))}})}})dnl;
define({{_record_classname}}, {{regexp({{$1}}, {{^\(\w+\).*$}}, {{\1}})}})dnl;
define({{_record_fieldname}}, {{regexp({{$1}}, {{^.+\s\(\w+\)$}}, {{\1}})}})dnl;
dnl;//---------------------------------------------------------------------------
2024-07-21 21:16:19
I had a small insight yesterday while building a component for a small web app: the user interface for editing an incomplete value of sum type A+B needs to remember a product of input 2×A×B from the user:
A + B ⟿ 2 × A × B
This allows the user to ergonomically change their mind about whether they’re building an A or a B without losing partially constructed values.
More precisely, the UI for a value of type A+B needs in general to be able to remember and manipulate 2×(A+1)×(B+1):
A + B ⟿ 2 × (A+1) × (B+1)
The extra 1s allow for nulls, for temporarily missing but required values. You could similarly generalise to allow for temporarily invalid or unparseable values.
Consider UI for creating a new project in an IDE, with two available options: create a new local project, by simply creating a new directory, or clone an existing git repository.
data NewProject =
Local { projectName :: String }
| Clone { gitUrl :: String,
credential :: String,
projectName :: String }
Abstractly, this is roughly Str + Str×Str×Str.
The user interface for this will look something like
Here we see that while a value of type NewProject
is being built, we need
to remember four strings (abstractly, Str×Str×Str×Str),
plus a boolean indicating whether we ultimately want a “local” or “clone”
project type (abstractly, 2).
All told, that’s
Str + Str×Str×Str ⟿ 2 × Str×Str×Str×Str
which exactly fits the pattern of
A + B ⟿ 2 × A × B
The translation can be applied recursively, but it (harmlessly) remembers slightly too much transient UI state,
A+(B+C) ⟿ 2 × A × (2 × B × C)
so perhaps it’s better to think about it applying directly to n-ary sums:
A+B+C ⟿ 3 × A × B × C
A+B+C+D ⟿ 4 × A × B × C × D
and so on.
2024-04-22 18:00:18
I noticed a bug in Guile 3.0.9’s aarch64 atomics handling, and found a couple of apparent solutions (1, 2), but one of them is weird enough for me to write this post.
(ETA: Nonstory. The problem was that the mov
instruction isn’t idempotent! Hat tip to
Andy Wingo for figuring out what the issue was. I’ve updated the rest of the article, and I’ll
leave it here for posterity.)
Long story short, the problem was with the equivalent of C’s
atomic_exchange
. Here’s the code
that Guile’s JIT was generating:
1:
mov x16, x0
ldaxr x0, [x1]
stlxr w17, x16, [x1]
cbnz w17, 1b
This code appears to occasionally lose writes (!). ETA: This code definitely loses
writes when interference means it has to go around the loop.
The first patch I wrote boringly replaced the lot with a single
swpal x0, x0, [x1]
which is fine, if you have an ARM v8.1 device to hand, but not if you don’t have a machine with
Large System
Extensions. So I
tried, on a hunch, the second patch, which just changed the target of the cbnz
,
producing code like this:
mov x16, x0
1:
ldaxr x0, [x1]
stlxr w17, x16, [x1]
cbnz w17, 1b
… and the issue disappeared! What! This shouldn’t have made a difference! Should it?
ETA: And fair enough, too! If the branch targets the mov
instruction, the value of x0
that ldaxr
set is used, meaning that the whole operation simply becomes a no-op assignment.
Are aarch64 atomics really this sensitive? Is there only One True Instruction Sequence that
should be used to implement ETA: Nothing to see here :-)atomic_exchange
? Why does making this seemingly-insignificant
change produce such a noticeable effect?
2024-01-24 22:09:24
As is
well-known,
JavaScript’s Promise
is not a monad. It will happily treat Promise<Promise<T>>
as if it was
Promise<T>
:
> [123, await Promise.resolve(123), await Promise.resolve(Promise.resolve(123))]
[ 123, 123, 123 ]
This can bite you in unexpected ways. Imagine you have a CSP-like Channel<T>
class for
sending T
s back and forth. Channel<T>
might have a method like this:
async pop(): Promise<T | undefined> { ... }
There’s an obvious problem here: what if undefined
∈ T
? So you make sure to note, in the
comment attached to Channel<T>
, that T
is not allowed to include undefined
.
But the less obvious problem is that T
is not allowed to contain Promise<undefined>
either,
even though in other contexts a promise of undefined cannot be confused with undefined:
> typeof undefined
'undefined'
> typeof Promise.resolve(undefined)
'object'
To see why this is a problem, instantiate T
with Promise<undefined>
, and look at the type
of pop()
:
Promise<Promise<undefined> | undefined>
Because JavaScript collapses promises-of-promises to just promises, this is equivalent to just
Promise<undefined>
and you’ve lost the ability to tell whether pop()
yielded a T
or an undefined
.
TypeScript does not warn you about this, incidentally. (Ask me how I know.)
Instead of accepting this loss of structure and adding another caveat to Channel<T>
to work
around JavaScript’s broken design—“T
must not include either undefined
or
Promise<undefined>
or Promise<Promise<undefined>>
etc.”—I decided to change the signature
of pop()
:
async pop(): Promise<Maybe<T>> { ... }
type Maybe<T> = Just<T> | undefined;
type Just<T> = { item: T };
Now both Channel<undefined>
and Channel<Promise<undefined>>
are sensible and work as
expected. No more exceptions regarding what T
s a Channel
may carry.
When T
is Promise<undefined>
, in particular, we see that the type of pop()
is
Promise<{ item: Promise<undefined> } | undefined>
Because the Promise
s aren’t immediately nested, JavaScript won’t erase our structure.
(Ironically, we’ve introduced a monad (Maybe<T>
) to fix the bad behaviour of something that
should have been a monad…)