2025-04-25 21:00:00
At work we've recently been using zstd as a better-compressing alternative to gzip, and overall I've been pretty happy with it. A minor documentation gripe, though, is that the behavior around multithreaded compression is a bit unclear. I understand it's chunking the work and sending chunks to different threads to parallelize the compression process, and this means that I should expect to see better use of threads on larger files because there are more chunks to spread around, but what is the relationship?
When I look in man zstd
I see that you can set
-B<num>
to specify the size of the chunks, and it's
documented as "generally 4 * windowSize
". Except the
documentation doesn't say how windowSize
is set.
From a bit of poking at the source, it looks to me like the way this
works is that windowSize
is 2**windowLog
,
and windowLog
depends on your compression level. If I
know I'm doing zstd -15
, though, how does
compressionLevel=15
translate into a value for
windowLog
? There's a table in lib/compress/clevels.h
which covers inputs >256KB:
Level | windowLog | chainLog | hashLog | searchLog | minMatch | targetLength | strategy |
---|---|---|---|---|---|---|---|
<1 | 19 | 12 | 13 | 1 | 6 | 1 | fast |
1 | 19 | 13 | 14 | 1 | 7 | 0 | fast |
2 | 20 | 15 | 16 | 1 | 6 | 0 | fast |
3 | 21 | 16 | 17 | 1 | 5 | 0 | dfast |
4 | 21 | 18 | 18 | 1 | 5 | 0 | dfast |
5 | 21 | 18 | 19 | 3 | 5 | 2 | greedy |
6 | 21 | 18 | 19 | 3 | 5 | 4 | lazy |
7 | 21 | 19 | 20 | 4 | 5 | 8 | lazy |
8 | 21 | 19 | 20 | 4 | 5 | 16 | lazy2 |
9 | 22 | 20 | 21 | 4 | 5 | 16 | lazy2 |
10 | 22 | 21 | 22 | 5 | 5 | 16 | lazy2 |
11 | 22 | 21 | 22 | 6 | 5 | 16 | lazy2 |
12 | 22 | 22 | 23 | 6 | 5 | 32 | lazy2 |
13 | 22 | 22 | 22 | 4 | 5 | 32 | btlazy2 |
14 | 22 | 22 | 23 | 5 | 5 | 32 | btlazy2 |
15 | 22 | 23 | 23 | 6 | 5 | 32 | btlazy2 |
16 | 22 | 22 | 22 | 5 | 5 | 48 | btopt |
17 | 23 | 23 | 22 | 5 | 4 | 64 | btopt |
18 | 23 | 23 | 22 | 6 | 3 | 64 | btultra |
19 | 23 | 24 | 22 | 7 | 3 | 256 | btultra2 |
20 | 25 | 25 | 23 | 7 | 3 | 256 | btultra2 |
21 | 26 | 26 | 24 | 7 | 3 | 512 | btultra2 |
22 | 27 | 27 | 25 | 9 | 3 | 999 | btultra2 |
See the source if you're interested in other sizes.
So it looks like windowSize
is:
≤1
: 524k
2
: 1M
3-8
(default): 2M
9-16
: 4M
17-19
: 8M
20
: 32M
21
: 64M
22
: 128M
Probably best not to rely on any of this, but it's good to know what
zstd -<level>
is doing by default!
2025-04-20 21:00:00
The Effective Altruism community has encouraged a range of different approaches to doing good over time. Initially there was more focus on frugality as a way to increase how much you could donate, which was mostly supplanted by emphasis on earning more. In late 2015 this started to shift towards doing things that are directly useful, which accelerated in 2021. Then the market fell in 2022, FTX turned out to be a fraud, and there haven't been new donors near the scale of Open Phil / Good Ventures. Among many changes, people are thinking more about frugality again: the less you can live on, the more you can stretch a given amount of funding. [1]
To encourage myself to live more frugally and to give an example of what I thought was a pretty fulfilling life at relatively low cost for the US, I used to calculate numbers for how much we spent on ourselves. This included housing, food, transportation, medical, etc but not donations, taxes, or savings. At one point there were some news stories comparing our spending to our income, and it was nice to have a simple number to point at.
I was thinking it might be nice to start calculating these numbers again, but when I looked back at why I stopped it's mostly that it's actually a pretty tricky accounting question and I'm not sure there are ways to draw the lines that make much sense. For example:
One of the main things I do for fun is play music. This costs some money (instruments, kids coming with me to gigs, fun things while traveling) and also earns some money. How should I account for this? At one extreme I could say that income is income and expenses are "spending on ourselves", but this doesn't match reality well: there shouldn't be a difference between playing a dance weekend that is $1,000 with reimbursed travel and one that's $1,500 but I need to spend $500 on flights. At the other extreme I could look at the whole activity on net, and subtract expenses from income, but should what's essentially a family vacation tacked onto a gig really not be "spending on ourselves"? In between I could count this the way the IRS does (an approach I think is a good fit for determining income for pledging purposes) but this is also not great. For example, some portion of my new keyboard should probably be "spending on ourselves" since I was motivated in part by a desire to enjoy playing a nicer instrument and have less hassle in gig packing. And in the other direction, if I pay $200 in childcare to take a $125 gig the IRS doesn't count the childcare against income at all but I think $75 would be closer to what most people would consider "spending on ourselves".
When I last calculated these I didn't include expenses paid by our employers: as someone earning to give my employer gave us much nicer health insurance (and meals, and working conditions) than we would have bought for ourselves, and at least that excess portion doesn't seem like it's "spending on ourselves". Now that I'm doing directly valuable work and my frugality is more driven by a desire to extend runway for my project, however, if my employer is paying a lot for my health insurance that affects runway same as any other expense.
I would live in Boston regardless, but if someone who would otherwise work remotely in a low cost of living area decided their highest-impact option was to move here to work at the NAO I wouldn't want to count at least some portion of their increased living expenses. Similarly, if it made sense for use to move to the Bay Area (please no) for our work I'm not sure how I would want to count the increased housing (and other) costs.
If things started going poorly with childcare or school and one of us went down to part time, perhaps this is part time childcare paid in kind, imputing both income and expense? But you get weird results either way: if you don't do this foregone childcare means "spending on ourselves" goes down in a somewhat misleading way, while if you count all the time we spend taking care of the kids as implied income+expense you get very large numbers. And in between I don't really see a principled reason to count this only for the delta between a normal work week and the reduced hours.
I think part of why this doesn't feel very coherent is I'm trying to get "spending on ourselves" to do too much. It can't be both what people naturally understand the term to be (even ignoring that this isn't all that consistent) while also a good number to optimize for maximizing altruistic impact.
So I don't think I'm going to try to go back to calculating a number here, and instead I'll stick with sharing spending updates every couple years.
[1] Prompted by some observations a friend recently posted, but not
linking since it was friends-only.
Comment via: facebook, lesswrong, the EA Forum, mastodon, bluesky
2025-04-19 21:00:00
Cross-posted from my NAO Notebook.
This is an internal strategy note I wrote in November 2024 that I'm making public with some light editing.
In my work at the NAO I've been thinking about what I expect to see as LLMs continue to become more capable and get closer to where they can significantly accelerate their own development. I think we may see very large advances in the power of these systems over the next few years.
I'd previously thought that the main impact of AI on the NAO was through accelerating potential adversaries, and so shorter timelines primarily meant more urgency: we needed to get a comprehensive detection system in place quickly.
I now think, however, that this also means the best response involves some reprioritization. Specifically, AI will likely speed up some aspects of the creation of a detection system more than others, and so to the extent that we expect rapid advances in AI we should prioritize the work that we expect to bottleneck our future AI-accelerated work.
One way to plan for this is to imagine what would be the main bottlenecks if we had a far larger staff. Imagine if each senior person had AI support equivalent to all the smart junior people they could effectively manage. Or even (but my argument doesn't depend on this) AI systems that are as capable as today's experienced researchers. I think if in a year or two we found ourselves in this situation we would wish that:
We had collected a lot more data, because with a very large virtual computational staff future-AI-assisted-NAO can wring insights out of data far more efficiently than present-NAO.
We had started large-scale collection sooner, because, even if AI accelerates sequencing's price decreases and we can collect a lot more data in the future, it can't give us historical data.
We had a lot more partnerships for bringing in samples and data, because these take real-human time to scale up.
While I don't think this is the only way things could play out, I think it's likely enough that we should be taking these considerations very seriously in our planning.
April 2025: since initially drafting this we've started an ambitious effort to scale up our pilot system.
2025-04-18 21:00:00
As an American who works with some people who speak British English, the language differences are usually not a problem. Most words mean the same thing, and those that don't are usually concrete enough not to cause confusion (ex: lift, flat, chips). The tricky ones, though, are the ones that differ primarily in connotations. For example:
In American English (AE), "quite" is an intensifier, while in British English (BE) it's a mild deintensifier. So "quite good" is "very good" in AE but "somewhat good" in BE. I think "rather" works similarly, though it's less common in AE and I don't have a great sense for it.
"Scheme" has connotations of deviousness in AE, but is neutral in BE. Describing a plans or system as a "scheme" is common in BE and negative in AE.
"Graft" implies corruption in AE but hard work in BE.
These can cause silent misunderstandings where two people have very different ideas about the other's view:
A: "I can't believe how much graft there was in the procurement process!"
B: "Yes, quite impressive. Rather keen on going above and beyond, aren't they?"
A: "And did you see the pension scheme they set up?"
B: "Sounds like they'll be quite well off when they'll leave office."
In this example A leaves thinking B approves of the corruption, while B doesn't realize there was any. It could be a long time, if ever, before they realize they misunderstood each other.
Are there other words people have run into that differ like this?
2025-04-17 21:00:00
I do a lot of work on EC2, where I ssh into a few instances I use for specific purposes. Each time I did this I'd get a prompt like:
$ ssh_ec2nf The authenticity of host 'ec2-54-224-39-217.compute-1.amazonaws.com (54.224.39.217)' can't be established. ED25519 key fingerprint is SHA256:... This host key is known by the following other names/addresses: ~/.ssh/known_hosts:591: ec2-18-208-226-191.compute-1.amazonaws.com ~/.ssh/known_hosts:594: ec2-54-162-24-54.compute-1.amazonaws.com ~/.ssh/known_hosts:595: ec2-54-92-171-153.compute-1.amazonaws.com ~/.ssh/known_hosts:596: ec2-3-88-72-156.compute-1.amazonaws.com ~/.ssh/known_hosts:598: ec2-3-82-12-101.compute-1.amazonaws.com ~/.ssh/known_hosts:600: ec2-3-94-81-150.compute-1.amazonaws.com ~/.ssh/known_hosts:601: ec2-18-234-179-96.compute-1.amazonaws.com ~/.ssh/known_hosts:602: ec2-18-232-154-156.compute-1.amazonaws.com (185 additional names omitted) Are you sure you want to continue connecting (yes/no/[fingerprint])?
The issue is that each time I start my instance it gets a new hostname (which is just derived from the IP) and so SSH's trust on first use doesn't work properly.
Checking that "185 additional names omitted" is about the number I'd expect to see is ok, but not great. And it delays login.
I figured out how to fix this today:
Edit ~/.ssh/known_hosts
to add an entry for each
EC2 host I use under my alias for it. So I have
c2-44-222-215-215.compute-1.amazonaws.com ssh-ed25519
AAAA...
and I duplicate that to add ec2nf ssh-ed25519
AAAA...
etc.
Modify my ec2
ssh script to set HostKeyAlias
:
ssh -o "StrictHostKeyChecking=yes" -o "HostKeyAlias=ec2nf"
...
More secure and more convenient!
(What got me to fix this was an interaction with my auto-shutdown
script, where if I did start_ec2nf && sleep 20 &&
ssh_ec2nf
but then went and did something else for a minute or
two the machine would often turn itself off before I came back and got
around to saying yes
.)
2025-04-09 21:00:00
People are often a lot more interested in hot meals, and my kids are no exception. I've tried a bunch of options here including putting rocks in thermoses (turns out kindergarteners worry more than you might think about whether they'll accidentally eat rocks that are bigger than their mouths), bringing a microwave and toaster (good, but too bulky for school especially when you count the battery), and ramen (great, but Lily only likes one kind and I'm worried she'll get sick of it). We recently got an electric lunchbox ( this one because it was on sale, but there are a bunch) and it's pretty great!
It's insulated, and we prepare it the night before and put it in the fridge:
In the morning I set the timer:
It only goes up to 4hr, perhaps for danger zone reasons, but since the kids leave at 8am and lunch is at 12:30 that's just right (4hr timer, 30min heating).
When the timer gets to zero it starts heating, and counts down from 30min (by default; adjustable):
Tastes marginally better than microwave food, and loads better than cold food.
I haven't yet tested if it's powerful enough to handle frozen food. I hope it is, since then you could prepare it whenever and leave it in the freezer until ready to eat.
I also wonder if it could bake bread. It only goes up to 90C and it be a bit like steamed bread, but freshness might outweigh texture here.