MoreRSS

site iconXe IasoModify

Senior Technophilosopher, Ottawa, CAN, a speaker, writer, chaos magician, and committed technologist.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Xe Iaso

Life pro tip: get your flu shot

2025-02-17 08:00:00

I had put off getting my flu shot. I seriously regret this. Over the weekend, I've been in a blur of sleep and wakefulness as I suffered through the flu. Get your flu shot. This year's flu is brutal.

"No way to prevent this" say users of only language where this regularly happens

2025-02-05 08:00:00

In the hours following the release of CVE-2025-0725 for the project curl, site reliability workers and systems administrators scrambled to desperately rebuild and patch all their systems to fix a buffer overflow involving malformed gzip streams when you are using a 21 year old version of zlib. This is due to the affected components being written in C, the only programming language where these vulnerabilities regularly happen. "This was a terrible tragedy, but sometimes these things just happen and there's nothing anyone can do to stop them," said programmer Miss Josianne Wisozk, echoing statements expressed by hundreds of thousands of programmers who use the only language where 90% of the world's memory safety vulnerabilities have occurred in the last 50 years, and whose projects are 20 times more likely to have security vulnerabilities. "It's a shame, but what can we do? There really isn't anything we can do to prevent memory safety vulnerabilities from happening if the programmer doesn't want to write their code in a robust manner." At press time, users of the only programming language in the world where these vulnerabilities regularly happen once or twice per quarter for the last eight years were referring to themselves and their situation as "helpless."

Life pro tip: How to run a gui application in the foreground with PowerShell

2025-02-04 08:00:00

Among other things, I am the systems administrator for the home network and all of its services. My husband wanted to run the dedicated server for a game on the homelab but was having some trouble setting it up. Eventually it resulted in us giving up and renting a Windows VPS in the cloud.

I wanted a script that I can leave running that will update the dedicated server program, start up the server, and then try to update the dedicated server when it crashes. In order to do this in PowerShell, you need a script like this:

Set-Location -Path C:\Games\Motortown
        
        while ($true) {
          & C:\Users\Administrator\Software\steamcmd.exe blah blah
          Start-Process \
            -FilePath C:\Games\Motortown\server.exe \
            -Wait \
            -NoNewWindow \
            -ArgumentList "Jeju_World?listen?","-server","-log","-useperfthreads","-Port=7777","-QueryPort=27015"
        }
        

The important part is the Start-Process cmdlet, this is what lets you start a GUI program and then block execution until the program exits. If you don't do this, then the script will continue to the next line while the program is started in the background. This makes steamcmd very unhappy.

In order to pass command-line arguments to the process, you need to use the -ArgumentList flag with the arguments for the program. This is similar to how you would pass arguments to a program in a shell script:

some-program arg1 arg2 arg3
        

Except in PowerShell it looks worse:

Start-Process -FilePath some-program.exe -ArgumentList "arg1","arg2","arg3"
        

I hope this helps you.

The gist of s1: simple test time scaling

2025-02-03 08:00:00

Recently a paper came across my desk: s1: Simple test-time scaling. The paper is about a technique to improve the output of reasoning models that is so hilariously simple that when I read it, I had to laugh, try it out, and then write it up here. In order to understand what's going on, let's talk about the problem that the paper is trying to solve.

The problem

There's two kinds of AI models, the kind that starts replying instantly when you ask it why the sky is blue, and the kind that takes a moment to consider what you send it before it starts replying "out loud". The latter kind of models are more accurate and more easily able to handle complex questions or math problems. The problem is that they're slower and sometimes can stop reasoning too early.

Cadey is aha
Cadey

In AI jargon, "test-time" refers to when the model is running and you are asking it questions. The model is being "tested", so to speak.

So you can get a better idea of what's going on, here's what the output of a reasoning model looks like:

Aoi is wut
Aoi

How many r's are in the word raspberry?

Mimi is happy
Mimi
Thought for 9 seconds

Okay, let's see. The question is asking how many times the letter "r" appears in the word "raspberry." Hmm, I need to make sure I spell the word correctly first. Raspberry. R-A-S-P-B-E-R-R-Y. Wait, let me break it down letter by letter.

Starting with the first letter: R. That's an "r" right there. Then A comes next. S follows after that. Then P. After P is B, then E, then R, another R, and finally Y. So let me count the "r"s. The first letter is R, that's one. Then towards the end, after B and E, there are two R's in a row. So that's R at the beginning and then two R's towards the end. Wait, but let me make sure I'm not missing any in between. The letters are R, A, S, P, B, E, R, R, Y. So yes, the first letter is R, then after B and E, there are two R's. So total of three R's? Wait, no. Wait, let me check again. Let's write out the letters one by one:

  1. R
  2. A
  3. S
  4. P
  5. B
  6. E
  7. R
  8. R
  9. Y

So positions 1, 7, and 8 are R's. That's three R's in total. But wait, when I say the word "raspberry," sometimes people might misspell it or mispronounce it. Like, maybe there's a common mistake where people think there's only one R in the middle. But according to the correct spelling, it's R-A-S-P-B-E-R-R-Y. So yes, two R's after the E. So that would make three R's in total. Let me confirm by writing the word out: raspberry. R (1), A, S, P, B, E, R (2), R (3), Y. Yep, three R's. So the answer should be 3.

The word "raspberry" contains 3 instances of the letter "r". Breaking it down:

  1. R (first letter)
  2. R (7th letter)
  3. R (8th letter)

Thus, the total number of "r"s is 3.

Generated by DeepSeek R1

Yes, this kind of question really is the kind of thing that people use as a benchmark for state of the art AI systems. Our industry is deeply silly.

The token-level output

Under the hood, the output of the model looks kinda like this:

<|Assistant|>
        <think>
        Okay, let's see. The question is asking how many times the letter "r"
        appears in the word "raspberry." Hmm, I need to make sure I spell the word
        correctly first. Raspberry. R-A-S-P-B-E-R-R-Y. Wait, let me break it down
        letter by letter.
        [...]
        </think>
        
        The word "raspberry" contains **3** instances of the letter "r".
        Breaking it down:
        
        1. **R** (first letter)
        2. **R** (7th letter)
        3. **R** (8th letter)
        
        Thus, the total number of "r"s is **3**.
        <|end▁of▁sentence|>
        

Everything inside those <think> and </think> tags is the model "thinking to itself" about the question. My friends and I think that this helps the model get a better answer because it breaks down the question from complicated sentences that the model may not be trained on into axioms that it is trained on. I don't have any proof for this other than vibes, but I think it's at least tangentially related to the truth.

Either way the key insight here is that in many cases the model will stop thinking "too soon" and with a little bit more effort it could get the right answer. The researchers proposed doing a flow like this:

  1. Configure the runtime to consider </think> tags as "stop tokens". Stop tokens are tokens that tell the runtime to stop executing and return the output up to that point.
  2. Remove the </think> tags from the output if they're present.
  3. Append a newline and the word "Wait" to the end of the output.
  4. Run the model again on the output from step 3.
  5. Repeat steps 2-4 for a few iterations. This is called the "reasoning effort".

This process allows a model to double-check its answer, which can usually make it realize errors in its reasoning and correct them. The researchers found that this caused a significant score increase on the MATH500 and AIME24 benchmarks, which contain competition-level math problems.

My thoughts

In my opinion, this is really neat because it allows you to have a "reasoning effort" parameter that you can change to influence the results. This allows you to trade off between latency and accuracy. I suspect that there's a "Ballmer Peak" of reasoning effort where more reasoning effort doesn't actually improve the results, but I don't have any data to back that up yet.

The other cool part is that you can use this to make a reasoning model support structured outputs when you're using a runtime that doesn't have custom grammar support. Have the model yap for your desired reasoning effort score, and then in the final output you can manually add the </think> token to the end of the reasoning output and have the model continue on from there in JSON.

I did some testing with the Ollama structured outputs API and managed to get the model to output more detailed information about Canada. My code is in Xe/structured-reasoning on GitHub.

Here's the output my naïve implementation got me:

{
          "countryName": "Canada",
          "isoCode": "CAN",
          "capital": "Ottawa",
          "languages": ["English", "French", "Inuktitut", "Cree", "Ojibwe"],
          "historicalEvents": [
            {
              "name": "Confederation of Canada",
              "date": "1867-07-01",
              "description": "The British North America Act establishes the Dominion of Canada, uniting three colonies into a single country."
            },
            {
              "name": "World War II Mobilization",
              "date": "1939-09-10",
              "description": "Canada declares war on Nazi Germany, contributing significantly to Allied efforts in Europe and elsewhere."
            },
            {
              "name": "Statute of Westminster",
              "date": "1931-12-11",
              "description": "This act grants Canada and other Dominions full autonomy within the British Commonwealth, solidifying its independent status."
            },
            {
              "name": "Battle of Vimy Ridge",
              "date": "1917-04-09",
              "description": "A pivotal World War I battle where Canadian forces captured a heavily fortified German position, marking a defining moment in Canadian military history."
            },
            {
              "name": "Discovery of Insulin",
              "date": "1921-05-01",
              "description": "Frederick Banting and Charles Best at the University of Toronto discover insulin, revolutionizing diabetes treatment worldwide."
            }
          ]
        }
        

This is really cool, I'm going to be using it in my upcoming research projects. I kinda wish I had access to more chonky GPUs locally so that I could iterate on this faster, but I wasn't able to get a 5090 and those have high failure rates so maybe it's for the best.

Nomadic Infrastructure Design for AI workloads

2025-01-27 08:00:00

How do you design a production-ready AI system to maximize effectiveness per dollar? How do you manage and reduce dependency lock-in? Moreover, how do you separate concerns between your compute, network, and storage? In this talk I'll be covering all of that and showing you how to design a production-worthy AI setup that lets you be nomadic between providers, hunting down deals as easily as possible.

Video

Want to watch this in your video player of choice? Take this:
https://cdn.xeiaso.net/file/christine-static/talks/2025/nomadic-compute/index.m3u8

Transcript

Cadey is coffee
Cadey

This is spoken word. It is not written like I write blogposts. It is reproduced here for your convenience.

The title slide of the talk. It shows the speaker name and the title.
The title slide of the talk. It shows the speaker name and the title.

Hi, I'm Xe. I work at Tigris Data, and I'm going to talk about the concept of nomadic infrastructure design for your AI workloads.

This is not a product demo.
This is not a product demo.

But disclaimer, this is not a product demo.

(Audience cheers)

This is thought leadership, which is a kind of product, I guess.

The three parts of a workload: compute, network, and storage.
The three parts of a workload: compute, network, and storage.

A workload has three basic parts. Compute, network, and storage. Compute is the part that does the number crunching or the linear algebra. The network is what connects all our computers together. It's why we have to update everything every fifth femtosecond. And storage is what remembers things for next time.

This is what you're billed on over time.

As I've been messing with new providers and trying to find cheap hacks to get my AI stuff working at absurdly low prices, I found a really weird thing.

Compute time is cheaper than storage time.
Compute time is cheaper than storage time.

Compute time is cheaper than storage time.

I don't know why this is the case. With Vast.ai, RunPod, all these bid-acquired GPU markets; spending time downloading things is cheaper than storing them for the next run.

Pricing details for a random 4090 in South Carolina.
Pricing details for a random 4090 in South Carolina.

Like, look at this. I selected a $40.90 in South Carolina at random. It costs two pennies per hour to run with 50 GB of local storage. Keeping that data around is one penny per hour. That's half of the price of the instance. Sure, there's probably some...creative financial decisions that go on into pricing things like this.

But if it takes 30 seconds to boot it and it costs like two cents an hour, it costs more to store things than it does to not store things. Really weird thing to think about.

How to cheat at infrastructure design.
How to cheat at infrastructure design.

So let's learn how to cheat an infrastructure design and find out why I am not allowed to be an SRE anymore. Asterisk.

A graph of Bluesky user activity.
A graph of Bluesky user activity.

So, the first thing that you can do is scale to zero because people don't use workloads when they're asleep. This graph has a sinusoidal wave and it's from bluesky when they blew up late last year. There's a peak in the middle of American daytime and then it all goes down to very low as the Americans go to sleep.

If you've ever worked in SRE stuff, you see this all the time. This is what your request rate looks like. This is what your active user account looks like. This is what healthy products look like. So if you just make your service turn off when nobody's using it, you already save 12 hours of runtime per day.

A green-haired anime woman immolating money and laughing.
A green-haired anime woman immolating money and laughing.

Like, remember, it may not be your money, but money is expensive now. The free tier is going to end. At some point, the hype will die out and the price of compute will reflect the price of acquiring the hardware.

Your AI workloads are dependencies. Without those workloads, your product is doomed. Those who control the infrastructure spice, control the infrastructure universe or whatever Frank Herbert said in Dune.

Tradeoffs

The tradeoffs.
The tradeoffs.

So when you're cheating, it's all about making trade-offs. There are several factors that come into mind, but in my view, the biggest one is time because that's what you're billed on.

A list of the steps involved in a cold start of an AI workload.
A list of the steps involved in a cold start of an AI workload.

Specifically, cold start time or the time that it takes to go from the service not running to the service running. Here's an example of all of the steps involved in running a workload on some cloud provider somewhere.

Statistically, Docker is the universal package format of the internet. It's going to be in a Docker image that has to be pulled and video stuff is like gigabytes of random C++ libraries and a whole bunch of bytecode for GPUs that you don't have, but has to ship around anyway because who knows, you might run it on a 2060.

That gets pulled, extracted, it gets started. Your app boots up, realizes, "Oh, I don't have any models. I need to pull them down."

And then that time that it takes from pulling the models to loading the models is time where you're on the clock doing nothing useful. But once you get to the point where the models are loaded, you can inference them, do whatever it is and somehow make profit. But everything above that inference model step is effectively wasted time.

Depending on the platform you're using, this can cost you money doing nothing.

A perfectly normal drawing of Sonic the Hedgehog.
A perfectly normal drawing of Sonic the Hedgehog.

How can we make it fast? How can we give our infrastructure Sanic speed? Users don't care if you're trying to cheap out. They care about responsiveness. There's two ways to handle this and both are different ways of cheating.

Batch operations.
Batch operations.

One of the biggest ways to cheat is to make your workloads happen on a regular basis where you can do a whole bunch of stuff en masse. This is called batch operations. This is how the US financial system works. This is a horrifying thing. You bundle everything up into big batches and do them every 6, 12, 24 hours, whatever father time says you should do.

This is great. Let's say you have a wallpaper of the day app and you want to have it every wallpaper generated by AI for some reason. Statistically, if there's the wallpaper of the day, you don't need to run it more than once a day. So you can just have it cron job, start it up, generate the wallpaper, put it into storage somewhere. Mark it as ready for the world after it passes some basic filtering. Bob's your uncle, you're good.

This lets you run the most expensive part of your app on pennies for the dollar using any model that you want that you have the bytes for. So that you can't have your upstream infrastructure provider say, "Oh, we're going to turn off the model you're using. Good luck!"

Speed up downloads.
Speed up downloads.

But the other way to cheat is to speed up the cold start process. Let's look at that list again.

Another copy of the list of cold start operations.
Another copy of the list of cold start operations.

Pulling models is the slowest part because that's usually done by your Python program and Python is still single threaded in Anno Dominium two thousand and twenty-five. Your app has to sit there doing nothing waiting for the model to pull and get ready. This can take minutes if you're unlucky and take tens of minutes if you're really unlucky.

What if you could cheat by doing it in a phase where you're not billed? You could just put it into the Docker image with the runtime, right? So I did this and to my horror, it worked kind of well.

There's just like many problems.

Docker hates this
Docker hates this

Number one, Docker hates this. Docker absolutely despises this because the way that Docker works is that it's a bunch of tar balls in a trench coat, right? In order to pull a Docker image, you have to extract all the tar balls. It can only extract one of the tar balls at once because tar balls are weird.

And if you have a Flux dev, that's like a 12 billion parameter model. So we're talking about like 26 gigabytes of floating point numbers, including the model, the autoencoder and whatever else it has.

But this isn't time you have to pay for, but it is time that users may notice. But we're cheating, so you could just do it for batch operations.

If you want to do this anyways, here's a trick I learned:

Model weights don't change often. So what you can do is you can make a separate Docker image that has all of the model weights and then link those model weights into your runtime image.

FROM anu-registry.fly.dev/models/waifuwave AS models
        
        FROM anu-registry.fly.dev/runners/comfyui:latest
        
        COPY --link --from=models /opt/comfyui/models/checkpoints /opt/comfyui/models/checkpoints
        COPY --link --from=models /opt/comfyui/models/embeddings /opt/comfyui/models/embeddings
        COPY --link --from=models /opt/comfyui/models/loras /opt/comfyui/models/loras
        COPY --link --from=models /opt/comfyui/models/vae /opt/comfyui/models/vae
        
Aoi is facepalm
Aoi

This works. I'm horrified.

You get to reuse these models between that because if you have a base stable diffusion checkpoint and each LoRA in a separate layer, you can just have those be there in the image by default. And if you need to download a separate LoRa, you can do that at runtime and only have to download like 150 megs instead of like 5 gigs. That's a lot faster.

And you can also reuse them between projects or workloads, which might be preferable depending on what you're doing.

The Docker Hub hates this
The Docker Hub hates this

Other big problem when you're doing this, the Docker Hub will not allow this. It has a maximum layer size of like 10 gigabytes and maximum image size of 10 gigabytes. And my testing that uses stable diffusion 1.5 from 2023 is an 11 gigabyte image.

GitHub's container registry barely tolerated it. I had to use my own registry. It's not that hard. Registries are basically asset flipping S3 and I work for a company that that is basically S3. So this is easy to do and I can tell you how to do it after the talk. I have stickers.

The upsides of doing this
The upsides of doing this

But the biggest upside of doing this horrific horrific crime is that your one deploy artifact has both your application code and your weights. This is something that doesn't sound like a big advantage until you've had your model get removed from hugging face or Civitai. And then you have a production incident that you can't easily resolve because nobody has the model cached.

Numa is disgust
Numa

Ask me how I know.

The two of them meme edited to be 'one of them'
The two of them meme edited to be 'one of them'

And because there's just one of them, you don't have multiple artifacts to wrangle. You don't have to like have extra logic to download weights. It's amazing how much code you don't have to write when you don't have to write it.

The Nomadic Compute cover image having a robot hunting down deals
The Nomadic Compute cover image having a robot hunting down deals

But this is the key idea in a nomadic compute setup. Your workload ships with everything it needs so that it can start up quickly, head out to hunt whatever deals it can, get the job done and then head back to the cave to slumber or something. The metaphor fell apart. I'm sorry.

You also don't need to be beholden to any cloud provider because if you can execute AMD 64 byte code and you have an Nvidia GPU and there's a modern ish version of CUDA, it doesn't matter. Everything else is fungible. The only way that you'd really be locked in is if you're using local storage and remember we're trying to save money. So we're not.

So you can just use tools like Skypilot. It just works.

Live demo
Live demo

Okay, so let's tempt God.

I am a very good at web design, so this is an HTML 1.0 form. My demo is a button on a page and if you click the button, you get anime women:

A profile shot of a brown-haired anime woman looking up to the sky, made with Counterfeit v3.0
A profile shot of a brown-haired anime woman looking up to the sky, made with Counterfeit v3.0

See that was that was hallucinated by a GPU that spun up on demand and it'll shut down when we're done. I'm glad that worked.

List of special thanks
List of special thanks

Special thanks to all these people. You know what you did if you're on this list. You know what you didn't if you're not.

Final slide with Xe's contact info
Final slide with Xe's contact info

And with that, I've been Xe. If you have any questions, please ask. I don't bite.

GHSA-56w8-8ppj-2p4f: Bot protection bypass in Anubis

2025-01-26 08:00:00

Hey all. I screwed up with part of how I made Anubis, and as a result I have both fixed it and am filing this CVE to explain what went wrong and how it was fixed. This is GHSA-56w8-8ppj-2p4f.

This requires a sophisticated attacker to target a server running Anubis. I suspect that the only instances of this in the wild were the ones done by the reporter as a proof of concept and in my testing.

Vulnerability details

These details have been copied from GHSA-56w8-8ppj-2p4f.

CVSS score: 2.3 (CVSS:4.0/AV:N/AC:H/AT:N/PR:L/UI:N/VC:L/VI:N/VA:N/SC:N/SI:N/SA:N)

Weakness: CWE-807: Reliance on Untrusted Inputs in a Security Decision

Vulnerable version: anything older than v1.11.0-37-gd98d70a

Patched version: v1.11.0-37-gd98d70a and newer

Context

Anubis is a tool that allows administrators to protect bots against AI scrapers through bot-checking heuristics and a proof-of-work challenge to discourage scraping from multiple IP addresses. For more information about Anubis, see Anubis' README.md.

Impact

A sophisticated attacker (or scraper runner) that is targeting a website that uses Anubis can easily bypass the bot protection mechanisms.

This requires a targeted attack.

Patches

Pull the most recent Docker image in order to be sure you have upgraded past commit e09d0226a628f04b1d80fd83bee777894a45cd02.

Workarounds

There are no known workarounds at this time. Users must upgrade to fix this issue.

Details

Anubis works by having a client request a challenge value with a given difficulty, then the client performs proof-of-work to create a sha-256 hash matching that difficulty. Before commit e09d0226a628f04b1d80fd83bee777894a45cd02, the client sent the difficulty it used back to the server and the server used that untrusted value to make an allow/deny decision.

This has been fixed by using the difficulty value set by the administrator in Anubis' configuration flags when making said allow/deny decisions.

GReeTZ

Thank you Coral Pink for reporting this issue.

The Techaro security issue reporting policy

At Techaro, we believe in total honesty in how we handle security issues. We try our best to not make vulnerable code, but inevitably we will mess up and do it by accident. When we do, we will be: transparent, honest, high-signal, and handle the situation like professional adults. We will value the time of security researchers.

At times, we will fail at this mission. The real thing we are measuring is not the number of times that it happens, but how we react when it does happen. This is why we are openly and honestly reporting this issue.

When things do fail, we will create regression tests to ensure that those failures do not repeat themselves. The testing for Anubis is currently private, but in the interest of transparency here is the test that we added to that repo to handle this regression:

func TestFakeChallengeDifficulty(t *testing.T) {
        	cli, err := anubis.New(*testServerURL)
        	if err != nil {
        		t.Fatal(err)
        	}
        
        	ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
        	defer cancel()
        
        	chall, err := cli.MakeChallenge(ctx)
        	if err != nil {
        		t.Fatal(err)
        	}
        
        	nonce := 42069
        
        	response, err := sha256sum(fmt.Sprintf("%s%d", chall.Challenge, nonce))
        	if err != nil {
        		t.Fatal(err)
        	}
        
        	if err := cli.PassChallenge(ctx, anubis.PassChallengeRequest{
        		Response:    response,
        		Nonce:       nonce,
        		Redir:       "https://xeiaso.net",
        		ElapsedTime: 420,
        		Difficulty:  0,
        	}); err != nil {
        		sce, ok := err.(*anubis.StatusCodeErr)
        		if !ok {
        			t.Fatal(err)
        		}
        		if sce.Got != http.StatusForbidden {
        			t.Fatalf("wrong status code, should have forbidden auth bypas: want: %d, got: %d", sce.Want, sce.Got)
        		}
        	}
        	return
        }
        

Thank you for following the development of Anubis.