MoreRSS

site iconXe Iaso

Senior Technophilosopher, Ottawa, CAN, a speaker, writer, chaos magician, and committed technologist.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Xe Iaso

Follow me on Bluesky!

2024-11-13 08:00:00

Hey all!

I'm not going to be posting as much on Twitter/X anymore. I've moved a lot of my online posting to Bluesky. If you want to follow me there, follow @xeiaso.net. You can also follow me on Bluesky via the Fediverse with Bridgy Fed at @[email protected].

I've locked my Twitter account and will not be posting anything there but reminders that I have left. Thank you for following me there all these years, but enough has become enough and I have no real reason to stay there. Bluesky is just a better place for me.

Stay safe out there and have fun on the internets!

Nomadic Infrastructure Design for AI Workloads

2024-11-12 08:00:00

Taco Bell is a miracle of food preparation. They manage to have a menu of dozens of items that all boil down to permutations of 8 basic items: meat, cheese, beans, vegetables, bread, and sauces. Those basic fundamentals are combined in new and interesting ways to give you the crunchwrap, the chalupa, the doritos locos tacos, and more. Just add hot water and they’re ready to eat.

Even though the results are exciting, the ingredients for them are not. They’re all really simple things. The best designed production systems I’ve ever used take the same basic idea: build exciting things out of boring components that are well understood across all facets of the industry (eg: S3, Postgres, HTTP, JSON, YAML, etc.). This adds up to your pitch deck aiming at disrupting the industry-disrupting industry.

A bunch of companies want to sell you inference time for your AI workloads or the results of them inferencing AI workloads for you, but nobody really tells you how to make this yourself. That’s the special Mexican Pizza sauce that you can’t replicate at home no matter how much you want to be able to.

Today, we’ll cover how you, a random nerd that likes reading architectural articles, should design a production-ready AI system so that you can maximize effectiveness per dollar, reduce dependency lock-in, and separate concerns down to their cores. Buckle up, it’s gonna be a ride.

Mara is hacker
<Mara>

The industry uses like a billion different terms for “unit of compute that has access to a network connection and the ability to store things for some amount of time” that all conflict in mutually incompatible ways. When you read “workload”, you should think about some program that has network access to some network and some amount of storage through some means running somewhere, probably in a container.

The fundamentals of any workload

At the core, any workload (computer games, iPadOS apps, REST APIs, Kubernetes, $5 Hetzner VPSen, etc.) is a combination of three basic factors:

  • Compute, or the part that executes code and does math
  • Network, or the part that lets you dial and accept sockets
  • Storage, or the part that remembers things for next time

In reality, these things will overlap a little (compute has storage in the form of ram, some network cards run their own Linux kernel, and storage is frequently accessed over the network), but that still very cleanly maps to the basic things that you’re billed for in the cloud:

  • Gigabyte-core-seconds of compute
  • Gigabytes egressed over the network
  • Gigabytes stored in persistent storage

And of course, there’s a huge money premium for any of this being involved in AI anything because people will pay. However, let’s take a look at that second basic thing you’re billed for a bit closer:

  • Gigabytes egressed over the network

Note that it’s egress out of your compute, not ingress to your compute. Providers generally want you to make it easy to put your data into their platform and harder to get the data back out. This is usually combined with your storage layer, which can make it annoying and expensive to deal with data that is bigger than your local disk. Your local disk is frequently way too small to store everything, so you have to make compromises.

What if your storage layer didn’t charge you per gigabyte of data you fetched out of it? What classes of problems would that allow you to solve that were previously too expensive to execute on?

If you put your storage in a service that is low-latency, close to your servers, and has no egress fees, then it can actually be cheaper to pull things from object storage just-in-time to use them than it is to store them persistently.

Storage that is left idle is more expensive than compute time

In serverless (Lambda) scenarios, most of the time your application is turned off. This is good. This is what you want. You want it to turn on when it’s needed, and turn back off when it’s not. When you do a setup like this, you also usually assume that the time it takes to do a cold start of the service is fast enough that the user doesn’t mind.

Let’s say that your AI app requires 16 gigabytes of local disk space for your Docker image with the inference engine and the downloaded model weights. In some clouds (such as Vast.ai), this can cost you upwards of $4-10 per month to have the data sitting there doing nothing, even if the actual compute time is as low as $0.99 per hour. If you’re using Flux [dev] (12 billion parameters, 25 GB of weight bytes) and those weights take 5 minutes to download, this means that you are only spending $0.12 waiting things to download. If you’re only doing inference in bulk scenarios where latency doesn’t matter as much, then it can be much, much cheaper to dynamically mint new instances, download the model weights from object storage, do all of the inference you need, and then slay those instances off when you’re done.

Most of the time, any production workload’s request rate is going to follow a sinusodal curve where there’s peak usage for about 8 hours in the middle of the day and things will fall off overnight as everyone goes to bed. If you spin up AI inference servers on demand following this curve, this means that the first person of the day to use an AI feature could have it take a bit longer for the server to get its coffee, but it’ll be hot’n’ready for the next user when they use that feature.

You can even cheat further with optional features such that the first user doesn’t actually see them, but it triggers the AI inference backend to wake up for the next request.

It may not be your money, but the amounts add up

When you set up cloud compute, it’s really easy to fall prey to the siren song of the seemingly bottomless budget of the corporate card. At a certain point, we all need to build sustainable business as the AI hype wears off and the free tier ends. However, thanks to the idea of Taco Bell infrastructure design, you can reduce the risk of lock-in and increase flexibility between providers so you can lower your burn rate.

In many platforms, data ingress is free. Data egress is where they get you. It’s such a problem for businesses that the EU has had to step in and tell providers that people need an easy way out. Every gigabyte of data you put into those platforms is another $0.05 that it’ll cost to move away should you need to.

This doesn’t sound like an issue, because the CTO negotiating dream is that they’ll be able to play the “we’re gonna move our stuff elsewhere” card and instantly win a discount and get a fantastic deal that will enable future growth or whatever.

This is a nice dream.

In reality, the sales representative has a number in big red letters in front of them. This number is the amount of money it would cost for you to move your 3 petabytes of data off of their cloud. You both know you’re stuck with eachother, and you’ll happily take an additional measly 5% discount on top of the 10% discount you negotiated last year. We all know that the actual cost of running the service is 15% of even that cost; but the capitalism machine has to eat somehow, right?

On the nature of dependencies

Let’s be real, dependencies aren’t fundamentally bad things to have. All of us have a hard dependency on the Internet, amd64 CPUs, water, and storage. Everything’s a tradeoff. The potentially harmful part comes in when your dependency locks you in so you can’t switch away easily.

This is normally pretty bad with traditional compute setups, but can be extra insidious with AI workloads. AI workloads make cloud companies staggering amounts of money, so they want to make sure that you keep your AI workloads on their servers as much as possible so they can extract as much revenue out of you as possible. Combine this with the big red number disadvantage in negotiations, and you can find yourself backed into a corner.

Strategic dependency choice

This is why picking your dependencies is such a huge thing to consider. There’s a lot to be said about choosing dependencies to minimize vendor lock-in, and that’s where the Taco Bell infrastructure philosophy comes in:

  • Trigger compute with HTTP requests that use well-defined schemata.
  • Find your target using DNS.
  • Store things you want to keep in Postgres or object storage.
  • Fetch things out of storage when you need them.
  • Mint new workers when there is work to be done.
  • Slay those workers off when they’re not needed anymore.

If you follow these rules, you can easily make your compute nomadic between services. Capitalize on things like Kubernetes (the universal API for cloud compute, as much as I hate that it won), and you make the underlying clouds an implementation detail that can be swapped out as you find better strategic partnerships that can offer you more than a measly 5% discount.

Just add water.

How AI models become dependencies

There's an extra evil way that AI models can become production-critical dependencies. Most of the time when you implement an application that uses an AI model, you end up encoding "workarounds" for the model into the prompts you use. This happens because AI models are fundamentally unpredictable and unreliable tools that sometimes give you the output you want. As a result though, changing out models sounds like it's something that should be easy. You just change out the model and then you can take advantage of better accuracy, new features like tool use, or JSON schema prompting, right?

In many cases, changing out a model will result in a service that superficially looks and functions the same. You give it a meeting transcript, it tells you what the action items are. The problem comes in with the subtle nuances of the je ne sais quoi of the experience. Even subtle differences like the current date being in the month of December can drastically change the quality of output. A recent paper from Apple concluded that adding superficial details that wouldn't throw off a human can severely impact the performance of large language models. Heck, they even struggle or fall prey to fairly trivial questions that humans find easy, such as:

  • How many r's are in the word "strawberry"?
  • What's heavier: 2 pounds of bricks, one pound of heavy strawberries, or three pounds of air?

If changing the placement of a comma in a prompt can cause such huge impacts to the user experience, what would changing the model do? What would being forced to change the model because the provider is deprecating it so they can run newer models that don't do the job as well as the model you currently use? This is a really evil kind of dependency that you can only get when you rely on cloud-hosted models. By controlling the weights and inference setups for your machines, you have a better chance of being able to dictate the future of your product and control all parts of the stack as much as possible.

How it’s made prod-ready

Like I said earlier, the three basic needs of any workload are compute, network, and storage. Production architectures usually have three basic planes to support them:

  • The compute plane, which is almost certainly going to be ether Docker or Kubernetes somehow.
  • The network plane, which will be a Virtual Private Cloud (VPC) or overlay network that knits clusters together.
  • The storage plane, which is usually the annoying exercise left to the reader, leading you to make yet another case for either using NFS or sparkly NFS like Longhorn.

Storage is the sticky bit; it’s not really changed since the beginning. You either use a POSIX-compatible key-value store or an S3 compatible key-value store. Both are used in practically the same ways that the framers intended in the late 80’s and 2009 respectively. You chuck bytes into the system with a name, and you get the bytes back when you give the name.

Storage is the really important part of your workloads. Your phone would not be as useful if it didn’t remember your list of text messages when you rebooted it. Many applications also (reasonably) assume that storage always works, is fast enough that it’s not an issue, and is durable enough that they don’t have to manually make backups.

What about latency? Human reaction time is about 250 milliseconds on average. It takes about 250 milliseconds for a TCP session to be established between Berlin and us-east-1. If you move your compute between providers, is your storage plane also going to move data around to compensate?

If your storage plane doesn’t have egress costs and stores your data close to where it’s used, this eliminates a lot of local storage complexity, at the cost of additional compute time spent waiting to pull things and the network throughput for them to arrive. Somehow compute is cheaper than storage in anno dominium two-thousand twenty-four. No, I don’t get how that happened either.

Pass-by-reference semantics for the cloud

Part of the secret for how people make these production platforms is that they cheat: they don’t pass around values as much as possible. They pass a reference to that value in the storage plane. When you upload an image to the ChatGPT API to see if it’s a picture of a horse, you do a file upload call and then an inference call with the ID of that upload. This makes it easier to sling bytes around and overall makes things a lot more efficient at the design level. This is a lot like pass-by-reference semantics in programming languages like Java or a pointer to a value in Go.

The big queue

The other big secret is that there’s a layer on top of all of the compute: an orchestrator with a queue.

This is the rest of the owl that nobody talks about. Just having compute, network, and storage is not good enough; there needs to be a layer on top that spreads the load between workers, intelligently minting and slaying them off as reality demands.

Okay but where’s the code?

Yeah, yeah, I get it, you want to see this live and in action. I don’t have an example totally ready yet, but in lieu of drawing the owl right now, I can tell you what you’d need in order to make it a reality on the cheap.

Let’s imagine that this is all done in one app, let’s call it orodayagzou (c.f. Ôrödyagzou, Ithkuil for “synesthesia”). This app is both a HTTP API and an orchestrator. It manages a pool of worker nodes that do the actual AI inferencing.

So let’s say a user submits a request asking for a picture of a horse. That’ll come in to the right HTTP route and it has logic like this:

type ScaleToZeroProxy struct {
          cfg         Config
        	ready       bool
        	endpointURL string
        	instanceID  int
        	lock        sync.RWMutex
        	lastUsed    time.Time
        }
        
        func (s *ScaleToZeroProxy) ServeHTTP(w http.ResponseWriter, r *http.Request) {
        	s.lock.RLock()
        	ready := s.ready
        	s.lock.RUnlock()
        
        	if !ready {
        		// TODO: implement instance creation
        	}
        
        	s.lock.RLock()
        	defer s.lock.RUnlock()
        	u, err := url.Parse(s.endpointURL)
        	if err != nil {
        		panic(err)
        	}
        
        	u.Path = r.URL.Path
        	u.RawQuery = r.URL.RawQuery
        
        	next := httputil.NewSingleHostReverseProxy(u)
        
        	next.ServeHTTP(w, r)
        	s.lock.Lock()
        	s.lastUsed = time.Now()
        	s.lock.Unlock()
        }
        

This is a simple little HTTP proxy in Go, it has an endpoint URL and an instance ID in memory, some logic to check if the instance is “ready”, and if it’s not then to create it. Let’s mint an instance using the Vast.ai CLI. First, some configuration:

const (
        	diskNeeded       = 36
          dockerImage      = "reg.xeiaso.net/runner/sdxl-tigris:latest"
          httpPort         = 5000
          modelBucketName  = "ciphanubakfu" // lojban: test-number-bag
          modelPath        = "glides/ponyxl"
          onStartCommand   = "python -m cog.server.http"
          publicBucketName = "xe-flux"
        
          searchCaveats = `verified=False cuda_max_good>=12.1 gpu_ram>=12 num_gpus=1 inet_down>=450`
        
          // assume awsAccessKeyID, awsSecretAccessKey, awsRegion, and awsEndpointURLS3 exist
        )
        
        type Config struct {
        	diskNeeded     int // gigabytes
        	dockerImage    string
        	environment    map[string]string
        	httpPort       int
        	onStartCommand string
        }
        

Then we can search for potential machines with some terrible wrappers to the CLI:

func runJSON[T any](ctx context.Context, args ...any) (T, error) {
        	return trivial.andThusAnExerciseForTheReader[T](ctx, args)
        }
        
        func (s *ScaleToZeroProxy) mintInstance(ctx context.Context) error {
        	s.lock.Lock()
        	defer s.lock.Unlock()
        	candidates, err := runJSON[[]vastai.SearchResponse](
        		ctx,
        		"vastai", "search", "offers",
        		searchCaveats,
        		"-o", "dph+", // sort by price (dollars per hour) increasing, cheapest option is first
        		"--raw",      // output JSON
        	)
        	if err != nil {
        		return fmt.Errorf("can't search for instances: %w", err)
        	}
        
        	// grab the cheapest option
        	candidate := candidates[0]
        
        	contractID := candidate.AskContractID
        	slog.Info("found candidate instance",
        		"contractID", contractID,
        		"gpuName", candidate.GPUName,
        		"cost", candidate.Search.TotalHour,
        	)
        	// ...
        }
        

Then you can try to create it:

func (s *ScaleToZeroProxy) mintInstance(ctx context.Context) error {
        	// ...
        	instanceData, err := runJSON[vastai.NewInstance](
        		ctx,
        		"vastai", "create", "instance",
        		contractID,
        		"--image", s.cfg.dockerImage,
        		// dump ports and envvars into format vast.ai wants
        		"--env", s.cfg.FormatEnvString(),
        		"--disk", s.cfg.diskNeeded,
        		"--onstart-cmd", s.cfg.onStartCommand,
        		"--raw",
        	)
        	if err != nil {
        		return fmt.Errorf("can't create new instance: %w", err)
        	}
        
        	slog.Info("created new instance", "instanceID", instanceData.NewContract)
        	s.instanceID = instanceData.NewContract
        	// ...
        

Then collect the endpoint URL:

func (s *ScaleToZeroProxy) mintInstance(ctx context.Context) error {
        	// ...
        	instance, err := runJSON[vastai.Instance](
        		ctx,
        		"vastai", "show", "instance",
        		instanceData.NewContract,
        		"--raw",
        	)
        	if err != nil {
        		return fmt.Errorf("can't show instance %d: %w", instanceData.NewContract, err)
        	}
        
        	s.EndpointURL = fmt.Sprintf(
        		"http://%s:%d",
        		instance.PublicIPAddr,
        		instance.Ports[fmt.Sprintf("%d/tcp", s.cfg.httpPort)][0].HostPort,
        	)
        
        	return nil
        }
        

And then finally wire it up and have it test if the instance is ready somehow:

func (s *ScaleToZeroProxy) ServeHTTP(w http.ResponseWriter, r *http.Request) {
        	// ...
        
        	if !ready {
        		if err := s.mintInstance(r.Context()); err != nil {
        			slog.Error("can't mint new instance", "err", err)
        			http.Error(w, err.Error(), http.StatusInternalServerError)
        			return
        		}
        
        		t := time.NewTicker(5 * time.Second)
        		defer t.Stop()
        		for range t.C {
        			if ok := s.testReady(r.Context()); ok {
        				break
        			}
        		}
        	}
        
        	// ...
        

Then the rest of the logic will run through, the request will be passed to the GPU instance and then a response will be fired. All that’s left is to slay the instances off when they’re unused for about 5 minutes:

func (s *ScaleToZeroProxy) maybeSlayLoop(ctx context.Context) {
        	t := time.NewTicker(5 * time.Minute)
        	defer t.Stop()
        
        	for {
        		select {
        		case <-t.C:
        			s.lock.RLock()
        			lastUsed := s.lastUsed
        			s.lock.RUnlock()
        
        			if lastUsed.Add(5 * time.Minute).Before(time.Now) {
        				if err := s.slay(ctx); err != nil {
        					slog.Error("can't slay instance", "err", err)
        				}
        			}
        		case <-ctx.Done():
        			return
        		}
        	}
        }
        

Et voila! Run maybeSlayLoop in the background and implement the slay() method to use the vastai destroy instance command, then you have yourself nomadic compute that makes and destroys itself on demand to the lowest bidder.

Of course, any production-ready implementation would have limits like “don’t have more than 20 workers” and segment things into multiple work queues. This is all really hypothetical right now, I wish I had a thing to say you could kubectl apply and use right now, but I don’t.

I’m going to be working on this this on my Friday streams on Twitch until it’s done. I’m going to implement it from an empty folder and then work on making it a Kubernetes operator to run any task you want. It’s going to involve generative AI, API reverse engineering, eternal torment, and hopefully not getting banned from the providers I’m going to be using. It should be a blast!

Conclusion

Every workload involves compute, network, and storage on top of production’s compute plane, network plane, and storage plane. Design your production clusters to take advantage of very well-understood fundamentals like HTTP, queues, and object storage so that you can reduce your dependencies to the bare minimum. Make your app an orchestrator of vast amounts of cheap compute so you don’t need to pay for compute or storage that nobody is using while everyone is asleep.

This basic pattern is applicable to just about anything on any platform, not just AI or not just with Tigris. We hope that by publishing this architectural design, you’ll take it to heart when building your production workloads of the future so that we can all use the cloud responsibly. Certain parts of the economics of this pattern work best when you have free (or basically free) egress costs though.

We’re excited about building the best possible storage layer based on the lessons learned building the storage layer Uber uses to service millions of rides per month. If you try us and disagree, that’s fine, we won’t nickel and dime you on the way out because we don’t charge egress costs.

When all of these concerns are made easier, all that’s left for you is to draw the rest of the owl and get out there disrupting industries.

Hello again, Kubernetes

2024-11-09 08:00:00

Previously on Xesite:

I think I made a mistake when I decided to put my cards into Kubernetes for my personal setup. It made sense at the time (I was trying to learn Kubernetes and I am cursed into learning by doing), however I don't think it is really the best choice available for my needs.

[...]

My Kubernetes setup is a money pit. I want to prioritize cost reduction as much as possible.

So after a few years of switching between a Hetzner dedi running NixOS and Docker images on Fly.io, I'm crawling back to Kubernetes for hosting my website. I'm not gonna lie, it will look like massive overkill from the outset, but consider this: Kubernetes is standard at this point. It's the boring, pragmatic choice.

Cadey is coffee
<Cadey>

Plus, every massive infrastructure crime and the inevitable ways they go horribly wrong only really serves to create more "how I thought I was doing something good but actually really fucked everything up" posts that y'all seem to like. Win/win. I get to play with fun things, you get to read about why I thought something would work, how it actually works, and how you make things meet in the middle.

I've had a really good experience with Kubernetes in my homelab, and I feel confident enough in my understanding of it to move my most important, most used, most valuable to me service over to a Kubernetes cluster. I changed it over a few days ago without telling anyone (and deploying anything, just in case). Nothing went wrong in the initial testing, so I feel comfortable enough to talk about it now.

Aeacus

Hi from the cluster Aeacus! My website is running on a managed k3s cluster via Civo. The cluster is named after one of the space elevators in an RPG where a guy found a monolith in Kenya, realized it was functionally an infinite battery, made a massive mistake, and then ended up making Welsh catgirls real (among other things).

If/when I end up making other Kubernetes clusters in the cloud, they'll probably be named Rhadamanthus and Minos (the names of the other space elevators in said world with Welsh catgirls).

Originally I was going to go with Vultr, but then I did some math on the egress of my website vs the amount of bandwidth I'd get for the cluster and started to raise some eyebrows. I don't do terrifying amounts of egress bandwidth, but sometimes I have months where I'm way more popular than other months and those "good" months would push me over the edge.

I also got a warning from a friend that Vultr vastly oversubscribes their CPU cores, so you get very, very high levels of CPU steal. Most of the time, my CPU cores are either idle or very close to idle; but when I do a build for my website in prod, the entire website blocks until it's done.

This is not good for availability.

Cadey is coffee
<Cadey>

When I spun up a test cluster on Vultr, I did notice that the k3s nodes they were using were based on Ubuntu 22.04 instead of 24.04. I get that 24.04 is kinda new and they haven't moved things over yet, but it was kind of a smell that something might be up.

I'm gonna admit, I hadn't heard of Civo cloud until someone in the Kubernetes homelab Discord told me about them, but there's one key thing in their pricing that made me really consider them:

At Civo, data transfer is completely free and unlimited - we do not charge for egress or ingress at all. Allowing you to move data freely between Civo and other platforms without any costs or limitations. No caveats, No fineprint. No surprise bills.

This is basically the entire thing that sold me. I've been really happy with Civo. I haven't had a need to rely on their customer support yet, but I'll report back should I need to.

Worst case, it's all just Kubernetes, I can set up a new cluster and move everything over without too much risk.

That being said, here's a short list of things that in a perfect world I wish I could either control, influence, or otherwise have power over:

  • I wish I could change the default cluster DNS name to aeacus.xeserv.us so that way the DNS names can be globally unique, enabling me to cross-cluster interconnect it with my homelab and potentially other clusters as my cloud needs expand.
  • I wish I could change the CIDR ranges for the Pod and Service network ranges so that they don't collide with the CIDR ranges for my homelab cluster. Maybe this is what 4via6 style routing is for?
  • I tried their Talos cluster option first but wasn't able to get HTTPS routing working, changing over to the k3s cluster option fixed everything. I'm not sure what's going on, will need to work with their community Slack to try and diagnose it further.
  • Civo is IPv4 only. I get why this is (IPv6 kinda sucks from a user education and systems administration standpoint), but I wish I had native dual-stack support on my cluster.

And here's a few things I learned about my setup in particular that aren't related to Civo cloud, but worth pointing out:

  • I tried to set up a service to point to both my homelab and Civo via external-dns, but it turns out external-dns doesn't support this kind of round-robin DNS configuration with multiple clusters and the issue tracking it has been through four generations of stalebot autoclosing the issue. I get why things like stalebot exist, but good god is it a pox on the industry.
  • With my homelab, I have Flannel as the Container Networking Interface (CNI). Vultr had Calico. Civo has Cillium. I realize that as far as I care it shouldn't matter that each of these clusters have different CNI implementations, I'm probably gonna have to take some action towards standardizing them in my setup. Might move the homelab over to Cillium or something. I don't know.

Either way, I moved over pronouns.within.lgbt to proof-of-concept the cluster beyond a hello world test deployment. That worked fine.

To be sure that things worked, I employed the industry standard "scream test" procedure where you do something that could break, test it to hell on your end, and see if anyone screams about it being down. Coincidentally, a friend was looking through it during the breaking part of the migration (despite my efforts to minimize the breakage) and noticed the downtime. They let me know immediately. I was so close to pulling it off without a hitch.

xesite and its infrastructure consequences have been a disaster for my wildest dreams of digital minimalism

Like any good abomination, my website has a fair number of moving parts, most of them are things that you don't see. Here's what the infrastructure of my website looks like:

A diagram showing how Xesite, Mi, Mimi, patreon-saasproxy, and a bunch of web services work together.
A diagram showing how Xesite, Mi, Mimi, patreon-saasproxy, and a bunch of web services work together.

This looks like a lot, and frankly, it is a lot. Most of this functionality is optional and degrades cleanly too. By default, when I change anything on GitHub (or someone subscribes/unsubscribes on Patreon), I get a webhook that triggers the site to rebuild. The rebuild will trigger fetching data from Patreon, which may trigger fetching an updated token from patreon-saasproxy. Once the build is done, a request to announce new posts will be made to Mi. Mi will syndicate any new posts out to Bluesky, Mastodon, Discord, and IRC.

Mara is hacker
<Mara>

The pattern of publishing on your own site and then announcing those posts out elsewhere is known as POSSE (Publish On your Site, Syndicate Elsewhere). It's a pretty neat pattern!

This, sadly, is an idealized diagram of the world I wish I could have. Here's what the real state of the world looks like:

A diagram showing how Xesite relies on patreon-saasproxy hosted on fly.io.
A diagram showing how Xesite relies on patreon-saasproxy hosted on fly.io.

I have patreon-saasproxy still hosted on fly.io. I'm not sure why the version on Aeacus doesn't work, but trying to use it makes it throw an error that I really don't expect to see:

{
          "time": "2024-11-09T09:12:17.76177-05:00",
          "level": "ERROR",
          "source": {
            "function": "main.main",
            "file": "/app/cmd/xesite/main.go",
            "line": 54
          },
          "msg": "can't create patreon client",
          "err": "The server could not verify that you are authorized to access the URL requested. You either supplied the wrong credentials (e.g. a bad password), or your browser doesn't understand how to supply the credentials required."
        }
        

I'm gonna need to figure out what's going on later, but I can live with this for now. I connect back to Fly.io using their WireGuard setup with a little sprinkle of userspace WireGuard. It works well enough for my needs.

Xesite over Tor

In the process of moving things over, I found out that there's a Tor hidden service operator for Kubernetes. This is really neat and lets me set up a mirror of this website on the darkweb. If you want or need to access my blog over Tor, you can use gi3bsuc5ci2dr4xbh5b3kja5c6p5zk226ymgszzx7ngmjpc25tmnhaqd.onion to do that. You'll be connected directly over Tor.

I configured this as a non-anonymous hidden service using a setup like this:

apiVersion: tor.k8s.torproject.org/v1alpha2
        kind: OnionService
        metadata:
          name: xesite
        spec:
          version: 3
          extraConfig: |
            HiddenServiceNonAnonymousMode 1
            HiddenServiceSingleHopMode 1
          rules:
            - port:
                number: 80
              backend:
                service:
                  name: xesite
                  port:
                    number: 80
        

This creates an OnionService set up to point directly to the backend that runs this website. Doing this bypasses the request logging that the nginx ingress controller does. I do not log requests made over Tor unless you somehow manage to get one of the things you're requesting to throw an error, even then I'll only log details about the error so I can investigate them later.

If you're already connected with the Tor browser, you may have noticed the ".onion available" in your address bar. This is because I added a middleware for adding the Onion-Location header to every request. The Tor browser listens for this header and will alert you to it.

I'm not sure how the Tor hidden service will mesh with the ads with Ethical Ads, but I'd imagine that looking at my website over Tor would functionally disable them.

I killed the zipfile

One of the most controversial things about my website's design is that everything was served out of a .zip file full of gzip streams. This was originally done so that I could implement a fastpath hack to serve gzip compressed streams to people directly. This would save a bunch of bandwidth, make things load faster, save christmas from the incoming elf army, etc.

Cadey is coffee
<Cadey>

Guess what I never implemented.

This zipfile strategy worked, for the most part. One of the biggest ways this didn't pan out is that I didn't support HTTP Range requests. Normally this isn't an issue, but Slack, LinkedIn, and other web services use them when doing a request to a page to unfurl links posted by users.

This has been a known issue for a while, but I decided to just fix it forever by making the website serve itself from the generated directory instead of using the zipfile in the line of serving things. I still use the zipfile for the preview site (I'm okay with that thing's functionality being weird), but yeah, it's gone.

If I ever migrate my website to use CI to build the website instead of having prod build it on-demand, I'll likely use the zipfile as a way to ship around the website files.

Crimes with file storage

Like any good Xe project, I had to commit some crimes somewhere, right? This time I implemented them at the storage layer. My website works by maintaining a git clone of its own repository and then running builds out of it. This is how I'm able to push updates to GitHub and then have it go live in less than a minute.

The main problem with this is that it can make cold start times long. Very long. Long enough that Kubernetes will think that the website isn't in a cromulent state and then slay it off before it can run the first build. I fixed this by making the readiness check run every 5 seconds for 5 minutes, but I realized there was a way I could do it better: I can cache the website checkout on the underlying node's filesystem.

So I use a hostPath volume to do this:

- name: data
          hostPath:
            path: /data/xesite
            type: DirectoryOrCreate
        
Aoi is wut
<Aoi>

Isn't this a very bad idea?

Using the hostPath volume type presents many security risks. If you can avoid using a hostPath volume, you should. For example, define a local PersistentVolume, and use that instead.

Shouldn't you use a PersistentVolumeClaim instead?

Normally, yes. This is a bad idea. However, a PersistentVolumeClaim doesn't really work for this due to how the Civo native Container Storage Interface works. They only support the ReadWriteOnce access mode, which would mean that I can only have my website running on one Kubernetes node at once. I'd like my website to be more nomadic between nodes, so I need to make it a ReadWriteMany mount so that the same folder can be used on different nodes.

I'll figure out a better solution eventually, but for now I can get away with just stashing the data in /data/xesite on the raw node filesystems and it'll be fine. My website doesn't grow at a rate where this would be a practical issue, and should this turn out to actually be a problem I can always reprovision my nodes as needed.

Declaring success

I'm pretty sure that this is way more than good enough for now. This should be more than enough for the next few years of infrastructure needs. Worst case though, it's just Kubernetes. I can move it anywhere else that has Kubernetes without too much fuss.

I'd like to make the Deno cache mounted in Tigris or something using csi-s3, but that's not a priority right now. This would only help with cold start latency, and to be honest the cold start latency right now is fine. Not the most ideal, but fine.

Everything else is just a matter of implementation more than anything at this point.

Hope this look behind the scenes was interesting! I put this level of thought and care into things so that you don't have to care about how things work.

My first deploys for a new Kubernetes cluster

2024-11-03 08:00:00

I'm setting up some cloud Kubernetes clusters for a bit coming up on the blog. As a result, I need some documentation on what a "standard" cluster looks like. This is that documentation.

Mara is hacker
<Mara>

Every Kubernetes term is WrittenInGoPublicValueCase. If you aren't sure what one of those terms means, google "site:kubernetes.io KubernetesTerm".

I'm assuming that the cluster is named mechonis.

For the "core" of a cluster, I need these services set up:

These all complete different aspects of the three core features of any cloud deployment: compute, network, and storage. Most of my data will be hosted in the default StorageClass implementation provided by the platform (or in the case of baremetal clusters, something like Longhorn), so the csi-s3 StorageClass is more of a "I need lots of data but am cheap" than anything.

Most of this will be managed with helmfile, but 1Password can't be.

1Password

The most important thing at the core of my k8s setups is the 1Password operator. This syncs 1password secrets to my Kubernetes clusters, so I don't need to define them in Secrets manually or risk putting the secret values into my OSS repos. This is done separately as I'm not able to use helmfile

After you have the op command set up, create a new server with access to the Kubernetes vault:

op connect server create mechonis --vaults Kubernetes
        

Then install the 1password connect Helm release with operator.create set to true:

helm repo add \
          1password https://1password.github.io/connect-helm-charts/
        helm install \
          connect \
          1password/connect \
          --set-file connect.credentials=1password-credentials.json \
          --set operator.create=true \
          --set operator.token.value=$(op connect token create --server mechonis --vault Kubernetes)
        

Now you can deploy OnePasswordItem resources as normal:

apiVersion: onepassword.com/v1
        kind: OnePasswordItem
        metadata:
          name: falin
        spec:
          itemPath: vaults/Kubernetes/items/Falin
        

cert-manager, ingress-nginx, metrics-server, and csi-s3

In the cluster folder, create a file called helmfile.yaml. Copy these contents:

helmfile.yaml
repositories:
          - name: jetstack
            url: https://charts.jetstack.io
          - name: csi-s3
            url: cr.yandex/yc-marketplace/yandex-cloud/csi-s3
            oci: true
          - name: ingress-nginx
            url: https://kubernetes.github.io/ingress-nginx
          - name: metrics-server
            url: https://kubernetes-sigs.github.io/metrics-server/
        
        releases:
          - name: cert-manager
            kubeContext: mechonis
            chart: jetstack/cert-manager
            createNamespace: true
            namespace: cert-manager
            version: v1.16.1
            set:
              - name: installCRDs
                value: "true"
              - name: prometheus.enabled
                value: "false"
          - name: csi-s3
            kubeContext: mechonis
            chart: csi-s3/csi-s3
            namespace: kube-system
            set:
              - name: "storageClass.name"
                value: "tigris"
              - name: "secret.accessKey"
                value: ""
              - name: "secret.secretKey"
                value: ""
              - name: "secret.endpoint"
                value: "https://fly.storage.tigris.dev"
              - name: "secret.region"
                value: "auto"
          - name: ingress-nginx
            chart: ingress-nginx/ingress-nginx
            kubeContext: mechonis
            namespace: ingress-nginx
            createNamespace: true
          - name: metrics-server
            kubeContext: mechonis
            chart: metrics-server/metrics-server
            namespace: kube-system
        

Create a new admin access token in the Tigris console and copy its access key ID and secret access key into secret.accessKey and secret.secretKey respectively.

Run helmfile apply:

$ helmfile apply
        

This will take a second to think, and then everything should be set up. The LoadBalancer Service may take a minute or ten to get a public IP depending on which cloud you are setting things up on, but once it's done you can proceed to setting up DNS.

external-dns

The next kinda annoying part is getting external-dns set up. It's something that looks like it should be packageable with something like Helm, but realistically it's such a generic tool that you're really better off making your own manifests and deploying it by hand. In my setup, I use these features of external-dns:

You will need two DynamoDB tables:

  • external-dns-mechonis-crd: for records created with DNSEndpoint resources
  • external-dns-mechonis-ingress: for records created with Ingress resources

Create a terraform configuration for setting up these DynamoDB configuration values:

main.tf
terraform {
          backend "s3" {
            bucket = "within-tf-state"
            key    = "k8s/mechonis/external-dns"
            region = "us-east-1"
          }
        }
        
        resource "aws_dynamodb_table" "external_dns_crd" {
          name           = "external-dns-crd-mechonis"
          billing_mode   = "PROVISIONED"
          read_capacity  = 1
          write_capacity = 1
          table_class    = "STANDARD"
        
          attribute {
            name = "k"
            type = "S"
          }
        
          hash_key = "k"
        }
        
        resource "aws_dynamodb_table" "external_dns_ingress" {
          name           = "external-dns-ingress-mechonis"
          billing_mode   = "PROVISIONED"
          read_capacity  = 1
          write_capacity = 1
          table_class    = "STANDARD"
        
          attribute {
            name = "k"
            type = "S"
          }
        
          hash_key = "k"
        }
        

Create the tables with terraform apply:

terraform init
        terraform apply --auto-approve # yolo!
        

While that cooks, head over to ~/Code/Xe/x/kube/rhadamanthus/core/external-dns and copy the contents to ~/Code/Xe/x/kube/mechonis/core/external-dns. Then open deployment-crd.yaml and replace the DynamoDB table in the crd container's args:

         args:
                 - --source=crd
                 - --crd-source-apiversion=externaldns.k8s.io/v1alpha1
                 - --crd-source-kind=DNSEndpoint
                 - --provider=aws
                 - --registry=dynamodb
                 - --dynamodb-region=ca-central-1
        -        - --dynamodb-table=external-dns-crd-rhadamanthus
        +        - --dynamodb-table=external-dns-crd-mechonis
        

And in deployment-ingress.yaml:

         args:
                 - --source=ingress
        -        - --default-targets=rhadamanthus.xeserv.us
        +        - --default-targets=mechonis.xeserv.us
                 - --provider=aws
                 - --registry=dynamodb
                 - --dynamodb-region=ca-central-1
        -        - --dynamodb-table=external-dns-ingress-rhadamanthus
        +        - --dynamodb-table=external-dns-ingress-mechonis
        

Apply these configs with kubectl apply:

kubectl apply -k .
        

Then write a DNSEndpoint pointing to the created LoadBalancer. You may have to look up the IP addresses in the admin console of the cloud platform in question.

load-balancer-dns.yaml
apiVersion: externaldns.k8s.io/v1alpha1
        kind: DNSEndpoint
        metadata:
          name: load-balancer-dns
        spec:
          endpoints:
            - dnsName: mechonis.xeserv.us
              recordTTL: 3600
              recordType: A
              targets:
                - whatever.ipv4.goes.here
            - dnsName: mechonis.xeserv.us
              recordTTL: 3600
              recordType: AAAA
              targets:
                - 2000:something:goes:here:lol
        

Apply it with kubectl apply:

kubectl apply -f load-balancer-dns.yaml
        

This will point mechonis.xeserv.us to the LoadBalancer, which will point to ingress-nginx based on Ingress configurations, which will route to your Services and Deployments, using Certs from cert-manager.

cert-manager ACME issuers

Copy the contents of ~/Code/Xe/x/kube/rhadamanthus/core/cert-manager to ~/Code/Xe/x/kube/mechonis/core/cert-manager. Apply them as-is, no changes are needed:

kubectl apply -k .
        

This will create letsencrypt-prod and letsencrypt-staging ClusterIssuers, which will allow the creation of Let's Encrypt certificates in their production and staging environments. 9 times out of 10, you won't need the staging environment, but when you are doing high-churn things involving debugging the certificate issuing setup, the staging environment is very useful because it has a much higher rate limit than the production environment does.

Deploying a "hello, world" workload

Mara is hacker
<Mara>

Nearly every term for "unit of thing to do" is taken by different aspects of Kubernetes and its ecosystem. The only one that isn't taken is "workload". A workload is a unit of work deployed somewhere, in practice this boils down to a Deployment, its Service, any PersistentVolumeClaims, Ingresses, or other resources that it needs in order to run.

Now you can put everything into test by making a simple "hello, world" workload. This will include:

  • A ConfigMap to store HTML to show to the user
  • A Deployment to run nginx pointed at the contents of the ConfigMap
  • A Service to give an internal DNS name for that Deployment's Pods
  • An Ingress to route traffic to that Service from the public Internet

Make a folder called hello-world and put these files in it:

configmap.yaml
apiVersion: v1
        kind: ConfigMap
        metadata:
          name: hello-world
        data:
          index.html: |
            <html>
            <head>
              <title>Hello World!</title>
            </head>
            <body>Hello World!</body>
            </html>
        
deployment.yaml
apiVersion: apps/v1
        kind: Deployment
        metadata:
          name: hello-world
        spec:
          selector:
            matchLabels:
              app: hello-world
          replicas: 1
          template:
            metadata:
              labels:
                app: hello-world
            spec:
              containers:
                - name: web
                  image: nginx
                  ports:
                    - containerPort: 80
                  volumeMounts:
                    - name: html
                      mountPath: /usr/share/nginx/html
              volumes:
                - name: html
                  configMap:
                    name: hello-world
        
service.yaml
apiVersion: v1
        kind: Service
        metadata:
          name: hello-world
        spec:
          ports:
            - port: 80
              protocol: TCP
          selector:
            app: hello-world
        
ingress.yaml
apiVersion: networking.k8s.io/v1
        kind: Ingress
        metadata:
          name: hello-world
          annotations:
            cert-manager.io/cluster-issuer: "letsencrypt-prod"
            nginx.ingress.kubernetes.io/ssl-redirect: "true"
        spec:
          ingressClassName: nginx
          tls:
            - hosts:
                - hello.mechonis.xeserv.us
              secretName: hello-mechonis-xeserv-us-tls
          rules:
            - host: hello.mechonis.xeserv.us
              http:
                paths:
                  - path: /
                    pathType: Prefix
                    backend:
                      service:
                        name: hello-world
                        port:
                          number: 80
        
kustomization.yaml
resources:
          - configmap.yaml
          - deployment.yaml
          - service.yaml
          - ingress.yaml
        

Then apply it with kubectl apply:

kubectl apply -k .
        

It will take a minute for it to work, but here are the things that will be done in order so you can validate them:

  • The Ingress object has the cert-manager.io/cluster-issuer: "letsencrypt-prod" annotation, which triggers cert-manager to create a Cert for the Ingress
  • The Cert notices that there's no data in the Secret hello-mechonis-xeserv-us-tls in the default Namespace, so it creates an Order for a new certificate from the letsencrypt-prod ClusterIssuer (set up in the cert-manager apply step earlier)
  • The Order creates a new Challenge for that certificate, setting a DNS record in Route 53 and then waiting until it can validate that the Challenge matches what it expects
  • cert-manager asks Let's Encrypt to check the Challenge
  • The Order succeeds and the certificate data is written to the Secret hello-mechonis-xeserv-us-tls in the default Namespace
  • ingress-nginx is informed that the Secret has been updated and rehashes its configuration accordingly
  • HTTPS routing is set up for the hello-world service so every request to hello.mechonis.xeserv.us points to the Pods managed by the hello-world Deployment
  • external-dns checks for the presence of newly created Ingress objects it doesn't know about, and creates Route 53 entries for them

This results in the hello-world workload going from nothing to fully working in about 5 minutes tops. Usually this can be less depending on how lucky you get with the response time of the Route 53 API. If it doesn't work, run through resources in this order in k9s:

  • The external-dns-ingress Pod logs
  • The cert-manager Pod logs
  • Look for the Cert, is it marked as Ready?
  • Look for that Cert's Order, does it show any errors in its list of events?
  • Look for that Order's Challenge, does it show any errors in its list of events?
Mara is hacker
<Mara>

By the way: k9s is fantastic. You should have it installed if you deal with Kubernetes. It should be baked into kubectl. It's a near perfect tool.

Conclusion

From here you can deploy anything else you want, as long as the workload configuration kinda looks like the hello-world configuration. Namely, you MUST have the following things set:

  • Ingress objects MUST have the cert-manager.io/cluster-issuer: "letsencrypt-prod" annotation, if they don't, then no TLS certificate will be minted
  • Workloads MUST have the nginx.ingress.kubernetes.io/ssl-redirect: "true" to ensure that all plain HTTP traffic is upgraded to HTTPS
  • Sensitive data MUST be managed in 1Password via OnePasswordItem objects
Cadey is enby
<Cadey>

If you work at a cloud provider that offers managed Kubernetes, I'm looking for a new place to put my website, sponsorship would be greatly appreciated!

Happy kubeing all!

"No way to prevent this" say users of only language where this regularly happens

2024-10-29 08:00:00

In the hours following the release of CVE-2024-9632 for the project X.org, site reliability workers and systems administrators scrambled to desperately rebuild and patch all their systems to fix a buffer overflow that allows an attacker with access to raw X client calls to arbitrarily read and write memory, allowing for privilege escalation attacks. This is due to the affected components being written in C, the only programming language where these vulnerabilities regularly happen. "This was a terrible tragedy, but sometimes these things just happen and there's nothing anyone can do to stop them," said programmer Queen Annamarie Bayer, echoing statements expressed by hundreds of thousands of programmers who use the only language where 90% of the world's memory safety vulnerabilities have occurred in the last 50 years, and whose projects are 20 times more likely to have security vulnerabilities. "It's a shame, but what can we do? There really isn't anything we can do to prevent memory safety vulnerabilities from happening if the programmer doesn't want to write their code in a robust manner." At press time, users of the only programming language in the world where these vulnerabilities regularly happen once or twice per quarter for the last eight years were referring to themselves and their situation as "helpless."