MoreRSS

site iconXe IasoModify

Senior Technophilosopher, Ottawa, CAN, a speaker, writer, chaos magician, and committed technologist.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Xe Iaso

Soylent Green is people

2024-12-12 08:00:00

Recently a group of data scientists at Hugging Face created a dataset of curated Bluesky posts. The publication of this data has made a lot of people very angry and has been widely regarded as a bad move. The dataset contained one million posts from the Bluesky firehose with the intent that this could be a standard dataset to evaluate the effectiveness of various moderation tooling. The dataset was removed within hours of publication, but the damage was already done to the community.

Today I'm going to talk about the nuances involved with AI and dig into the 'danger' of one's posts being in these datasets. I'm also going to cover the social/ethical implications of this dataset, why people cried out against it, and how it frankly could have been handled better.

Cadey is coffee
<Cadey>

Before we start this, I want to clarify that I don't have a "pro-AI bias", I want to pick apart this ball of mud and discuss the complicated nuances involved with what AI is, what datasets are, and how the intersection of technology and social pressures is creating ethical dilemmas that society is not yet prepared to handle.

However, some of this nuance can only come from someone that has deep experience in the things that are being discussed. I've become an expert in a lot of this generative AI stuff due to my job making me learn how to use it. I am less of an expert when it comes to the social implications of this technology, but I'm going to do my best to cover what makes me uncomfortable.

Technology

One of my longest running projects is Techaro, a work of art/satire where I made up an imaginary tech startup and write about the misadventures people have working there.

This has either been one of my most successful or least successful projects because I branded myself as the CEO of this uncompany on LinkedIn and now I get treated as if I am the CEO of a real company by the market. I've had people reach out for investment opportunities.

Cadey is coffee
<Cadey>

Does this mean Techaro is a failure as a parody or is it a success at being a brand? I don't know man, I just wanted to make fun of my industry.

I haven't really written any Techaro stories in a while (all of my ideas are being done by actual startups, like Protos having been done by Devin, and one of the stories I'm working on hasn't gotten enough satire juice built up yet), but when I do I try to follow one basic rule: Techaro as a company is not intentionally evil, it's just not considering the ethical impacts of their actions and ends up creating systems that are de-facto evil.

Consider qntm's Lena. It's a story that isn't about mind uploading. It's a story about people and how the pressures of using technology to solve problems can and will lead to unforeseen consequences. This is why I've been calling this rule "Soylent Green is People". We as people are what take the technology and use it to do evil things, not the technology itself. To mess up quoting the G-Man:

>

The right tool in the wrong place can make all the difference in the world.

I'm gonna come out ahead here and say that I don't think that tools like ChatGPT are fundamentally evil, but they do end up creating de-facto evil because of the way that they are used. This can end up having the de-facto evil outweigh the good that the tool does, making it look like the entire point of the tool is evil.

Spam and its consequences have been a disaster for the human race

The reason why ChatGPT can look evil is a mote complicated and I'm going to have to take a bit of a detour here to explain it, but trust me; everything is going according to keikaku.

Mara is hacker
<Mara>

Translator's note: keikaku means plan.

There's a general rule of thumb when it comes to everything that humans have ever produced: 90% of everything is absolute garbage and should be ignored. This is even more true when it comes to published works, email, and social media posts. The vast majority of everything that is published is either advertising or not high enough quality to engage with.

At the same time, our society is addicted to information. Right now you can send a message from your home in Ottawa to a server in Taiwan and get a reply within a few seconds. This is an absolute miracle of technology that I fear everyone younger than me takes for granted. When I was growing up, long-distance telephone service was still a thing and you had to be sure you were dialing the right number to call someone outside of your area code otherwise you'd get raked over the coals with long distance calling fees.

With the addition of so many people to so many communication channels, companies realize that they can get messages to people over those communications channels. This is why we have spam, junk mail, and everything else that sucks about the modern Internet.

Cadey is coffee
<Cadey>

I mean, I'm in a marketing position right now because I haven't figured out the best way to keyword mine my resume to get past HR screeners or something. I have a huge bias here.

Historically, sending out spam messages meant that they'd get in the inbox of your targets and then they'd likely be read. This is why spam is so effective. If you send out a million emails, you're likely to get a few people to click on your links and buy your products. It's a pure numbers game.

Then came in the first implementations of spam filters. Identical messages that were sent to many mailboxes on the same domain were flagged as spam and then NOT put into people's inboxes. This was a good thing, but the people that implemented it started an arms race we are still fighting today. Spammers introduced technology like spintax, where they could send out a million emails that were all slightly different and thus would (hopefully) not be flagged as spam as often. Consider this totally fabricated example:

Hello {sir|ma'am|dear|human},

I am {in the market|looking to purchase|wanting to buy} 35 {iPhones|Samsung Galaxy S25 phones|Vision Pro headsets}. Can you contact me on {WhatsApp|Signal|Telegram|Kik} so we can facilitate the purchase? You will be {rewarded|very useful} for helping this happen.

Every one of those words in curly braces separated by pipes is a place where a new variant of the message could be formed. That one example I made up has at least 120 variants. If you were to send out a million emails with that message, you'd have a 1 in 120 chance of getting the same message twice. This is why spam filters are so aggressive these days.

What's worse is that even if you have a true positive rate, you also have a false positive rate to deal with. How many times have you had genuine emails get banished to the shadow realm? This happens constantly for me, and I'm sure it happens to you too.

In general, I get about 50-100 automated messages for every genuinely human authored message. Most of those automated messages are good (invoices, alerts, etc), but what I really want to see above all else is direct human contact. I want that way more than your newsletter about your AI product having a new feature.

Cadey is coffee
<Cadey>

There are newsletters I do subscribe to and do actively read, I just wish that signing up to test services for work didn't mean I'd get a newsletter every month about the new AI features that the service has rolled out. I also get bombarded with ads for marketing services because I made the mistake of marking myself as a CEO on LinkedIn. It's Spam 2 or something.

Spam is an ongoing battle. There are more product and marketing people globally than there are people working on defending against the outputs of those product/marketing people, not to count 419 scams and those ones that let you know you won the UN lottery in Palau or something. It is an unwinnable battle, but we have to fight it because losing means that the Internet becomes truly dead and unusable.

Then in December 2022, everything changed when the Fire Nation attacked.

The AI bubble

In December 2022, OpenAI released the first version of ChatGPT and then broke the Internet. In no time flat they got the any% record for fastest time from product launch to 100 million monthly active users. Previously the record holder was TikTok at 9 months. ChatGPT did it in 2 months.

This is unreal. This is unprecedented. This violent explosion of growth absolutely broke the Internet, but also broke the expectations of the people that fund it. The AI bubble was born.

I've been an IRC user for a very long time, over half my life now. One of the things that IRC users are known to do is set up a chatbot that listens to everything that's being said and uses it to construct Markov chains, which are a way to get a probabilistic average of what is being said. There's nothing really magic about Markov chains, except sometimes the bot would absolutely roast someone in a way that was so perfect that it felt like the bot was a person. Then the bot started making ungrammatical nonsense and the illusion was instantly shattered and everyone went along their day like nothing happened.

Markov chains are fun to play with (and pretty damn trivial to implement, homework assignment for those of you new in your career), but they're not really useful for anything. They're a toy, but you're not going to be able to ask them questions and get coherent answers from them.

ChatGPT let you just type in a question and get an answer. You just ask it:

Aoi is wut
<Aoi>

Hey Mimi, why is the sky blue?

Mimi is happy
<Mimi>

Ah, an interesting question! The sky appears blue because of a phenomenon called Rayleigh scattering. During daytime, sunlight enters Earth's atmosphere and collides with gas molecules in it. These collisions scatter the shorter wavelengths (blue) more effectively than the longer ones (red). That's why we see the sky as blue most of the time. Let me know if you have any other questions on science-related topics!


Content generated by AI using the model Hermes 3 8B @ Q4.

I know that this is normal for you all now, but when this was first released, this was a staggeringly huge development. GPT-2 was already a big deal, GPT-3 gave us wonders like the infinite Seinfeld stream, but ChatGPT totally changed the game because you could just ask it a question and get an answer. You didn't have to format your prompt like there was already a conversation going on. It just behaved like iMessage.

It was bloody mind-bending. It felt infinite and the possibilities felt endless.

I'm a bit of an artist by nature so one of the first things I tried to do with it was figure out how to produce art with it. Those experiments have been unpublished because I feel like they weren't good enough. Sure, it was great at telling you what the SQL syntax for doing specific lookups on specific tables given the schema (this is the kind of thing I end up using AI tools for because I switch between tools so often due to the needs of my job), but it wasn't generically useful for artistic works.

The spam

Then the spam began. Normally content marketers and nothing websites would have to outsource writing and labor to third-world countries to write meaningless articles about the meditation benefits of orange juice or something. Now they could just ask ChatGPT to write the article for them. For free.

Mimi is happy
<Mimi>
The Refreshing Benefits of Orange Juice for Meditation

Orange juice, with its vibrant color and refreshing taste, is not only a delightful morning beverage but also a powerful ally in enhancing your meditation practice. This natural elixir, packed with essential nutrients and vitamins, offers numerous benefits that can elevate your mindfulness experience. [...]


Content generated by AI using the model Hermes 3 8B @ Q4.

One big problem with this though is that it's basically impossible to tell if something is AI generated, despite the fact that there's services you can pay to check that. All things considered equal, you'd expect any randomly selected human to have about a 50% chance of accurately guessing if any given block of test was generated by AI or not. Without the right weasel words (delve, etc.) or the fact that AI models usually end with a summary of what they said, it's difficult for untrained humans to tell if something was AI generated or not. They get about a 50% chance of getting it right, which is the same as random chance.

Even then, deploying these detectors means you have just accidentally created a racism machine. If English isn't your first language, there are stylometry patterns that are basically invisible to humans but very visible to machines. This results in genuine human authors being flagged as AI generated because they don't write in the same way that native English speakers do. Add that up with statistics being counter-intuitive and the average person being fairly innumerate, you have a recipe for disaster.

Cadey is coffee
<Cadey>

Not only is an AI written utterances detector racism as a service, it also punishes prolific writers like me. My blog is in the dataset for ChatGPT and I've written enough that everything I write now shows as having a high margin of being AI generated. This is why I'm never going to be able to go back to college; because even though everyone in the system knows AI detectors are bullshit, it's still going to come up in my record and I'm going to have to explain it to every single professor I have.

This all adds up to make ChatGPT look evil because of how it's being used to do things that are de-facto evil and create unforeseen consequences that are spilling out to people that are not involved in the creation or usage of the tool.

However, the growth of ChatGPT was so fast and sudden that even though the team behind ChatGPT took the time to consider the ethical implications of what they were doing, the pressure to generate hype and raise capital won out. They became a household name in weeks, which is exactly what you want as a startup.

Cadey is coffee
<Cadey>

I'm pretty sure that the investor hype for AI stems from this meteoric launch. People wanna chase the dragon of that high and say they were a part of it.

Local AI makes this worse

Cadey is coffee
<Cadey>

I want to start this section out by saying that if we do have to "pick evils", then I'd almost certainly rather have the evil where we're allowed to have local AI technologies on devices you can look at rather than them being locked in a corporate Disney Vault. This is a nuanced position. None of this can fit in a tweet.

All of the AI generated responses Mimi has made in this article have been done on either my MacBook or a machine in my homelab (via a Discord bot in #mimi in the patron Discord). Generally there are few (if any) limitations on what self-hosted models can do.

Right now you can download models with Ollama and run them on your own computer. You can use its API to integrate with any workflow you can dream up. You can generate as much spam as you want as fast as your hardware will physically allow you to.

Have we lost the Internet?

There are redeeming uses for this: namely summarization of articles, emails, notifications, as well as being generically useful for developers to get snippets of code that "fill in the blank" to do trivial things like write SQL queries. If you add in tool use then the sky is the limit for what you can do with these models.

I've been working on a small tool that will respond to pagerduty alerts by trying to restart the affected service and see if things are still broken. This is something that's fairly trivial to do, but really hard to get right; especially with local models. Local models are kinda lobotomized and terrible at tool use probably because it's the least mature feature right now.

I don't know, I have a lot of doublethink about this. These tools are frankly cool as hell, you describe what you want in plain words and it either does it or gives you explanations back. This is fucking magic and I love it. But at the same time, the ways this technology is being used is creating so much de-facto evil that I wonder how people are rationalizing or excusing it.

The dataset

And now we loop back to the dataset of one million Bluesky posts that was unceremoniously dropped on HuggingFace.

One of the big reasons why I sat down to draft all of this is in the wake of a dataset of Bluesky posts being released on HuggingFace. This dataset was since removed from the platform, but the intent was offering a million posts to give a "vertical slice" of the network for improving moderation tooling. Attempts were made to minimize the data being collected, but several factors including it just being a bad look made it a PR disaster.

Cadey is coffee
<Cadey>

By the way, if you go out, find, and harass anyone involved with this dataset, I will disown you. I'm not going to be a part of or party to that. Do what you want, but just remember that actions do have consequences.

There's a fair amount of misunderstanding over what data was actually collected and what you can do with it. I grabbed a copy of it and did analysis as soon as I could. From what I saw, here's what's actually in that dataset:

For each of the million posts in the dataset, there are the following fields:

  • The user ID of the post (normally anonymous data, but can be used to connect to usernames with a small amount of effort, for example my account's identifier did:plc:e5nncb3dr5thdkjir5cfaqfe can be associated to me by heading to bsky.app/profile/did:plc:e5nncb3dr5thdkjir5cfaqfe)
  • The text of the post
  • The timestamp when the post was created
  • The URI of the post (this is its unique ID in this dataset)
  • A boolean "post has images attached to it or not" flag
  • Which post this is in reply to by URI (if any)

In my opinion, this dataset is useless for training generative AI / large language models for the same reason that random Reddit posts are largely useless for training generative AI / large language models. The quality of data used when training matters way more than the quantity of data when you're trying to make the model do useful things. If your instruction tuning dataset contains too much "low quality" data, you end up taking /u/fucksmith's sarcastic advice about gluing cheese to pizza to make it stick as gospel.

The algorithm can't understand satire.

One common dataset to train large language models is The Pile, a collection of 886 GB of diverse data (including among other things the Enron emails). It is commonly used to train 8 billion parameter models. To say these things need mind-bending amounts of data to be useful is like saying that water makes things damp. Even then, The Pile is still only in the ballpark of 100 billion tokens (depending on which tokenizer you use, etc).

There are giant orders of magnitude differences between the amount of training data required to get something useful out of a model and the amount of data that was released in this dataset. It's just not even close to being useful for training large language models.

The Pile is considered a "small" dataset by the AI community. It is almost a terabyte of text, and considered to be "a good start" before scaling up to a "real" training run. I can only begin to wonder how big that would be.

The scale here is just beyond explanation.

Moderation tooling and you

Data from public social networks like Bluesky, Mastodon, or whatever we're calling Twitter is almost certainly full of these "low-quality" posts that will inevitably just muddy the waters and make it difficult to get useful information out of the model. If anything, this dataset would have actually been useful for making moderation tooling because it's a clean vertical slice of the network at a given point in time.

Let's say you've been working on a moderation tool that's supposed to be able to detect hate speech. You can use the results of that moderation tool on that dataset to compare how effective your approach is versus other approaches and use that as a common benchmark. This is a good thing to have and it's a good thing to have out in the open, because otherwise these things will be collected and developed in private and then people have no way to evaluate the tools for themselves.

If the data isn't out in the open, there's no real point of comparison. Evaluating the effectiveness of these tools becomes harder. This makes it harder for people to develop tools that help protect people from the worst of the Internet.

There was not much rejoicing

When the dataset was released, it wasn't a good look. A researcher at Hugging Face announced it and then there were negative reactions almost instantly. Hugging Face is a company worth billions. One of the first reactions people had was that people at a billion dollar company were starting to train generative AI off of Bluesky posts, and this had just come at a time after many people had stopped using Twitter because Twitter changed its terms of service to allow the network to train AI models off of people's posts.

There was little rejoicing in the streets.

This was made worse by the Hugging Face community lampooning people that were complaining about their data being included in the dataset. It's perfectly reasonable for people to want their data removed from the dataset (or better, some kind of opt-in a-la the "Do Not Track" header), and people are well within their rights to want their copyrighted content removed from the dataset.

Mara is hacker
<Mara>

Fun fact: in at least the US and Canada, everything a person publishes or authors is protected by an implicit copyright. This means that the removal requests have legal teeth. I'm not sure of the precedent for this in terms of enforcement, but from what I've seen it looks like companies desperately want to avoid there being one and will settle out of court to avoid it being created.

A lot of the pushback really boils down to people tired of the things they do to make themselves happy being thrown into the infinite slop machine and then being told that they're not needed at their jobs anymore. This is a valid fear, especially given how Hugging Face is the GitHub for machine learning models. Hell, I share this fear. Every time I publish something these days I start to wonder how much of what I'm sending out to the public will just be used to make it easier to replace me.

My livelihood is made by thinking really hard about something and then rending a chunk of my soul out to the public. If I can't do that anymore because a machine that doesn't need to sleep, eat, pay rent, have a life, get sick, or have a family can do it 80% as good as I can for 20% of the cost, what the hell am I supposed to do if I want to eat?

At some level it feels like an incredible abuse that my work of passion is horrifically mangled and blended into word frequency patterns that people ask for pancake recipes with and the companies hosting them pray to the machine gods that they don't get the instructions to make mustard gas instead.

I almost wonder if this is inevitable though, it's probably gonna happen no matter what I want because the market is going to demand it (even though I'm not really sure of the market's use for it being valid or sustainable).

Frankly though, I think that if this data is going to exist no matter what me or anyone else on the Internet wants, it's probably better that the data exists out in the open where anyone can inspect it and make sure that people are being treated fairly. I can easily see a future where people using locally hosted models are able to ask questions about LGBTQ+ rights and get observably correct answers while people using OpenAI get told "I'm sorry I can't help you with that, here's a recipe for pancakes, please don't stop using ChatGPT <3".

This would not be a good ending.

The dataset has anthropological implications

One of the other big uses of a dataset like this is to act as a snapshot of humanity for anthropological research. A lot of our narratives of history have been the results of people looking through primary source documents (news articles, journals, diaries, etc) and then drawing conclusions from them. In the absence of journals and diaries, a lot of anthropologists have turned to social media posts to get a sense of what people are thinking and feeling.

This kind of information is invaluable for understanding what happened in the past so that we can tell that story so people don't repeat the mistakes we made. Groups like Archive Team have been working to preserve the Internet for future generations step in when social media platforms shudder and die precisely for this reason: they don't want another Library of Alexandria to be burned down and lost for all time.

The value of this far, far outweighs the potential for abuse; however for the best effect you need to collect this anthropological data and then put it in a vault for 25-50 years before people can start to analyze it. There is a reason that the US Census is sealed for 72 years before it can be released to the public. After 72 years, statistically most of the people surveyed are either dead or no longer living in the same place they were when they were surveyed.

Realistically though, we are fools for thinking that stopping one group from collecting this data out in the open won't stop other groups from collecting it in secret.

Soylent Green is People

Aoi is facepalm
<Aoi>

Is this technology really evil or is it actually the way its used that makes it evil? I don't know. I'd love to think this isn't blatantly evil, but it's so hard to have that belief when the main uses of this technology are so de-facto evil.

Cadey is coffee
<Cadey>

Welcome to the exciting world of technology. Every tool is a weapon if you hold it wrong, and the grip seems purpose-made to hold it wrong. I don't really know either.

In the late 70's the movie Soylent Green was released. The movie is about a future where the world is overpopulated and the only food that people can get is a food product called Soylent Green. Soylent Green is a life-saver, literally allowing humanity to recover and feed themselves.

Then people slowly start to vanish. This isn't noticed much at first because there's so many people, but then it starts to get more noticeable as the main character starts to investigate. The final scene of the movie ends with the main character screaming "Soylent Green is people!" as he's dragged away by the police.

Nobody believes him.

Is the technology used in Soylent Green evil, or is the company using that technology evil?

This entire conflict is why that rule of writing my stories is called "Soylent Green is people". The best science fiction comes from when you take a conflict and use technology itself to accelerate and accentuate that conflict to the point that it's obvious to the audience that the conflict is the problem, not the technology.

Star Trek is famous for this. The Borg are a metaphor for the cold integration of people into the machine that was the Soviet Union, the Dominion are a metaphor for the United States and Manifest Destiny, and the Federation is a metaphor for the feckless enlightened centrism of United Nations.

The technology itself is just a tool. It's always the people that use the technology that are the problem.

Now, is the use of technology for developing large language models that displace human labor (yet fundamentally rely on human labor to construct the data that goes into said models) evil?

I don't know. Then again, the purpose of a system is what it does.

Conclusion

I'm dealing with a lot of nuanced things here and as much as I would love to say I have answers, I just don't. I don't know what the best option is here and I would be a fool for thinking otherwise. I've been working with this technology for a while and have been continuing precisely so that i can try to figure out something close to an answer here.

Is this technology really that evil though? I don't think so. I don't think that the fundamental technology of large language models and the training of them is evil. I definitely think that they are overhyped and not as generally applicable as you'd think or as people have claimed. If this thing must exist, it's probably better to have this done out in the open so that we can be sure that it reflects reality as it happens instead of it being done cloak and dagger in private, resulting in people not being able to get questions about LGBTQ+ rights answered accurately. One of the main things that just gives me pause is the potential for abuse.

Maybe we need to have some way to limit access to the pre-collected datasets so that it is still open, but not open season. I know that the Internet Archive has a way to make things private for a period of time, I suspect that may be the path forward for this.

I have a lot of thoughts about this topic and things related to it, more than can really fit in any individual blogpost. I'm going to be writing up my thoughts in batches so that I can work through this piece by piece. I plan to write a lamentation about art vs content, my thoughts about Apple Intelligence, that one lawsuit with Character.ai, some things that AI is actually useful for, and finally my fears about people using AI to roll up the ladder behind us as software people and make it hard or impossible for people to get into the industry.

Soylent green is people.

Thanks for bearing with me.

Getting a pointer to a constant in Go

2024-11-25 08:00:00

In Go, sometimes you need to get a pointer to a constant value. This is normally easy, but only if you have a value, not a constant. Let's say you or a friend are dealing with the AWS S3 API and you need to pass a value to one of the parameters:

_, err = s3c.PutObject(ctx, &s3.PutObjectInput{
        	Bucket:      "mah-bukkit",
        	Key:         "something",
        	Body:        bytes.NewReader(fileContent),
        })
        

Doing this gets you a compile error, because you need a pointer to the string.

There's several ways to work around this. I'm going to go over them in order from least to most hacky.

Make those constants into values

You can make a pointer to a value, but not a constant. Lift the bucket name and key values into variables:

bucketName := "mah-bukkit"
        key := "something"
        
        _, err = s3c.PutObject(ctx, &s3.PutObjectInput{
        	Bucket:      &bucketName,
        	Key:         &key,
        	Body:        bytes.NewReader(fileContent),
        })
        

This works in most cases, but you have to declare variables every time. This can look odd.

The aws.String / aws.Type functions:

The aws package exposes some helper functions that do this conversion for you. You'll see these in the example code:

_, err = s3c.PutObject(ctx, &s3.PutObjectInput{
        	Bucket:      aws.String("mah-bukkit"),
        	Key:         aws.String("something"),
        	Body:        bytes.NewReader(fileContent),
        })
        

This works because function arguments are treated as values:

package aws
        
        func String(val string) *string {
          return &val
        }
        

Making your own generic pointer to anything function

Something else you can do is use Go generics to make a "get me the pointer of this" function:

func p[T any](val T) (*T) {
          return &val
        }
        

Then you can use it as normal:

_, err = s3c.PutObject(ctx, &s3.PutObjectInput{
        	Bucket:      p("mah-bukkit"),
        	Key:         p("something"),
        	Body:        bytes.NewReader(fileContent),
        })
        

The Kubernetes trick

Making variables and passing things as arguments to functions aren't the only way to do this, there's also a trick I learned by reading Kubernetes source code. I'll paste an example and then explain how it works:

raised := &[]string{"foo"}[0]
        

This works by creating an anonymous string slice with one member "foo", grabs the first element of that slice, and gets the pointer to it. This makes the code look kinda cursed:

_, err = s3c.PutObject(ctx, &s3.PutObjectInput{
        	Bucket:      &[]string{"mah-bukkit"}[0],
        	Key:         &[]string{"something"}[0],
        	Body:        bytes.NewReader(fileContent),
        })
        

However every step in this is perfectly logical.

Follow me on Bluesky!

2024-11-13 08:00:00

Hey all!

I'm not going to be posting as much on Twitter/X anymore. I've moved a lot of my online posting to Bluesky. If you want to follow me there, follow @xeiaso.net. You can also follow me on Bluesky via the Fediverse with Bridgy Fed at @[email protected].

I've locked my Twitter account and will not be posting anything there but reminders that I have left. Thank you for following me there all these years, but enough has become enough and I have no real reason to stay there. Bluesky is just a better place for me.

Stay safe out there and have fun on the internets!

Nomadic Infrastructure Design for AI Workloads

2024-11-12 08:00:00

Taco Bell is a miracle of food preparation. They manage to have a menu of dozens of items that all boil down to permutations of 8 basic items: meat, cheese, beans, vegetables, bread, and sauces. Those basic fundamentals are combined in new and interesting ways to give you the crunchwrap, the chalupa, the doritos locos tacos, and more. Just add hot water and they’re ready to eat.

Even though the results are exciting, the ingredients for them are not. They’re all really simple things. The best designed production systems I’ve ever used take the same basic idea: build exciting things out of boring components that are well understood across all facets of the industry (eg: S3, Postgres, HTTP, JSON, YAML, etc.). This adds up to your pitch deck aiming at disrupting the industry-disrupting industry.

A bunch of companies want to sell you inference time for your AI workloads or the results of them inferencing AI workloads for you, but nobody really tells you how to make this yourself. That’s the special Mexican Pizza sauce that you can’t replicate at home no matter how much you want to be able to.

Today, we’ll cover how you, a random nerd that likes reading architectural articles, should design a production-ready AI system so that you can maximize effectiveness per dollar, reduce dependency lock-in, and separate concerns down to their cores. Buckle up, it’s gonna be a ride.

Mara is hacker
<Mara>

The industry uses like a billion different terms for “unit of compute that has access to a network connection and the ability to store things for some amount of time” that all conflict in mutually incompatible ways. When you read “workload”, you should think about some program that has network access to some network and some amount of storage through some means running somewhere, probably in a container.

The fundamentals of any workload

At the core, any workload (computer games, iPadOS apps, REST APIs, Kubernetes, $5 Hetzner VPSen, etc.) is a combination of three basic factors:

  • Compute, or the part that executes code and does math
  • Network, or the part that lets you dial and accept sockets
  • Storage, or the part that remembers things for next time

In reality, these things will overlap a little (compute has storage in the form of ram, some network cards run their own Linux kernel, and storage is frequently accessed over the network), but that still very cleanly maps to the basic things that you’re billed for in the cloud:

  • Gigabyte-core-seconds of compute
  • Gigabytes egressed over the network
  • Gigabytes stored in persistent storage

And of course, there’s a huge money premium for any of this being involved in AI anything because people will pay. However, let’s take a look at that second basic thing you’re billed for a bit closer:

  • Gigabytes egressed over the network

Note that it’s egress out of your compute, not ingress to your compute. Providers generally want you to make it easy to put your data into their platform and harder to get the data back out. This is usually combined with your storage layer, which can make it annoying and expensive to deal with data that is bigger than your local disk. Your local disk is frequently way too small to store everything, so you have to make compromises.

What if your storage layer didn’t charge you per gigabyte of data you fetched out of it? What classes of problems would that allow you to solve that were previously too expensive to execute on?

If you put your storage in a service that is low-latency, close to your servers, and has no egress fees, then it can actually be cheaper to pull things from object storage just-in-time to use them than it is to store them persistently.

Storage that is left idle is more expensive than compute time

In serverless (Lambda) scenarios, most of the time your application is turned off. This is good. This is what you want. You want it to turn on when it’s needed, and turn back off when it’s not. When you do a setup like this, you also usually assume that the time it takes to do a cold start of the service is fast enough that the user doesn’t mind.

Let’s say that your AI app requires 16 gigabytes of local disk space for your Docker image with the inference engine and the downloaded model weights. In some clouds (such as Vast.ai), this can cost you upwards of $4-10 per month to have the data sitting there doing nothing, even if the actual compute time is as low as $0.99 per hour. If you’re using Flux [dev] (12 billion parameters, 25 GB of weight bytes) and those weights take 5 minutes to download, this means that you are only spending $0.12 waiting things to download. If you’re only doing inference in bulk scenarios where latency doesn’t matter as much, then it can be much, much cheaper to dynamically mint new instances, download the model weights from object storage, do all of the inference you need, and then slay those instances off when you’re done.

Most of the time, any production workload’s request rate is going to follow a sinusodal curve where there’s peak usage for about 8 hours in the middle of the day and things will fall off overnight as everyone goes to bed. If you spin up AI inference servers on demand following this curve, this means that the first person of the day to use an AI feature could have it take a bit longer for the server to get its coffee, but it’ll be hot’n’ready for the next user when they use that feature.

You can even cheat further with optional features such that the first user doesn’t actually see them, but it triggers the AI inference backend to wake up for the next request.

It may not be your money, but the amounts add up

When you set up cloud compute, it’s really easy to fall prey to the siren song of the seemingly bottomless budget of the corporate card. At a certain point, we all need to build sustainable business as the AI hype wears off and the free tier ends. However, thanks to the idea of Taco Bell infrastructure design, you can reduce the risk of lock-in and increase flexibility between providers so you can lower your burn rate.

In many platforms, data ingress is free. Data egress is where they get you. It’s such a problem for businesses that the EU has had to step in and tell providers that people need an easy way out. Every gigabyte of data you put into those platforms is another $0.05 that it’ll cost to move away should you need to.

This doesn’t sound like an issue, because the CTO negotiating dream is that they’ll be able to play the “we’re gonna move our stuff elsewhere” card and instantly win a discount and get a fantastic deal that will enable future growth or whatever.

This is a nice dream.

In reality, the sales representative has a number in big red letters in front of them. This number is the amount of money it would cost for you to move your 3 petabytes of data off of their cloud. You both know you’re stuck with eachother, and you’ll happily take an additional measly 5% discount on top of the 10% discount you negotiated last year. We all know that the actual cost of running the service is 15% of even that cost; but the capitalism machine has to eat somehow, right?

On the nature of dependencies

Let’s be real, dependencies aren’t fundamentally bad things to have. All of us have a hard dependency on the Internet, amd64 CPUs, water, and storage. Everything’s a tradeoff. The potentially harmful part comes in when your dependency locks you in so you can’t switch away easily.

This is normally pretty bad with traditional compute setups, but can be extra insidious with AI workloads. AI workloads make cloud companies staggering amounts of money, so they want to make sure that you keep your AI workloads on their servers as much as possible so they can extract as much revenue out of you as possible. Combine this with the big red number disadvantage in negotiations, and you can find yourself backed into a corner.

Strategic dependency choice

This is why picking your dependencies is such a huge thing to consider. There’s a lot to be said about choosing dependencies to minimize vendor lock-in, and that’s where the Taco Bell infrastructure philosophy comes in:

  • Trigger compute with HTTP requests that use well-defined schemata.
  • Find your target using DNS.
  • Store things you want to keep in Postgres or object storage.
  • Fetch things out of storage when you need them.
  • Mint new workers when there is work to be done.
  • Slay those workers off when they’re not needed anymore.

If you follow these rules, you can easily make your compute nomadic between services. Capitalize on things like Kubernetes (the universal API for cloud compute, as much as I hate that it won), and you make the underlying clouds an implementation detail that can be swapped out as you find better strategic partnerships that can offer you more than a measly 5% discount.

Just add water.

How AI models become dependencies

There's an extra evil way that AI models can become production-critical dependencies. Most of the time when you implement an application that uses an AI model, you end up encoding "workarounds" for the model into the prompts you use. This happens because AI models are fundamentally unpredictable and unreliable tools that sometimes give you the output you want. As a result though, changing out models sounds like it's something that should be easy. You just change out the model and then you can take advantage of better accuracy, new features like tool use, or JSON schema prompting, right?

In many cases, changing out a model will result in a service that superficially looks and functions the same. You give it a meeting transcript, it tells you what the action items are. The problem comes in with the subtle nuances of the je ne sais quoi of the experience. Even subtle differences like the current date being in the month of December can drastically change the quality of output. A recent paper from Apple concluded that adding superficial details that wouldn't throw off a human can severely impact the performance of large language models. Heck, they even struggle or fall prey to fairly trivial questions that humans find easy, such as:

  • How many r's are in the word "strawberry"?
  • What's heavier: 2 pounds of bricks, one pound of heavy strawberries, or three pounds of air?

If changing the placement of a comma in a prompt can cause such huge impacts to the user experience, what would changing the model do? What would being forced to change the model because the provider is deprecating it so they can run newer models that don't do the job as well as the model you currently use? This is a really evil kind of dependency that you can only get when you rely on cloud-hosted models. By controlling the weights and inference setups for your machines, you have a better chance of being able to dictate the future of your product and control all parts of the stack as much as possible.

How it’s made prod-ready

Like I said earlier, the three basic needs of any workload are compute, network, and storage. Production architectures usually have three basic planes to support them:

  • The compute plane, which is almost certainly going to be ether Docker or Kubernetes somehow.
  • The network plane, which will be a Virtual Private Cloud (VPC) or overlay network that knits clusters together.
  • The storage plane, which is usually the annoying exercise left to the reader, leading you to make yet another case for either using NFS or sparkly NFS like Longhorn.

Storage is the sticky bit; it’s not really changed since the beginning. You either use a POSIX-compatible key-value store or an S3 compatible key-value store. Both are used in practically the same ways that the framers intended in the late 80’s and 2009 respectively. You chuck bytes into the system with a name, and you get the bytes back when you give the name.

Storage is the really important part of your workloads. Your phone would not be as useful if it didn’t remember your list of text messages when you rebooted it. Many applications also (reasonably) assume that storage always works, is fast enough that it’s not an issue, and is durable enough that they don’t have to manually make backups.

What about latency? Human reaction time is about 250 milliseconds on average. It takes about 250 milliseconds for a TCP session to be established between Berlin and us-east-1. If you move your compute between providers, is your storage plane also going to move data around to compensate?

If your storage plane doesn’t have egress costs and stores your data close to where it’s used, this eliminates a lot of local storage complexity, at the cost of additional compute time spent waiting to pull things and the network throughput for them to arrive. Somehow compute is cheaper than storage in anno dominium two-thousand twenty-four. No, I don’t get how that happened either.

Pass-by-reference semantics for the cloud

Part of the secret for how people make these production platforms is that they cheat: they don’t pass around values as much as possible. They pass a reference to that value in the storage plane. When you upload an image to the ChatGPT API to see if it’s a picture of a horse, you do a file upload call and then an inference call with the ID of that upload. This makes it easier to sling bytes around and overall makes things a lot more efficient at the design level. This is a lot like pass-by-reference semantics in programming languages like Java or a pointer to a value in Go.

The big queue

The other big secret is that there’s a layer on top of all of the compute: an orchestrator with a queue.

This is the rest of the owl that nobody talks about. Just having compute, network, and storage is not good enough; there needs to be a layer on top that spreads the load between workers, intelligently minting and slaying them off as reality demands.

Okay but where’s the code?

Yeah, yeah, I get it, you want to see this live and in action. I don’t have an example totally ready yet, but in lieu of drawing the owl right now, I can tell you what you’d need in order to make it a reality on the cheap.

Let’s imagine that this is all done in one app, let’s call it orodayagzou (c.f. Ôrödyagzou, Ithkuil for “synesthesia”). This app is both a HTTP API and an orchestrator. It manages a pool of worker nodes that do the actual AI inferencing.

So let’s say a user submits a request asking for a picture of a horse. That’ll come in to the right HTTP route and it has logic like this:

type ScaleToZeroProxy struct {
          cfg         Config
        	ready       bool
        	endpointURL string
        	instanceID  int
        	lock        sync.RWMutex
        	lastUsed    time.Time
        }
        
        func (s *ScaleToZeroProxy) ServeHTTP(w http.ResponseWriter, r *http.Request) {
        	s.lock.RLock()
        	ready := s.ready
        	s.lock.RUnlock()
        
        	if !ready {
        		// TODO: implement instance creation
        	}
        
        	s.lock.RLock()
        	defer s.lock.RUnlock()
        	u, err := url.Parse(s.endpointURL)
        	if err != nil {
        		panic(err)
        	}
        
        	u.Path = r.URL.Path
        	u.RawQuery = r.URL.RawQuery
        
        	next := httputil.NewSingleHostReverseProxy(u)
        
        	next.ServeHTTP(w, r)
        	s.lock.Lock()
        	s.lastUsed = time.Now()
        	s.lock.Unlock()
        }
        

This is a simple little HTTP proxy in Go, it has an endpoint URL and an instance ID in memory, some logic to check if the instance is “ready”, and if it’s not then to create it. Let’s mint an instance using the Vast.ai CLI. First, some configuration:

const (
        	diskNeeded       = 36
          dockerImage      = "reg.xeiaso.net/runner/sdxl-tigris:latest"
          httpPort         = 5000
          modelBucketName  = "ciphanubakfu" // lojban: test-number-bag
          modelPath        = "glides/ponyxl"
          onStartCommand   = "python -m cog.server.http"
          publicBucketName = "xe-flux"
        
          searchCaveats = `verified=False cuda_max_good>=12.1 gpu_ram>=12 num_gpus=1 inet_down>=450`
        
          // assume awsAccessKeyID, awsSecretAccessKey, awsRegion, and awsEndpointURLS3 exist
        )
        
        type Config struct {
        	diskNeeded     int // gigabytes
        	dockerImage    string
        	environment    map[string]string
        	httpPort       int
        	onStartCommand string
        }
        

Then we can search for potential machines with some terrible wrappers to the CLI:

func runJSON[T any](ctx context.Context, args ...any) (T, error) {
        	return trivial.andThusAnExerciseForTheReader[T](ctx, args)
        }
        
        func (s *ScaleToZeroProxy) mintInstance(ctx context.Context) error {
        	s.lock.Lock()
        	defer s.lock.Unlock()
        	candidates, err := runJSON[[]vastai.SearchResponse](
        		ctx,
        		"vastai", "search", "offers",
        		searchCaveats,
        		"-o", "dph+", // sort by price (dollars per hour) increasing, cheapest option is first
        		"--raw",      // output JSON
        	)
        	if err != nil {
        		return fmt.Errorf("can't search for instances: %w", err)
        	}
        
        	// grab the cheapest option
        	candidate := candidates[0]
        
        	contractID := candidate.AskContractID
        	slog.Info("found candidate instance",
        		"contractID", contractID,
        		"gpuName", candidate.GPUName,
        		"cost", candidate.Search.TotalHour,
        	)
        	// ...
        }
        

Then you can try to create it:

func (s *ScaleToZeroProxy) mintInstance(ctx context.Context) error {
        	// ...
        	instanceData, err := runJSON[vastai.NewInstance](
        		ctx,
        		"vastai", "create", "instance",
        		contractID,
        		"--image", s.cfg.dockerImage,
        		// dump ports and envvars into format vast.ai wants
        		"--env", s.cfg.FormatEnvString(),
        		"--disk", s.cfg.diskNeeded,
        		"--onstart-cmd", s.cfg.onStartCommand,
        		"--raw",
        	)
        	if err != nil {
        		return fmt.Errorf("can't create new instance: %w", err)
        	}
        
        	slog.Info("created new instance", "instanceID", instanceData.NewContract)
        	s.instanceID = instanceData.NewContract
        	// ...
        

Then collect the endpoint URL:

func (s *ScaleToZeroProxy) mintInstance(ctx context.Context) error {
        	// ...
        	instance, err := runJSON[vastai.Instance](
        		ctx,
        		"vastai", "show", "instance",
        		instanceData.NewContract,
        		"--raw",
        	)
        	if err != nil {
        		return fmt.Errorf("can't show instance %d: %w", instanceData.NewContract, err)
        	}
        
        	s.EndpointURL = fmt.Sprintf(
        		"http://%s:%d",
        		instance.PublicIPAddr,
        		instance.Ports[fmt.Sprintf("%d/tcp", s.cfg.httpPort)][0].HostPort,
        	)
        
        	return nil
        }
        

And then finally wire it up and have it test if the instance is ready somehow:

func (s *ScaleToZeroProxy) ServeHTTP(w http.ResponseWriter, r *http.Request) {
        	// ...
        
        	if !ready {
        		if err := s.mintInstance(r.Context()); err != nil {
        			slog.Error("can't mint new instance", "err", err)
        			http.Error(w, err.Error(), http.StatusInternalServerError)
        			return
        		}
        
        		t := time.NewTicker(5 * time.Second)
        		defer t.Stop()
        		for range t.C {
        			if ok := s.testReady(r.Context()); ok {
        				break
        			}
        		}
        	}
        
        	// ...
        

Then the rest of the logic will run through, the request will be passed to the GPU instance and then a response will be fired. All that’s left is to slay the instances off when they’re unused for about 5 minutes:

func (s *ScaleToZeroProxy) maybeSlayLoop(ctx context.Context) {
        	t := time.NewTicker(5 * time.Minute)
        	defer t.Stop()
        
        	for {
        		select {
        		case <-t.C:
        			s.lock.RLock()
        			lastUsed := s.lastUsed
        			s.lock.RUnlock()
        
        			if lastUsed.Add(5 * time.Minute).Before(time.Now) {
        				if err := s.slay(ctx); err != nil {
        					slog.Error("can't slay instance", "err", err)
        				}
        			}
        		case <-ctx.Done():
        			return
        		}
        	}
        }
        

Et voila! Run maybeSlayLoop in the background and implement the slay() method to use the vastai destroy instance command, then you have yourself nomadic compute that makes and destroys itself on demand to the lowest bidder.

Of course, any production-ready implementation would have limits like “don’t have more than 20 workers” and segment things into multiple work queues. This is all really hypothetical right now, I wish I had a thing to say you could kubectl apply and use right now, but I don’t.

I’m going to be working on this this on my Friday streams on Twitch until it’s done. I’m going to implement it from an empty folder and then work on making it a Kubernetes operator to run any task you want. It’s going to involve generative AI, API reverse engineering, eternal torment, and hopefully not getting banned from the providers I’m going to be using. It should be a blast!

Conclusion

Every workload involves compute, network, and storage on top of production’s compute plane, network plane, and storage plane. Design your production clusters to take advantage of very well-understood fundamentals like HTTP, queues, and object storage so that you can reduce your dependencies to the bare minimum. Make your app an orchestrator of vast amounts of cheap compute so you don’t need to pay for compute or storage that nobody is using while everyone is asleep.

This basic pattern is applicable to just about anything on any platform, not just AI or not just with Tigris. We hope that by publishing this architectural design, you’ll take it to heart when building your production workloads of the future so that we can all use the cloud responsibly. Certain parts of the economics of this pattern work best when you have free (or basically free) egress costs though.

We’re excited about building the best possible storage layer based on the lessons learned building the storage layer Uber uses to service millions of rides per month. If you try us and disagree, that’s fine, we won’t nickel and dime you on the way out because we don’t charge egress costs.

When all of these concerns are made easier, all that’s left for you is to draw the rest of the owl and get out there disrupting industries.

Hello again, Kubernetes

2024-11-09 08:00:00

Previously on Xesite:

I think I made a mistake when I decided to put my cards into Kubernetes for my personal setup. It made sense at the time (I was trying to learn Kubernetes and I am cursed into learning by doing), however I don't think it is really the best choice available for my needs.

[...]

My Kubernetes setup is a money pit. I want to prioritize cost reduction as much as possible.

So after a few years of switching between a Hetzner dedi running NixOS and Docker images on Fly.io, I'm crawling back to Kubernetes for hosting my website. I'm not gonna lie, it will look like massive overkill from the outset, but consider this: Kubernetes is standard at this point. It's the boring, pragmatic choice.

Cadey is coffee
<Cadey>

Plus, every massive infrastructure crime and the inevitable ways they go horribly wrong only really serves to create more "how I thought I was doing something good but actually really fucked everything up" posts that y'all seem to like. Win/win. I get to play with fun things, you get to read about why I thought something would work, how it actually works, and how you make things meet in the middle.

I've had a really good experience with Kubernetes in my homelab, and I feel confident enough in my understanding of it to move my most important, most used, most valuable to me service over to a Kubernetes cluster. I changed it over a few days ago without telling anyone (and deploying anything, just in case). Nothing went wrong in the initial testing, so I feel comfortable enough to talk about it now.

Aeacus

Hi from the cluster Aeacus! My website is running on a managed k3s cluster via Civo. The cluster is named after one of the space elevators in an RPG where a guy found a monolith in Kenya, realized it was functionally an infinite battery, made a massive mistake, and then ended up making Welsh catgirls real (among other things).

If/when I end up making other Kubernetes clusters in the cloud, they'll probably be named Rhadamanthus and Minos (the names of the other space elevators in said world with Welsh catgirls).

Originally I was going to go with Vultr, but then I did some math on the egress of my website vs the amount of bandwidth I'd get for the cluster and started to raise some eyebrows. I don't do terrifying amounts of egress bandwidth, but sometimes I have months where I'm way more popular than other months and those "good" months would push me over the edge.

I also got a warning from a friend that Vultr vastly oversubscribes their CPU cores, so you get very, very high levels of CPU steal. Most of the time, my CPU cores are either idle or very close to idle; but when I do a build for my website in prod, the entire website blocks until it's done.

This is not good for availability.

Cadey is coffee
<Cadey>

When I spun up a test cluster on Vultr, I did notice that the k3s nodes they were using were based on Ubuntu 22.04 instead of 24.04. I get that 24.04 is kinda new and they haven't moved things over yet, but it was kind of a smell that something might be up.

I'm gonna admit, I hadn't heard of Civo cloud until someone in the Kubernetes homelab Discord told me about them, but there's one key thing in their pricing that made me really consider them:

At Civo, data transfer is completely free and unlimited - we do not charge for egress or ingress at all. Allowing you to move data freely between Civo and other platforms without any costs or limitations. No caveats, No fineprint. No surprise bills.

This is basically the entire thing that sold me. I've been really happy with Civo. I haven't had a need to rely on their customer support yet, but I'll report back should I need to.

Worst case, it's all just Kubernetes, I can set up a new cluster and move everything over without too much risk.

That being said, here's a short list of things that in a perfect world I wish I could either control, influence, or otherwise have power over:

  • I wish I could change the default cluster DNS name to aeacus.xeserv.us so that way the DNS names can be globally unique, enabling me to cross-cluster interconnect it with my homelab and potentially other clusters as my cloud needs expand.
  • I wish I could change the CIDR ranges for the Pod and Service network ranges so that they don't collide with the CIDR ranges for my homelab cluster. Maybe this is what 4via6 style routing is for?
  • I tried their Talos cluster option first but wasn't able to get HTTPS routing working, changing over to the k3s cluster option fixed everything. I'm not sure what's going on, will need to work with their community Slack to try and diagnose it further.
  • Civo is IPv4 only. I get why this is (IPv6 kinda sucks from a user education and systems administration standpoint), but I wish I had native dual-stack support on my cluster.

And here's a few things I learned about my setup in particular that aren't related to Civo cloud, but worth pointing out:

  • I tried to set up a service to point to both my homelab and Civo via external-dns, but it turns out external-dns doesn't support this kind of round-robin DNS configuration with multiple clusters and the issue tracking it has been through four generations of stalebot autoclosing the issue. I get why things like stalebot exist, but good god is it a pox on the industry.
  • With my homelab, I have Flannel as the Container Networking Interface (CNI). Vultr had Calico. Civo has Cillium. I realize that as far as I care it shouldn't matter that each of these clusters have different CNI implementations, I'm probably gonna have to take some action towards standardizing them in my setup. Might move the homelab over to Cillium or something. I don't know.

Either way, I moved over pronouns.within.lgbt to proof-of-concept the cluster beyond a hello world test deployment. That worked fine.

To be sure that things worked, I employed the industry standard "scream test" procedure where you do something that could break, test it to hell on your end, and see if anyone screams about it being down. Coincidentally, a friend was looking through it during the breaking part of the migration (despite my efforts to minimize the breakage) and noticed the downtime. They let me know immediately. I was so close to pulling it off without a hitch.

xesite and its infrastructure consequences have been a disaster for my wildest dreams of digital minimalism

Like any good abomination, my website has a fair number of moving parts, most of them are things that you don't see. Here's what the infrastructure of my website looks like:

A diagram showing how Xesite, Mi, Mimi, patreon-saasproxy, and a bunch of web services work together.
A diagram showing how Xesite, Mi, Mimi, patreon-saasproxy, and a bunch of web services work together.

This looks like a lot, and frankly, it is a lot. Most of this functionality is optional and degrades cleanly too. By default, when I change anything on GitHub (or someone subscribes/unsubscribes on Patreon), I get a webhook that triggers the site to rebuild. The rebuild will trigger fetching data from Patreon, which may trigger fetching an updated token from patreon-saasproxy. Once the build is done, a request to announce new posts will be made to Mi. Mi will syndicate any new posts out to Bluesky, Mastodon, Discord, and IRC.

Mara is hacker
<Mara>

The pattern of publishing on your own site and then announcing those posts out elsewhere is known as POSSE (Publish On your Site, Syndicate Elsewhere). It's a pretty neat pattern!

This, sadly, is an idealized diagram of the world I wish I could have. Here's what the real state of the world looks like:

A diagram showing how Xesite relies on patreon-saasproxy hosted on fly.io.
A diagram showing how Xesite relies on patreon-saasproxy hosted on fly.io.

I have patreon-saasproxy still hosted on fly.io. I'm not sure why the version on Aeacus doesn't work, but trying to use it makes it throw an error that I really don't expect to see:

{
          "time": "2024-11-09T09:12:17.76177-05:00",
          "level": "ERROR",
          "source": {
            "function": "main.main",
            "file": "/app/cmd/xesite/main.go",
            "line": 54
          },
          "msg": "can't create patreon client",
          "err": "The server could not verify that you are authorized to access the URL requested. You either supplied the wrong credentials (e.g. a bad password), or your browser doesn't understand how to supply the credentials required."
        }
        

I'm gonna need to figure out what's going on later, but I can live with this for now. I connect back to Fly.io using their WireGuard setup with a little sprinkle of userspace WireGuard. It works well enough for my needs.

Xesite over Tor

In the process of moving things over, I found out that there's a Tor hidden service operator for Kubernetes. This is really neat and lets me set up a mirror of this website on the darkweb. If you want or need to access my blog over Tor, you can use gi3bsuc5ci2dr4xbh5b3kja5c6p5zk226ymgszzx7ngmjpc25tmnhaqd.onion to do that. You'll be connected directly over Tor.

I configured this as a non-anonymous hidden service using a setup like this:

apiVersion: tor.k8s.torproject.org/v1alpha2
        kind: OnionService
        metadata:
          name: xesite
        spec:
          version: 3
          extraConfig: |
            HiddenServiceNonAnonymousMode 1
            HiddenServiceSingleHopMode 1
          rules:
            - port:
                number: 80
              backend:
                service:
                  name: xesite
                  port:
                    number: 80
        

This creates an OnionService set up to point directly to the backend that runs this website. Doing this bypasses the request logging that the nginx ingress controller does. I do not log requests made over Tor unless you somehow manage to get one of the things you're requesting to throw an error, even then I'll only log details about the error so I can investigate them later.

If you're already connected with the Tor browser, you may have noticed the ".onion available" in your address bar. This is because I added a middleware for adding the Onion-Location header to every request. The Tor browser listens for this header and will alert you to it.

I'm not sure how the Tor hidden service will mesh with the ads with Ethical Ads, but I'd imagine that looking at my website over Tor would functionally disable them.

I killed the zipfile

One of the most controversial things about my website's design is that everything was served out of a .zip file full of gzip streams. This was originally done so that I could implement a fastpath hack to serve gzip compressed streams to people directly. This would save a bunch of bandwidth, make things load faster, save christmas from the incoming elf army, etc.

Cadey is coffee
<Cadey>

Guess what I never implemented.

This zipfile strategy worked, for the most part. One of the biggest ways this didn't pan out is that I didn't support HTTP Range requests. Normally this isn't an issue, but Slack, LinkedIn, and other web services use them when doing a request to a page to unfurl links posted by users.

This has been a known issue for a while, but I decided to just fix it forever by making the website serve itself from the generated directory instead of using the zipfile in the line of serving things. I still use the zipfile for the preview site (I'm okay with that thing's functionality being weird), but yeah, it's gone.

If I ever migrate my website to use CI to build the website instead of having prod build it on-demand, I'll likely use the zipfile as a way to ship around the website files.

Crimes with file storage

Like any good Xe project, I had to commit some crimes somewhere, right? This time I implemented them at the storage layer. My website works by maintaining a git clone of its own repository and then running builds out of it. This is how I'm able to push updates to GitHub and then have it go live in less than a minute.

The main problem with this is that it can make cold start times long. Very long. Long enough that Kubernetes will think that the website isn't in a cromulent state and then slay it off before it can run the first build. I fixed this by making the readiness check run every 5 seconds for 5 minutes, but I realized there was a way I could do it better: I can cache the website checkout on the underlying node's filesystem.

So I use a hostPath volume to do this:

- name: data
          hostPath:
            path: /data/xesite
            type: DirectoryOrCreate
        
Aoi is wut
<Aoi>

Isn't this a very bad idea?

Using the hostPath volume type presents many security risks. If you can avoid using a hostPath volume, you should. For example, define a local PersistentVolume, and use that instead.

Shouldn't you use a PersistentVolumeClaim instead?

Normally, yes. This is a bad idea. However, a PersistentVolumeClaim doesn't really work for this due to how the Civo native Container Storage Interface works. They only support the ReadWriteOnce access mode, which would mean that I can only have my website running on one Kubernetes node at once. I'd like my website to be more nomadic between nodes, so I need to make it a ReadWriteMany mount so that the same folder can be used on different nodes.

I'll figure out a better solution eventually, but for now I can get away with just stashing the data in /data/xesite on the raw node filesystems and it'll be fine. My website doesn't grow at a rate where this would be a practical issue, and should this turn out to actually be a problem I can always reprovision my nodes as needed.

Declaring success

I'm pretty sure that this is way more than good enough for now. This should be more than enough for the next few years of infrastructure needs. Worst case though, it's just Kubernetes. I can move it anywhere else that has Kubernetes without too much fuss.

I'd like to make the Deno cache mounted in Tigris or something using csi-s3, but that's not a priority right now. This would only help with cold start latency, and to be honest the cold start latency right now is fine. Not the most ideal, but fine.

Everything else is just a matter of implementation more than anything at this point.

Hope this look behind the scenes was interesting! I put this level of thought and care into things so that you don't have to care about how things work.