MoreRSS

site iconBear Blog Trending PostsModify

Ranked according to the following algorithm:Score = log10(U) + (S / D * 8600), U is Upvotes , S/D is time.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Bear Blog Trending Posts

Training Data is Still an Open Problem

2026-04-24 02:03:00

We spend most of our time working on training data. If someone told me this a year ago, it would have surprised me. The general thinking was that brute force scaling of AI models would begin to reach diminishing returns, and that data collection infrastructure like ours would be used as a tool during inference.

Teams are still scaling pretraining aggressively. Not in a brute force way, but in a much more targeted and deliberate manner. The way this is happening tends to vary between labs, but the consistent signal is a constant need for sources with high densities of “good” tokens.

There’s an over-simplified version of the AI stack that shows up in a lot of investor decks. It goes something like this:

chips -> infrastructure -> apps

Interestingly, data often ends up being a bit of a footnote in these conversations. Even some of the people closest to the largest labs grossly underestimate what “internet-scale” really means, and view data as a solved problem in the training process. A good way to think about this might be, if Google suddenly stopped scraping the whole internet every day, would they still be the best at pretraining in a year’s time? Unlikely.

Training data demand isn’t uniform, and it doesn’t fall into clean categories. The biggest AI labs have incredible amounts of compute, and although they still care about efficiency, in practice they tend to optimize more for coverage. The default behavior is to take as much data as they can within a certain domain and absorb the long tail.

Smaller labs that can’t afford to train on everything will focus on density instead of coverage. This means stricter filtering and getting as much signal as possible out of fewer tokens.

Teams looking to do the same thing might want completely different data, and at the same time, teams will do very different things with the same data.

Regardless of what teams are looking for, the workflow tends to look similar. They start broad, absorb a large dataset, and use it to figure out where the signal is coming from. From there, requests become more specific. More filtering is introduced, and sometimes this involves deploying machine annotations at scale to properly index datasets. This is especially true for multimodal data, where filtering is very expensive and is often deferred to later stages.

At the same time, teams don’t stop absorbing large amounts of data. Even as pipelines become more selective, coverage continues to expand in parallel. In a strange way, the data “system” never really converges. It becomes broader and narrower at the same time, all the time.

There are a few reasons why training data is still far from a solved problem (there’s probably a strong case to be made that because training data is the real alpha, it will never be “solved”). To get a great training dataset, it isn’t as simple as just crawling as many pages as possible. You have to decide what to crawl, where to crawl it, how often to revisit it, and how to deal with the fact that large parts of the web don’t want to be crawled at all. Even the largest labs have internal pipelines, but that doesn’t eliminate the problem. Coverage is always incomplete, and priorities shift quickly enough that the need for external data never really goes away.

None of this is “set and forget”. It requires constant iteration. Sources degrade, others emerge, and what might be considered high-quality data doesn’t necessarily stay that way.

At scale, even small inefficiencies compound quickly. Every day spent on data acquisition is a day not spent training, and this is time you can't get back in a race to ship better models. Most teams would rather focus on improving models than maintaining complex data pipelines.

A year ago, it was easy to think about training and inference data as separate domains. That distinction is beginning to break down. To build good web search, you need a comprehensive, high-signal crawl of the internet. The same is true for training data. The work that goes into identifying, collecting, and refining high-quality data ends up looking similar in both cases.

Good training data tends to look like good search data, and the infrastructure being built for pretraining data collection today is laying the foundation for what real-time systems will rely on in the future.

Most of this isn't visible from the outside. The scale and complexity of the training data market is easy to underestimate unless you're directly working with it.

ʕ•ᴥ•ʔノ゙ pinewind: a chill bearblog

2026-04-24 00:31:07

pinewind, a blog by kwist, gives us a calm space to enjoy the jots and thoughts of a highly interesting Japan-based hobbyist, enthusiast, and multipotentialite writer.

i can't decide if my favorite post is his thought-log on blogging or his post on personal colour pallets. both posts show the breadth of kwist's interests.

he considers his blog more of a garden. i have certainly appreciated wandering through his garden and have secretly (not anymore) taken some cuttings home with me to graft onto my own projects. please take a patient wander through pinewind, you will experience a lovely visit:

Blackcap

2026-04-23 20:46:47

Today, I managed to get my first Blackcap on camera. He was buried deep in the bush and was hard to even get a clear shot of his face. I didn't even realise he was mid-song until I got home and see the images on a bigger screen.

P4230366

Also, another couple of Robins. Just because I liked these images.

P4170076

P3260694

Epistemological disappointment

2026-04-23 18:07:00

I liked school. I was not the most diligent of students for the same reason I liked it: I just liked learning stuff. Not necessarily the stuff I was expected to learn, although I learned that too. I didn't care about the academics of it, the grades. I loved reading about things, I loved understanding things and more than anything I loved not understanding things. I loved talking to my teachers, who wished I was a better student (that I was "más aplicada"), and having them praise my questions and direct me to other things they thought I would like learning and were out of scope for the class.

Some other kids didn't like school so much, for one or many silly or very valid reasons. And if I had to work together with them, on a presentation or some kind of research project, for example, what invariably happened was that they wanted to half-ass it and think about it as little as possible.

I am not against half-assing school projects. I think it's fine, and in fact I was very good at that myself. But I did like to think about them, often a lot, before ultimately deciding to half-ass them. So my disappointment in those team projects wasn't that of someone with an academic disposition, who wanted to work hard and get good grades, but the result of a missed opportunity. Most of the times I thought the tasks either were fun and interesting or had the potential to be made fun an interesting. But making a school task fun and interesting when it isn't out of the box requires an investment. Not "hard work", something far more nebulous. Good faith, maybe, curiosity; a willingness to let yourself be interested, intellectually engaged. You have to want to have fun (not a given) and you have to believe that this particular thing has the potential to be twisted into something that will make you have fun.

My disappointment was epistemological.

And this brings me to the present: I like learning and knowing. I like thinking about learning and knowing and thinking about thinking about learning and knowing. I am not a "result-oriented person" and find the mere existence of the phrase frankly terrifying.

When I ask you a question whose answer I could, with relative ease, find somewhere else, I am not asking it to get the answer, or at least not exclusively, but precisely to ask you the question. I care about my asking of the question, and I care about your answering the question far more than I care about your answer to it.

So when I ask you a question and you reply "let me just ask ch4tGPT" you are stabbing a knife to my heart.

People who hate thinking and their fundamental, sneering disdain for anyone who doesn't have been around ever since someone sat down for the first time and invented thinking. But I don't think it has ever been as ubiquitous and accepted as it is now to laugh at and quickly dismiss anyone who enjoys the distance between a question and its answer, a distance to be trodden deliberately and, yes, sometimes slowly.

I refuse to use "AI" because I refuse to be a part of the unadulterated epistemological disappointment that the world has become. I refuse to use "AI" even to automate away boring and repetitive tasks. If I ever want to automate away a boring and repetitive task I will find a way to automate it myself, maybe learning something in the process. Better yet, I will not automate it away, but think about what exactly it is that makes it boring and repetitive, or the words "repetitive" and "boring" be so often used together. I will look at the thing again, more charitably, and find something about it I can enjoy. And then I will go ahead and enjoy it.


The creative journey is not an inconvenience

2026-04-23 17:45:55

I was reading an article on Music Radar today about Suno Studio AI and the growing use of AI in music creation. Some music maker I've never heard of said that we can't beat AI, so we should either integrate it into music making or just drive Ubers for a living.

I'm pretty tired of dipshit tech bros and opportunists telling everyone that we gotta jump on the ai train now or get left behind! Apparently, in their view, making coin and amassing fame are the only things that matter. Let ai do the work while you do the art - that's another braindead winner I often hear.

So, we have AI, trained on a bunch of copyrighted music, now being sold as an easy and quick way to make the music that is currently overrunning services like Spotify and Deezer so that lazy executives and rent seekers can extract more money without paying royalties to the same music makers they owe a debt to for using their music as an AI training ground? Doesn’t that sound like the biggest scam?

Apparently, putting in the hours to make music, learning, failing, trying - that's just an inconvenience that AI solves. The journey to create art IS part of the art. It's not dispensable. It's not an inconvenience. It's the way we create meaning for ourselves. It's the method by which we grow and understand both ourselves and the world.

Reply by email: [email protected]. Subscribe to my blog via RSS feed. Find me on Mastodon.

Written-By-a-Human-Not-By-AI-Badge-white

/