MoreRSS

site iconMaggie AppletonModify

Designer, anthropologist, and mediocre developer. She.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Maggie Appleton

May 2025

2025-05-25 08:00:00

In a wonderfully dramatic change to my life, I became a mother two months ago. My son was born at the end of March via an unplanned but otherwise uncomplicated c-section. Parenthood has been predictably overwhelming, exhausting, and existentially glorious.

My days are now spent holding a sleeping newborn on my chest, timing wake windows, picking up the dropped pacifier for the 19th time, trying to eat with 0.5 hands free, and watching an eternal stream of Gilmore Girls episodes on a precariously balanced iPhone while feeding/burping/soothing/rocking/patting this tiny human. It swings between hard physical labour with high cortisol levels, and extremely chill, serene, and joyful a dozen times throughout the day and night.

I had doubts about becoming a mother when I was younger. Mostly related to systemic gender inequality, believing I would need to sacrifice my whole career for it, and thinking myself incapable of bearing the responsibility (which, to be fair, I was before age ~28). I spent a solid year in angst and turmoil trying to figure it out. All the parents around me only shared details of how stressful, sleep-deprived, expensive, and burdensome their new lives were. Perhaps because it felt too trite or vulnerable to put into words the love, joy, and purpose that comes with it.

Being on the other side, I now realise there was no calculation or algorithm or pro/con list or financial spreadsheet that could have helped me understand what it would feel like. Nothing that would do justice to the emotional weight of holding your sleeping baby that you made with your own body. Of watching them grin back at you with uncomplicated joy. Of realising you'll get to watch them grow into a full person; one that is – at least genetically – half you and half the person you love most in the world. Of watching them trip out as they realise they have hands.

I can now say with certainty I am evolutionarily wired for this. Perhaps not everyone is. But everything in me is designed to feel existential delight at each little fart, squeak, grunt, and sneeze that comes out of this child. Delight that is unrivalled by any successful day at work, fully shipped feature, long cathartic run, or Sunday morning buttery croissant – the banal highlights of my past life. When I think back to my pre-baby self, trying to calculate herself into a clear decision, I wish I could let her feel for one minute what it's like to hold him. And tell her I can't believe I ever considered depriving myself of this.

In other news, I've read no books (other than Your Baby Week by Week and Secrets Of The Baby Whisperer), had few higher-order thoughts, and binge watched all of Motherland. As this child learns to sleep in more predictable ways, I'm looking forward to being less of a zombie and engaging with the world again.

Statistically, When Will My Baby Be Born?

2025-03-24 08:00:00

A tiny tool to calculate when your baby might arrive

ChatGPT Would be a Decent Policy Advisor

2025-03-14 00:30:58

Revealed: How the UK tech secretary uses ChatGPT for policy advice

The New Scientist used freedom of information laws to get the ChatGPT records of the UK's technology secretary.

The headline hints at a damning exposé, but ends up being a story about a politician making pretty reasonable and sensible use of language models to be more informed and make better policy decisions.

He asked it why small business owners are slow to adopt AI, which popular podcasts he should appear on, and to define terms like antimatter and digital inclusion.

This all seems extremely fine to me. Perhaps my standards for politicians are too low, but I assume they don't actually know much and rely heavily on advisors to define terms for them and decide on policy improvements. And I think ChatGPT connected to some grounded sources would be a decent policy advisor. Better than most human policy advisors. At least when it comes to consistency, rapidly searching and synthesising lots of documents, and avoiding personal bias. Models still carry the bias of their creators, but it all becomes a trade-off between human flaws and model flaws.

Claiming language models should have anything to do with national governance feels slightly insane. But we're also sitting in a moment where Trump and Musk are implementing policies that trigger trade wars and crash the U.S. economy. And I have to think "What if we just put Claude in charge?"

I joke. Kind of.

March 2025

2025-03-05 08:00:00

Well, I've had a dramatic start to the year.
Normally, the design agency I joined a short eight months ago, unexpectedly closed down in January. Despite running for a decade and working with almost every major tech company, client work slowed down and the founders decided to close up shop.

It's been a sad time. Everyone I worked with there was exceptionally talented and kind. I'm thankful I got to build with them for a short while.

I was already due to start maternity leave in March, so Normally closing just moved that date up a bit sooner. But I managed to fit in a couple of months of work with Deep Mirror before taking my baby break. They're a London-based startup using machine learning to speed up the drug discovery process, specifically by helping medicinal chemists generate ideas for new molecules.

While I was completely new to the field of drug discovery, many of the design challenges echoed the ones I'd worked on with Elicit – complex research workflows, information-dense interfaces, and making the inner workings of models and their reasoning process visible to users. I've learned I like this shape of work; AI/ML tools designed to help scientific researchers who have high standards and need to thoroughly understand how models “reason” and how answers are generated. It's fertile ground for responsible AI interface design.

My baby break has now started. Only two weeks remain until the new human arrives. A terrifyingly short timeline. Luckily, the excitement of meeting our child and the physical discomfort of late pregnancy outweigh any fears about birth or the impending marathon of sleep deprivation. I'd happily start labour tomorrow if I had any say in the matter.

Given that I won't be in a 9-5 job for the next six months, I've stocked up on new books. Though it's naïve to think I'll have the mental capacity to read any of them in between baby feedings and waking up a dozen times a night. But one can hope. I've added the full pile to my Antilibrary, but these are the ones I'm most excited about:

<a href="https://www.google.co.uk/books/edition/Soldiers_and_Kings/EzPBEAAAQBAJ"><strong>Soldiers and Kings: Survival and Hope in the World of Human Smuggling</strong></a> by Jason De Leon

This got my attention when it started popping up on all the “best of” ethnography lists in 2024, and then went on to win the national book award for non-fiction. I expect it to be a slightly intense read, but well-researched ethnographies are my favourite genre.

<a href="https://www.google.co.uk/books/edition/Cue_the_Sun/GObnEAAAQBAJ"><strong>Cue the Sun! The Invention of Reality TV</strong></a> by Emily Nussbaum

Like most of us, I have a love/hate/fascination/repulsion relationship with reality TV. I've watched my fair share of trash series, but will happily defend (most of) them as time well spent. They're always insightful windows into our collective value systems and cultural narratives, and I'm keen to read Nussbaum's critical take on the medium.

<a href="https://www.google.co.uk/books/edition/The_Invention_of_Nature/w1WNBQAAQBAJ"><strong>The Invention of Nature: The Adventures of Alexander von Humboldt</strong></a> by Andrea Wulf

Given my long standing preoccupation with how we try to define and divide “nature” from “culture”, it's about time I did a bit more historical reading into the origins of this cultural dichotomy.

I've been using a bit of my pre-baby time to build as well. I added a new section to this garden called Smidgeons. These are teeny tiny posts: links with a bit of commentary, research papers I enjoyed, or one-liners that would otherwise go on Bluesky.

I'm also quite deep into a new research project and set of prototypes I'm calling Lodestone. It's an exploration of how language models might be able to get us to think more, not less. Specifically, I'm interested in whether models can enable me to be a better critical thinker and rigorous writer. Not by writing for me, but by guiding me through a well-defined process of understanding what claims I'm making, what evidence I have to support it, and how my argument structure fits together. I'm tackling it from a few angles, but here's some previews from the latest prototype:

The code is all open source on Github, though it'll evolve a lot from here. I'll publish more about it soon, but the ideas still feel early and my thesis is unproven. I'll wait until it all gels together a bit more.

I should mention that starting this summer I'll be looking for a new role as a Design Engineer or technically-inclined Product Designer. I'm planning to be on maternity leave until early September, but I'm happy to start talking to companies, teams, and founders now if you think we could be a good fit. Just email hello at maggieappleton.com or DM me on Bluesky.

Humanity's Last Exam

2025-02-20 19:42:03

Humanity's Last Exam

We have a new(ish) benchmark, cutely named “Humanity's Last Exam.”

If you're not familiar with benchmarks, they're how we measure the capabilities of particular AI models like o1 or Claude Sonnet 3.5. Each one is a standardised test designed to check a specific skill set.

For example:

  • MMLU (Massive Multitask Language Understanding) measures understanding across 57 academic subjects including STEM, social science, and the humanities.
  • HumanEval measures code generation skills.
  • GPQA (Graduate-Level Google-Proof Q&A Benchmark) measures correctness on a set of questions written by PhD students and domain experts in biology, physics, and chemistry.

When you run a model on a benchmark it gets a score, which allows us to create leaderboards showing which model is currently the best for that test. To make scoring easy, the answers are usually formatted as multiple choice, true/false, or unit tests for programming tasks.

Among the many problems with using benchmarks as a stand-in for “intelligence” (other than the fact they're multiple choice standardised tests – do you think that's a reasonable measure of human capabilities in the real world?), is that our current benchmarks aren't hard enough.

New models routinely achieve 90%+ on the best ones we have. So there's a clear need for harder benchmarks to measure model performance against.

Hence, Humanity's Last Exam.

Made by ScaleAI and the Center for AI Safety, they've crowdsourced "the hardest and broadest set of questions ever" by experts across domains. 2,700 questions at the moment, some of which they're keeping private to prevent future models training on the dataset and memorising answers ahead of time. Questions like this:

<img src="/images/smidgeons/last-exam-1.png" alt="Samples of the diverse and challenging questions submitted to Humanity's Last Exam." />

<img src="/images/smidgeons/last-exam-2.png" alt="Samples of the diverse and challenging questions submitted to Humanity's Last Exam." />

<img src="/images/smidgeons/last-exam-3.png" alt="Samples of the diverse and challenging questions submitted to Humanity's Last Exam." />

So far, it's doing it's job well – the highest scoring model is OpenAI's Deep Research at 26.6%, with other common models like GPT-4o, Grok, and Claude only getting 3-4% correct. Maybe it'll last a year before we have to design the next “last exam.”

A quick note on benchmarks and sweeping generalisations

When people make sweeping statements like “language models are bullshit machines” or “ChatGPT lies,” it usually tells me they're not seriously engaged in any kind of AI/ML work or productive discourse in this space.

First, because saying a machine “lies” or “bullshits” implies motivated intent in a social context, which language models don't have. Models doing statistical pattern matching aren't purposefully trying to deceive or manipulate their users.

And second, broad generalisations about “AI”'s correctness, truthfulness, or usefulness is meaningless outside of a specific context. Or rather, a specific model measured on a specific benchmark or reproducible test.

So, next time you hear someone making grand statements about AI capabilities (both critical and overhyped), ask: which model are they talking about? On what benchmark? With what prompting techniques? With what supporting infrastructure around the model? Everything is in the details, and the only way to be a sensible thinker in this space is to learn about the details.

DeepSeek

2025-01-26 18:00:35

If you're not distressingly embedded in the torrent of AI news on Twixxer like I reluctantly am, you might not know what DeepSeek is yet. Bless you.

From what I've gathered:

  • On January 20th, a Chinese company named DeepSeek released a new reasoning model called R1.
  • A reasoning model is a large language model told to “think step-by-step” before it gives a final answer. This “chain of thought” technique dramatically improves the quality of its answers. These models are also fine-tuned to perform well on complex reasoning tasks.
  • R1 reaches equal or better performance on a number of major benchmarks compared to OpenAI's o1 (our current state-of-the-art reasoning model) and Anthropic's Claude Sonnet 3.5 but is significantly cheaper to use.
  • DeepSeek R1 is open-source, meaning you can download it and run it on your own machine.
  • They offer API access at a much lower cost than OpenAI or Anthropic. But given this is a Chinese model, and the current political climate is “complicated,” and they're almost certainly training on input data, don't put any sensitive or personal data through it.
  • You can use R1 online through the DeepSeek chat interface. You can turn on both reasoning and web search to inform your answers. Reasoning mode shows you the model “thinking out loud” before returning the final answer.

<img src="/images/smidgeons/deepseek1.png" alt="DeepSeek R1 showing its thinking" />

  • You can use Ollama to run R1 on your own machine, but standard personal laptops won't be able to handle the larger, more capable versions of the model (32B+). You'll have to run the smaller 8B or 14B version, which will be slightly less capable. I have the 14B version running just fine on a Macbook Pro with an Apple M1 chip. Here's a Reddit guide on getting it running locally.
  • DeepSeek claims it only cost $5.5 million to train the model, compared to an estimated $41-78 million for GPT-4. If true, building state-of-the-art models is no longer just a billionaires game.
  • The thoughtbois of Twixxer are winding themselves into knots trying to theorise what this means for the U.S.-China AI arms race. A few people have referred to this as a “sputnik moment.”
  • From my initial, unscientific, unsystematic explorations with it, it's really good. Using it as my default LM going forward (for tasks that don't involve sensitive data). Quirks include being way too verbose in its reasoning explanations and using lots of Chinese language sources when it searches the web. Makes it challenging to validate whether claims match the source texts.

Here's the announcement Tweet:

TLDR high-quality reasoning models are getting significantly cheaper and more open-source. This means companies like Google, OpenAI, and Anthropic won't be able to maintain a monopoly on access to fast, cheap, good quality reasoning. This is net good for everyone.