2025-03-09 22:51:55
Neat interactive visualization of Uber's H3 hexagonal geographical indexing mechanism.
Here's the source code.
Tags: geospatial, javascript
2025-03-09 07:19:51
I presented two sessions at the NICAR 2025 data journalism conference this year. The first was this one based on my review of LLMs in 2024, extended by several months to cover everything that's happened in 2025 so far. The second was a workshop on Cutting-edge web scraping techniques, which I've written up separately.
Here are the slides and detailed notes from my review of what's new in LLMs, with a focus on trends that are relative to data journalism.
I wrapped up with a Q&A and an invitation: if you work in a newsroom that is figuring this stuff out I would love to jump on a Zoom call and talk to your team. Contact me at swillison@
Google's webmail provider.
Tags: data-journalism, speaking, ai, generative-ai, llms, annotated-talks, nicar, vision-llms
2025-03-09 03:25:36
Cutting-edge web scraping techniques at NICAR
Here's the handout for a workshop I presented this morning at NICAR 2025 on web scraping, focusing on lesser know tips and tricks that became possible only with recent developments in LLMs.For workshops like this I like to work off an extremely detailed handout, so that people can move at their own pace or catch up later if they didn't get everything done.
The workshop consisted of four parts:
- Building a Git scraper - an automated scraper in GitHub Actions that records changes to a resource over time
- Using in-browser JavaScript and then shot-scraper to extract useful information
- Using LLM with both OpenAI and Google Gemini to extract structured data from unstructured websites
- Video scraping using Google AI Studio
I released several new tools in preparation for this workshop (I call this "NICAR Driven Development"):
I also came up with a fun way to distribute API keys for workshop participants: I had Claude build me a web page where I can create an encrypted message with a passphrase, then share a URL to that page with users and give them the passphrase to unlock the encrypted message. You can try that at tools.simonwillison.net/encrypt - or use this link and enter the passphrase "demo":
Tags: shot-scraper, gemini, nicar, openai, git-scraping, ai, speaking, llms, scraping, generative-ai, claude-artifacts, ai-assisted-programming, claude
2025-03-09 01:13:30
Politico: 5 Questions for Jack Clark
I tend to ignore statements with this much future-facing hype, especially when they come from AI labs who are both raising money and trying to influence US technical policy.Anthropic's Jack Clark has an excellent long-running newsletter which causes me to take him more seriously than many other sources.
Jack says:
In 2025 myself and @AnthropicAI will be more forthright about our views on AI, especially the speed with which powerful things are arriving.
In response to Politico's question "What’s one underrated big idea?" Jack replied:
People underrate how significant and fast-moving AI progress is. We have this notion that in late 2026, or early 2027, powerful AI systems will be built that will have intellectual capabilities that match or exceed Nobel Prize winners. They’ll have the ability to navigate all of the interfaces… they will have the ability to autonomously reason over kind of complex tasks for extended periods. They’ll also have the ability to interface with the physical world by operating drones or robots. Massive, powerful things are beginning to come into view, and we’re all underrating how significant that will be.
Via @jackclarksf
Tags: jack-clark, anthropic, ai
2025-03-08 13:39:25
Apple Is Delaying the ‘More Personalized Siri’ Apple Intelligence Features
Apple told John Gruber (and other Apple press) this about the new "personalized" Siri:It’s going to take us longer than we thought to deliver on these features and we anticipate rolling them out in the coming year.
I have a hunch that this delay might relate to security.
These new Apple Intelligence features involve Siri responding to requests to access information in applications and then perform actions on the user's behalf.
This is the worst possible combination for prompt injection attacks! Any time an LLM-based system has access to private data, tools it can call and exposure to potentially malicious instructions (like emails and text messages from untrusted strangers) there's a significant risk that an attacker might subvert those tools and use them to damage or exfiltration a user's data.
I published this piece about the risk of prompt injection to personal digital assistants back in November 2023, and nothing has changed since then to make me think this is any less of an open problem.
Tags: apple, ai, john-gruber, llms, prompt-injection, security, apple-intelligence, generative-ai
2025-03-08 07:19:47
State-of-the-art text embedding via the Gemini API
Gemini just released their new text embedding model, with the snappy namegemini-embedding-exp-03-07
. It supports 8,000 input tokens - up from 3,000 - and outputs vectors that are a lot larger than their previous text-embedding-004
model - that one output size 768 vectors, the new model outputs 3072.
Storing that many floating point numbers for each embedded record can use a lot of space. thankfully, the new model supports Matryoshka Representation Learning - this means you can simply truncate the vectors to trade accuracy for storage.
I added support for the new model in llm-gemini 0.14. LLM doesn't yet have direct support for Matryoshka truncation so I instead registered different truncated sizes of the model under different IDs: gemini-embedding-exp-03-07-2048
, gemini-embedding-exp-03-07-1024
, gemini-embedding-exp-03-07-512
, gemini-embedding-exp-03-07-256
, gemini-embedding-exp-03-07-128
.
The model is currently free while it is in preview, but comes with a strict rate limit - 5 requests per minute and just 100 requests a day. I quickly tripped those limits while testing out the new model - I hope they can bump those up soon.
Via @officiallogank
Tags: embeddings, gemini, ai, google, llm