MoreRSS

site iconAlex WlchanModify

I‘m a software developer, writer, and hand crafter from the UK. I’m queer and trans.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Alex Wlchan

What I learnt about making websites by reading two thousand web pages

2025-05-26 18:01:21

Over the past year, I built a web archive of over two thousand web pages – my own copy of everything I’ve bookmarked in the last fifteen years. I saved each one by hand, reading and editing the HTML to build a self-contained, standalone copy of each web page.

These web pages were made by other people, many using tools and techniques I didn’t recognise. What started as an exercise in preservation became an unexpected lesson in coding: I was getting a crash course in how the web is made. Reading somebody else’s code is a great way to learn, and I was reading a lot of somebody else’s code.

In this post, I’ll show you some of what I learnt about making websites: how to write thoughtful HTML, new-to-me features of CSS, and some quirks and relics of the web.

This article is the third in a four part bookmarking mini-series:

  1. Creating a static site for all my bookmarks – why I bookmark, why I use a static site, and how it works.
  2. Building a personal archive of the web, the slow way – how I built a web archive by hand, the tradeoffs between manual and automated archiving, and what I learnt about preserving the web.
  3. Learning how to make websites by reading two thousand web pages (this article)
  4. Some cool websites from my bookmark collection (coming 26 May) – some websites which are doing especially fun or clever things with the web.

Interesting HTML tags

I know I’ve read a list of HTML tags in reference documents and blog posts, but there are some tags I’d forgotten, misunderstood, or never seen used in the wild. Reading thousands of real-world pages gave me a better sense of how these tags are actually used, and when they’re useful.

The <aside> element

MDN describes <aside> as “a portion of a document whose content is only indirectly related to the document’s main content”. That’s vague enough that I was never quite sure when to use it.

In the web pages I read, I saw <aside> used in the middle of larger articles, for things like ads, newsletter sign ups, pull quotes, or links to related articles. I don’t have any of those elements on my site, but now I have a stronger mental model of where to use <aside>. I find concrete examples more useful than abstract definitions.

I also saw a couple of sites using the <ins> (inserted text) element for ads, but I think <aside> is a better semantic fit.

The <mark> element

The <mark> element highlights text, typically with a yellow background. It’s useful for drawing visual attention to a phrase, and I suspect it’s helpful for screen readers and parsing tools as well.

I saw it used in Medium to show reader highlights, and I’ve started using it in code samples when I want to call out specific lines.

The <section> element

The <section> tag is a useful way to group content on a page – more meaningful than a generic <div>. I’d forgotten about it, although I use similar tags like <article> and <main>. Seeing it used across different sites reminded me it exists, and I’ve since added it to a few projects.

The <hgroup> (heading group) element

The <hgroup> tag is for grouping a heading with related metadata, like a title and a publication date:

<hgroup>
  <h1>All about web bookmarking</h1>
  <p>Posted 16 March 2025</p>
</hgroup>

This is another tag I’d forgotten, which I’ve started using for the headings on this site.

The <video> element

The <video> tag is used to embed videos in a web page. It’s a tag I’ve known about for a long time – I still remember reading Kroc Camen’s article Video is for Everybody in 2010, back when Flash was being replaced as dominant way to watch video on the web.

While building my web archive, I replaced a lot of custom video players with <video> elements and local copies of the videos. This was my first time using the tag in anger, not just in examples.

One mistake I kept making was forgetting to close the tag, or trying to make it self-closing:

<!-- this is wrong -->
<video controls src="videos/Big_Buck_Bunny.mp4"/>

It looks like <img>, which is self-closing, but <video> can have child elements, so you have to explicitly close it with </video>.

The <progress> indicator element

The <progress> element shows a progress indicator. I saw it on a number of sites that publish longer articles – they used a progress bar to show you how far you’d read.

<label for="file">Progress:</label>
<progress id="file" max="100" value="70">70%</progress>
70%

I don’t have a use for it right now, but I like the idea of getting OS-native progress bars in HTML – no custom JavaScript or CSS required.

The <base> element

The <base> element specifies the base URL to use for any relative URLs in a document. For example, in this document:

<base href="https://example.com/">

<img src="/pictures/cat.jpg">

the image will be loaded from https://example.com/pictures/cat.jpg.

It’s still not clear to me when you should use <base>, or what the benefits are (aside from making your URLs a bit shorter), but it’s something I’ll keep an eye out for in future projects.


Clever corners of CSS

The CSS @import rule

CSS has @import rules, which allow one stylesheet to load another:

@import "fonts.css";

I’ve used @import in Sass, but I only just realised it’s now a feature of vanilla CSS – and one that’s widely used. The CSS for this website is small enough that I bundle it into a single file for serving over HTTP (a mere 13KB), but I’ve started using @import for static websites I load from my local filesystem, and I can imagine it being useful for larger projects.

One feature I’d find useful is conditional imports based on selectors. You can already do conditional imports based on a media query (“only load these styles on a narrow screen”) and something similar for selectors would be useful too (for example, “only load these styles if a particular class is visible”). I have some longer rules that aren’t needed on every page, like styles for syntax highlighting, and it would be nice to load them only when required.

[attr$=value] is a CSS selector for suffix values

While reading Home Sweet Homepage, I found a CSS selector I didn’t understand:

img[src$="page01/image2.png"] {
  left: 713px;
  top:  902px;
}

This $= syntax is a bit of CSS that selects elements whose src attribute ends with page01/image2.png. It’s one of a several attribute selectors that I hadn’t seen before – you can also match exact values, prefixes, or words in space-separated lists. You can also control whether you want case-sensitive or -insensitive matching.

You can create inner box shadows with inset

Here’s a link style from an old copy of the Entertainment Weekly website:

a { box-shadow: inset 0 -6px 0 #b0e3fb; }
A link on EW.com

The inset keyword was new to me: it draws the shadow inside the box, rather than outside. In this case, they’re setting offset-x=0, offset-y=-6px and blur-radius=0 to create a solid stripe that appears behind the link text – like a highlighter running underneath it.

If you want something that looks more shadow-like, here are two boxes that show the inner/outer shadow with a blur radius:

inner shadow
outer shadow

I don’t have an immediate use for this, but I like the effect, and the subtle sense of depth it creates. The contents of the box with inner-shadow looks like it’s below the page, while the box with outer-shadow floats above it.

For images that get bigger, cursor: zoom-in can show a magnifying glass

On gallery websites, I often saw this CSS rule used for images that link to a larger version:

cursor: zoom-in;

Instead of using cursor: pointer; (the typical hand icon for links), this shows a magnifying glass icon – a subtle cue that clicking will zoom or expand the image.

Here’s a quick comparison:

A small icon of an arrow the default cursor is typically an arrow
A small icon of a hand with a raised pointer finger the pointer cursor is typically a hand, used to indicate links
A small icon of a magnifying sign with a plus symbol the zoom-in cursor is a magnifying glass with a plus sign, suggesting “click to enlarge”

I knew about the cursor property, but I’d never thought to use it that way. It’s a nice touch, and I want to use it the next time I build a gallery.


Writing thoughtful HTML

The order of elements

My web pages have a simple one column design: a header at the top, content in the middle, a footer at the bottom. I mirror that order in my HTML, because it feels a natural structure.

I’d never thought about how to order the HTML elements in more complex layouts, when there isn’t such a clear direction. For example, many websites have a sidebar that sits alongside the main content. Which comes first in the HTML?

I don’t have a firm answers, but reading how other people structure their HTML got me thinking. I noticed several pages that put the sidebar at the very end of the HTML, then used CSS to position it visually alongside the content. That way, the main content appears earlier in the HTML file, which means it can load and become readable sooner.

It’s something I want to consider next time I’m building a more complex page.

Comments to mark the end of large containers

I saw a lot of websites (mostly WordPress) that used HTML comments to mark the end of containers with a lot of content. For example:

<div id="primary">
  <main id="main"></main><!-- #main -->
</div><!-- #primary -->

These comments made the HTML much easier to read – I could see exactly where each component started and ended.

I like this idea, and I’m tempted to use it in my more complex projects. I can imagine this being especially helpful in template files, where HTML is mixed with template markup in a way that might confuse code folding, or make the structure harder to follow.

The data-href attribute in <style> tags

Here’s a similar idea: I saw a number of sites set a data-href attribute on their <style> tags, as a way to indicate the source of the CSS. Something like:

<style data-href="https://example.com/style.css">

I imagine this could be useful for developers working on that page, to help them find where they need to make changes to that <style> tag.

Translated pages with <link rel="alternate"> and hreflang

I saw a few web pages with translated versions, and they used <link> tags with rel="alternate" and an hreflang attribute to point to those translations. Here’s an example from a Panic article, which is available in both US English and Japanese:

<link rel="alternate" hreflang="en-us" href="https://blog.panic.com/firewatch-demo-day-at-gdc/">
<link rel="alternate" hreflang="ja"    href="https://blog.panic.com/ja/firewatch-demo-day-at-gdc-j/">

This seems to be for the benefit of search engines and other automated tools, not web browsers. If your web browser is configured to prefer Japanese, you’d see a link to the Japanese version in search results – but if you open the English URL directly, you won’t be redirected.

This makes sense to me – translations can differ in content, and some information might only be available in one language. It would be annoying if you couldn’t choose which version you wanted.

Panic’s article includes a third <link rel="alternate"> tag:

<link rel="alternate" hreflang="x-default" href="https://blog.panic.com/firewatch-demo-day-at-gdc/">

This x-default value is a fallback, used when there’s no better match for the user’s language. For example, if you used a French search engine, you’d be directed to this URL because there isn’t a French translation.

Almost every website I’ve worked has been English-only, so internationalisation is a part of the web I know very little about.

Fetching resources faster with <link rel="preload">

I saw a lot of websites that with <link rel="preload"> tags in their <head>. This tells the browser about resources that will be needed soon, so it should start fetching them immediately.

Here’s an example from this site:

<link rel="preload" href="https://alexwlchan.net/theme/white-waves-transparent.png" as="image" type="image/png"/>

That image is used as a background texture in my CSS file. Normally, the browser would have to download and parse the CSS before it even knows about the image – which means a delay before it starts loading it. By preloading the image, the browser can begin downloading the image in parallel with the CSS file, so it’s already in progress when the browser reads the CSS.

The difference is probably imperceptible on a fast connection, but it is a performance improvement – and as long as you scope the preloads correctly, there’s little downside. (Scoping means ensuring you don’t preload resources that aren’t used).

I saw some sites use DNS prefetching, which is a similar idea. The rel="dns-prefetch" attribute tells the browser about domains it’ll fetch resources from soon, so it should begin DNS resolution early. The most common example was websites using Google Fonts:

<link rel="dns-prefetch" href="https://fonts.googleapis.com/" />

I only added preload tags to my site a few weeks ago. I’d seen them in other web pages, but I didn’t appreciate the value until I wrote one of my own.


Quirks and relics

There are still lots of <!--[if IE]> comments

Old versions of Internet Explorer supported conditional comments, which allowed developers to add IE-specific behaviour to their pages. Internet Explorer would render the contents of the comment as HTML, while other browsers ignored it. This was a common workaround for deficiencies in IE, when pages needed specific markup or styles to render correctly.

Here’s an example, where the developer adds an IE-specific style to fix a layout issue:

<!--[if IE]>
  <style>
    /* old IE unsupported flexbox fixes */
    .greedy-nav .site-title {
      padding-right: 3em;
    }
  </style>
<![endif]-->

Developers could also target specific versions of IE:

<!--[if lte IE 7]><link rel="stylesheet" href="/css/ie.css"><![endif]-->

Some websites even used conditional comments to display warnings and encourage users to upgrade, like this message which that’s still present on the RedMonk website today:

<!--[if IE]>
  <div class="alert alert-warning">
    You are using an <strong>outdated</strong> browser. Please <a href="http://browsehappy.com/">upgrade your browser</a> to improve your experience.
  </div>
<![endif]-->

This syntax was already disappearing by the time I started building websites – support for conditional comments was removed in Internet Explorer 10, released in 2012, the same year that Google Chrome became the most-used browser worldwide. I never wrote one of these comments, but I saw lots of them in archived web pages.

These comments are a relic of an earlier web. Most websites have removed them, but they live on in web archives, and in the memories of web developers who remember the bad old days of IE6.

Templates in <script> tags with a non-standard type attribute

I came across a few pages using <script> tags with a type attribute that I didn’t recognise. Here’s a simple example:

<script type="text/x-handlebars-template" id="loading_animation">
  <div class="loading_animation pulsing <%= extra_class %> "><div></div></div>
</script>

Browsers ignore <script> tags with an unrecognised type – they don’t run them, and they don’t render their contents. Developers have used this as a way to include HTML templates in their pages, which JavaScript could extract and use later.

This trick was so widespread that HTML introduced a dedicated <template> tag element for the same purpose. It’s been in all the major browsers for years, but there are still instances of this old technique floating around the web.

Browsers won’t load external file:// resources from file:// pages

Because my static archives are saved as plain HTML files on disk, I often open them directly using the file:// protocol, rather than serving them over HTTP. This mostly works fine – but I ran into a few cases where pages behave differently depending on how they’re loaded.

One example is the SVG <use> element. Some sites I saved use SVG sprite sheets for social media icons, with markup like:

<use href="sprite.svg#logo-icon"></use>

This works over http://, but when loaded via file://, it silently fails – the icons don’t show up.

This turns out to be a security restriction. When a file:// page tries to load another file:// resource, modern browsers treat it as a cross-origin request and block it. This is to prevent a malicious downloaded HTML file from snooping around your local disk.

It took me a while to figure this out. At first, all I got was a missing icon. I could see an error in my browser console, but it was a bit vague – it just said I couldn’t load the file for “security reasons”.

Then I dropped this snippet into my dev tools console:

fetch("sprite.svg")
  .then(response => console.log("Fetch succeeded:", response))
  .catch(error => console.error("Fetch failed:", error));

It gave me a different error message, one that explicitly mentioned cross-origin requesting sharing: “CORS request not http”. This gave me something I could look up, and led me to the answer.

This is easy to work around – if I spin up a local web server (like Python’s http.server), I can open the page over HTTP and everything loads correctly.

What does GPT stand for in attributes?

Thanks to the meteoric rise of ChatGPT, I’ve come to associate the acronym “GPT” with large language models (LLMs) – it stands for Generative Pre-trained Transformer.

That means I was quite surprised to see “GPT” crop up on web pages that predate the widespread use of generative AI. It showed up in HTML attributes like this:

<div id="div-gpt-ad-1481124643331-2">

I discovered that “GPT” also stands for Google Publisher Tag, part of Google’s ad infrastructure. I’m not sure exactly what these tags were doing – and since I stripped all the ads out of my web archive, they’re not doing anything now – but it was clearly ad-related.

What’s the instapaper_ignore class?

I found some pages that use the instapaper_ignore CSS class to hide certain content. Here’s an example from an Atlantic article I saved in 2017:

<aside class="pullquote instapaper_ignore">
  Somewhere at Google there is a database containing 25 million books and nobody is allowed to read them.
</aside>

Instapaper is a “read later” service – you save an article that looks interesting, and later you can read it in the Instapaper app. Part of the app is a text parser that tries to extract the article’s text, stripping away junk or clutter.

The instapaper_ignore class is a way for publishers to control what that parser includes. From a blog post in 2010:

Additionally, the Instapaper text parser will support some standard CSS class names to instruct it:

  • instapaper_body: This element is the body container.
  • instapaper_ignore: These elements, when inside the body container, should be removed from the text parser’s output.

In this example, the element is a pull quote – a repeated line from the article, styled to stand out. On the full web page, it works. But in the unstyled Instapaper view, it would just look like a duplicate sentence. It makes sense that the Atlantic wouldn’t want it to appear in that context.

Only a handful of pages I’ve saved ever used instapaper_ignore, and even fewer are still using it today. I don’t even know if Instapaper’s parser still looks for it.

This stood out to me because I was an avid Instapaper user for a long time. I deleted my account years ago, and I don’t hear much about “read later” apps these days – but then I stumble across a quiet little relic like this, buried in the HTML.

I found a bug in the WebKit developer tools

Safari is my regular browser, and I was using it to preview pages as I saved them to my archive. While I was archiving one of Jeffrey Zeldman’s posts, I was struggling to understand how some of his CSS worked. I could see the rule in my developer tools, but I couldn’t figure out why it was behaving the way it was.

Eventually, I discovered the problem: a bug in WebKit’s developer tools was introducing whitespace that changed the meaning of the CSS.

For example, suppose the server sends this minifed CSS rule:

body>*:not(.black){color:green;}

WebKit’s dev tools prettify it like this:

body > * :not(.black) {
    color: green;
}

But these aren’t equivalent!

  • The original rule matches direct children of <body> that don’t have the black class.
  • The prettified version matches any descendant of <body> that doesn’t have the black class and that isn’t a direct child.

The CSS renders correctly on the page, but the bug means the Web Inspector can show something subtly wrong. It’s a formatting bug that sent me on a proper wild goose chase.

This bug remains unfixed – but interestingly, a year later, that particular CSS rule has disappeared from Zeldman’s site. I wonder if it caused any other problems?


Closing thoughts

The web is big and messy and bloated, and there are lots of reasons to be pessimistic about the state of modern web development – but there are also lots of people doing cool and interesting stuff with it. As I was reading this mass of HTML and CSS, I had so many moments where I thought “ooh, that’s clever!” or “neat!” or “I wonder how that works?”. I hope that as you’ve read this post, you’ve learnt something too.

I’ve always believed in the spirit of “view source”, the idea that you can look at the source code of any web page and see how it works. Although that’s become harder as more of the web is created by frameworks and machines, this exercise shows that it’s clinging on. We can still learn from reading other people’s source code.

When I set out to redo my bookmarks, I was only trying to get my personal data under control. Learning more about front-end web development has been a nice bonus. My knowledge is still a tiny tip of an iceberg, but now it’s a little bit bigger.

I know this post has been particularly dry and technical, so next week I’ll end this series on a lighter note. I’ll show you some of my favourite websites from my bookmarks – the fun, the whimsical, the joyous – the people who use the web as a creative canvas, and who inspire me to make my web presence better.

[If the formatting of this post looks odd in your feed reader, visit the original article]

Building a personal archive of the web, the slow way

2025-05-19 14:51:21

I manage my bookmarks with a static website. I’ve bookmarked over 2000 pages, and I keep a local snapshot of every page. These snapshots are stored alongside the bookmark data, so I always have access, even if the original website goes offline or changes.

Screenshot of a web page showing a 504 Gateway Timeout error.Screenshot of the same page, from a local snapshot, showing the headline “30 years on, what’s next for the web?”

I’ve worked on web archives in a professional setting, but this one is strictly personal. This gives me more freedom to make different decisions and trade-offs. I can focus on the pages I care about, spend more time on quality control, and delete parts of a page I don’t need – without worrying about institutional standards or long-term public access.

In this post, I’ll show you how I built this personal archive of the web: how I save pages, why I chose to do it by hand, and what I do to make sure every page is properly preserved.

This article is the second in a four part bookmarking mini-series:

  1. Creating a static site for all my bookmarks – why I bookmark, why I use a static site, and how it works.
  2. Creating a local archive of all my bookmarks (this article)
  3. What I learnt about making websites by reading two thousand web pages – how to write thoughtful HTML, new-to-me features of CSS, and some quirks and relics I found while building my personal web archive.
  4. Some cool websites from my bookmark collection (coming 2 June) – some websites which are doing especially fun or clever things with the web.

What do I want from a web archive?

I’m building a personal web archive – it’s just for me. I can be very picky about what it contains and how it works, because I’m the only person who’ll read it or save pages. It’s not a public resource, and nobody else will ever see it.

This means it’s quite different to what I’d do in a professional or institutional setting. There, the priorities are different: automation over manual effort, immutability over editability, and a strong preference for content that can be shared or made public.

I want a complete copy of every web page

I want my archive to have a copy of every page I’ve bookmarked, and for each copy to be a good substitute for the original. It should include everything I need to render the page – text, images, videos, styles, and so on.

If the original site changes or goes offline, I should still be able to see the page as I saved it.

I want the archive to live on my computer

I don’t want to rely on an online service which could break, change, or be shut down.

I learnt this the hard way with Pinboard. I was paying for an archiving account, which promised to save a copy of all my bookmarks. But in recent years it became unreliable – sometimes it would fail to archive a page, and sometimes I couldn’t retrieve a supposedly saved page.

It should be easy to save new pages

I save a couple of new bookmarks a week. I want to keep this archive up-to-date, and I don’t want adding pages to be a chore.

It should support private or paywalled pages

I read a lot of pages which aren’t on the public web, stuff behind paywalls or login screens. I want to include these in my web archive.

Many web archives only save public content – either because they can’t access private content to save, or they couldn’t share if it they did. This makes it even more important that I keep my own copy of private pages, because I may not find another.

It should be possible to edit snapshots

This is both additive and subtractive.

Web pages can embed external resources, and sometimes I want those resources in my archive. For example, suppose somebody publishes a blog post about a conference talk, and embeds a YouTube video of them giving the talk. I want to download the video, not just the YouTube embed code.

Web pages also contain a lot of junk that I don’t care about saving – ads, tracking, pop-ups, and more. I want to cut all that stuff out, and just keep the useful parts. It’s like taking clippings from a magazine: I want the article, not the ads wrapped around it.


What does my web archive look like?

I treat my archived bookmarks like the bookmarks site itself: as static files, saved in folders on my local filesystem.

A static folder for every page

For every page, I have a folder with the HTML, stylesheets, images, and other linked files. Each folder is a self-contained “mini-website”. If I want to look at a saved page, I can just open the HTML file in my web browser.

Screnshot of a folder containing an HTML page, and two folders with linked resources: images and static.
The files for a single web page, saved in a folder in my archive. I flatten the structure of each website into top-level folders like images and static, which keeps things simple and readable. I don’t care about the exact URL paths from the original site.

Any time the HTML refers to an external file, I’ve changed it to fetch the file from the local folder rather than the original website. For example, the original HTML might have an <img> tag that loads an image from https://preshing.com/~img/poster-wall.jpg, but in my local copy I’d change the <img> tag to load from images/poster-wall.jpg.

I like this approach because it’s using open, standards-based web technology, and this structure is simple, durable, and easy to maintain. These folder-based snapshots will likely remain readable for the rest of my life.

Why not WARC or WACZ?

Many institutions store their web archives as WARC or WACZ, which are file formats specifically designed to store preserved web pages.

These files contain the saved page, as well as extra information about how the archive was created – useful context for future researchers. This could include the HTTP headers, the IP address, or the name of the software that created the archive.

You can only open WARC or WACZ files with specialist “playback” software, or by unpacking the files from the archive. Both file formats are open standards, so theoretically you could write your own software to read them – archives saved this way aren’t trapped in a proprietary format – but in practice, you’re picking from a small set of tools.

In my personal archive, I don’t need that extra context, and I don’t want to rely on a limited set of tools. It’s also difficult to edit WARC files, which is one of my requirements. I can’t easily open them up and delete all the ads, or add extra files.

I prefer the flexibility of files and folders – I can open HTML files in any web browser, make changes with ease, and use whatever tools I like.


How do I save a local copy of each web page?

I save every page by hand, then I check it looks good – that I’ve saved all the external resources like images and stylesheets.

This manual inspection gives me the peace of mind to know that I really have saved each web page, and that it’s a high quality copy. I’m not going to open a snapshot in two years time only to discover that I’m missing a key image or illustration.

Let’s go through that process in more detail.

Saving a single web page by hand

I start by saving the HTML file, using the “Save As” button in my web browser.

I open that file in my web browser and my text editor. Using the browser’s developer tools, I look for external files that I need to save locally – stylesheets, fonts, images, and so on. I download the missing files, edit the HTML in my text editor to point at the local copy, then reload the page in the browser to see the result. I keep going until I’ve downloaded everything, and I have a self-contained, offline copy of the page.

Most of my time in the developer tools is spent in two tabs.

I look at the Network tab to see what files the page is loading. Are they being served from my local disk, or fetched from the Internet? I want everything to come from disk.

The network tab in my browser developer tools, which has a list of files loaded by the page, and the domain they were loaded from. A lot of these files were loaded from remote servers.
This HTML file is making a lot of external network requests – I have more work to do!

I check the Console tab for any errors loading the page – some image that can’t be found, or a JavaScript file that didn’t load properly. I want to fix all these errors.

The console tab in my browser developer tools, which has a lot of messages highlighted in red about resources that weren't loaded properly.
So much red!

I spend a lot of time reading and editing HTML by hand. I’m fairly comfortable working with other people’s code, and it typically takes me a few minutes to save a page. This is fine for the handful of new pages I save every week, but it wouldn’t scale for a larger archive.

Once I’ve downloaded everything the page needs, eliminated external requests, and fixed the errors, I have my snapshot.

Deleting all the junk

As I’m saving a page, I cut away all the stuff I don’t want. This makes my snapshots smaller, and pages often shrank by 10–20×. The junk I deleted includes:

  • Ads. So many ads. I found one especially aggressive plugin that inserted an advertising <div> between every single paragraph.

  • Banners for time-sensitive events. News tickers, announcements, limited-time promotions, and in one case, a museum’s bank holiday opening hours.

  • Inline links to related content. There are many articles where, every few paragraphs, you get a promo for a different article. I find this quite distracting, especially as I’m already reading the site! I deleted all those, so my saved articles are just the text.

  • Cookie notices, analytics, tracking, and other services for gathering “consent”. I don’t care what tracking tools a web page was using when I saved it, and they’re a complete waste of space in my personal archive.

As I was editing the page in my text editor, I’d look for <script> and <iframe> elements. These are good indicators of the stuff I want to remove – for example, most ads are loaded in iframes. A lot of what I save is static content, where I don’t need the interactivity of JavaScript. I can remove it from the page and still have a useful snapshot.

In my personal archive, I think these deletions are a clear improvement. Snapshots load faster, they’re easier to read, and I’m not preserving megabytes of junk I’ll never use. But I’d be a lot more cautious doing this in a public context.

Institutional web archives try to preserve web pages exactly as they were. They want researchers to trust that they’re seeing an authentic representation of the original page, unchanged in content or meaning. Deleting anything from the page, however well-intentioned, might undermine that trust – who decides what gets deleted? What’s cruft to me might be a crucial clue to someone else.

Using templates for repeatedly-bookmarked sites

For big, complex websites that I bookmark often, I’ve created simple HTML templates.

When I want to save a new page, I discard the original HTML, and I just copy the text and images into the template. It’s a lot faster than unpicking the site’s HTML every time, and I’m saving the content of the article, which is what I really care about.

Screenshot of an article on the New York Times website. You can only see a headline – most of the page is taken up by an ad and a cookie banner.Screenshot of the same article saved in my archive. You can see the main illustration, the headline, and two paragraphs of the article.
Here’s an example from the New York Times. You can tell which page is the real article, because you have to click through two dialogs and scroll past an ad before you see any text.

I was inspired by AO3 (the Archive of Our Own), a popular fanfiction site. You can download copies of every story in multiple formats, and they believe in it so strongly that everything published on their site can be downloaded. Authors don’t get to opt out.

An HTML download from AO3 looks different to the styled version you’d see browsing the web:

Screenshot of a story ‘The Jacket Bar’ on AO3. There are styles and colours, and a red AO3 site header at the top fo the page.Screenshot of the same story, as an HTML download. It’s an unstyled HTML page, with Times New Roman font and default blue links.

But the difference is only cosmetic – both files contain the full text of the story, which is what I really care about. I don’t care about saving a visual snapshot of what AO3 looks like.

Most sites don’t offer a plain HTML download of their content, but I know enough HTML and CSS to create my own templates. I have a dozen or so of these templates, which make it easy to create snapshots of sites I visit often – sites like Medium, Wired, and the New York Times.

Backfilling my existing bookmarks

When I decided to build a new web archive by hand, I already had partial collections from several sources – Pinboard, the Wayback Machine, and some personal scripts.

I gradually consolidated everything into my new archive, tackling a few bookmarks a day: fixing broken pages, downloading missing files, deleting ads and other junk. I had over 2000 bookmarks, and it took about a year to migrate all of them. Now I have a collection where I’ve checked everything by hand, and I know I have a complete set of local copies.

I wrote some Python scripts to automate common cleanup tasks, and I used regular expressions to help me clean up the mass of HTML. This code is too scrappy and specific to be worth sharing, but I wanted to acknowledge my use of automation, albeit at a lower level than most archiving tools. There was a lot of manual effort involved, but it wasn’t entirely by hand.

Now I’m done, there’s only one bookmark that seems conclusively lost – a review of Rogue One on Dreamwidth, where the only capture I can find is a content warning interstitial.

I consider this a big success, but it was also a reminder of how fragmented our internet archives are. Many of my snapshots are “franken-archives” – stitched together from multiple sources, combining files that were saved years apart.

Backing up the backups

Once I have a website saved as a folder, that folder gets backed up like all my other files.

I use Time Machine and Carbon Copy Cloner to back up to a pair of external SSDs, and Backblaze to create a cloud backup that lives outside my house.


Why not use automated tools?

I’m a big fan of automated tools for archiving the web, I think they’re an essential part of web preservation, and I’ve used many of them in the past.

Tools like ArchiveBox, Webrecorder, and the Wayback Machine have preserved enormous chunks of the web – pages that would otherwise be lost. I paid for a Pinboard archiving account for a decade, and I search the Internet Archive at least once a week. I’ve used command-line tools like wget, and last year I wrote my own tool to create Safari webarchives.

The size and scale of today’s web archives are only possible because of automation.

But automation isn’t a panacea, it’s a trade-off. You’re giving up accuracy for speed and volume. If nobody is reviewing pages as they’re archived, it’s more likely that they’ll contain mistakes or be missing essential files.

When I reviewed my old Pinboard archive, I found a lot of pages that weren’t archived properly – they had missing images, or broken styles, or relied on JavaScript from the original site. These were web pages I really care about, and I thought I had them saved, but that was a false sense of security. I’ve found issues like this whenever I’ve used automated tools to archive the web.

That’s why I decided to create my new archive manually – it’s much slower, but it gives me the comfort of knowing that I have a good copy of every page.


What I learnt about archiving the web

Lots of the web is built on now-defunct services

I found many pages that rely on third-party services that no longer exist, like:

  • Photo sharing sites – some I’d heard of (Twitpic, Yfrog), others that were new to me (phto.ch)
  • Link redirection services – URL shorteners and sponsored redirects
  • Social media sharing buttons and embeds

This means that if you load the live site, the main page loads, but key resources like images and scripts are broken or missing.

Just because the site is up, doesn’t mean it’s right

One particularly insidious form of breakage is when the page still exists, but the content has changed. Here’s an example: a screenshot from an iTunes tutorial on LiveJournal that’s been replaced with an “18+ warning”:

Screenshot of a text box in the iTunes UI, where you can enter a year.A warning from LiveJournal that this image is “18+”.

This kind of failure is hard to detect automatically – the server is returning a valid response, just not the one you want. That’s why I wanted to look at every web page with my eyes, and not rely on a computer to tell me it was saved correctly.

Many sites do a poor job of redirects

I was surprised by how many web pages still exist, but the original URLs no longer work, especially on large news sites. Many of my old bookmarks now return a 404, but if you search for the headline, you can find the story at a completely different URL.

I find this frustrating and disappointing. Whenever I’ve restructured this site, I always set up redirects because I’m an old-school web nerd and I think URLs are cool – but redirects aren’t just about making me feel good. Keeping links alive makes it easier to find stuff in your back catalogue – without redirects, most people who encounter a broken link will assume the page was deleted, and won’t dig further.

Images are getting easier to serve, harder to preserve

When the web was young, images were simple. You wrote an <img src="…"> tag in your HTML, and that was that.

Today, images are more complicated. You can provide multiple versions of the same image, or control when images are loaded. This can make web pages more efficient and accessible, but harder to preserve.

There are two features that stood out to me:

  1. Lazy loading is a technique where a web page doesn’t load images or resources until they’re needed – for example, not loading an image at the bottom of an article until you scroll down.

    Modern lazy loading is easy with <img loading="lazy">, but there are lots of sites that were built before that attribute was widely-supported. They have their own code for lazy loading, and every site behaves a bit differently. For example, a page might load a low-res image first, then swap it out for a high-res version with JavaScript. But automated tools can’t always run that JavaScript, so they only capture the low-res image.

  2. The <picture> tag allows pages to specify multiple versions of an image. For example:

    • A page could send a high-res image to laptops, and a low-res images to phones. This is more efficient; you’re not sending an unnecessarily large image to a small screen.
    • A page could send different images based on your colour scheme. You could see a graph on a white background if you use light mode, or on black if you use dark mode.

    If you preserve a page, which images should you get? All of them? Just one? If so, which one? For my personal archive, I always saved the highest resolution copy of each image, but I’m not sure that’s the best answer in every case.

    On the modern web, pages may not look the same for everyone – different people can see different things. When you’re preserving a page, you need to decide which version of it you want to save.

There’s no clearly-defined boundary of what to collect

Once you’ve saved the initial HTML page, what else do you save?

Some automated tools will aggressively follow every link on the page, and every link on those pages, and every link after that, and so on. Others will follow simple heuristics, like “save everything linked from the first page, but no further”, or “save everything up to a fixed size limit”.

I struggled to come up with a good set of heuristics for my own approach, and I was often making decisions on a case-by-case basis. Here are two examples:

  • I’ve bookmarked blog posts about conferences talks, where authors embed a YouTube video of them giving the talk. I think the video is a key part of the page, so I want to download it – but “download all embeds and links” would be a very expensive rule.

  • I’ve bookmarked blog posts that comment on scientific papers. Usually the link to the original paper doesn’t go directly to the PDF, but to a landing page on a site like arXiv.

    I want to save the PDF because it’s important context for the blog post, but now I’m saving something two clicks away from the original post – which would be even more expensive if applied as a universal rule.

This is another reason why I’m really glad I build my archive by hand – I could make different decisions based on the content and context.


Should you do this?

I can recommend having a personal web archive. Just like I keep paper copies of my favourite books, I now keep local copies of my favourite web pages. I know that I’ll always be able to read them, even if the original website goes away.

It’s harder to recommend following my exact approach. Building my archive by hand took nearly a year, probably hundreds of hours of my free time. I’m very glad I did it, I enjoyed doing it, and I like the result – but it’s a big time commitment, and it was only possible because I have a lot of experience building websites.

Don’t let that discourage you – a web archive doesn’t have to be fancy or extreme.

If you take a few screenshots, save some PDFs, or download HTML copies of your favourite fic, that’s a useful backup. You’ll have something to look at if the original web page goes away.

If you want to scale up your archive, look at automated tools. For most people, they’re a better balance of cost and reward than saving and editing HTML by hand. But you don’t need to, and even a folder with just a few files is better than nothing.

When I was building my archive – and reading all those web pages – I learnt a lot about how the web is built. In part 3 of this series, I’ll share what that process taught me about making websites.

If you’d like to know when that article goes live, subscribe to my RSS feed or newsletter!

[If the formatting of this post looks odd in your feed reader, visit the original article]

Creating a static website for all my bookmarks

2025-05-12 14:52:08

I’m storing more and more of my data as static websites, and about a year ago I switched to using a local, static site to manage my bookmarks. It’s replaced Pinboard as the way I track interesting links and cool web pages. I save articles I’ve read, talks I’ve watched, fanfiction I’ve enjoyed.

Screenshot of a web page titled ‘bookmarks’, and a list of three bookmarks below it. Each bookmark has a blue link as the title, a grey URL below the title, and some descriptive text below it.

It’s taken hundreds of hours to get all my bookmarks and saved web pages into this new site, and it’s taught me a lot about archiving and building the web. This post is the first of a four-part series about my bookmarks, and I’ll publish the rest of the series over the next three weeks.

Bookmarking mini-series

  1. Creating a static site for all my bookmarks (this article)
  2. Building a personal archive of the web, the slow way – how I built a web archive by hand, the tradeoffs between manual and automated archiving, and what I learnt about preserving the web.
  3. What I learnt about making websites by reading two thousand web pages – how to write thoughtful HTML, new-to-me features of CSS, and some quirks and relics I found while building my personal web archive.
  4. Some cool websites from my bookmark collection (coming 2 June) – my favorite sites I rediscovered while reviewing my bookmarks. Fun, creative corners of the web that make me smile.

Why do I bookmark?

I bookmark because I want to find links later

Keeping my own list of bookmarks means that I can always find old links. If I have a vague memory of a web page, I’m more likely to find it in my bookmarks than in the vast ocean of the web. Links rot, websites break, and a lot of what I read isn’t indexed by Google.

This is particularly true for fannish creations. A lot of fan artists deliberately publish in closed communities, so their work is only visible to like-minded fans, and not something that a casual Internet user might stumble upon. If I’m looking for an exact story I read five years ago, I’m far more likely to find it in my bookmarks than by scouring the Internet.

Finding old web pages has always been hard, and the rise of generative AI has made it even harder. Search results are full of AI slop, and people are trying to hide their content from scrapers. Locking the web behind paywalls and registration screens might protect it from scrapers, but it also makes it harder to find.

I bookmark to remember why I liked a link

I write notes on each bookmark I keep, so I remember why a particular link was fun or interesting.

When I save articles, I write a short summary of the key points, information, arguments. I can review a one paragraph summary much faster than I can reread the entire page.

When I save fanfiction, I write notes on the plot or key moments. Is this the story where they live happily ever after? Or does this have a gut wrenching ending that needs a box of tissues?

These summaries could be farmed out to generative AI, but I much prefer writing them myself. I can phrase things in my own words, write down connections to other ideas, and write a more personal summary than I get from a machine. And when I read those summaries back later, I remember writing them, and it revives my memories of the original article or story. It’s slower, but I find it much more useful.


Why use a static website?

I was a happy Pinboard customer for over a decade, but the site feels abandoned. I’ve not had catastrophic errors, but I kept running into broken features or rough edges – search timeouts, unreliable archiving, unexpected outages. There does seem to be some renewed activity in the last few months, but it was too late – I’d already moved away.

I briefly considered Pinboard alternatives like Raindrop, Pocket and Instapaper – but I’d still be trusting an online service. I’ve been burned too many times by services being abandoned, and so I’ve gradually been moving the data I care about to static websites on my local machine. It takes a bit of work to set up, but then I have more control and I’m not worried about it going away.

My needs are pretty simple – I just want a list of links, some basic filtering and search, and a saved copy of all my links. I don’t need social features or integrations, which made it easier to walk away from Pinboard.

I’ve been using static sites for all sorts of data, and I’m enjoying the flexibility and portability of vanilla HTML and JavaScript. They need minimal maintenance and there’s no lock-in, and I’ve made enough of them now that I can create new ones pretty quickly.


What does it look like?

The main page is a scrolling list of all my bookmarks. This is a single page that shows every link, because my collection is small enough that I don’t need pagination.

Screenshot of a web page titled ‘bookmarks’, and a list of three bookmarks below it. Each bookmark has a blue link as the title, a grey URL below the title, and some descriptive text below it.

Each bookmark has a title, a URL, my hand-written summary, and a list of tags.

If I’m looking for a specific bookmark, I can filter by tag or by author, or I can use my browser’s in-page search feature. I can sort by title, date saved, or “random” – this is a fun way to find links I might have forgotten about.

Let’s look at a single bookmark:

Close-up view of a single bookmark.

The main title is blue and underlined, and the URL of the original page is shown below it. Call me old-fashioned, but I still care about URL design, I think it’s cool to underline links, and I have nostalgia for #0000ff blue. I want to see those URLs.

If I click the URL in grey, I go to the page live on the web. But if I click the title, I go to a local snapshot of the page. Because these are the links I care about most, and links can rot, I’ve saved a local copy of every page as a mini-website – an HTML file and supporting assets. These local snapshots will always be present, and they work offline – that’s why I link to them from the title.

This is something I can only do because this is a personal tool. If a commercial bookmarking website tried to direct users to their self-hosted snapshots rather than the original site, they’d be accused of stealing traffic.

Creating these archival copies took months, and I’ll discuss it more in the rest of this series.


How does it work?

This is a static website built using the pattern I described in How I create static websites for tiny archives. I have my metadata in a JavaScript file, and a small viewer that renders the metadata as an HTML page I can view in my browser.

I’m not sharing the code because it’s deeply personalised and tied to my specific bookmarks, but if you’re interested, that tutorial is a good starting point.

Here’s an example of what a bookmark looks like in the metadata file:

"https://notebook.drmaciver.com/posts/2020-02-22-11:37.html": {
  "title": "You should try bad things",
  "authors": [
    "David R. MacIver"
  ],
  "description": "If you only do things you know are good, eventually some of them will fall out of favour and you'll have an ever-shrinking pool of things.\r\n\r\nSome of the things you think will be bad will end up being good \u2013 trying them helps expand your pool.",
  "date_saved": "2024-12-03T07:29:10Z",
  "tags": [
    "self-improvement"
  ],
  "archive": {
    "path": "archive/n/notebook.drmaciver.com/you-should-try-bad-things.html",
    "asset_paths": [
      "archive/n/notebook.drmaciver.com/static/drmnotes.css",
      "archive/n/notebook.drmaciver.com/static/latex.css",
      "archive/n/notebook.drmaciver.com/static/pandoc.css",
      "archive/n/notebook.drmaciver.com/static/pygments.css",
      "archive/n/notebook.drmaciver.com/static/tufte.css"
    ],
    "saved_at": "2024-12-03T07:30:21Z"
  },
  "publication_year": "2020",
  "type": "article"
}

I started with the data model from the Pinboard API, which in turn came from an older bookmarking site called Delicious. Over time, I’ve added my own fields. Previously I was putting everything in the title and the tags, but now I can have dedicated fields for things like authors, word count, and fannish tropes.

The archive object is new – that’s my local snapshot of a page. The path points to the main HTML file for the snapshot, and then asset_paths is a list of any other files that get used in the webpage (CSS files, images, fonts and so on). I have a Python script to checks that every archived file has been saved properly.

This is another advantage of writing my own bookmarking tool – I know exactly what data I want to store, and I can design the schema to fit.

When I want to add a bookmark or make changes, I open the JSON in my text editor and make changes by hand. I have a script that checks the file is properly formatted and the archive paths are all correct, and I track changes in Git.

Now you know what my new bookmarking site looks like, and how it works. Over the next three weeks, I’ll publish the remaining parts in this series. In part 2, I’ll explain how I created local snapshots of every web page. In part 3, I’ll tell you what that process taught me about building web pages. In the final part, I’ll highlight some fun stuff I found as I went through my bookmarks.

If you’d like to know when those articles go live, subscribe to my RSS feed or newsletter!

[If the formatting of this post looks odd in your feed reader, visit the original article]

Handling JSON objects with duplicate names in Python

2025-05-04 23:17:01

Consider the following JSON object:

{
  "sides":  4,
  "colour": "red",
  "sides":  5,
  "colour": "blue"
}

Notice that sides and colour both appear twice. This looks invalid, but I learnt recently that this is actually legal JSON syntax! It’s unusual and discouraged, but it’s not completely forbidden.

This was a big surprise to me. I think of JSON objects as key/value pairs, and I associate them with data structures like a dict in Python or a Hash in Ruby – both of which only allow unique keys. JSON has no such restriction, and I started thinking about how to handle it.

What does the JSON spec say about duplicate names?

JSON is described by several standards, which Wikipedia helpfully explains for us:

After RFC 4627 had been available as its “informational” specification since 2006, JSON was first standardized in 2013, as ECMA‑404.

RFC 8259, published in 2017, is the current version of the Internet Standard STD 90, and it remains consistent with ECMA‑404.

That same year, JSON was also standardized as ISO/IEC 21778:2017.

The ECMA and ISO/IEC standards describe only the allowed syntax, whereas the RFC covers some security and interoperability considerations.

All three of these standards explicitly allow the use of duplicate names in objects.

ECMA‑404 and ISO/IEC 21778:2017 have identical text to describe the syntax of JSON objects, and they say (emphasis mine):

An object structure is represented as a pair of curly bracket tokens surrounding zero or more name/value pairs. […] The JSON syntax does not impose any restrictions on the strings used as names, does not require that name strings be unique, and does not assign any significance to the ordering of name/value pairs. These are all semantic considerations that may be defined by JSON processors or in specifications defining specific uses of JSON for data interchange.

RFC 8259 goes further and strongly recommends against duplicate names, but the use of SHOULD means it isn’t completely forbidden:

The names within an object SHOULD be unique.

The same document warns about the consequences of ignoring this recommendation:

An object whose names are all unique is interoperable in the sense that all software implementations receiving that object will agree on the name-value mappings. When the names within an object are not unique, the behavior of software that receives such an object is unpredictable. Many implementations report the last name/value pair only. Other implementations report an error or fail to parse the object, and some implementations report all of the name/value pairs, including duplicates.

So it’s technically valid, but it’s unusual and discouraged.

I’ve never heard of a use case for JSON objects with duplicate names. I’m sure there was a good reason for it being allowed by the spec, but I can’t think of it.

Most JSON parsers – including jq, JavaScript, and Python – will silently discard all but the last instance of a duplicate name. Here’s an example in Python:

>>> import json
>>> json.loads('{"sides": 4, "colour": "red", "sides": 5, "colour": "blue"}')
{'colour': 'blue', 'sides': 5}

What if I wanted to decode the whole object, or throw an exception if I see duplicate names?

This happened to me recently. I was editing a JSON file by hand, and I’d copy/paste objects to update the data. I also had scripts which could update the file. I forgot to update the name on one of the JSON objects, so there were two name/value pairs with the same name. When I ran the script, it silently erased the first value.

I was able to recover the deleted value from the Git history, but I wondered how I could prevent this happening again. How could I make the script fail, rather than silently delete data?

Decoding duplicate names in Python

When Python decodes a JSON object, it first parses the object as a list of name/value pairs, then it turns that list of name value pairs into a dictionary.

We can see this by looking at the JSONObject function in the CPython source code: it builds a list pairs, and at the end of the function, it calls dict(pairs) to turn the list into a dictionary. This relies on the fact that dict() can take an iterable of key/value tuples and create a dictionary:

>>> dict([('sides', 4), ('colour', 'red')])
{'colour': 'red', 'sides': 4}

The docs for dict() tell us that it will discard duplicate keys: “if a key occurs more than once, the last value for that key becomes the corresponding value in the new dictionary”.

>>> dict([('sides', 4), ('colour', 'red'), ('sides', 5), ('colour', 'blue')])
{'colour': 'blue', 'sides': 5}

We can customise what Python does with the list of name/value pairs. Rather than calling dict(), we can pass our own function to the object_pairs_hook parameter of json.loads(), and Python will call that function on the list of pairs. This allows us to parse objects in a different way.

For example, we can just return the literal list of name/value pairs:

>>> import json
>>> json.loads(
...     '{"sides": 4, "colour": "red", "sides": 5, "colour": "blue"}',
...     object_pairs_hook=lambda pairs: pairs
... )
...
[('sides', 4), ('colour', 'red'), ('sides', 5), ('colour', 'blue')]

We could also use the multidict library to get a dict-like data structure which supports multiple values per key. This is based on HTTP headers and URL query strings, two environments where it’s common to have multiple values for a single key:

>>> from multidict import MultiDict
>>> md = json.loads(
...     '{"sides": 4, "colour": "red", "sides": 5, "colour": "blue"}',
...     object_pairs_hook=lambda pairs: MultiDict(pairs)
... )
...
>>> md
<MultiDict('sides': 4, 'colour': 'red', 'sides': 5, 'colour': 'blue')>
>>> md['sides']
4
>>> md.getall('sides')
[4, 5]

Preventing silent data loss

If we want to throw an exception when we see duplicate names, we need a longer function. Here’s the code I wrote:

import collections
import typing


def dict_with_unique_names(pairs: list[tuple[str, typing.Any]]) -> dict[str, typing.Any]:
    """
    Convert a list of name/value pairs to a dict, but only if the
    names are unique.

    If there are non-unique names, this function throws a ValueError.
    """
    # First try to parse the object as a dictionary; if it's the same
    # length as the pairs, then we know all the names were unique and
    # we can return immediately.
    pairs_as_dict = dict(pairs)

    if len(pairs_as_dict) == len(pairs):
        return pairs_as_dict

    # Otherwise, let's work out what the repeated name(s) were, so we
    # can throw an appropriate error message for the user.
    name_tally = collections.Counter(n for n, _ in pairs)

    repeated_names = [n for n, count in name_tally.items() if count > 1]
    assert len(repeated_names) > 0

    if len(repeated_names) == 1:
        raise ValueError(f"Found repeated name in JSON object: {repeated_names[0]}")
    else:
        raise ValueError(
            f"Found repeated names in JSON object: {', '.join(repeated_names)}"
        )

If I use this as my object_pairs_hook when parsing an object which has all unique names, it returns the normal dict I’d expect:

>>> json.loads(
...     '{"sides": 4, "colour": "red"}',
...     object_pairs_hook=dict_with_unique_names
... )
...
{'colour': 'red', 'sides': 4}

But if I’m parsing an object with one or more repeated names, the parsing fails and throws a ValueError:

>>> json.loads(
...     '{"sides": 4, "colour": "red", "sides": 5}',
...      object_pairs_hook=dict_with_unique_names
... )
Traceback (most recent call last):
[…]
ValueError: Found repeated name in JSON object: sides

>>> json.loads(
...     '{"sides": 4, "colour": "red", "sides": 5, "colour": "blue"}',
...     object_pairs_hook=dict_with_unique_names
... )
Traceback (most recent call last):
[…]
ValueError: Found repeated names in JSON object: sides, colour

This is precisely the behaviour I want – throwing an exception, not silently dropping data.

Encoding non-unique names in Python

It’s hard to think of a use case, but this post feels incomplete without at least a brief mention.

If you want to encode custom data structures with Python’s JSON library, you can subclass JSONEncoder and define how those structures should be serialised. Here’s a rudimentary attempt at doing that for a MultiDict:

class MultiDictEncoder(json.JSONEncoder):

    def encode(self, o: typing.Any) -> str:
        # If this is a MultiDict, we need to construct the JSON string
        # manually -- first encode each name/value pair, then construct
        # the JSON object literal.
        if isinstance(o, MultiDict):
            name_value_pairs = [
                f'{super().encode(str(name))}: {self.encode(value)}'
                for name, value in o.items()
            ]

            return '{' + ', '.join(name_value_pairs) + '}'

        return super().encode(o)

and here’s how you use it:

>>> md = MultiDict([('sides', 4), ('colour', 'red'), ('sides', 5), ('colour', 'blue')])
>>> json.dumps(md, cls=MultiDictEncoder)
{"sides": 4, "colour": "red", "sides": 5, "colour": "blue"}

This is rough code, and you shouldn’t use it – it’s only an example. I’m constructing the JSON string manually, so it doesn’t handle edge cases like indentation or special characters. There are almost certainly bugs, and you’d need to be more careful if you wanted to use this for real.

In practice, if I had to encode a multi-dict as JSON, I’d encode it as a list of objects which each have a key and a value field. For example:

[
  {"key": "sides",  "value": 4     },
  {"key": "colour", "value": "red" },
  {"key": "sides",  "value": 5     },
  {"key": "colour", "value": "blue"},
]

This is a pretty standard pattern, and it won’t trip up JSON parsers which aren’t expecting duplicate names.

Do you need to worry about this?

This isn’t a big deal. JSON objects with duplicate names are pretty unusual – this is the first time I’ve ever encountered one, and it was a mistake.

Trying to account for this edge case in every project that uses JSON would be overkill. It would add complexity to my code and probably never catch a single error.

This started when I made a copy/paste error that introduced the initial duplication, and then a script modified the JSON file and caused some data loss. That’s a somewhat unusual workflow, because most JSON files are exclusively modified by computers, and this wouldn’t be an issue.

I’ve added this error handling to my javascript-data-files library, but I don’t anticipate adding it to other projects. I use that library for my static website archives, which is where I had this issue.

Although I won’t use this code exactly, it’s been good practice at writing custom encoders/decoders in Python. That is something I do all the time – I’m often encoding native Python types as JSON, and I want to get the same type back when I decode later.

I’ve been writing my own subclasses of JSONEncoder and JSONDecoder for a while. Now I know a bit more about how Python decodes JSON, and object_pairs_hook is another tool I can consider using. This was a fun deep dive for me, and I hope you found it helpful too.

[If the formatting of this post looks odd in your feed reader, visit the original article]

A faster way to copy SQLite databases between computers

2025-04-30 05:47:09

I store a lot of data in SQLite databases on remote servers, and I often want to copy them to my local machine for analysis or backup.

When I’m starting a new project and the database is near-empty, this is a simple rsync operation:

$ rsync --progress username@server:my_remote_database.db my_local_database.db

As the project matures and the database grows, this gets slower and less reliable. Downloading a 250MB database from my web server takes about a minute over my home Internet connection, and that’s pretty small – most of my databases are multiple gigabytes in size.

I’ve been trying to make these copies go faster, and I recently discovered a neat trick.

What really slows me down is my indexes. I have a lot of indexes in my SQLite databases, which dramatically speed up my queries, but also make the database file larger and slower to copy. (In one database, there’s an index which single-handedly accounts for half the size on disk!)

The indexes don’t store anything unique – they just duplicate data from other tables to make queries faster. Copying the indexes makes the transfer less efficient, because I’m copying the same data multiple times. I was thinking about ways to skip copying the indexes, and I realised that SQLite has built-in tools to make this easy.

Dumping a database as a text file

SQLite allows you to dump a database as a text file. If you use the .dump command, it prints the entire database as a series of SQL statements. This text file can often be significantly smaller than the original database.

Here’s the command:

$ sqlite3 my_database.db .dump > my_database.db.txt

And here’s what the beginning of that file looks like:

PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE IF NOT EXISTS "tags" (
   [name] TEXT PRIMARY KEY,
   [count_uses] INTEGER NOT NULL
);
INSERT INTO tags VALUES('carving',260);
INSERT INTO tags VALUES('grass',743);

Crucially, this reduces the large and disk-heavy indexes into a single line of text – it’s an instruction to create an index, not the index itself.

CREATE INDEX [idx_photo_locations]
    ON [photos] ([longitude], [latitude]);

This means that I’m only storing each value once, rather than the many times it may be stored across the original table and my indexes. This is how the text file can be smaller than the original database.

If you want to reconstruct the database, you pipe this text file back to SQLite:

$ cat my_database.db.txt | sqlite3 my_reconstructed_database.db

Because the SQL statements are very repetitive, this text responds well to compression:

$ sqlite3 explorer.db .dump | gzip -c > explorer.db.txt.gz

To give you an idea of the potential savings, here’s the relative disk size for one of my databases.

File Size on disk
original SQLite database 3.4 GB
text file (sqlite3 my_database.db .dump) 1.3 GB
gzip-compressed text (sqlite3 my_database.db .dump | gzip -c) 240 MB

The gzip-compressed text file is 14× smaller than the original SQLite database – that makes downloading the database much faster.

My new ssh+rsync command

Rather than copying the database directly, now I create a gzip-compressed text file on the server, copy that to my local machine, and reconstruct the database. Like so:

# Create a gzip-compressed text file on the server
ssh username@server "sqlite3 my_remote_database.db .dump | gzip -c > my_remote_database.db.txt.gz"

# Copy the gzip-compressed text file to my local machine
rsync --progress username@server:my_remote_database.db.txt.gz my_local_database.db.txt.gz

# Remove the gzip-compressed text file from my server
ssh username@server "rm my_remote_database.db.txt.gz"

# Uncompress the text file
gunzip my_local_database.db.txt.gz

# Reconstruct the database from the text file
cat my_local_database.db.txt | sqlite3 my_local_database.db

# Remove the local text file
rm my_local_database.db.txt

A database dump is a stable copy source

This approach fixes another issue I’ve had when copying SQLite databases.

If it takes a long time to copy a database and it gets updated midway through, rsync may give me an invalid database file. The first half of the file is pre-update, the second half file is post-update, and they don’t match. When I try to open the database locally, I get an error:

database disk image is malformed

By creating a text dump before I start the copy operation, I’m giving rsync a stable copy source. That text dump isn’t going to change midway through the copy, so I’ll always get a complete and consistent text file.


This approach has saved me hours when working with large databases, and made my downloads both faster and more reliable. If you have to copy around large SQLite databases, give it a try.

[If the formatting of this post looks odd in your feed reader, visit the original article]

A flash of light in the darkness

2025-04-22 14:40:55

I support dark mode on this site, and as part of the dark theme, I have a colour-inverted copy of the default background texture. I like giving my website a subtle bit of texture, which I think makes it stand out from a web which is mostly solid-colour backgrounds. Both my textures are based on the “White Waves” pattern made by Stas Pimenov.

If you don’t switch between light and dark mode, you’ve probably only seen one of these background textures.

I was setting these images as my background with two CSS rules, using the prefers-color-scheme: dark media feature to use the alternate image in dark mode:

body {
  background: url('https://alexwlchan.net/theme/white-waves-transparent.png');
}

@media (prefers-color-scheme: dark) {
  body {
    background: url('https://alexwlchan.net/theme/black-waves-transparent.png');
  }
}

This works, mostly.

But I prefer light mode, so while I wrote this CSS and I do some brief testing whenever I make changes, I’m not using the site in dark mode. I know how dark mode works in my local development environment, not how it feels as a day-to-day user.

Late last night I was using my phone in dark mode to avoid waking the other people in the house, and I opened my site. I saw a brief flash of white, and then the dark background texture appeared. That flash of bright white is precisely what you don’t want when you’re using dark mode, but it happened anyway. I made a note to work it out in the morning, then I went to bed.

Now I’m fully awake, it’s obvious what happened. Because my only background is the image URL, there’s a brief gap between the CSS being parsed and the background image being loaded. In that time, the browser doesn’t have anything to put in the background, so you just get pure white.

This was briefly annoying in the moment, but it would be even more worse if the background texture never loaded. I have light text on black in dark mode, but without the background image it’s just light text on white, which is barely readable:

Screenshot of light grey text on a white background. The text is difficult to read because it has barely any contrast with the background.

I never noticed this in local development, because I’m usually working in a well-lit room where that white flash would be far less obvious. I’m also using a local version of the site, which loads near-instantly and where the background image is almost certainly saved in my browser cache.

I’ve made two changes to prevent this happening again.

  1. I’ve added a colour to use as a fallback until the image loads. The CSS background property supports adding a colour, which is used until the image loads, or as a fallback if it doesn’t. I already use this in a few places, and now I’ve added it to my body background.

    body {
      background: url('https://…/white-waves-transparent.png') #fafafa;
    }
    
    @media (prefers-color-scheme: dark) {
      body {
        background: url('https://…/black-waves-transparent.png') #0d0d0d;
      }
    }

    This avoids the flash of unstyled background before the image loads – the browser will use a solid dark background until it gets the texture.

  2. I’ve added rel="preload" elements to the head of the page, so the browser will start loading the background textures faster. These elements are a clue to the browser that these resources are going to be useful when it renders the page, so it should start loading them as soon as possible:

    <link
      rel="preload"
      href="https://alexwlchan.net/theme/white-waves-transparent.png"
      as="image"
      type="image/png"
      media="(prefers-color-scheme: light)"
    />
    <link
      rel="preload"
      href="https://alexwlchan.net/theme/black-waves-transparent.png"
      as="image"
      type="image/png"
      media="(prefers-color-scheme: dark)"
    />
    

    This means the browser is downloading the appropriate texture at the same time as it’s downloading the CSS file. Previously it had to download the CSS file, parse it, and only then would it know to start downloading the texture. With the preload, it’s a bit faster!

    The difference is probably imperceptible if you’re on a fast connection, but it’s a small win and I can’t see any downside (as long as I scope the preload correctly, and don’t preload resources I don’t end up using).

    I’ve seen a lot of sites using <link rel="preload"> and I’ve only half-understood what it is and why it’s useful – I’m glad to have a chance to use it myself, so I can understand it better.

This bug reminds me of a phenomenon called flash of unstyled text. Back when custom fonts were fairly new, you’d often see web pages appear briefly with the default font before custom fonts finished loading. There are well-understood techniques for preventing this, so it’s unusual to see that brief unstyled text on modern web pages – but the same issue is affecting me in dark mode I avoided using custom fonts on the web to avoid tackling this issue, but it got me anyway!

In these dark times for the web, old bugs are new again.

[If the formatting of this post looks odd in your feed reader, visit the original article]