Last week, the Swift.org website got a redesign.
I don’t write much Swift at the moment, but I glanced at the new website to see what’s up and OOH COOL BIRD!
When you load the page, there’s a swooping animation as the bird appears:
I was curious how the animation worked.
I thought maybe it was an autoplaying video with no controls, but no, it’s much cooler than that!
The animation is implemented entirely in code – there are a few image assets, and then the motion uses JavaScript and the HTML5 canvas element.
I’ve never done anything with animation, so I started reading the code to understand how it works.
I’m not going to walk through it in detail, but I do want to show you what I learnt.
Most of the animation is made up of five “swoop” images, which look like strokes of a paintbrush.
These were clearly made by an artist in a design app like Photoshop.
These images are gradually revealed, so it looks like somebody actually painting with a brush.
This is more complex than a simple horizontal wipe, the sort of animation you might do in PowerPoint.
Notice how, for example, the purple swoop doubles back on itself – if you did a simple left-to-right wipe, it would start as two separate swoops before joining into one.
It would look very strange!
Each swoop is animated in the same way, so let’s focus on the purple one, just because it’s the most visually interesting.
The animation is applying a mask to the underlying image, and the mask gradually expands to show more and more of the image.
The mask matches the general shape of the brush stroke, so as it expands, it reveals more of the image.
I wrote about masking with SVG four years ago, and the principle is similar here – but the Swift.org animation uses HTML5 canvas, not SVG.
The best way to explain this is with a quick demo: as you drag the slider back and forth, you can see the mask get longer and shorter, and that’s reflected in the final image.
original image
+
mask
→
final image
animation progress:
We can break this down into a couple of steps:
Only draw part of a curved path (drawing the mask)
Combine the partially-drawn path with the original image (applying the mask)
Gradually increase the amount of the path that we draw (animating the path)
Start the animation when the page loads
Let’s go through each of these in turn.
Only draw part of a curved path with a dash pattern
Alongside the graphical image of a brush stroke, the artist supplied an SVG path for the mask:
If you’re not familiar with SVG path syntax, I really recommend Mathieu Dutour’s excellent SVG Path Visualizer tool.
You give it a path definition, and it gives you a step-by-step explanation of what it’s doing, and you can see where each part of the path appears in the final shape.
The way Swift.org draws a partial path is a really neat trick: they’re using a line dash pattern with a variable offset.
It took me a moment to figure out what their code was doing, but then it all clicked into place.
First they set a line dash pattern using setLineDash(), which specifies alternating lengths of lines and gaps to draw the line.
Here’s a quick demo:
ctx.setLineDash([100])
The path starts in the lower left-hand corner, and notice how it always starts with a complete dash, not a gap.
You can change this by setting the lineDashOffset property, which causes the patern to start on a gap, or halfway through a dash.
Here’s a demo where you can set both variables at once:
ctx.setLineDash([75])
ctx.lineDashOffset=0;
I find the behaviour of lineDashOffset a bit counter-intuitive: as I increase the offset, it looks like the path is moving backward.
I was expecting increasing the offset to increase the start of the first dash, so the line would move in the other direction.
I’m sure it makes sense if you have the right mental model, but I’m not sure what it is.
If you play around with these two variables, you might start to see how you can animate the path as if it’s being drawn from the start.
Here are the steps:
Set the dash length to the exact length of the path.
This means every dash and every gap is the same length as the entire path.
(The length of the purple swoop path is 2776, a number I got from the Swift.org source code.
This must have been calculated with an external tool; I can’t find a way to calculate this length in a canvas.)
Set the dash offset to the exact length of the path.
This means the entire path is just a gap, which makes it look like there’s nothing there.
Gradually reduce the dash offset to zero.
A dash becomes visible at the beginning of the path, and the closer the offset gets to zero, the more of that dash is visible.
Eventually it fills the entire path.
Here’s one more demo, where I’ve set up the line dash pattern, and you can adjust the progress.
Notice how the line gradually appears:
Now we have a way to draw part of a path, and as we advance the progress, it looks it’s being drawn with a brush.
The real code has a couple of extra styles – in particular, it sets a stroke width and a line cap – but it’s the way the animation uses the dash pattern that really stood out to me.
Once we have our path, how do we use it to mask an image?
Mask an image with a globalCompositeOperation
The masking uses a property of HTML5 canvas called globalCompositeOperation.
If you’ve already drawn some shapes on a canvas, you can control how new shapes will appear on top of them – for example, which one appears on top, or whether to clip one to fit inside the other.
I’m familiar with the basic idea – I wrote an article about clips and masks in SVG in 2021 that I still look back on fondly – but I find this feature a bit confusing, especially the terminology.
Rather than talking about clips or masks, this property is defined using sources (shapes you’re about to draw on the canvas) and destinations (shapes that are already on the canvas).
I’m sure that naming makes sense to somebody, but it’s not immediately obvious to me.
First we need to load the bitmap image which will be our “source”.
We can create a new img element with document.createElement("img"), then load the image by setting the src attribute:
In the Swift.org animation, the value of globalCompositeOperation is source-in – the new shape is only drawn where the new shape and the old shape overlap, and the old shape becomes transparent.
Here’s the code:
// The thick black stroke is the "destination"ctx.stroke(path)// The "source-in" mode means only the part of the source that is// inside the destination will be shown, and the destination will// be transparent.ctx.globalCompositeOperation='source-in'// The bitmap image is the "source"ctx.drawImage(img,0,0)
and here’s what the result looks like, when the animation is halfway complete:
destination
+
source
→
final image
There are many different composite operations, including ones that combine colours or blend pixels from both shapes.
If you’re interested, you can read the docs on MDN, which includes a demo of all the different blending modes.
This is a bit of code where I can definitely understand what it does when I read it, but I wouldn’t feel confident writing something like this myself.
It’s too complex a feature to wrap my head around with a single example, and the other examples I found are too simple and unmotivating.
(Many sites use the example of a solid red circle and a solid blue rectangle, which I find completely unhelpful because I can produce the final result in a dozen other ways.
What’s the real use case for this property?
What can I only do if I use globalCompositeOperation?)
Then again, perhaps I’m not the target audience for this feature.
I mostly do simple illustrations, and this is a more powerful graphics operation.
I’m glad to know it’s there, even if I’m not sure when I’ll use it.
Now we can draw a partial stroke and use it as a mask, how do we animate it?
Animate the brush stroke with Anime.js
Before I started reading the code in detail, I tried to work out how I might create an animation like this myself.
I haven’t done much animation, so the only thing I could think of was JavaScript’s setTimeout() and setInterval() functions.
Using those repeatedly to update a progress value would gradually draw the stroke.
I tried it, and that does work!
But I can think of some good reasons why it’s not what’s used for the animation on Swift.org.
The timing of setTimeout() and setInterval() isn’t guaranteed – the browser may delay longer than expected if the system is under load or you’re updating too often.
That could make the animation jerky or stuttery.
Even if the delays fire correctly, it could still look a bit janky – you’re stepping between a series of discrete frames, rather than smoothly animating a shape.
If there’s too much of a change between each frame, it would ruin the illusion.
Swift.org is using Julian Garnier’s Anime.js animation library.
Under the hood, this library uses web technologies like requestAnimationFrame() and hardware acceleration – stuff I’ve heard of, but never used.
I assume these browser features are optimised for doing smooth and efficient animations – for example, they must sync to the screen refresh rate, only drawing frames as necessary, whereas using setInterval() might draw lots of unused frames and waste CPU.
Anime.js has a lot of different options, but the way Swift.org uses it is fairly straightforward.
First it creates an object to track the state of the animation:
conststate={progress:0};
Then there’s a function that redraws the swoop based on the current progress.
It clears the canvas, then redraws the partial path and the mask:
functionupdateSwoop(){// Clear canvas before next drawctx.clearRect(0,0,canvas.width,canvas.height);// Draw the part of the stroke that we want to display// at this point in the animationctx.lineDashOffset=swoop.pathLength*(1-state.progress);ctx.stroke(newPath2D(swoop.path));// Draw the image, using "source-in" to apply a maskctx.globalCompositeOperation='source-in'ctx.drawImage(img,0,0);// Reset to default for our next stroke paintctx.globalCompositeOperation='source-out';}
You may have wondered why the state is an object, and not a single value like const progress = 0.
If we passed a numeric value to tl.add(), JavaScript would pass it by value, and any changes wouldn’t be visible to the updateSwoop() function.
By wrapping the progress value in an object, JavaScript will pass by reference instead, so changes made inside tl.add() will be visible when updateSwoop() is called.
Now we can animate our swoop, as if it was a brush stroke.
There’s one final piece: how do we start the animation?
Start the animation with a MutationObserver
If I want to do something when a page loads, I normally watch for the DOMContentLoaded event, for example:
Then it uses a MutationObserver to watch the entire page for changes, and start the animation once it finds this wrapper <div>:
// Start animation when container is mountedconstobserver=newMutationObserver(()=>{constanimContainer=document.querySelector('.animation-container')if (animContainer){observer.disconnect()heroAnimation(animContainer)}})observer.observe(document.documentElement,{childList:true,subtree:true,})
It achieves the same effect as watching for DOMContentLoaded, but in a different way.
I don’t think there’s much difference between DOMContentLoaded and MutationObserver in this particular case, but I can see that MutationObserver is more flexible for the general case.
You can target a more precise element than “the entire document”, and you can look for changes beyond just the initial load.
I suspect the MutationObserver approach may also be slightly faster – I added a bit of console logging, and if you don’t disconnect the observer, it gets called three times when loading the Swift.org homepage.
If the animation container exists on the first call, you can start the animation immediately, rather than waiting for the rest of the DOM to load.
I’m not sure if that’s a perceptible difference though, except for very large and complex web pages.
This step completes the animation.
When the page loads, we can start an animation that draws the brush stroke as a path.
As the animation continues, we draw more and more of that path, and the path is used as a mask for a bitmap image, gradually unveiling the purple swoop.
Skip the animation if you have (prefers-reduced-motion: reduce)
There’s one other aspect of the animation on Swift.org that I want to highlight.
At the beginning of the animation sequence, it checks to see if you have the “prefers reduced motion” preference.
This is an accessibility setting that allows somebody to minimise non-essential animations.
Further down, the code checks for this preference, and if it’s set, it skips the animation and just renders the final image.
I’m already familiar with this preference and I use it on a number of websites. sites, but it’s still cool to see.
Closing thoughts
Thanks again to the three people who wrote this animation code: Federico Bucchi, Jesse Borden, and Nicholas Krambousanos.
They wrote some very readable JavaScript, so I could understand how it worked.
The ability to “view source” and see how a page works is an amazing feature of the web, and finding the commit history as open source is the cherry on the cake.
I really enjoyed writing this post, and getting to understand how this animation works.
I don’t know that I could create something similar – in particular, I don’t have the graphics skills to create the bitmap images of brush strokes – but I’d feel a lot more confident trying than I would before.
I’ve learned a lot from reading this code, and I hope you’ve learned something as well.
I wanted to end this series on a lighter note, so here’s a handful of my favorite sites I rediscovered while reviewing my bookmarks – fun, creative corners of the web that make me smile.
This article is the final part of a four part bookmarking mini-series:
My favourite websites from my bookmark collection (this article) – websites that change randomly, that mirror the real world, or even follow the moon and the sun, plus my all-time favourite website design.
Jason Kottke’s website, kottke.org, has a sidebar that shows four coloured circles, different for every visitor.
There are nearly a trillion possible combinations, which means everyone gets their own unique version of the page.
I happened to capture a particularly aesthetically pleasing collection of reds and pinks in this preserved snapshot.
They add a pop of colour to the page, but they don’t overwhelm it.
I think this adds a dash of fun whimsy, and I’ve tried adding something similar to my own sites, but it’s easy to get wrong.
My experiments in randomness often failed because they lacked constraints – for example, I’d pick random tint colours, but some combinations were unreadable.
The Kottke “planets” strike a nice balance: the randomness stands out, but it’s reined in so the overall page will always look good.
There are thousands of snapshots of kottke.org in the Wayback Machine, many saved automatically.
That means there are unique combinations of circles already archived that have yet to be seen by a person – frozen moments that may only be seen by a future reader, long after this design has gone.
I rather like that: a tiny, quiet, time capsule on the web.
Physical meets digital on panic.org
The software company Panic has a circular logo: a stylised “P” on a two-tone blue background.
But for years, if you visited their website, you might see that logo in a different colour, like this:
Where did those colours come from?
The logo image gets loaded from signserver.panic.com, which makes me think it reflected the current colours of the physical sign on their building.
They even had a website where anybody could change the colours of the sign (though it’s offline now – they took the sign down when they moved offices).
I love this detail: a tiny bit of the physical world seeping into the digital.
A Tumblr theme that follows the sun
Another instance of the physical world affecting the digital comes from one of my bookmarked Tumblr posts, which has a remarkable theme Circadium 2.0, made by Tumblr user Laighlin.
Forget a binary switch between light and dark mode, this is a theme that slowly changes the appearance through the entire day.
The background changes colour, stars fade in and out, and the moon and the sun gradually rise and set.
It cycles through noon, twilight, and dusk, before starting the same thing over again.
It’s hard to describe it in words, so here’s a screen recording of the demo site for a 24 hour cycle:
This effect is very subtle, because the appearance is set based on the time you loaded the page, and doesn’t change after that.
Unless you reload the same page repeatedly, you may even not notice the background is changing.
This is the sort of creativity I love about sites like Tumblr and LiveJournal, where users can really customise the appearance of their sites – not just pick a profile picture and a tint colour.
Subtle transitions at Campo Santo
The Campo Santo blog has a more restrained design, but still makes fun use of shifting colours – the tint colour of the page gradually switches from a reddish orange to brown, to green, to a dark yellow, and back to orange.
This tint colour affects multiple elements on the page: the header, the sidebar promo, headings and social media links.
Here’s what it looks like:
I so enjoyed Firewatch, and I’m still a little bitter that In the Valley of Gods got cancelled.
I would have loved another first-person exploration game from that team.
Sadly, this animation only lives on in web archives and in memory – something has broken in the JavaScript that means it no longer works on the live site.
The fragility of the web isn’t just entire pages or sites going offline, it’s also the gradual breaking of pages that remain online.
The hand-drawn aesthetic of Owltastic
My favourite website is the old design of Meagan Fisher Couldwell’s website.
It has a beautiful, hand-drawn aesthetic, and it’s full of subtle texture and round corners – no straight lines, no hard edges.
It has a soft and gentle appearance, and a friendly owl mascot to boot.
I bookmarked this particular page in 2013, before iOS 7 when loud textures and skeuomorphism were still in fashion – but unlike many designs from that era which now look dated, I think this site still looks good today.
I just know that owl and I would be friends.
Owltastic is the first site I remember seeing and thinking “wow”, and wanting to build something that looked that good.
Meagan has since redesigned her site, but I have a lot of nostalgia for that hand-drawn look.
Final thoughts
A lot of the creativity has been squeezed out of the web.
I’ve been working on a separate social media archiving project recently, and it’s depressing how many sites look essentially the same – black sans serif text on a white background.
(Many of them even use the default system font, because heaven forbid a site have any personality.)
Going through my bookmarks has been a fun reminder that the web is still a creative, colourful, and diverse space – the variety is there, even if it’s getting harder to find.
Lots of people are doing interesting stuff on the web, and my bookmarks are a way to remember and celebrate that.
Revamping the way I organise my bookmarks has taken a lot of work, but I’m so pleased with the result.
Now I have a list of my most important web pages in an open format, saved locally, with an archived copy of each page as well.
I can browse them in a simple web interface, and see every page as I remember it, even if the original website has disappeared.
I don’t like making predictions, but this feels like a system that should last a long time.
There are no third-party dependencies, nothing that will need upgrading, no external service that could be abandoned or enshittified.
I feel like I could manage my bookmarks this way for the rest of my life – stay tuned to see if that holds true!
Writing this four-part series has been the capstone to this year-long project.
I had a lot of time to think about bookmarks and web archiving, and I didn’t want those thoughts to disappear.
I hope you’ve enjoyed it, and that I’ve given you some new ideas.
Thank you for reading this far.
My favourite parts of the web are the spaces where people share interesting ideas.
This mini-series – and this entire blog – is my contribution to that collective work.
Over the past year, I built a web archive of over two thousand web pages – my own copy of everything I’ve bookmarked in the last fifteen years.
I saved each one by hand, reading and editing the HTML to build a self-contained, standalone copy of each web page.
These web pages were made by other people, many using tools and techniques I didn’t recognise.
What started as an exercise in preservation became an unexpected lesson in coding: I was getting a crash course in how the web is made.
Reading somebody else’s code is a great way to learn, and I was reading a lot of somebody else’s code.
In this post, I’ll show you some of what I learnt about making websites: how to write thoughtful HTML, new-to-me features of CSS, and some quirks and relics of the web.
This article is the third in a four part bookmarking mini-series:
I know I’ve read a list of HTML tags in reference documents and blog posts, but there are some tags I’d forgotten, misunderstood, or never seen used in the wild.
Reading thousands of real-world pages gave me a better sense of how these tags are actually used, and when they’re useful.
The <aside> element
MDN describes <aside> as “a portion of a document whose content is only indirectly related to the document’s main content”.
That’s vague enough that I was never quite sure when to use it.
In the web pages I read, I saw <aside> used in the middle of larger articles, for things like ads, newsletter sign ups, pull quotes, or links to related articles.
I don’t have any of those elements on my site, but now I have a stronger mental model of where to use <aside>.
I find concrete examples more useful than abstract definitions.
I also saw a couple of sites using the <ins> (inserted text) element for ads, but I think <aside> is a better semantic fit.
The <mark> element
The <mark> element highlights text, typically with a yellow background.
It’s useful for drawing visual attention to a phrase, and I suspect it’s helpful for screen readers and parsing tools as well.
I saw it used in Medium to show reader highlights, and I’ve started using it in code samples when I want to call out specific lines.
The <section> element
The <section> tag is a useful way to group content on a page – more meaningful than a generic <div>.
I’d forgotten about it, although I use similar tags like <article> and <main>.
Seeing it used across different sites reminded me it exists, and I’ve since added it to a few projects.
The <hgroup> (heading group) element
The <hgroup> tag is for grouping a heading with related metadata, like a title and a publication date:
<hgroup><h1>All about web bookmarking</h1><p>Posted 16 March 2025</p></hgroup>
This is another tag I’d forgotten, which I’ve started using for the headings on this site.
The <video> element
The <video> tag is used to embed videos in a web page.
It’s a tag I’ve known about for a long time – I still remember reading Kroc Camen’s article Video is for Everybody in 2010, back when Flash was being replaced as dominant way to watch video on the web.
While building my web archive, I replaced a lot of custom video players with <video> elements and local copies of the videos.
This was my first time using the tag in anger, not just in examples.
One mistake I kept making was forgetting to close the tag, or trying to make it self-closing:
<!-- this is wrong --><videocontrolssrc="videos/Big_Buck_Bunny.mp4"/>
It looks like <img>, which is self-closing, but <video> can have child elements, so you have to explicitly close it with </video>.
The <progress> indicator element
The <progress> element shows a progress indicator.
I saw it on a number of sites that publish longer articles – they used a progress bar to show you how far you’d read.
the image will be loaded from https://example.com/pictures/cat.jpg.
It’s still not clear to me when you should use <base>, or what the benefits are (aside from making your URLs a bit shorter), but it’s something I’ll keep an eye out for in future projects.
Clever corners of CSS
The CSS @import rule
CSS has @import rules, which allow one stylesheet to load another:
@import"fonts.css";
I’ve used @import in Sass, but I only just realised it’s now a feature of vanilla CSS – and one that’s widely used.
The CSS for this website is small enough that I bundle it into a single file for serving over HTTP (a mere 13KB), but I’ve started using @import for static websites I load from my local filesystem, and I can imagine it being useful for larger projects.
One feature I’d find useful is conditional imports based on selectors.
You can already do conditional imports based on a media query (“only load these styles on a narrow screen”) and something similar for selectors would be useful too (for example, “only load these styles if a particular class is visible”).
I have some longer rules that aren’t needed on every page, like styles for syntax highlighting, and it would be nice to load them only when required.
[attr$=value] is a CSS selector for suffix values
While reading Home Sweet Homepage, I found a CSS selector I didn’t understand:
This $= syntax is a bit of CSS that selects elements whose src attribute ends with page01/image2.png.
It’s one of a several attribute selectors that I hadn’t seen before – you can also match exact values, prefixes, or words in space-separated lists.
You can also control whether you want case-sensitive or -insensitive matching.
You can create inner box shadows with inset
Here’s a link style from an old copy of the Entertainment Weekly website:
a{box-shadow:inset0-6px0#b0e3fb;}
A link on EW.com
The inset keyword was new to me: it draws the shadow inside the box, rather than outside.
In this case, they’re setting offset-x=0, offset-y=-6px and blur-radius=0 to create a solid stripe that appears behind the link text – like a highlighter running underneath it.
If you want something that looks more shadow-like, here are two boxes that show the inner/outer shadow with a blur radius:
inner shadow
outer shadow
I don’t have an immediate use for this, but I like the effect, and the subtle sense of depth it creates.
The contents of the box with inner-shadow looks like it’s below the page, while the box with outer-shadow floats above it.
For images that get bigger, cursor: zoom-in can show a magnifying glass
On gallery websites, I often saw this CSS rule used for images that link to a larger version:
cursor:zoom-in;
Instead of using cursor: pointer; (the typical hand icon for links), this shows a magnifying glass icon – a subtle cue that clicking will zoom or expand the image.
Here’s a quick comparison:
the default cursor is typically an arrow
the pointer cursor is typically a hand, used to indicate links
the zoom-in cursor is a magnifying glass with a plus sign, suggesting “click to enlarge”
I knew about the cursor property, but I’d never thought to use it that way.
It’s a nice touch, and I want to use it the next time I build a gallery.
Writing thoughtful HTML
The order of elements
My web pages have a simple one column design: a header at the top, content in the middle, a footer at the bottom.
I mirror that order in my HTML, because it feels a natural structure.
I’d never thought about how to order the HTML elements in more complex layouts, when there isn’t such a clear direction.
For example, many websites have a sidebar that sits alongside the main content.
Which comes first in the HTML?
I don’t have a firm answers, but reading how other people structure their HTML got me thinking.
I noticed several pages that put the sidebar at the very end of the HTML, then used CSS to position it visually alongside the content.
That way, the main content appears earlier in the HTML file, which means it can load and become readable sooner.
It’s something I want to consider next time I’m building a more complex page.
Comments to mark the end of large containers
I saw a lot of websites (mostly WordPress) that used HTML comments to mark the end of containers with a lot of content.
For example:
These comments made the HTML much easier to read – I could see exactly where each component started and ended.
I like this idea, and I’m tempted to use it in my more complex projects.
I can imagine this being especially helpful in template files, where HTML is mixed with template markup in a way that might confuse code folding, or make the structure harder to follow.
The data-href attribute in <style> tags
Here’s a similar idea: I saw a number of sites set a data-href attribute on their <style> tags, as a way to indicate the source of the CSS.
Something like:
<style data-href="https://example.com/style.css">
I imagine this could be useful for developers working on that page, to help them find where they need to make changes to that <style> tag.
Translated pages with <link rel="alternate"> and hreflang
I saw a few web pages with translated versions, and they used <link> tags with rel="alternate" and an hreflang attribute to point to those translations.
Here’s an example from a Panic article, which is available in both US English and Japanese:
This seems to be for the benefit of search engines and other automated tools, not web browsers.
If your web browser is configured to prefer Japanese, you’d see a link to the Japanese version in search results – but if you open the English URL directly, you won’t be redirected.
This makes sense to me – translations can differ in content, and some information might only be available in one language.
It would be annoying if you couldn’t choose which version you wanted.
Panic’s article includes a third <link rel="alternate"> tag:
This x-default value is a fallback, used when there’s no better match for the user’s language.
For example, if you used a French search engine, you’d be directed to this URL because there isn’t a French translation.
Almost every website I’ve worked has been English-only, so internationalisation is a part of the web I know very little about.
Fetching resources faster with <link rel="preload">
I saw a lot of websites that with <link rel="preload"> tags in their <head>.
This tells the browser about resources that will be needed soon, so it should start fetching them immediately.
That image is used as a background texture in my CSS file.
Normally, the browser would have to download and parse the CSS before it even knows about the image – which means a delay before it starts loading it.
By preloading the image, the browser can begin downloading the image in parallel with the CSS file, so it’s already in progress when the browser reads the CSS.
The difference is probably imperceptible on a fast connection, but it is a performance improvement – and as long as you scope the preloads correctly, there’s little downside.
(Scoping means ensuring you don’t preload resources that aren’t used).
I saw some sites use DNS prefetching, which is a similar idea.
The rel="dns-prefetch" attribute tells the browser about domains it’ll fetch resources from soon, so it should begin DNS resolution early.
The most common example was websites using Google Fonts:
I only added preload tags to my site a few weeks ago.
I’d seen them in other web pages, but I didn’t appreciate the value until I wrote one of my own.
Quirks and relics
There are still lots of <!--[if IE]> comments
Old versions of Internet Explorer supported conditional comments, which allowed developers to add IE-specific behaviour to their pages.
Internet Explorer would render the contents of the comment as HTML, while other browsers ignored it.
This was a common workaround for deficiencies in IE, when pages needed specific markup or styles to render correctly.
Here’s an example, where the developer adds an IE-specific style to fix a layout issue:
Some websites even used conditional comments to display warnings and encourage users to upgrade, like this message which that’s still present on the RedMonk website today:
<!--[if IE]>
<div class="alert alert-warning">
You are using an <strong>outdated</strong> browser. Please <a href="http://browsehappy.com/">upgrade your browser</a> to improve your experience.
</div>
<![endif]-->
This syntax was already disappearing by the time I started building websites – support for conditional comments was removed in Internet Explorer 10, released in 2012, the same year that Google Chrome became the most-used browser worldwide.
I never wrote one of these comments, but I saw lots of them in archived web pages.
These comments are a relic of an earlier web.
Most websites have removed them, but they live on in web archives, and in the memories of web developers who remember the bad old days of IE6.
Templates in <script> tags with a non-standard type attribute
I came across a few pages using <script> tags with a type attribute that I didn’t recognise.
Here’s a simple example:
Browsers ignore <script> tags with an unrecognised type – they don’t run them, and they don’t render their contents.
Developers have used this as a way to include HTML templates in their pages, which JavaScript could extract and use later.
This trick was so widespread that HTML introduced a dedicated <template> tag element for the same purpose.
It’s been in all the major browsers for years, but there are still instances of this old technique floating around the web.
Browsers won’t load external file:// resources from file:// pages
Because my static archives are saved as plain HTML files on disk, I often open them directly using the file:// protocol, rather than serving them over HTTP.
This mostly works fine – but I ran into a few cases where pages behave differently depending on how they’re loaded.
One example is the SVG <use> element.
Some sites I saved use SVG sprite sheets for social media icons, with markup like:
<usehref="sprite.svg#logo-icon"></use>
This works over http://, but when loaded via file://, it silently fails – the icons don’t show up.
This turns out to be a security restriction.
When a file:// page tries to load another file:// resource, modern browsers treat it as a cross-origin request and block it.
This is to prevent a malicious downloaded HTML file from snooping around your local disk.
It took me a while to figure this out.
At first, all I got was a missing icon.
I could see an error in my browser console, but it was a bit vague – it just said I couldn’t load the file for “security reasons”.
Then I dropped this snippet into my dev tools console:
It gave me a different error message, one that explicitly mentioned cross-origin requesting sharing: “CORS request not http”.
This gave me something I could look up, and led me to the answer.
This is easy to work around – if I spin up a local web server (like Python’s http.server), I can open the page over HTTP and everything loads correctly.
What does GPT stand for in attributes?
Thanks to the meteoric rise of ChatGPT, I’ve come to associate the acronym “GPT” with large language models (LLMs) – it stands for Generative Pre-trained Transformer.
That means I was quite surprised to see “GPT” crop up on web pages that predate the widespread use of generative AI.
It showed up in HTML attributes like this:
<divid="div-gpt-ad-1481124643331-2">
I discovered that “GPT” also stands for Google Publisher Tag, part of Google’s ad infrastructure.
I’m not sure exactly what these tags were doing – and since I stripped all the ads out of my web archive, they’re not doing anything now – but it was clearly ad-related.
What’s the instapaper_ignore class?
I found some pages that use the instapaper_ignore CSS class to hide certain content.
Here’s an example from an Atlantic article I saved in 2017:
<asideclass="pullquote instapaper_ignore">
Somewhere at Google there is a database containing 25 million books and nobody is allowed to read them.
</aside>
Instapaper is a “read later” service – you save an article that looks interesting, and later you can read it in the Instapaper app.
Part of the app is a text parser that tries to extract the article’s text, stripping away junk or clutter.
The instapaper_ignore class is a way for publishers to control what that parser includes.
From a blog post in 2010:
Additionally, the Instapaper text parser will support some standard CSS class names to instruct it:
instapaper_body: This element is the body container.
instapaper_ignore: These elements, when inside the body container, should be removed from the text parser’s output.
In this example, the element is a pull quote – a repeated line from the article, styled to stand out.
On the full web page, it works.
But in the unstyled Instapaper view, it would just look like a duplicate sentence.
It makes sense that the Atlantic wouldn’t want it to appear in that context.
Only a handful of pages I’ve saved ever used instapaper_ignore, and even fewer are still using it today.
I don’t even know if Instapaper’s parser still looks for it.
This stood out to me because I was an avid Instapaper user for a long time.
I deleted my account years ago, and I don’t hear much about “read later” apps these days – but then I stumble across a quiet little relic like this, buried in the HTML.
I found a bug in the WebKit developer tools
Safari is my regular browser, and I was using it to preview pages as I saved them to my archive.
While I was archiving one of Jeffrey Zeldman’s posts, I was struggling to understand how some of his CSS worked.
I could see the rule in my developer tools, but I couldn’t figure out why it was behaving the way it was.
Eventually, I discovered the problem: a bug in WebKit’s developer tools was introducing whitespace that changed the meaning of the CSS.
For example, suppose the server sends this minifed CSS rule:
body>*:not(.black){color:green;}
WebKit’s dev tools prettify it like this:
body>*:not(.black){color:green;}
But these aren’t equivalent!
The original rule matches direct children of <body> that don’t have the black class.
The prettified version matches any descendant of <body> that doesn’t have the black class and that isn’t a direct child.
The CSS renders correctly on the page, but the bug means the Web Inspector can show something subtly wrong.
It’s a formatting bug that sent me on a proper wild goose chase.
This bug remains unfixed – but interestingly, a year later, that particular CSS rule has disappeared from Zeldman’s site.
I wonder if it caused any other problems?
Closing thoughts
The web is big and messy and bloated, and there are lots of reasons to be pessimistic about the state of modern web development – but there are also lots of people doing cool and interesting stuff with it.
As I was reading this mass of HTML and CSS, I had so many moments where I thought “ooh, that’s clever!” or “neat!” or “I wonder how that works?”.
I hope that as you’ve read this post, you’ve learnt something too.
I’ve always believed in the spirit of “view source”, the idea that you can look at the source code of any web page and see how it works.
Although that’s become harder as more of the web is created by frameworks and machines, this exercise shows that it’s clinging on.
We can still learn from reading other people’s source code.
When I set out to redo my bookmarks, I was only trying to get my personal data under control.
Learning more about front-end web development has been a nice bonus.
My knowledge is still a tiny tip of an iceberg, but now it’s a little bit bigger.
I know this post has been particularly dry and technical, so next week I’ll end this series on a lighter note.
I’ll show you some of my favourite websites from my bookmarks – the fun, the whimsical, the joyous – the people who use the web as a creative canvas, and who inspire me to make my web presence better.
I manage my bookmarks with a static website.
I’ve bookmarked over 2000 pages, and I keep a local snapshot of every page.
These snapshots are stored alongside the bookmark data, so I always have access, even if the original website goes offline or changes.
I’ve worked on web archives in a professional setting, but this one is strictly personal.
This gives me more freedom to make different decisions and trade-offs.
I can focus on the pages I care about, spend more time on quality control, and delete parts of a page I don’t need – without worrying about institutional standards or long-term public access.
In this post, I’ll show you how I built this personal archive of the web: how I save pages, why I chose to do it by hand, and what I do to make sure every page is properly preserved.
This article is the second in a four part bookmarking mini-series:
I’m building a personal web archive – it’s just for me.
I can be very picky about what it contains and how it works, because I’m the only person who’ll read it or save pages.
It’s not a public resource, and nobody else will ever see it.
This means it’s quite different to what I’d do in a professional or institutional setting.
There, the priorities are different: automation over manual effort, immutability over editability, and a strong preference for content that can be shared or made public.
I want a complete copy of every web page
I want my archive to have a copy of every page I’ve bookmarked, and for each copy to be a good substitute for the original.
It should include everything I need to render the page – text, images, videos, styles, and so on.
If the original site changes or goes offline, I should still be able to see the page as I saved it.
I want the archive to live on my computer
I don’t want to rely on an online service which could break, change, or be shut down.
I learnt this the hard way with Pinboard.
I was paying for an archiving account, which promised to save a copy of all my bookmarks.
But in recent years it became unreliable – sometimes it would fail to archive a page, and sometimes I couldn’t retrieve a supposedly saved page.
It should be easy to save new pages
I save a couple of new bookmarks a week.
I want to keep this archive up-to-date, and I don’t want adding pages to be a chore.
It should support private or paywalled pages
I read a lot of pages which aren’t on the public web, stuff behind paywalls or login screens.
I want to include these in my web archive.
Many web archives only save public content – either because they can’t access private content to save, or they couldn’t share if it they did.
This makes it even more important that I keep my own copy of private pages, because I may not find another.
It should be possible to edit snapshots
This is both additive and subtractive.
Web pages can embed external resources, and sometimes I want those resources in my archive.
For example, suppose somebody publishes a blog post about a conference talk, and embeds a YouTube video of them giving the talk.
I want to download the video, not just the YouTube embed code.
Web pages also contain a lot of junk that I don’t care about saving – ads, tracking, pop-ups, and more.
I want to cut all that stuff out, and just keep the useful parts.
It’s like taking clippings from a magazine: I want the article, not the ads wrapped around it.
What does my web archive look like?
I treat my archived bookmarks like the bookmarks site itself: as static files, saved in folders on my local filesystem.
A static folder for every page
For every page, I have a folder with the HTML, stylesheets, images, and other linked files.
Each folder is a self-contained “mini-website”.
If I want to look at a saved page, I can just open the HTML file in my web browser.
The files for a single web page, saved in a folder in my archive.
I flatten the structure of each website into top-level folders like images and static, which keeps things simple and readable.
I don’t care about the exact URL paths from the original site.
Any time the HTML refers to an external file, I’ve changed it to fetch the file from the local folder rather than the original website.
For example, the original HTML might have an <img> tag that loads an image from https://preshing.com/~img/poster-wall.jpg, but in my local copy I’d change the <img> tag to load from images/poster-wall.jpg.
I like this approach because it’s using open, standards-based web technology, and this structure is simple, durable, and easy to maintain.
These folder-based snapshots will likely remain readable for the rest of my life.
Why not WARC or WACZ?
Many institutions store their web archives as WARC or WACZ, which are file formats specifically designed to store preserved web pages.
These files contain the saved page, as well as extra information about how the archive was created – useful context for future researchers.
This could include the HTTP headers, the IP address, or the name of the software that created the archive.
You can only open WARC or WACZ files with specialist “playback” software, or by unpacking the files from the archive.
Both file formats are open standards, so theoretically you could write your own software to read them – archives saved this way aren’t trapped in a proprietary format – but in practice, you’re picking from a small set of tools.
In my personal archive, I don’t need that extra context, and I don’t want to rely on a limited set of tools.
It’s also difficult to edit WARC files, which is one of my requirements.
I can’t easily open them up and delete all the ads, or add extra files.
I prefer the flexibility of files and folders – I can open HTML files in any web browser, make changes with ease, and use whatever tools I like.
How do I save a local copy of each web page?
I save every page by hand, then I check it looks good – that I’ve saved all the external resources like images and stylesheets.
This manual inspection gives me the peace of mind to know that I really have saved each web page, and that it’s a high quality copy.
I’m not going to open a snapshot in two years time only to discover that I’m missing a key image or illustration.
Let’s go through that process in more detail.
Saving a single web page by hand
I start by saving the HTML file, using the “Save As” button in my web browser.
I open that file in my web browser and my text editor.
Using the browser’s developer tools, I look for external files that I need to save locally – stylesheets, fonts, images, and so on.
I download the missing files, edit the HTML in my text editor to point at the local copy, then reload the page in the browser to see the result.
I keep going until I’ve downloaded everything, and I have a self-contained, offline copy of the page.
Most of my time in the developer tools is spent in two tabs.
I look at the Network tab to see what files the page is loading.
Are they being served from my local disk, or fetched from the Internet?
I want everything to come from disk.
This HTML file is making a lot of external network requests – I have more work to do!
I check the Console tab for any errors loading the page – some image that can’t be found, or a JavaScript file that didn’t load properly.
I want to fix all these errors.
So much red!
I spend a lot of time reading and editing HTML by hand.
I’m fairly comfortable working with other people’s code, and it typically takes me a few minutes to save a page.
This is fine for the handful of new pages I save every week, but it wouldn’t scale for a larger archive.
Once I’ve downloaded everything the page needs, eliminated external requests, and fixed the errors, I have my snapshot.
Deleting all the junk
As I’m saving a page, I cut away all the stuff I don’t want.
This makes my snapshots smaller, and pages often shrank by 10–20×.
The junk I deleted includes:
Ads.
So many ads.
I found one especially aggressive plugin that inserted an advertising <div> between every single paragraph.
Banners for time-sensitive events.
News tickers, announcements, limited-time promotions, and in one case, a museum’s bank holiday opening hours.
Inline links to related content.
There are many articles where, every few paragraphs, you get a promo for a different article.
I find this quite distracting, especially as I’m already reading the site!
I deleted all those, so my saved articles are just the text.
Cookie notices, analytics, tracking, and other services for gathering “consent”.
I don’t care what tracking tools a web page was using when I saved it, and they’re a complete waste of space in my personal archive.
As I was editing the page in my text editor, I’d look for <script> and <iframe> elements.
These are good indicators of the stuff I want to remove – for example, most ads are loaded in iframes.
A lot of what I save is static content, where I don’t need the interactivity of JavaScript.
I can remove it from the page and still have a useful snapshot.
In my personal archive, I think these deletions are a clear improvement.
Snapshots load faster, they’re easier to read, and I’m not preserving megabytes of junk I’ll never use.
But I’d be a lot more cautious doing this in a public context.
Institutional web archives try to preserve web pages exactly as they were.
They want researchers to trust that they’re seeing an authentic representation of the original page, unchanged in content or meaning.
Deleting anything from the page, however well-intentioned, might undermine that trust – who decides what gets deleted?
What’s cruft to me might be a crucial clue to someone else.
Using templates for repeatedly-bookmarked sites
For big, complex websites that I bookmark often, I’ve created simple HTML templates.
When I want to save a new page, I discard the original HTML, and I just copy the text and images into the template.
It’s a lot faster than unpicking the site’s HTML every time, and I’m saving the content of the article, which is what I really care about.
Here’s an example from the New York Times.
You can tell which page is the real article, because you have to click through two dialogs and scroll past an ad before you see any text.
I was inspired by AO3 (the Archive of Our Own), a popular fanfiction site.
You can download copies of every story in multiple formats, and they believe in it so strongly that everything published on their site can be downloaded.
Authors don’t get to opt out.
An HTML download from AO3 looks different to the styled version you’d see browsing the web:
But the difference is only cosmetic – both files contain the full text of the story, which is what I really care about.
I don’t care about saving a visual snapshot of what AO3 looks like.
Most sites don’t offer a plain HTML download of their content, but I know enough HTML and CSS to create my own templates.
I have a dozen or so of these templates, which make it easy to create snapshots of sites I visit often – sites like Medium, Wired, and the New York Times.
Backfilling my existing bookmarks
When I decided to build a new web archive by hand, I already had partial collections from several sources – Pinboard, the Wayback Machine, and some personal scripts.
I gradually consolidated everything into my new archive, tackling a few bookmarks a day: fixing broken pages, downloading missing files, deleting ads and other junk.
I had over 2000 bookmarks, and it took about a year to migrate all of them.
Now I have a collection where I’ve checked everything by hand, and I know I have a complete set of local copies.
I wrote some Python scripts to automate common cleanup tasks, and I used regular expressions to help me clean up the mass of HTML.
This code is too scrappy and specific to be worth sharing, but I wanted to acknowledge my use of automation, albeit at a lower level than most archiving tools.
There was a lot of manual effort involved, but it wasn’t entirely by hand.
Now I’m done, there’s only one bookmark that seems conclusively lost – a review of Rogue One on Dreamwidth, where the only capture I can find is a content warning interstitial.
I consider this a big success, but it was also a reminder of how fragmented our internet archives are.
Many of my snapshots are “franken-archives” – stitched together from multiple sources, combining files that were saved years apart.
Backing up the backups
Once I have a website saved as a folder, that folder gets backed up like all my other files.
I’m a big fan of automated tools for archiving the web, I think they’re an essential part of web preservation, and I’ve used many of them in the past.
Tools like ArchiveBox, Webrecorder, and the Wayback Machine have preserved enormous chunks of the web – pages that would otherwise be lost.
I paid for a Pinboard archiving account for a decade, and I search the Internet Archive at least once a week.
I’ve used command-line tools like wget, and last year I wrote my own tool to create Safari webarchives.
The size and scale of today’s web archives are only possible because of automation.
But automation isn’t a panacea, it’s a trade-off.
You’re giving up accuracy for speed and volume.
If nobody is reviewing pages as they’re archived, it’s more likely that they’ll contain mistakes or be missing essential files.
When I reviewed my old Pinboard archive, I found a lot of pages that weren’t archived properly – they had missing images, or broken styles, or relied on JavaScript from the original site.
These were web pages I really care about, and I thought I had them saved, but that was a false sense of security.
I’ve found issues like this whenever I’ve used automated tools to archive the web.
That’s why I decided to create my new archive manually – it’s much slower, but it gives me the comfort of knowing that I have a good copy of every page.
What I learnt about archiving the web
Lots of the web is built on now-defunct services
I found many pages that rely on third-party services that no longer exist, like:
Photo sharing sites – some I’d heard of (Twitpic, Yfrog), others that were new to me (phto.ch)
Link redirection services – URL shorteners and sponsored redirects
Social media sharing buttons and embeds
This means that if you load the live site, the main page loads, but key resources like images and scripts are broken or missing.
Just because the site is up, doesn’t mean it’s right
One particularly insidious form of breakage is when the page still exists, but the content has changed.
Here’s an example: a screenshot from an iTunes tutorial on LiveJournal that’s been replaced with an “18+ warning”:
This kind of failure is hard to detect automatically – the server is returning a valid response, just not the one you want.
That’s why I wanted to look at every web page with my eyes, and not rely on a computer to tell me it was saved correctly.
Many sites do a poor job of redirects
I was surprised by how many web pages still exist, but the original URLs no longer work, especially on large news sites.
Many of my old bookmarks now return a 404, but if you search for the headline, you can find the story at a completely different URL.
I find this frustrating and disappointing.
Whenever I’ve restructured this site, I always set up redirects because I’m an old-school web nerd and I think URLs are cool – but redirects aren’t just about making me feel good.
Keeping links alive makes it easier to find stuff in your back catalogue – without redirects, most people who encounter a broken link will assume the page was deleted, and won’t dig further.
Images are getting easier to serve, harder to preserve
When the web was young, images were simple.
You wrote an <img src="…"> tag in your HTML, and that was that.
Today, images are more complicated.
You can provide multiple versions of the same image, or control when images are loaded.
This can make web pages more efficient and accessible, but harder to preserve.
There are two features that stood out to me:
Lazy loading is a technique where a web page doesn’t load images or resources until they’re needed – for example, not loading an image at the bottom of an article until you scroll down.
Modern lazy loading is easy with <img loading="lazy">, but there are lots of sites that were built before that attribute was widely-supported.
They have their own code for lazy loading, and every site behaves a bit differently.
For example, a page might load a low-res image first, then swap it out for a high-res version with JavaScript.
But automated tools can’t always run that JavaScript, so they only capture the low-res image.
The <picture> tag allows pages to specify multiple versions of an image.
For example:
A page could send a high-res image to laptops, and a low-res images to phones.
This is more efficient; you’re not sending an unnecessarily large image to a small screen.
A page could send different images based on your colour scheme.
You could see a graph on a white background if you use light mode, or on black if you use dark mode.
If you preserve a page, which images should you get?
All of them?
Just one?
If so, which one?
For my personal archive, I always saved the highest resolution copy of each image, but I’m not sure that’s the best answer in every case.
On the modern web, pages may not look the same for everyone – different people can see different things.
When you’re preserving a page, you need to decide which version of it you want to save.
There’s no clearly-defined boundary of what to collect
Once you’ve saved the initial HTML page, what else do you save?
Some automated tools will aggressively follow every link on the page, and every link on those pages, and every link after that, and so on.
Others will follow simple heuristics, like “save everything linked from the first page, but no further”, or “save everything up to a fixed size limit”.
I struggled to come up with a good set of heuristics for my own approach, and I was often making decisions on a case-by-case basis.
Here are two examples:
I’ve bookmarked blog posts about conferences talks, where authors embed a YouTube video of them giving the talk.
I think the video is a key part of the page, so I want to download it – but “download all embeds and links” would be a very expensive rule.
I’ve bookmarked blog posts that comment on scientific papers.
Usually the link to the original paper doesn’t go directly to the PDF, but to a landing page on a site like arXiv.
I want to save the PDF because it’s important context for the blog post, but now I’m saving something two clicks away from the original post – which would be even more expensive if applied as a universal rule.
This is another reason why I’m really glad I build my archive by hand – I could make different decisions based on the content and context.
Should you do this?
I can recommend having a personal web archive.
Just like I keep paper copies of my favourite books, I now keep local copies of my favourite web pages.
I know that I’ll always be able to read them, even if the original website goes away.
It’s harder to recommend following my exact approach.
Building my archive by hand took nearly a year, probably hundreds of hours of my free time.
I’m very glad I did it, I enjoyed doing it, and I like the result – but it’s a big time commitment, and it was only possible because I have a lot of experience building websites.
Don’t let that discourage you – a web archive doesn’t have to be fancy or extreme.
If you take a few screenshots, save some PDFs, or download HTML copies of your favourite fic, that’s a useful backup.
You’ll have something to look at if the original web page goes away.
If you want to scale up your archive, look at automated tools.
For most people, they’re a better balance of cost and reward than saving and editing HTML by hand.
But you don’t need to, and even a folder with just a few files is better than nothing.
When I was building my archive – and reading all those web pages – I learnt a lot about how the web is built.
In part 3 of this series, I’ll share what that process taught me about making websites.
I’m storing more and more of my data as static websites, and about a year ago I switched to using a local, static site to manage my bookmarks.
It’s replaced Pinboard as the way I track interesting links and cool web pages.
I save articles I’ve read, talks I’ve watched, fanfiction I’ve enjoyed.
It’s taken hundreds of hours to get all my bookmarks and saved web pages into this new site, and it’s taught me a lot about archiving and building the web.
This post is the first of a four-part series about my bookmarks, and I’ll publish the rest of the series over the next three weeks.
Bookmarking mini-series
Creating a static site for all my bookmarks (this article)
Keeping my own list of bookmarks means that I can always find old links.
If I have a vague memory of a web page, I’m more likely to find it in my bookmarks than in the vast ocean of the web.
Links rot, websites break, and a lot of what I read isn’t indexed by Google.
This is particularly true for fannish creations.
A lot of fan artists deliberately publish in closed communities, so their work is only visible to like-minded fans, and not something that a casual Internet user might stumble upon.
If I’m looking for an exact story I read five years ago, I’m far more likely to find it in my bookmarks than by scouring the Internet.
Finding old web pages has always been hard, and the rise of generative AI has made it even harder.
Search results are full of AI slop, and people are trying to hide their content from scrapers.
Locking the web behind paywalls and registration screens might protect it from scrapers, but it also makes it harder to find.
I bookmark to remember why I liked a link
I write notes on each bookmark I keep, so I remember why a particular link was fun or interesting.
When I save articles, I write a short summary of the key points, information, arguments.
I can review a one paragraph summary much faster than I can reread the entire page.
When I save fanfiction, I write notes on the plot or key moments.
Is this the story where they live happily ever after?
Or does this have a gut wrenching ending that needs a box of tissues?
These summaries could be farmed out to generative AI, but I much prefer writing them myself.
I can phrase things in my own words, write down connections to other ideas, and write a more personal summary than I get from a machine.
And when I read those summaries back later, I remember writing them, and it revives my memories of the original article or story.
It’s slower, but I find it much more useful.
Why use a static website?
I was a happy Pinboard customer for over a decade, but the site feels abandoned.
I’ve not had catastrophic errors, but I kept running into broken features or rough edges – search timeouts, unreliable archiving, unexpected outages.
There does seem to be some renewed activity in the last few months, but it was too late – I’d already moved away.
I briefly considered Pinboard alternatives like Raindrop, Pocket and Instapaper – but I’d still be trusting an online service.
I’ve been burned too many times by services being abandoned, and so I’ve gradually been moving the data I care about to static websites on my local machine.
It takes a bit of work to set up, but then I have more control and I’m not worried about it going away.
My needs are pretty simple – I just want a list of links, some basic filtering and search, and a saved copy of all my links.
I don’t need social features or integrations, which made it easier to walk away from Pinboard.
I’ve been using static sites for all sorts of data, and I’m enjoying the flexibility and portability of vanilla HTML and JavaScript.
They need minimal maintenance and there’s no lock-in, and I’ve made enough of them now that I can create new ones pretty quickly.
What does it look like?
The main page is a scrolling list of all my bookmarks.
This is a single page that shows every link, because my collection is small enough that I don’t need pagination.
Each bookmark has a title, a URL, my hand-written summary, and a list of tags.
If I’m looking for a specific bookmark, I can filter by tag or by author, or I can use my browser’s in-page search feature.
I can sort by title, date saved, or “random” – this is a fun way to find links I might have forgotten about.
Let’s look at a single bookmark:
The main title is blue and underlined, and the URL of the original page is shown below it.
Call me old-fashioned, but I still care about URL design, I think it’s cool to underline links, and I have nostalgia for #0000ff blue.
I want to see those URLs.
If I click the URL in grey, I go to the page live on the web.
But if I click the title, I go to a local snapshot of the page.
Because these are the links I care about most, and links can rot, I’ve saved a local copy of every page as a mini-website – an HTML file and supporting assets.
These local snapshots will always be present, and they work offline – that’s why I link to them from the title.
This is something I can only do because this is a personal tool.
If a commercial bookmarking website tried to direct users to their self-hosted snapshots rather than the original site, they’d be accused of stealing traffic.
Creating these archival copies took months, and I’ll discuss it more in the rest of this series.
How does it work?
This is a static website built using the pattern I described in How I create static websites for tiny archives.
I have my metadata in a JavaScript file, and a small viewer that renders the metadata as an HTML page I can view in my browser.
I’m not sharing the code because it’s deeply personalised and tied to my specific bookmarks, but if you’re interested, that tutorial is a good starting point.
Here’s an example of what a bookmark looks like in the metadata file:
"https://notebook.drmaciver.com/posts/2020-02-22-11:37.html": {
"title": "You should try bad things",
"authors": [
"David R. MacIver"
],
"description": "If you only do things you know are good, eventually some of them will fall out of favour and you'll have an ever-shrinking pool of things.\r\n\r\nSome of the things you think will be bad will end up being good \u2013 trying them helps expand your pool.",
"date_saved": "2024-12-03T07:29:10Z",
"tags": [
"self-improvement"
],
"archive": {
"path": "archive/n/notebook.drmaciver.com/you-should-try-bad-things.html",
"asset_paths": [
"archive/n/notebook.drmaciver.com/static/drmnotes.css",
"archive/n/notebook.drmaciver.com/static/latex.css",
"archive/n/notebook.drmaciver.com/static/pandoc.css",
"archive/n/notebook.drmaciver.com/static/pygments.css",
"archive/n/notebook.drmaciver.com/static/tufte.css"
],
"saved_at": "2024-12-03T07:30:21Z"
},
"publication_year": "2020",
"type": "article"
}
I started with the data model from the Pinboard API, which in turn came from an older bookmarking site called Delicious.
Over time, I’ve added my own fields.
Previously I was putting everything in the title and the tags, but now I can have dedicated fields for things like authors, word count, and fannish tropes.
The archive object is new – that’s my local snapshot of a page.
The path points to the main HTML file for the snapshot, and then asset_paths is a list of any other files that get used in the webpage (CSS files, images, fonts and so on).
I have a Python script to checks that every archived file has been saved properly.
This is another advantage of writing my own bookmarking tool – I know exactly what data I want to store, and I can design the schema to fit.
When I want to add a bookmark or make changes, I open the JSON in my text editor and make changes by hand.
I have a script that checks the file is properly formatted and the archive paths are all correct, and I track changes in Git.
Notice that sides and colour both appear twice.
This looks invalid, but I learnt recently that this is actually legal JSON syntax!
It’s unusual and discouraged, but it’s not completely forbidden.
This was a big surprise to me.
I think of JSON objects as key/value pairs, and I associate them with data structures like a dict in Python or a Hash in Ruby – both of which only allow unique keys.
JSON has no such restriction, and I started thinking about how to handle it.
What does the JSON spec say about duplicate names?
JSON is described by several standards, which Wikipedia helpfully explains for us:
After RFC 4627 had been available as its “informational” specification since 2006, JSON was first standardized in 2013, as ECMA‑404.
RFC 8259, published in 2017, is the current version of the Internet Standard STD 90, and it remains consistent with ECMA‑404.
The ECMA and ISO/IEC standards describe only the allowed syntax, whereas the RFC covers some security and interoperability considerations.
All three of these standards explicitly allow the use of duplicate names in objects.
ECMA‑404 and ISO/IEC 21778:2017 have identical text to describe the syntax of JSON objects, and they say (emphasis mine):
An object structure is represented as a pair of curly bracket tokens surrounding zero or more name/value pairs.
[…]
The JSON syntax does not impose any restrictions on the strings used as names, does not require that name strings be unique, and does not assign any significance to the ordering of name/value pairs.
These are all semantic considerations that may be defined by JSON processors or in specifications defining specific uses of JSON for data interchange.
RFC 8259 goes further and strongly recommends against duplicate names, but the use of SHOULD means it isn’t completely forbidden:
The names within an object SHOULD be unique.
The same document warns about the consequences of ignoring this recommendation:
An object whose names are all unique is interoperable in the sense that all software implementations receiving that object will agree on the name-value mappings.
When the names within an object are not unique, the behavior of software that receives such an object is unpredictable.
Many implementations report the last name/value pair only.
Other implementations report an error or fail to parse the object, and some implementations report all of the name/value pairs, including duplicates.
So it’s technically valid, but it’s unusual and discouraged.
I’ve never heard of a use case for JSON objects with duplicate names.
I’m sure there was a good reason for it being allowed by the spec, but I can’t think of it.
Most JSON parsers – including jq, JavaScript, and Python – will silently discard all but the last instance of a duplicate name.
Here’s an example in Python:
What if I wanted to decode the whole object, or throw an exception if I see duplicate names?
This happened to me recently.
I was editing a JSON file by hand, and I’d copy/paste objects to update the data.
I also had scripts which could update the file.
I forgot to update the name on one of the JSON objects, so there were two name/value pairs with the same name.
When I ran the script, it silently erased the first value.
I was able to recover the deleted value from the Git history, but I wondered how I could prevent this happening again.
How could I make the script fail, rather than silently delete data?
Decoding duplicate names in Python
When Python decodes a JSON object, it first parses the object as a list of name/value pairs, then it turns that list of name value pairs into a dictionary.
We can see this by looking at the JSONObject function in the CPython source code: it builds a list pairs, and at the end of the function, it calls dict(pairs) to turn the list into a dictionary.
This relies on the fact that dict() can take an iterable of key/value tuples and create a dictionary:
The docs for dict() tell us that it will discard duplicate keys: “if a key occurs more than once, the last value for that key becomes the corresponding value in the new dictionary”.
We can customise what Python does with the list of name/value pairs.
Rather than calling dict(), we can pass our own function to the object_pairs_hook parameter of json.loads(), and Python will call that function on the list of pairs.
This allows us to parse objects in a different way.
For example, we can just return the literal list of name/value pairs:
We could also use the multidict library to get a dict-like data structure which supports multiple values per key.
This is based on HTTP headers and URL query strings, two environments where it’s common to have multiple values for a single key:
If we want to throw an exception when we see duplicate names, we need a longer function.
Here’s the code I wrote:
importcollectionsimporttypingdefdict_with_unique_names(pairs:list[tuple[str,typing.Any]])->dict[str,typing.Any]:"""
Convert a list of name/value pairs to a dict, but only if the
names are unique.
If there are non-unique names, this function throws a ValueError.
"""# First try to parse the object as a dictionary; if it's the same
# length as the pairs, then we know all the names were unique and
# we can return immediately.
pairs_as_dict=dict(pairs)iflen(pairs_as_dict)==len(pairs):returnpairs_as_dict# Otherwise, let's work out what the repeated name(s) were, so we
# can throw an appropriate error message for the user.
name_tally=collections.Counter(nforn,_inpairs)repeated_names=[nforn,countinname_tally.items()ifcount>1]assertlen(repeated_names)>0iflen(repeated_names)==1:raiseValueError(f"Found repeated name in JSON object: {repeated_names[0]}")else:raiseValueError(f"Found repeated names in JSON object: {', '.join(repeated_names)}")
If I use this as my object_pairs_hook when parsing an object which has all unique names, it returns the normal dict I’d expect:
But if I’m parsing an object with one or more repeated names, the parsing fails and throws a ValueError:
>>>json.loads(...'{"sides": 4, "colour": "red", "sides": 5}',...object_pairs_hook=dict_with_unique_names...)Traceback (most recent call last):
[…]
ValueError: Found repeated name in JSON object: sides
>>>json.loads(...'{"sides": 4, "colour": "red", "sides": 5, "colour": "blue"}',...object_pairs_hook=dict_with_unique_names...)Traceback (most recent call last):
[…]
ValueError: Found repeated names in JSON object: sides, colour
This is precisely the behaviour I want – throwing an exception, not silently dropping data.
Encoding non-unique names in Python
It’s hard to think of a use case, but this post feels incomplete without at least a brief mention.
If you want to encode custom data structures with Python’s JSON library, you can subclass JSONEncoder and define how those structures should be serialised.
Here’s a rudimentary attempt at doing that for a MultiDict:
classMultiDictEncoder(json.JSONEncoder):defencode(self,o:typing.Any)->str:# If this is a MultiDict, we need to construct the JSON string
# manually -- first encode each name/value pair, then construct
# the JSON object literal.
ifisinstance(o,MultiDict):name_value_pairs=[f'{super().encode(str(name))}: {self.encode(value)}'forname,valueino.items()]return'{'+', '.join(name_value_pairs)+'}'returnsuper().encode(o)
This is rough code, and you shouldn’t use it – it’s only an example.
I’m constructing the JSON string manually, so it doesn’t handle edge cases like indentation or special characters.
There are almost certainly bugs, and you’d need to be more careful if you wanted to use this for real.
In practice, if I had to encode a multi-dict as JSON, I’d encode it as a list of objects which each have a key and a value field.
For example:
This is a pretty standard pattern, and it won’t trip up JSON parsers which aren’t expecting duplicate names.
Do you need to worry about this?
This isn’t a big deal.
JSON objects with duplicate names are pretty unusual – this is the first time I’ve ever encountered one, and it was a mistake.
Trying to account for this edge case in every project that uses JSON would be overkill.
It would add complexity to my code and probably never catch a single error.
This started when I made a copy/paste error that introduced the initial duplication, and then a script modified the JSON file and caused some data loss.
That’s a somewhat unusual workflow, because most JSON files are exclusively modified by computers, and this wouldn’t be an issue.
Although I won’t use this code exactly, it’s been good practice at writing custom encoders/decoders in Python.
That is something I do all the time – I’m often encoding native Python types as JSON, and I want to get the same type back when I decode later.
I’ve been writing my own subclasses of JSONEncoder and JSONDecoder for a while.
Now I know a bit more about how Python decodes JSON, and object_pairs_hook is another tool I can consider using.
This was a fun deep dive for me, and I hope you found it helpful too.