MoreRSS

site iconAlex WlchanModify

I‘m a software developer, writer, and hand crafter from the UK. I’m queer and trans.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Alex Wlchan

Resizing images in Rust, now with EXIF orientation support

2025-09-09 06:42:48

Resizing an image is one of those programming tasks that seems simple, but has some rough edges. One common mistake is forgetting to handle the EXIF orientation, which can make resized images look very different from the original.

Last year I wrote create_thumbnail tool to resize images, and today I released a small update. Now it’s aware of EXIF orientation, and it no longer mangles these images. This is possible thanks to a new version of the Rust image crate, which just improved its EXIF support.

What’s EXIF orientation?

Images can specify an orientation in their EXIF metadata, which can describe both rotation and reflection. This metadata is usually added by cameras and phones, which can detect how you’re holding them, and tell viewing software how to display the picture later.

For example, if you take a photo while holding your camera on its side, the camera can record that the image should be rotated 90 degrees when viewed. If you use a front-facing selfie camera, the camera could note that the picture needs to be mirrored.

There are eight different values for EXIF orientation – rotating in increments of 90°, and mirrored or not. The default value is “1” (display as-is), and here are the other seven values:

16832574A diagram showing the eight different orientations of the word ‘FLY’: four rotations, four rotations with a mirror reflection.

You can see the EXIF orientation value with programs like Phil Harvey’s exiftool, which helpfully converts the numeric value into a human-readable description:

$ # exiftool's default output is human-readable
$ exiftool -orientation my_picture.jpg
Orientation                     : Rotate 270 CW

$ # or we can get the raw numeric value
$ exiftool -n -orientation my_picture.jpg
Orientation                     : 8

Resizing images in Rust

I use the image crate to resize images in Rust.

My old code for resizing images would open the image, resize it, then save it back to disk. Here’s a short example:

use image::imageops::FilterType;
use std::error::Error;

fn main() -> Result<(), Box<dyn Error>> {
    // Old method: doesn't know about EXIF orientation
    let img = image::open("original.jpg")?;
    img.resize(180, 120, FilterType::Lanczos3)
        .save("thumbnail.jpg")?;

    Ok(())
}

The thumbnail will keep the resized pixels in the same order as the original image, but the thumbnail doesn’t have the EXIF orientation metadata. This means that if the original image had an EXIF orientation, the thumbnail could look different, because it’s no longer being rotated/reflected properly.

When I wrote create_thumbnail, the image crate didn’t know anything about EXIF orientation – but last week’s v0.25.8 release added several functions related to EXIF orientation. In particular, I can now read the orientation and apply it to an image:

use image::imageops::FilterType;
use image::{DynamicImage, ImageDecoder, ImageReader};
use std::error::Error;

fn main() -> Result<(), Box<dyn Error>> {
    // New methods in image v0.25.8 know about EXIF orientation,
    // and allow us to apply it to the image before resizing.
    let mut decoder = ImageReader::open("original.jpg")?.into_decoder()?;
    let orientation = decoder.orientation()?;
    let mut img = DynamicImage::from_decoder(decoder)?;
    img.apply_orientation(orientation);

    img.resize(180, 120, FilterType::Lanczos3)
        .save("thumbnail.jpg")?;

    Ok(())
}

The thumbnail still doesn’t have any EXIF orientation data, but the pixels have been rearranged so the resized image looks similar to the original. That’s what I want.

Here’s a visual comparison of the three images. Notice how the thumbnail from the old code looks upside down:

original image thumbnail from the old code thumbnail from the new code

This test image comes from Dave Perrett’s exif-orientation-examples repo, which has a collection of images that were very helpful for testing this code.

Is this important?

This is a small change, but it solves an annoyance I’ve hit in every project that deals with images. I’ve written this fix, but images with an EXIF orientation are rare enough that I always forget them when I start a new project – and I used to solve the same problem again and again.

By handling EXIF orientation in create_thumbnail, I won’t have to think about this again. That’s the beauty of a shared tool – I fix it once, and then it’s fixed for all my current and future projects.

[If the formatting of this post looks odd in your feed reader, visit the original article]

Using vcrpy to test HTTP interactions in Python

2025-08-29 05:36:37

Testing code that makes HTTP requests can be difficult. Real requests are slow, flaky, and hard to control. That’s why I use a Python library called vcrpy, which does a one-off recording of real HTTP interactions, then replays them during future tests.

These recordings are saved to a “cassette” – a plaintext file that I keep alongside my tests and my code. The cassette ensures that all my tests get consistent HTTP responses, which makes them faster and more reliable, especially in CI. I only have to make one real network request, and then I can run my tests locally and offline.

In this post, I’ll show you how I use vcrpy in a production codebase – not just the basics, but also the patterns, pitfalls, and fixtures that make it work for a real team.

Table of contents

A pile of three black video cassette tapes stacked on a wooden table.
Three black compact video cassette tapes. Photo credit: Anthony on Pexels.

Why not make real HTTP requests in tests?

There are several reasons why I avoid real HTTP requests in my tests:

It makes my tests slower

I want my tests to be fast, because then I’ll run them more often and catch mistakes sooner. An individual HTTP call might be quick, but stack up hundreds of them and tests really start to drag.

It makes my tests less reliable

Even if my code is correct, my tests could fail because of problems on the remote server. What if I’m offline? What if the server is having a temporary outage? What if the server starts rate limiting me for making too many HTTP requests?

It makes my tests more brittle

If my tests depend on the server having certain state, then the server state could change and break or degrade my test suite.

Sometimes this change is obvious. For example, suppose I’m testing a function to fetch photos from Flickr, and then the photo I’m using in my test gets deleted. My code works correctly for photos that still exist, but now my test starts failing.

Sometimes this change is more subtle. Suppose I’ve written a regression test for an edge case, and then the server state changes, so the example I’m checking is no longer an instance of the edge case. I could break the code and never realise, because the test would keep passing. My test suite would become less effective.

It means passing around more secrets

A lot of my HTTP calls require secrets, like API keys or OAuth tokens. If the tests made real HTTP calls, I’d need to copy those secrets to every environment where I’m running the tests. That increases the risk of the secret getting leaked.

It makes my tests harder to debug

If there are more reasons why a test could fail, then it takes longer to work out if the failure was caused by my mistake, or a change on the server.

Recording and replaying HTTP requests solves these problems

If my test suite is returning consistent responses for HTTP calls, and those responses are defined within the test suite itself, then my tests get faster and more reliable. I’m not making real network calls, I’m not dependent on the behaviour of a server, and I don’t need real secrets to run the tests.

There are a variety of ways to define this sort of test mock; I like to record real responses because it ensures I’m getting a high-fidelity mock, and it makes it fairly easy to add new tests.


Why do you like vcrpy?

I know two Python libraries that record real HTTP responses: vcrpy and betamax, both based on a Ruby library called vcr. I’ve used all three, they behave in a similar way, and they work well.

I prefer vcrpy for Python because it supports a wide variety of HTTP libraries, whereas betamax only works with requests. I currently use a mixture of httpx and urllib3, and it’s convenient to test them both with the same library and test helpers.

I also like that vcrpy works without needing any changes to the code I’m testing. I can write HTTP code as I normally would, then I add a vcrpy decorator in my test and the responses get recorded. I don’t like test frameworks that require me to rewrite my code to fit – the tests should follow the code, not the other way round.


A basic example of using vcrpy

Here’s a test that uses vcrpy to fetch www.example.com, and look for some text in the response. I use vcr.use_cassette as a context manager around the code that makes an HTTP request:

import httpx
import vcr


def test_example_domain():
    with vcr.use_cassette("fixtures/vcr_cassettes/test_example_domain.yml"):
        resp = httpx.get("https://www.example.com/")
        assert "<h1>Example Domain</h1>" in resp.text

Alternatively, you can use vcr.use_cassette as a decorator:

@vcr.use_cassette("fixtures/vcr_cassettes/test_example_domain.yml")
def test_example_domain():
    resp = httpx.get("https://www.example.com/")
    assert "<h1>Example Domain</h1>" in resp.text

With the decorator, you can also omit the path to the cassette file, and vcrpy will name the cassette file after the function:

@vcr.use_cassette()
def test_example_domain():
    resp = httpx.get("https://www.example.com/")
    assert "<h1>Example Domain</h1>" in resp.text

When I run this test using pytest (python3 -m pytest test_example.py), vcrpy will check if the cassette file exists. If the file is missing, it makes a real HTTP call and saves it to the file. If the file exists, it replays the previously-recorded HTTP call.

By default, the cassette is a YAML file. Here’s what it looks like: test_example_domain.yml.

If a test makes more than one HTTP request, vcrpy records all of them in the same cassette file.


Using vcrpy in production

Keeping secrets out of my cassettes

The cassette files contain the complete HTTP request and response, which includes the URL, form data, and HTTP headers. If I’m testing an API that requires auth, the HTTP request could include secrets like an API key or OAuth token. I don’t want to save those secrets in the cassette file!

Fortunately, vcrpy can filter sensitive data before it’s saved to the cassette file – HTTP headers, URL query parameters, or form data.

Here’s an example where I’m using filter_query_parameters to redact an API key. I’m replacing the real value with the placeholder REDACTED_API_KEY.

import os

import httpx
import vcr


def test_flickr_api():
    with vcr.use_cassette(
        "fixtures/vcr_cassettes/test_flickr_api.yml",
        filter_query_parameters=[("api_key", "REDACTED_API_KEY")],
    ):
        api_key = os.environ.get("FLICKR_API_KEY", "API_KEY")

        resp = httpx.get(
            "https://api.flickr.com/services/rest/",
            params={
                "api_key": api_key,
                "method": "flickr.urls.lookupUser",
                "url": "https://www.flickr.com/photos/alexwlchan/",
            },
        )

        assert '<user id="199258389@N04">' in resp.text

When I run this test the first time, I need to pass an env var FLICKR_API_KEY. This makes a real request and records a cassette, but with my redacted value. When I run the test again, I don’t need to pass the env var, but the test will still pass.

You can see the complete YAML file in test_flickr_api.yml. Notice how the api_key query parameter has been redacted in the recorded request:

interactions:
- request:
    
    uri: https://api.flickr.com/services/rest/?api_key=REDACTED_API_KEY&method=flickr.urls.lookupUser&url=https%3A%2F%2Fwww.flickr.com%2Fphotos%2Falexwlchan%2F
    

You can also tell vcrpy to omit the sensitive field entirely, but I like to insert a placeholder value. It’s useful for debugging later – you can see that a value was replaced, and easily search for the code that’s doing the redaction.

Improving the human readability of cassettes

If you look at the first two cassette files, you’ll notice that the response body is stored as base64-encoded binary data:

response:
  body:
    string: !!binary |
      H4sIAAAAAAAAAH1UTXPbIBC9+1ds1UsyIyQnaRqPLWn6mWkPaQ9pDz0SsbKYCFAByfZ08t+7Qo4j
      N5makYFdeLvvsZC9Eqb0uxah9qopZtljh1wUM6Bf5qVvsPi85aptED4ZxaXO0tE6G5co9BzKmluH
      Po86X7FFBGkxcdbetwx/d7LPo49Ge9SeDWEjKMdZHnnc+nQIvzpAvYSkucI86iVuWmP9ZP9GCl/n

That’s because example.com and api.flickr.com both gzip compress their responses, and vcrpy is preserving that compression. But gzip compression is handled by the HTTP libraries – my code never needs to worry about compression; it just gets the uncompressed response.

Where possible, I prefer to store responses in their uncompressed form. It makes the cassettes easier to read, and you can see if secrets are included in the saved response data. I also find it useful to read cassettes as an example of what an API response looks like – and in particular, what it looked like when I wrote the test. Cassettes have helped me spot undocumented changes in APIs.

Here’s an example where I’m using decode_compressed_response=True to remove the gzip compression in the cassette:

def test_example_domain_with_decode():
    with vcr.use_cassette(
        "fixtures/vcr_cassettes/test_example_domain_with_decode.yml",
        decode_compressed_response=True,
    ):
        resp = httpx.get("https://www.example.com/")
        assert "<h1>Example Domain</h1>" in resp.text

You can see the complete cassette file in test_example_domain_with_decode.yml. Notice the response body now contains an HTML string:

response:
  body:
    string: "<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n
      \   <meta charset=\"utf-8\" />\n    <meta http-equiv=\"Content-type\" content=\"text/html;
      charset=utf-8\" />\n    <meta name=\"viewport\" content=\"width=device-width,

Naming my cassettes to make sense later

If you write a lot of tests that use vcrpy, you’ll end up with a fixtures directory that’s full of cassettes. I like cassette names to match my test functions, so they’re easy to match up later.

I could specify a cassette name explicitly in every test, but that’s extra work and prone to error. Alternatively, I could use the decorator and use the automatic cassette name – but vcrpy uses the name of the test function, which may not distinguish between tests. In particular, I often group tests into classes, or use parametrized tests to run the same test with different values.

Consider the following example:

import httpx
import pytest
import vcr


class TestExampleDotCom:
    def test_status_code(self):
        resp = httpx.get("https://example.com")
        assert resp.status_code == 200


@vcr.use_cassette()
@pytest.mark.parametrize(
    "url, status_code",
    [
        ("https://httpbin.org/status/200", 200),
        ("https://httpbin.org/status/404", 404),
        ("https://httpbin.org/status/500", 500),
    ],
)
def test_status_code(url, status_code):
    resp = httpx.get(url)
    assert resp.status_code == status_code

This is four different tests, but vcrpy’s automatic cassette name is the same for each of them: test_status_code. The tests will fail if you try to run them – vcrpy will record a cassette for the first test that runs, then try to replay that cassette for the second test. The second test makes a different HTTP request, so vcrpy will throw an error because it can’t find a matching request.

Here’s what I do instead: I have a pytest fixture to choose cassette names, which includes the name of the test class (if any) and the ID of the parametrized test case. Because I sometimes use URLs in parametrized tests, I also check the test case ID doesn’t include slashes or colons – I don’t want those in my filenames!

Here’s the decorator:

@pytest.fixture
def cassette_name(request: pytest.FixtureRequest) -> str:
    """
    Returns the filename of a VCR cassette to use in tests.

    The name can be made up of (up to) three parts:

    -   the name of the test class
    -   the name of the test function
    -   the ID of the test case in @pytest.mark.parametrize

    """
    name = request.node.name

    # This is to catch cases where e.g. I try to include a complete
    # HTTP URL in a cassette name, which creates messy folders in
    # the fixtures directory.
    if ":" in name or "/" in name:
        raise ValueError(
            "Illegal characters in VCR cassette name - "
            "please set a test ID with pytest.param(…, id='')"
        )

    if request.cls is not None:
        return f"{request.cls.__name__}.{name}.yml"
    else:
        return f"{name}.yml"

Here’s my test rewritten to use that new decorator:

class TestExampleDotCom:
    def test_status_code(self, cassette_name):
        with vcr.use_cassette(cassette_name):
            resp = httpx.get("https://example.com")
            assert resp.status_code == 200


@vcr.use_cassette()
@pytest.mark.parametrize(
    "url, status_code",
    [
        pytest.param("https://httpbin.org/status/200", 200, id="ok"),
        pytest.param("https://httpbin.org/status/404", 404, id="not_found"),
        pytest.param("https://httpbin.org/status/500", 500, id="server_error"),
    ],
)
def test_status_code(url, status_code, cassette_name):
    with vcr.use_cassette(cassette_name):
        resp = httpx.get(url)
        assert resp.status_code == status_code

The four tests now get distinct cassette filenames:

  • TestExampleDotCom.test_status_code
  • test_status_code[ok]
  • test_status_code[not_found]
  • test_status_code[server_error]

Explaining how to use cassettes with helpful errors

Most of the time, you don’t need to worry about how vcrpy works. If you’re running an existing test, then vcrpy is just a fancy test mock that happens to be reading its data from a YAML file. You don’t need to worry about the implementation details.

However, if you’re writing a new test, you need to record new cassettes. This can involve some non-obvious setup, especially if you’ve never done it before.

Let’s revisit an earlier example:

def test_flickr_api():
    with vcr.use_cassette(
        "fixtures/vcr_cassettes/test_flickr_api.yml",
        filter_query_parameters=[("api_key", "REDACTED_API_KEY")],
    ):
        api_key = os.environ.get("FLICKR_API_KEY", "API_KEY")

        resp = httpx.get(
            "https://api.flickr.com/services/rest/",
            params={
                "api_key": api_key,
                "method": "flickr.urls.lookupUser",
                "url": "https://www.flickr.com/photos/alexwlchan/",
            },
        )

        assert '<user id="199258389@N04">' in resp.text

If you run this test without passing a FLICKR_API_KEY environment variable, it will call the real Flickr API with the placeholder API key. Unsurprisingly, the Flickr API will return an error response, and your test will fail:

<?xml version="1.0" encoding="utf-8" ?>
<rsp stat="fail">
  <err code="100" msg="Invalid API Key (Key has invalid format)" />
</rsp>

Worse still, vcrpy will record this error in the cassette file. Even if you work out you need to re-run the test with the env var, it will keep failing as it replays the recorded error.

Can we make this better? In this scenario, what I’d prefer is:

  1. The test fails if you don’t pass an env var
  2. The error explains how to run the test properly
  3. vcrpy doesn’t save a cassette file

I worked out how to get this nicer error handling. vcrpy has a before_record_response hook, that allows you to modify a response before writing it to the cassette file. You could use this to redact secrets from responses, but I realised you could also use it to validate the response – and if you throw an exception, it prevents vcrpy from writing a cassette.

Here’s a hook I wrote, which checks if a vcrpy response is a Flickr API error telling us that we passed an invalid API key, and throws an exception if so:

def check_for_invalid_api_key(response):
    """
    Before we record a new response to a cassette, check if it's
    a Flickr API response telling us we're missing an API key -- that
    means we didn't set up the test correctly.

    If so, give the developer an instruction explaining what to do next.
    """
    try:
        body: bytes = response["body"]["string"]
    except KeyError:
        body = response["content"]

    is_error_response = (
        b'<err code="100" msg="Invalid API Key (Key has invalid format)" />' in body
    )

    if is_error_response:
        raise RuntimeError(
            "You tried to record a new call to the Flickr API, \n"
            "but the tests don't have an API key.\n"
            "\n"
            "Pass an API key as an env var FLICKR_API_KEY=ae84…,\n"
            "and re-run the test.\n"
        )

    return response

We can call this hook in our vcr.use_cassette call:

def test_flickr_api(cassette_name):
    with vcr.use_cassette(
        cassette_name,
        filter_query_parameters=[("api_key", "REDACTED_API_KEY")],
        decode_compressed_response=True,
        before_record_response=check_for_invalid_api_key,
    ):
        ...

Now, if you try to record a Flickr API call and don’t set the API key, you’ll get a helpful error explaining how to re-run the test correctly.

Wrapping everything in a fixture for convenience

This is all useful, but it’s a lot of boilerplate to add to every test. To make everything cleaner, I wrap vcrpy in a pytest fixture that returns an HTTP client I can use in my tests. This fixture allows me to configure vcrpy, and also do any other setup I need on the HTTP client – for example, adding authentication params or HTTP headers.

Here’s an example of such a fixture in a library for using the Flickr API:

@pytest.fixture
def flickr_api(cassette_name):
    with vcr.use_cassette(
        cassette_name,
        filter_query_parameters=[("api_key", "REDACTED_API_KEY")],
        decode_compressed_response=True,
        before_record_response=check_for_invalid_api_key,
    ):
        client = httpx.Client(
            params={"api_key": os.environ.get("FLICKR_API_KEY", "API_KEY")},
            headers={
                # Close the connection as soon as the API returns a
                # response, to fix pytest warnings about unclosed sockets.
                "Connection": "Close",
            },
        )

        yield client

This makes individual tests much shorter and simpler:

def test_flickr_api_without_boilerplate(flickr_api):
    resp = flickr_api.get(
        "https://api.flickr.com/services/rest/",
        params={
            "method": "flickr.urls.lookupUser",
            "url": "https://www.flickr.com/photos/alexwlchan/",
        },
    )

    assert '<user id="199258389@N04">' in resp.text

When somebody reads this test, they don’t need to think about the authentication or or mocking; they can just see the API call that we’re making.


When I don’t vcrpy

Although vcrpy is useful, there are times when I prefer to test my HTTP code in a different way. Here are a few examples.

If I’m testing error handling

If I’m testing my error handling code – errors like timeouts, connection failures, or 5xx errors – it’s difficult to record a real response. Even if I could find a reliable error case today, it might be fixed tomorrow, which makes it difficult to reproduce if I ever need to re-record a cassette.

When I test error handling, I prefer a pure-Python mock where I can see exactly what error conditions I’m creating.

If I’m fetching lots of binary files

If my HTTP code is downloading images and video, storing them in a vcrpy cassette is pretty inefficient – they have to be encoded as base64. This makes the cassettes large and inefficient, the extra decoding step slows my test down, and the files are hard to inspect.

When I’m testing with binary files, I store them as standalone files in my fixtures directory (e.g. in tests/fixtures/images), and I write my own mock to read the file from disk. I can easily inspect or modify the fixture data, and I don’t have the overhead of using cassettes.

If I’m testing future or hypothetical changes in an API

A vcrpy cassette locks in the current behaviour. But suppose I know about an upcoming change, or I want to check my code would handle an unusual response – I can’t capture that in a vcrpy cassette, because the server isn’t returning responses like that (yet).

In those cases, I either construct a vcrpy cassette with the desired response by hand, or I use a code-based mock to return my unusual response.


Summary

Using vcrpy has allowed me to write more thorough tests, and it does all the hard work of intercepting HTTP calls and serialising them to disk. It gives me high-fidelity snapshots of HTTP responses, allowing me to mock HTTP calls and avoid network requests in my tests. This makes my tests faster, consistent, and reliable.

Here’s a quick reminder of what I do to run vcrpy in production:

  • I use filter_query_parameters and filter_headers to keep secrets out of cassette files
  • I set decode_compressed_response=True to make cassettes more readable
  • I name cassettes after the test function they’re associated with
  • I throw errors if an HTTP client isn’t set up correctly when you try to record a cassette
  • I wrap everything in a fixture, to keep individual tests simpler

If you make HTTP calls from your tests, I really recommend it: https://vcrpy.readthedocs.io/en/latest/

[If the formatting of this post looks odd in your feed reader, visit the original article]

Create space-saving clones on macOS with Python

2025-08-03 22:49:06

The standard Mac filesystem, APFS, has a feature called space-saving clones. This allows you to create multiple copies of a file without using additional disk space – the filesystem only stores a single copy of the data.

Although cloned files share data, they’re independent – you can edit one copy without affecting the other (unlike symlinks or hard links). APFS uses a technique called copy-on-write to store the data efficiently on disk – the cloned files continue to share any pieces they have in common.

Cloning files is both faster and uses less disk space than copying. If you’re working with large files – like photos, videos, or datasets – space-saving clones can be a big win.

Several filesystems support cloning, but in this post, I’m focusing on macOS and APFS.

For a recent project, I wanted to clone files using Python. There’s an open ticket to support file cloning in the Python standard library. In Python 3.14, there’s a new Path.copy() function which adds support for cloning on Linux – but there’s nothing yet for macOS.

In this post, I’ll show you two ways to clone files in APFS using Python.

Table of contents


What are the benefits of cloning?

There are two main benefits to using clones rather than copies.

Cloning files uses less disk space than copying

Because the filesystem only has to keep one copy of the data, cloning a file doesn’t use more space on disk. We can see this with an experiment. Let’s start by creating a random file with 1GB of data, and checking our free disk size:

$ dd if=/dev/urandom of=1GB.bin bs=64M count=16
16+0 records in
16+0 records out
1073741824 bytes transferred in 2.113280 secs (508092550 bytes/sec)

$ df -h -I /
Filesystem        Size    Used   Avail Capacity  Mounted on
/dev/disk3s1s1   460Gi    14Gi    43Gi    25%    /

My disk currently has 43GB available.

Let’s copy the file, and check the free disk space after it’s done. Notice that it decreases to 42GB, because the filesystem is now storing a second copy of this 1GB file:

$ # Copying
$ cp 1GB.bin copy.bin

$ df -h -I /
Filesystem        Size    Used   Avail Capacity  Mounted on
/dev/disk3s1s1   460Gi    14Gi    42Gi    25%    /

Now let’s clone the file by passing the -c flag to cp. Notice that the free disk space stays the same, because the filesystem is just keeping a single copy of the data between the original and the clone:

$ # Cloning
$ cp -c 1GB.bin clone.bin

$ df -h -I /
Filesystem        Size    Used   Avail Capacity  Mounted on
/dev/disk3s1s1   460Gi    14Gi    42Gi    25%    /

Cloning files is faster than copying

When you clone a file, the filesystem only has to write a small amount of metadata about the new clone. When you copy a file,it needs to write all the bytes of the entire file. This means that cloning a file is much faster than copying, which we can see by timing the two approaches:

$ # Copying
$ time cp 1GB.bin copy.bin
Executed in  260.07 millis

$ # Cloning
$ time cp -c 1GB.bin clone.bin
Executed in    6.90 millis

This 43× difference is with my Mac’s internal SSD. In my experience, the speed difference is even more pronounced on slower disks, like external hard drives.

How do you clone files on macOS?

Using the “Duplicate” command in Finder

If you use the Duplicate command in Finder (File > Duplicate or ⌘D), it clones the file.

Using cp -c on the command line

If you use the cp (copy) command with the -c flag, and it’s possible to clone the file, you get a clone rather than a copy. If it’s not possible to clone the file – for example, if you’re on a non-APFS volume that doesn’t support cloning – you get a regular copy.

Here’s what that looks like:

$ cp -c src.txt dst.txt

Using the clonefile() function

There’s a macOS syscall clonefile() which creates space-saving clones. It was introduced alongside APFS.

Syscalls are quite low level, and they’re how programs are meant to interact with the operating system. I don’t think I’ve ever made a syscall directly – I’ve used wrappers like the Python os module, which make syscalls on my behalf, but I’ve never written my own code to call them.

Here’s a rudimentary C program that uses clonefile() to clone a file:

#include <stdio.h>
#include <stdlib.h>
#include <sys/clonefile.h>

int main(void) {
    const char *src = "1GB.bin";
    const char *dst = "clone.bin";

    /* clonefile(2) supports several options related to symlinks and
     * ownership information, but for this example we'll just use
     * the default behaviour */
    const int flags = 0;

    if (clonefile(src, dst, flags) != 0) {
        perror("clonefile failed");
        return EXIT_FAILURE;
    }

    printf("clonefile succeeded: %s ~> %s\n", src, dst);

    return EXIT_SUCCESS;
}

You can compile and run this program like so:

$ gcc clone.c

$ ./a.out
clonefile succeeded: 1GB.bin ~> clone.bin

$ ./a.out
clonefile failed: File exists

But I don’t use C in any of my projects – can I call this function from Python instead?

How do you clone files with Python?

Shelling out to cp -c using subprocess

The easiest way to clone a file in Python is by shelling out to cp -c with the subprocess module. Here’s a short example:

import subprocess

# Adding the `-c` flag means the file is cloned rather than copied,
# if possible.  See the man page for `cp`.
subprocess.check_call(["cp", "-c", "1GB.bin", "clone.bin"])

I think this snippet is pretty simple, and a new reader could understand what it’s doing. If they’re unfamiliar with file cloning on APFS, they might not immediately understand why this is different from shutil.copyfile, but they could work it out quickly.

This approach gets all the nice behaviour of the cp command – for example, if you try to clone on a volume that doesn’t support cloning, it falls back to a regular file copy instead. There’s a bit of overhead from spawning an external process, but the overall impact is negligible (and easily offset by the speed increase of cloning).

The problem with this approach is that error handling gets harder. The cp command fails with exit code 1 for every error, so you need to parse the stderr to distinguish different errors, or implement your own error handling.

In my project, I wrapped this cp call in a function which had some additional checks to spot common types of error, and throw them as more specific exceptions. Any remaining errors get thrown as a generic subprocess.CalledProcessError. Here’s an example:

from pathlib import Path
import subprocess


def clonefile(src: Path, dst: Path):
    """Clone a file on macOS by using the `cp` command."""
    # Check a couple of common error cases so we can get nice exceptions,
    # rather than relying on the `subprocess.CalledProcessError` from `cp`.
    if not src.exists():
        raise FileNotFoundError(src)

    if not dst.parent.exists():
        raise FileNotFoundError(dst.parent)

    # Adding the `-c` flag means the file is cloned rather than copied,
    # if possible.  See the man page for `cp`.
    subprocess.check_call(["cp", "-c", str(src), str(dst)])

    assert dst.exists()

For me, this code strikes a nice balance between being readable and returning good errors.

Calling the clonefile() function using ctypes

What if we want detailed error codes, and we don’t want the overhead of spawning an external process? Although I know it’s possible to make syscalls from Python using the ctypes library, I’ve never actually done it. This is my chance to learn!

Following the documentation for ctypes, these are the steps:

  1. Import ctypes and load a dynamic link library. This is the first thing we need to do – in this case, we’re loading the macOS link library that contains the clonefile() function.

    import ctypes
    
    libSystem = ctypes.CDLL("libSystem.B.dylib")
    

    I worked out that I need to load libSystem.B.dylib by looking at other examples of ctypes code on GitHub. I couldn’t find an explanation of it in Apple’s documentation.

    I later discovered that I can use otool to see the shared libraries that a compiled executable is linking to. For example, I can see that cp is linking to the same libSystem.B.dylib:

    $ otool -L /bin/cp
    /bin/cp:
        /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1351.0.0)
    

    This CDLL() call only works on macOS, which makes sense – it’s loading macOS libraries. If I run this code on my Debian web server, I get an error: OSError: libSystem.B.dylib: cannot open shared object file: No such file or directory.

  2. Tell ctypes about the function signature. If we look at the man page for clonefile(), we see the signature of the C function:

    int clonefile(const char * src, const char * dst, int flags);
    

    We need to tell ctypes to find this function inside libSystem.B.dylib, then describe the arguments and return type of the function:

    clonefile = libSystem.clonefile
    clonefile.argtypes = [ctypes.c_char_p, ctypes.c_char_p, ctypes.c_int]
    clonefile.restype = ctypes.c_int
    

    Although ctypes can call C functions if you don’t describe the signature, it’s a good practice and gives you some safety rails.

    For example, now ctypes knows that the clonefile() function takes three arguments. If I try to call the function with one or two arguments, I get a TypeError. If I didn’t specify the signature, I could call it with any number of arguments, but it might behave in weird or unexpected ways.

  3. Define the inputs for the function. This function needs three arguments.

    In the original C function, src and dst are char* – pointers to a null-terminated string of char values. In Python, this means the inputs need to be bytes values. Then flags is a regular Python int.

    # Source and destination files
    src = b"1GB.bin"
    dst = b"clone.bin"
    
    # clonefile(2) supports several options related to symlinks and
    # ownership information, but for this example we'll just use
    # the default behaviour
    flags = 0
    
  4. Call the function. Now we have the function available in Python, and the inputs in C-compatible types, we can call the function:

    import os
        
    if clonefile(src, dst, flags) != 0:
        errno = ctypes.get_errno()
        raise OSError(errno, os.strerror(errno))
        
    print(f"clonefile succeeded: {src} ~> {dst}")
    

    If the clone succeeds, this program runs successfully. But if the clone fails, we get an unhelpful error: OSError: [Errno 0] Undefined error: 0.

    The point of calling the C function is to get useful error codes, but we need to opt-in to receiving them. In particular, we need to add the use_errno parameter to our CDLL call:

    libSystem = ctypes.CDLL("libSystem.B.dylib", use_errno=True)
    

    Now, when the clone fails, we get different errors depending on the type of failure. The exception includes the numeric error code, and Python will throw named subclasses of OSError like FileNotFoundError, FileExistsError, or PermissionError. This makes it easier to write try … except blocks for specific failures.

Here’s the complete script, which clones a single file:

import ctypes
import os

# Load the libSystem library
libSystem = ctypes.CDLL("libSystem.B.dylib", use_errno=True)

# Tell ctypes about the function signature
# int clonefile(const char * src, const char * dst, int flags);
clonefile = libSystem.clonefile
clonefile.argtypes = [ctypes.c_char_p, ctypes.c_char_p, ctypes.c_int]
clonefile.restype = ctypes.c_int

# Source and destination files
src = b"1GB.bin"
dst = b"clone.bin"

# clonefile(2) supports several options related to symlinks and
# ownership information, but for this example we'll just use
# the default behaviour
flags = 0

# Actually call the clonefile() function
if clonefile(src, dst, flags) != 0:
    errno = ctypes.get_errno()
    raise OSError(errno, os.strerror(errno))
    
print(f"clonefile succeeded: {src} ~> {dst}")

I wrote this code for my own learning, and it’s definitely not production-ready. It works in the happy case and helped me understand ctypes, but if you actually wanted to use this, you’d want proper error handling and testing.

In particular, there are cases where you’d want to fall back to shutil.copyfile or similar if the clone fails – say if you’re on an older version of macOS, or you’re copying files on a volume which doesn’t support cloning. Both those cases are handled by cp -c, but not the clonefile() syscall.

In practice, how am I cloning files in Python?

In my project, I used cp -c with a wrapper like the one described above. It’s a short amount of code, pretty readable, and returns useful errors for common cases.

Calling clonefile() directly with ctypes might be slightly faster than shelling out to cp -c, but the difference is probably negligible. The downside is that it’s more fragile and harder for other people to understand – it would have been the only part of the codebase that was using ctypes.

File cloning made a noticeable difference. The project involving copying lots of files on an external USB hard drive, and cloning instead of copying full files made it much faster. Tasks that used to take over an hour were now completing in less than a minute. (The files were copied between folders on the same drive – cloned files have to be on the same APFS volume.)

I’m excited to see how file cloning works on Linux in Python 3.14 with Path.copy(), and I hope macOS support isn’t far behind.

[If the formatting of this post looks odd in your feed reader, visit the original article]

Slipstitch, Queer Craft, and community spaces

2025-07-30 17:26:41

Two weeks ago, I was at Queer Craft – a fortnightly meet-up at Slipstitch, a haberdashery and yarn shop near Alexandra Palace. I was working on a cross stitch piece of a train in a snowy landscape, chatting to my friends, admiring their creations, and gently snacking on a variety of baked goods.

This week, I wasn’t there, because Slipstitch closed its doors on Saturday.

A photo of the shop's exterior. It has bright turquoise paint with the word ‘Slipstitch’ in gold lettering across the top, and large windows that show the shop interior. You can see shelves, balls of wool, and two large wooden knitting needles.
Slipstitch in sunnier times. Photo from the Slipstitch website.

I can’t remember exactly when I first came across Slipstitch, but I remember why. Slipstitch was sometimes the target of stickering campaigns from TERFs and anti-trans campaigners, who objected to Rosie’s vocal support of trans people. (Rosie is the shop’s owner and a dear friend.)

This discussion spilled onto Twitter, and at some point Rosie’s tweets appeared in my timeline. I saw the shop account calling out the stickers and re-affirming its support of trans people, and since I’m trans and I do lots of cross-stitch, I decided to check it out. I looked around the online store, thinking I might buy some thread – then I found an event called “Queer Craft”, and booked.

Turning up the next Monday, I was a bit nervous – would I be queer enough? Would I be crafty enough? For whatever reason, my mental image of a craft meetup is people doing knitting or crochet – does anybody bring cross-stitch to these things?

My nerves were quickly put at ease – Rosie welcomed me enthusiastically, and I settled in. I sat down at the table, put on a name badge, and took out my cross-stitch. As I stitched away, I started chatting to strangers who would soon become friends.

Queer Craft was every other Monday, and I began making regular trips to Muswell Hill for two hours of crafting and conversation. We’d admire each other’s work, and share tips and advice if somebody was struggling. The group was always generous with knowledge, equipment, and sympathy for unexpected snags – but the conversation often drifted away from our crafts.


As we got to know each other more, we developed in-jokes and stories and familiar topics. We’d talk about Taskmaster and the careers of cutlery salesmen. We’d discuss what it’s like to grow up in Cornwall. We’d chat about theatre shows and West End drama. Everyone made jokes about how I’m a spy. (I’m not a spy.) Rosie would tell us about the wrong way to make coffee. We passed around many, many photos of our pets.

I know that Rosie was always keen for Queer Craft to be welcoming to newcomers and not too “clique”-y – I suspect the rate of in-jokes made that difficult, but I admire the effort. (I wonder if that’s the fate of all groups designed to help strangers meet? Either it fizzles out, or a group of regulars gradually forms that makes it harder for new people to join.) I confess I was never too worried about this, because I was too busy having a nice time with my friends.

Craft is such a personal hobby, and we saw glimpses of each other’s lives through the things we were making, especially when they were gifts for somebody else. Somebody making a baby blanket for a friend, or a stuffed toy for a parent, or some socks for a partner. Everyone poured so much time and love into their work. It felt very intimate and special.

I’m always a bit nervous about how visibly trans to be, but that was never an issue at Queer Craft – something I should have known from the start, thanks to Rosie’s vocal support of her trans staff and customers. Everyone treated transness as so self-evidently normal, it didn’t bear comment.

Sometimes I was femme, and sometimes I was masc, and nobody batted an eyelid. (Apart from the time Rosie took one look at my officewear and described it as “corporate drag”, a burn so savage I may never recover.) That sense of casual, unconditional trans acceptance feels like it’s getting more common – but it’s still nice every time.

These friendships have spilled out of Slipstitch and into the world beyond. One Queer Craft regular is a published author, and some of us went to celebrate her at a book event. Several others are in choirs, and I’ve been to see them sing. Last year, I invited people round to my house for “Queer Crafternoon”, where we all crafted in my living room and ate scones with jam and cream.


Nine weeks ago, Rosie told us that she’d decided to close the physical shop. The world is tough for small businesses, and tougher for the person running them. I’m sad, but I could see how much stress it was putting on Rosie, and I respect her decision to put her own wellbeing first, and to close the doors on her own terms.

On Saturday, Rosie held a party to mark the closing of the space. The shelves were empty, the room anything but. There were Queer Craft friends, regulars from the other Meet & Make group, people who’d come to Rosie’s classes, regular customers, and other friends of the shop. The community around Slipstitch is much more than just the queers who met on Monday evenings. The shop was the busiest I’ve ever seen it, and it was lovely to see so many people there to celebrate and mourn.

A rendition of the shopfront in cross-stitch, mounted in a gold frame on a brick wall. The shop is a simple geometric design in turquoise thread, and each of the three windows shows a mini-display – a pair of barbies, a pair of jumpers on stands, six balls of wool in rainbow colours.
My closing gift for Rosie was her shopfront, rendered in cross-stitch. I had a lot of fun designing the little details – the barbies in the shop window, the jumpers on display, the balls of wool in a rainbow pattern. And of course, I bought all the thread from her, but fortunately she never thought to ask why I was buying so much turquoise thread.

I’m sure Queer Craft and our friendships will continue in some shape or form, but sitting here now, I can’t help but be a little upset about what we’ve lost. Of course, there are other haberdasheries, and there are other queer craft groups – it’s hardly a unique idea – but Slipstitch was the haberdashery where I shopped, it’s where our group met, and I’m sad to know it’s gone.

In her final newsletter before the closure, Rosie wrote “[Slipstitch] never wanted for community”. I think that’s a lovely sentiment, and one that rung true in my experience – it always felt like such a friendly, welcoming space, and I’m glad I found it. I hope the friendships forged in Queer Craft will survive a long time after the physical shop is gone. I know that Rosie wants Slipstitch to continue as an idea, if not a physical venue, and I’m excited to see what happens next.

Her words also made me reflect on the fragility of our community spaces – those places where we can meet strangers and bond over a common interest. They’re getting scarcer and scarcer. As every bit of land is forced into more and more commercialisation, we’re running out of places to just hang out and meet people. We often talk about how hard it is to make friends as an adult – and that’s in part because the spaces where we might do so are dwindling.

These community spaces are precious for queer people, yes, but for everyone else too. I’m sad that the shop has closed, and I’m sad that this iteration of Queer Craft is over, and I’m sad that this is a trend. These spaces are rare, and getting rarer – we shouldn’t take them for granted.

[If the formatting of this post looks odd in your feed reader, visit the original article]

Today was my last day at the Flickr Foundation

2025-07-26 04:12:14

Today was my last day at the Flickr Foundation. At 5pm I closed my laptop, left the office for the last time, and took a quiet walk along Regent’s Canal. I saw an adorable family of baby coots, and a teenage coot who was still a bit fluffy and raggedy around the edges.

I’ve got another job lined up, but I’m taking a short break before I start.

My new role is still in software engineering, but in a completely different field. I’m stepping away from the world of libraries, archives, and photography. I’ve met some amazing people, and I’m very proud of everything we accomplished in digital preservation and cultural heritage. I’ll always treasure those memories, but I’m also excited to try something new.

For the last few years, I’ve been among the more senior engineers in my team. In my next role, I’ll be firmly middle of the pack, and I’m looking forward to learning from people who have more wisdom and experience than me.

But first: rest.

[If the formatting of this post looks odd in your feed reader, visit the original article]

Minifying HTML on my Jekyll website

2025-07-25 05:59:10

I minify all the HTML on this website – removing unnecessary whitespace, tidying up attributes, optimising HTML entities, and so on. This makes each page smaller, and theoretically the website should be slightly faster.

I’m not going to pretend this step is justified by the numbers. My pages are already pretty small pre-minification, and it only reduces the average page size by about 4%. In June, minification probably saved less than MiB of bandwidth.

But I do it anyway. I minify HTML because I like tinkering with the website, and I enjoy finding ways to make it that little bit faster or more efficient. I recently changed the way I’m minifying HTML, and I thought this would be a good time to compare the three approaches I’ve used and share a few things I learned about HTML along the way.

I build this website using Jekyll, so I’ve looked for Jekyll or Ruby-based solutions.

Table of contents

Approach #1: Compress HTML in Jekyll, by Anatol Broder

This is a Jekyll layout that compresses HTML. It’s a single HTML file written in pure Liquid (the templating language used by Jekyll).

First you save the HTML file to _layouts/compress.html, then reference it in your highest-level layout. For example, in _layouts/default.html you might write:

---
layout: compress
---

<html>
{{ content }}
</html>

Because it’s a single HTML file, it’s easy to install and doesn’t require any plugins. This is useful if you’re running in an environment where plugins are restricted or disallowed (which I think includes GitHub Pages, although I’m not 100% sure).

The downside is that the single HTML file can be tricky to debug, it only minifies HTML (not CSS or JavaScript), and there’s no easy way to cache the output.

Approach #2: The htmlcompressor gem, by Paolo Chiodi

The htmlcompressor gem is a Ruby port of Google’s Java-based HtmlCompressor. The README describes it as an “alpha version”, but in my usage it was very stable and it has a simple API.

I start by changing my compress.html layout to pass the page content to a compress_html filter:

---
---

{{ content | compress_html }}

This filter is defined as a custom plugin; I save the following code in _plugins/compress_html.rb:

def run_compress_html(html)
  require 'htmlcompressor'

  options = {
    remove_intertag_spaces: true
  }
  compressor = HtmlCompressor::Compressor.new(options)
  compressor.compress(html)
end

module Jekyll
  module CompressHtmlFilter
    def compress_html(html)
      cache = Jekyll::Cache.new('CompressHtml')

      cache.getset(html) do
        run_compress_html(html)
      end
    end
  end
end

Liquid::Template.register_filter(Jekyll::CompressHtmlFilter)

I mostly stick with the default options; the only extra rule I enabled was to remove inter-tag spaces. Consider the following example:

<p>hello world</p> <p>my name is Alex</p>

By default, htmlcompressor will leave the space between the closing </p> and the opening <p> as-is. Enabling remove_intertag_spaces makes it a bit more aggressive, and it removes that space.

I’m using the Jekyll cache to save the results of the compression – most pages don’t change from build-to-build, and it’s faster to cache the results than recompress the HTML each time.

The gem seems abandoned – the last push to GitHub was in 2017.

Approach #3: The minify-html library, by Wilson Lin

This is a Rust-based HTML minifier, with bindings for a variety of languages, including Ruby, Python, and Node. It’s very fast, and even more aggressive than other minifiers.

I use it in a very similar way to htmlcompressor. I call the same compress_html filter in _layouts/compress.html, and then my run_compress_html in _plugins/compress_html.rb is a bit different:

def run_compress_html(html)
  require 'minify_html'

  options = {
    keep_html_and_head_opening_tags: true,
    keep_closing_tags: true,
    minify_css: true,
    minify_js: true
  }

  minify_html(html, options)
end

This is a much more aggressive minifier. For example, it turns out that the <html> and <head> elements are optional in an HTML5 document, so this minifier removes them if it can. I’ve disabled this behaviour, because I’m old-fashioned and I like my pages to have <html> and <head> tags.

This library also allows minifying inline CSS and JavaScript, which is a nice bonus. That has some rough edges though: there’s an open issue with JS minification, and I had to tweak several of my if-else statements to work with the minifier. Activity on the GitHub repository is sporadic, so I don’t know if that will get fixed any time soon.

Minify, but verify

After I minify HTML, but before I publish the site, I run HTML-Proofer to validate my HTML.

I’m not sure this has ever caught an issue introduced by a minifer, but it gives me peace of mind that these tools aren’t mangling my HTML. (It has caught plenty of issues caused by my mistakes!)

Comparing the three approaches

There are two key metrics for HTML minifiers:

  • Speed: this is a dead heat. When I built the site with a warm cache, it takes about 2.5s whatever minifier I’m using. The htmlcompressor gem and minify-html library are much slower if I have a cold cache, but that’s only a few extra seconds and it’s rare for me to build the site that way.

  • File size: the Ruby and Rust-based minifiers achieve slightly better minification, because they’re more aggressive in what they trim. For example, they’re smarter about removing unnecessary spaces and quoting around attribute values.

    Here’s the average page size after minification:

    Approach Average HTML page size
    Without minification 14.9 KiB
    Compress HTML in Jekyll 3.2.0 14.3 KiB
    htmlcompressor 0.4.0 14.0 KiB
    minify-html 0.16.4 13.5 KiB

I’m currently using minify-html. This is partly because it gets slightly smaller page sizes, and partly because it has bindings in other languages. This website is my only major project that uses Ruby, and so I’m always keen to find things I can share in my other non-Ruby projects. If minify-html works for me (and it is so far), I can imagine using it elsewhere.

[If the formatting of this post looks odd in your feed reader, visit the original article]