MoreRSS

site iconAlex WlchanModify

I‘m a software developer, writer, and hand crafter from the UK. I’m queer and trans.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Alex Wlchan

Creating a personal wrapper around yt-dlp

2025-10-07 14:19:18

I download a lot of videos from YouTube, and yt-dlp is my tool of choice. Sometimes I download videos as a one-off, but more often I’m downloading videos in a project – my bookmarks, my collection of TV clips, or my social media scrapbook.

I’ve noticed myself writing similar logic in each project – finding the downloaded files, converting them to MP4, getting the channel information, and so on. When you write the same thing multiple times, it’s a sign you should extract it into a shared tool – so that’s what I’ve done.

yt-dlp_alexwlchan is a script that calls yt-dlp with my preferred options, in particular:

  • Download the highest-quality video, thumbnail, and subtitles
  • Save the video as MP4 and the thumbnail as a JPEG
  • Get some information about the video (like title and description) and the channel (like the name and avatar)

All this is presented in a CLI command which prints a JSON object that other projects can parse. Here’s an example:

$ yt-dlp_alexwlchan.py "https://www.youtube.com/watch?v=TUQaGhPdlxs"
{
  "id": "TUQaGhPdlxs",
  "url": "https://www.youtube.com/watch?v=TUQaGhPdlxs",
  "title": "\"new york city, manhattan, people\" - Free Public Domain Video",
  "description": "All videos uploaded to this channel are in the Public Domain: Free for use by anyone for any purpose without restriction. #PublicDomain",
  "date_uploaded": "2022-03-25T01:10:38Z",
  "video_path": "\uff02new york city, manhattan, people\uff02 - Free Public Domain Video [TUQaGhPdlxs].mp4",
  "thumbnail_path": "\uff02new york city, manhattan, people\uff02 - Free Public Domain Video [TUQaGhPdlxs].jpg",
  "subtitle_path": null,
  "channel": {
    "id": "UCDeqps8f3hoHm6DHJoseDlg",
    "name": "Public Domain Archive",
    "url": "https://www.youtube.com/channel/UCDeqps8f3hoHm6DHJoseDlg",
    "avatar_url": "https://yt3.googleusercontent.com/ytc/AIdro_kbeCfc5KrnLmdASZQ9u649IxrxEUXsUaxdSUR_jA_4SZQ=s0"
  },
  "site": "youtube"
}

Rather than using the yt-dlp CLI, I’m using the Python interface. I can import the YouTubeDL class and pass it some options, then pull out the important fields from the response. The library is very flexible, and the options are well-documented.

This is similar to my create_thumbnail tool. I only have to define my preferred behaviour once, then other code can call it as an external script.

I have ideas for changes I might make in the future, like tidying up filenames or supporting more sites, but I’m pretty happy with this first pass. All the code is in my yt-dlp_alexwlchan GitHub repo.

This script is based on my preferences, so you probably don’t want to use it directly – but if you use yt-dlp a lot, it could be a helpful starting point for writing your own script.

Even if you don’t use yt-dlp, the idea still applies: when you find yourself copy-pasting configuration and options, turn it into a standalone tool. It keeps your projects cleaner and more consistent, and your future self will thnak you for it.

[If the formatting of this post looks odd in your feed reader, visit the original article]

Opening all the files that have been modified in a Git branch

2025-09-19 05:14:29

Today a colleague asked for a way to open all the files that have changed in a particular Git branch. They were reviewing a large pull request, and sometimes it’s easier to review files in your local editor than in GitHub’s code review interface. You can see the whole file, run tests or local builds, and get more context than the GitHub diffs.

This is the snippet I suggested:

git diff --name-only "$BRANCH_NAME" $(git merge-base origin/main "$BRANCH_NAME") \
  | xargs open -a "Visual Studio Code"

It uses a couple of nifty Git features, so let’s break it down.

How this works

There are three parts to this command:

  1. Work out where the dev branch diverges from main. We can use git-merge-base:

    $ git merge-base origin/main "$BRANCH_NAME"
    9ac371754d220fd4f8340dc0398d5448332676c3

    This command gives us the common ancestor of our main branch and our dev branch – this is the tip of main when the developer created their branch.

    In a small codebase, main might not have changed since the dev branch was created. But in a large codebase where lots of people are making changes, the main branch might have moved on since the dev branch was created.

    Here’s a quick picture:

    commonancestormainbranchdevbranchSimple illustration of Git history. There's a linear series of commits on the main branch, and then a development branch created later. The commit where the branch was created is highlighted in red.

    This tells us which commits we’re reviewing – what are the changes in this branch?

  2. Get a list of files which have changed in the dev branch. We can use git-diff to see the difference between two commits. If we add the --name-only flag, it only prints a list of filenames with changes, not the full diffs.

    $ git diff --name-only "$BRANCH_NAME" $(git merge-base …)
    assets/2025/exif_orientation.py
    src/_drafts/create-thumbnail-is-exif-aware.md
    src/_images/2025/exif_orientation.svg

    Because we're diffing between the tip of our dev branch, and the point where our dev branch diverged from main, this prints a list of files that have changed in the dev branch.

    (I originally suggested using git diff --name-only "$BRANCH_NAME" origin/main, but that's wrong. That prints all the files that differ between the two branches, which includes changes merged to main after the dev branch was created.)

  3. Open the files in our text editor. I suggested piping to xargs and open, but there are many ways to do this:

    $ git diff … | xargs open -a "Visual Studio Code"

    The xargs command is super useful for doing the same thing repeatedly – in this case, opening a bunch of files in VS Code. You feed it a space-delimited string, it splits the string into different pieces, and runs the same command on each of them, one-by-one. It’s equivalent to running:

    open -a "Visual Studio Code" "assets/2025/exif_orientation.py"
    open -a "Visual Studio Code" "src/_drafts/create-thumbnail-is-exif-aware.md"
    open -a "Visual Studio Code" "src/_images/2025/exif_orientation.svg"

    The open command opens files, and the -a flag tells it which application to use. We mostly use VS Code at work, but you could pass any text editor here.

    Reading the manpage for open, I'm reminded that you can open multiple files at once, so I could have done this without using xargs. I instinctively reached for xargs because I’m very familiar with it, and it’s a reliable way to take a command that takes a single input, and run it with many inputs.

[If the formatting of this post looks odd in your feed reader, visit the original article]

Linking to text fragments with a bookmarklet

2025-09-15 05:44:01

One of my favourite features added to web browsers in the last few years is text fragments.

Text fragments allow you to link directly to specific text on a web page, and some browsers will highlight the linked text – for example, by scrolling to it, or adding a coloured highlight. This is useful if I’m linking to a long page that doesn’t have linkable headings – I want it to be easy for readers to find the part of the page I was looking for.

Here’s an example of a URL with a text fragment:

https://example.com/#:~:text=illustrative%20examples

But I don’t find the syntax especially intuitive – I can never remember exactly what mix of colons and tildes to add to a URL.

To help me out, I’ve written a small bookmarklet to generate these URLs:

Create link to selected text

To install the bookmarklet, drag it to my bookmarks bar.

When I’m looking at a page and want to create a text fragment link, I select the text and click the bookmarklet. It works out the correct URL and shows it in a popup, ready to copy and paste. You can try it now – select some text on this page, then click the button to see the text fragment URL.

It’s a small tool, but it’s made my link sharing much easier.

Update, 16 September 2025: Smoljaguar on Mastodon pointed out that Firefox, Chrome, and Safari all have menu items for “Copy Link with Highlight” which does something very similar. The reason I don’t use these is because I didn’t know they exist!

I use Safari as my main browser, and this item is only available in the right-click menu. One reason I like bookmarklets is that they become items in the Bookmarks menu, and then it’s easy for me to assign keyboard shortcuts.

Bookmarklet source code

This is the JavaScript that gets triggered when you run the bookmarklet:

const selectedText = window.getSelection().toString().trim();

if (!selectedText) {
  alert("You need to select some text!");
  return;
}

const url = new URL(window.location);
url.hash = `:~:text=${encodeURIComponent(selectedText)}`;

alert(url.toString());

[If the formatting of this post looks odd in your feed reader, visit the original article]

Resizing images in Rust, now with EXIF orientation support

2025-09-09 06:42:48

Resizing an image is one of those programming tasks that seems simple, but has some rough edges. One common mistake is forgetting to handle the EXIF orientation, which can make resized images look very different from the original.

Last year I wrote create_thumbnail tool to resize images, and today I released a small update. Now it’s aware of EXIF orientation, and it no longer mangles these images. This is possible thanks to a new version of the Rust image crate, which just improved its EXIF support.

What’s EXIF orientation?

Images can specify an orientation in their EXIF metadata, which can describe both rotation and reflection. This metadata is usually added by cameras and phones, which can detect how you’re holding them, and tell viewing software how to display the picture later.

For example, if you take a photo while holding your camera on its side, the camera can record that the image should be rotated 90 degrees when viewed. If you use a front-facing selfie camera, the camera could note that the picture needs to be mirrored.

There are eight different values for EXIF orientation – rotating in increments of 90°, and mirrored or not. The default value is “1” (display as-is), and here are the other seven values:

16832574A diagram showing the eight different orientations of the word ‘FLY’: four rotations, four rotations with a mirror reflection.

You can see the EXIF orientation value with programs like Phil Harvey’s exiftool, which helpfully converts the numeric value into a human-readable description:

$ # exiftool's default output is human-readable
$ exiftool -orientation my_picture.jpg
Orientation                     : Rotate 270 CW

$ # or we can get the raw numeric value
$ exiftool -n -orientation my_picture.jpg
Orientation                     : 8

Resizing images in Rust

I use the image crate to resize images in Rust.

My old code for resizing images would open the image, resize it, then save it back to disk. Here’s a short example:

use image::imageops::FilterType;
use std::error::Error;

fn main() -> Result<(), Box<dyn Error>> {
    // Old method: doesn't know about EXIF orientation
    let img = image::open("original.jpg")?;
    img.resize(180, 120, FilterType::Lanczos3)
        .save("thumbnail.jpg")?;

    Ok(())
}

The thumbnail will keep the resized pixels in the same order as the original image, but the thumbnail doesn’t have the EXIF orientation metadata. This means that if the original image had an EXIF orientation, the thumbnail could look different, because it’s no longer being rotated/reflected properly.

When I wrote create_thumbnail, the image crate didn’t know anything about EXIF orientation – but last week’s v0.25.8 release added several functions related to EXIF orientation. In particular, I can now read the orientation and apply it to an image:

use image::imageops::FilterType;
use image::{DynamicImage, ImageDecoder, ImageReader};
use std::error::Error;

fn main() -> Result<(), Box<dyn Error>> {
    // New methods in image v0.25.8 know about EXIF orientation,
    // and allow us to apply it to the image before resizing.
    let mut decoder = ImageReader::open("original.jpg")?.into_decoder()?;
    let orientation = decoder.orientation()?;
    let mut img = DynamicImage::from_decoder(decoder)?;
    img.apply_orientation(orientation);

    img.resize(180, 120, FilterType::Lanczos3)
        .save("thumbnail.jpg")?;

    Ok(())
}

The thumbnail still doesn’t have any EXIF orientation data, but the pixels have been rearranged so the resized image looks similar to the original. That’s what I want.

Here’s a visual comparison of the three images. Notice how the thumbnail from the old code looks upside down:

original image thumbnail from the old code thumbnail from the new code

This test image comes from Dave Perrett’s exif-orientation-examples repo, which has a collection of images that were very helpful for testing this code.

Is this important?

This is a small change, but it solves an annoyance I’ve hit in every project that deals with images. I’ve written this fix, but images with an EXIF orientation are rare enough that I always forget them when I start a new project – and I used to solve the same problem again and again.

By handling EXIF orientation in create_thumbnail, I won’t have to think about this again. That’s the beauty of a shared tool – I fix it once, and then it’s fixed for all my current and future projects.

[If the formatting of this post looks odd in your feed reader, visit the original article]

Using vcrpy to test HTTP interactions in Python

2025-08-29 05:36:37

Testing code that makes HTTP requests can be difficult. Real requests are slow, flaky, and hard to control. That’s why I use a Python library called vcrpy, which does a one-off recording of real HTTP interactions, then replays them during future tests.

These recordings are saved to a “cassette” – a plaintext file that I keep alongside my tests and my code. The cassette ensures that all my tests get consistent HTTP responses, which makes them faster and more reliable, especially in CI. I only have to make one real network request, and then I can run my tests locally and offline.

In this post, I’ll show you how I use vcrpy in a production codebase – not just the basics, but also the patterns, pitfalls, and fixtures that make it work for a real team.

Table of contents

A pile of three black video cassette tapes stacked on a wooden table.
Three black compact video cassette tapes. Photo credit: Anthony on Pexels.

Why not make real HTTP requests in tests?

There are several reasons why I avoid real HTTP requests in my tests:

It makes my tests slower

I want my tests to be fast, because then I’ll run them more often and catch mistakes sooner. An individual HTTP call might be quick, but stack up hundreds of them and tests really start to drag.

It makes my tests less reliable

Even if my code is correct, my tests could fail because of problems on the remote server. What if I’m offline? What if the server is having a temporary outage? What if the server starts rate limiting me for making too many HTTP requests?

It makes my tests more brittle

If my tests depend on the server having certain state, then the server state could change and break or degrade my test suite.

Sometimes this change is obvious. For example, suppose I’m testing a function to fetch photos from Flickr, and then the photo I’m using in my test gets deleted. My code works correctly for photos that still exist, but now my test starts failing.

Sometimes this change is more subtle. Suppose I’ve written a regression test for an edge case, and then the server state changes, so the example I’m checking is no longer an instance of the edge case. I could break the code and never realise, because the test would keep passing. My test suite would become less effective.

It means passing around more secrets

A lot of my HTTP calls require secrets, like API keys or OAuth tokens. If the tests made real HTTP calls, I’d need to copy those secrets to every environment where I’m running the tests. That increases the risk of the secret getting leaked.

It makes my tests harder to debug

If there are more reasons why a test could fail, then it takes longer to work out if the failure was caused by my mistake, or a change on the server.

Recording and replaying HTTP requests solves these problems

If my test suite is returning consistent responses for HTTP calls, and those responses are defined within the test suite itself, then my tests get faster and more reliable. I’m not making real network calls, I’m not dependent on the behaviour of a server, and I don’t need real secrets to run the tests.

There are a variety of ways to define this sort of test mock; I like to record real responses because it ensures I’m getting a high-fidelity mock, and it makes it fairly easy to add new tests.


Why do you like vcrpy?

I know two Python libraries that record real HTTP responses: vcrpy and betamax, both based on a Ruby library called vcr. I’ve used all three, they behave in a similar way, and they work well.

I prefer vcrpy for Python because it supports a wide variety of HTTP libraries, whereas betamax only works with requests. I currently use a mixture of httpx and urllib3, and it’s convenient to test them both with the same library and test helpers.

I also like that vcrpy works without needing any changes to the code I’m testing. I can write HTTP code as I normally would, then I add a vcrpy decorator in my test and the responses get recorded. I don’t like test frameworks that require me to rewrite my code to fit – the tests should follow the code, not the other way round.


A basic example of using vcrpy

Here’s a test that uses vcrpy to fetch www.example.com, and look for some text in the response. I use vcr.use_cassette as a context manager around the code that makes an HTTP request:

import httpx
import vcr


def test_example_domain():
    with vcr.use_cassette("fixtures/vcr_cassettes/test_example_domain.yml"):
        resp = httpx.get("https://www.example.com/")
        assert "<h1>Example Domain</h1>" in resp.text

Alternatively, you can use vcr.use_cassette as a decorator:

@vcr.use_cassette("fixtures/vcr_cassettes/test_example_domain.yml")
def test_example_domain():
    resp = httpx.get("https://www.example.com/")
    assert "<h1>Example Domain</h1>" in resp.text

With the decorator, you can also omit the path to the cassette file, and vcrpy will name the cassette file after the function:

@vcr.use_cassette()
def test_example_domain():
    resp = httpx.get("https://www.example.com/")
    assert "<h1>Example Domain</h1>" in resp.text

When I run this test using pytest (python3 -m pytest test_example.py), vcrpy will check if the cassette file exists. If the file is missing, it makes a real HTTP call and saves it to the file. If the file exists, it replays the previously-recorded HTTP call.

By default, the cassette is a YAML file. Here’s what it looks like: test_example_domain.yml.

If a test makes more than one HTTP request, vcrpy records all of them in the same cassette file.


Using vcrpy in production

Keeping secrets out of my cassettes

The cassette files contain the complete HTTP request and response, which includes the URL, form data, and HTTP headers. If I’m testing an API that requires auth, the HTTP request could include secrets like an API key or OAuth token. I don’t want to save those secrets in the cassette file!

Fortunately, vcrpy can filter sensitive data before it’s saved to the cassette file – HTTP headers, URL query parameters, or form data.

Here’s an example where I’m using filter_query_parameters to redact an API key. I’m replacing the real value with the placeholder REDACTED_API_KEY.

import os

import httpx
import vcr


def test_flickr_api():
    with vcr.use_cassette(
        "fixtures/vcr_cassettes/test_flickr_api.yml",
        filter_query_parameters=[("api_key", "REDACTED_API_KEY")],
    ):
        api_key = os.environ.get("FLICKR_API_KEY", "API_KEY")

        resp = httpx.get(
            "https://api.flickr.com/services/rest/",
            params={
                "api_key": api_key,
                "method": "flickr.urls.lookupUser",
                "url": "https://www.flickr.com/photos/alexwlchan/",
            },
        )

        assert '<user id="199258389@N04">' in resp.text

When I run this test the first time, I need to pass an env var FLICKR_API_KEY. This makes a real request and records a cassette, but with my redacted value. When I run the test again, I don’t need to pass the env var, but the test will still pass.

You can see the complete YAML file in test_flickr_api.yml. Notice how the api_key query parameter has been redacted in the recorded request:

interactions:
- request:
    
    uri: https://api.flickr.com/services/rest/?api_key=REDACTED_API_KEY&method=flickr.urls.lookupUser&url=https%3A%2F%2Fwww.flickr.com%2Fphotos%2Falexwlchan%2F
    

You can also tell vcrpy to omit the sensitive field entirely, but I like to insert a placeholder value. It’s useful for debugging later – you can see that a value was replaced, and easily search for the code that’s doing the redaction.

Improving the human readability of cassettes

If you look at the first two cassette files, you’ll notice that the response body is stored as base64-encoded binary data:

response:
  body:
    string: !!binary |
      H4sIAAAAAAAAAH1UTXPbIBC9+1ds1UsyIyQnaRqPLWn6mWkPaQ9pDz0SsbKYCFAByfZ08t+7Qo4j
      N5makYFdeLvvsZC9Eqb0uxah9qopZtljh1wUM6Bf5qVvsPi85aptED4ZxaXO0tE6G5co9BzKmluH
      Po86X7FFBGkxcdbetwx/d7LPo49Ge9SeDWEjKMdZHnnc+nQIvzpAvYSkucI86iVuWmP9ZP9GCl/n

That’s because example.com and api.flickr.com both gzip compress their responses, and vcrpy is preserving that compression. But gzip compression is handled by the HTTP libraries – my code never needs to worry about compression; it just gets the uncompressed response.

Where possible, I prefer to store responses in their uncompressed form. It makes the cassettes easier to read, and you can see if secrets are included in the saved response data. I also find it useful to read cassettes as an example of what an API response looks like – and in particular, what it looked like when I wrote the test. Cassettes have helped me spot undocumented changes in APIs.

Here’s an example where I’m using decode_compressed_response=True to remove the gzip compression in the cassette:

def test_example_domain_with_decode():
    with vcr.use_cassette(
        "fixtures/vcr_cassettes/test_example_domain_with_decode.yml",
        decode_compressed_response=True,
    ):
        resp = httpx.get("https://www.example.com/")
        assert "<h1>Example Domain</h1>" in resp.text

You can see the complete cassette file in test_example_domain_with_decode.yml. Notice the response body now contains an HTML string:

response:
  body:
    string: "<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n
      \   <meta charset=\"utf-8\" />\n    <meta http-equiv=\"Content-type\" content=\"text/html;
      charset=utf-8\" />\n    <meta name=\"viewport\" content=\"width=device-width,

Naming my cassettes to make sense later

If you write a lot of tests that use vcrpy, you’ll end up with a fixtures directory that’s full of cassettes. I like cassette names to match my test functions, so they’re easy to match up later.

I could specify a cassette name explicitly in every test, but that’s extra work and prone to error. Alternatively, I could use the decorator and use the automatic cassette name – but vcrpy uses the name of the test function, which may not distinguish between tests. In particular, I often group tests into classes, or use parametrized tests to run the same test with different values.

Consider the following example:

import httpx
import pytest
import vcr


class TestExampleDotCom:
    def test_status_code(self):
        resp = httpx.get("https://example.com")
        assert resp.status_code == 200


@vcr.use_cassette()
@pytest.mark.parametrize(
    "url, status_code",
    [
        ("https://httpbin.org/status/200", 200),
        ("https://httpbin.org/status/404", 404),
        ("https://httpbin.org/status/500", 500),
    ],
)
def test_status_code(url, status_code):
    resp = httpx.get(url)
    assert resp.status_code == status_code

This is four different tests, but vcrpy’s automatic cassette name is the same for each of them: test_status_code. The tests will fail if you try to run them – vcrpy will record a cassette for the first test that runs, then try to replay that cassette for the second test. The second test makes a different HTTP request, so vcrpy will throw an error because it can’t find a matching request.

Here’s what I do instead: I have a pytest fixture to choose cassette names, which includes the name of the test class (if any) and the ID of the parametrized test case. Because I sometimes use URLs in parametrized tests, I also check the test case ID doesn’t include slashes or colons – I don’t want those in my filenames!

Here’s the decorator:

@pytest.fixture
def cassette_name(request: pytest.FixtureRequest) -> str:
    """
    Returns the filename of a VCR cassette to use in tests.

    The name can be made up of (up to) three parts:

    -   the name of the test class
    -   the name of the test function
    -   the ID of the test case in @pytest.mark.parametrize

    """
    name = request.node.name

    # This is to catch cases where e.g. I try to include a complete
    # HTTP URL in a cassette name, which creates messy folders in
    # the fixtures directory.
    if ":" in name or "/" in name:
        raise ValueError(
            "Illegal characters in VCR cassette name - "
            "please set a test ID with pytest.param(…, id='')"
        )

    if request.cls is not None:
        return f"{request.cls.__name__}.{name}.yml"
    else:
        return f"{name}.yml"

Here’s my test rewritten to use that new decorator:

class TestExampleDotCom:
    def test_status_code(self, cassette_name):
        with vcr.use_cassette(cassette_name):
            resp = httpx.get("https://example.com")
            assert resp.status_code == 200


@vcr.use_cassette()
@pytest.mark.parametrize(
    "url, status_code",
    [
        pytest.param("https://httpbin.org/status/200", 200, id="ok"),
        pytest.param("https://httpbin.org/status/404", 404, id="not_found"),
        pytest.param("https://httpbin.org/status/500", 500, id="server_error"),
    ],
)
def test_status_code(url, status_code, cassette_name):
    with vcr.use_cassette(cassette_name):
        resp = httpx.get(url)
        assert resp.status_code == status_code

The four tests now get distinct cassette filenames:

  • TestExampleDotCom.test_status_code
  • test_status_code[ok]
  • test_status_code[not_found]
  • test_status_code[server_error]

Explaining how to use cassettes with helpful errors

Most of the time, you don’t need to worry about how vcrpy works. If you’re running an existing test, then vcrpy is just a fancy test mock that happens to be reading its data from a YAML file. You don’t need to worry about the implementation details.

However, if you’re writing a new test, you need to record new cassettes. This can involve some non-obvious setup, especially if you’ve never done it before.

Let’s revisit an earlier example:

def test_flickr_api():
    with vcr.use_cassette(
        "fixtures/vcr_cassettes/test_flickr_api.yml",
        filter_query_parameters=[("api_key", "REDACTED_API_KEY")],
    ):
        api_key = os.environ.get("FLICKR_API_KEY", "API_KEY")

        resp = httpx.get(
            "https://api.flickr.com/services/rest/",
            params={
                "api_key": api_key,
                "method": "flickr.urls.lookupUser",
                "url": "https://www.flickr.com/photos/alexwlchan/",
            },
        )

        assert '<user id="199258389@N04">' in resp.text

If you run this test without passing a FLICKR_API_KEY environment variable, it will call the real Flickr API with the placeholder API key. Unsurprisingly, the Flickr API will return an error response, and your test will fail:

<?xml version="1.0" encoding="utf-8" ?>
<rsp stat="fail">
  <err code="100" msg="Invalid API Key (Key has invalid format)" />
</rsp>

Worse still, vcrpy will record this error in the cassette file. Even if you work out you need to re-run the test with the env var, it will keep failing as it replays the recorded error.

Can we make this better? In this scenario, what I’d prefer is:

  1. The test fails if you don’t pass an env var
  2. The error explains how to run the test properly
  3. vcrpy doesn’t save a cassette file

I worked out how to get this nicer error handling. vcrpy has a before_record_response hook, that allows you to modify a response before writing it to the cassette file. You could use this to redact secrets from responses, but I realised you could also use it to validate the response – and if you throw an exception, it prevents vcrpy from writing a cassette.

Here’s a hook I wrote, which checks if a vcrpy response is a Flickr API error telling us that we passed an invalid API key, and throws an exception if so:

def check_for_invalid_api_key(response):
    """
    Before we record a new response to a cassette, check if it's
    a Flickr API response telling us we're missing an API key -- that
    means we didn't set up the test correctly.

    If so, give the developer an instruction explaining what to do next.
    """
    try:
        body: bytes = response["body"]["string"]
    except KeyError:
        body = response["content"]

    is_error_response = (
        b'<err code="100" msg="Invalid API Key (Key has invalid format)" />' in body
    )

    if is_error_response:
        raise RuntimeError(
            "You tried to record a new call to the Flickr API, \n"
            "but the tests don't have an API key.\n"
            "\n"
            "Pass an API key as an env var FLICKR_API_KEY=ae84…,\n"
            "and re-run the test.\n"
        )

    return response

We can call this hook in our vcr.use_cassette call:

def test_flickr_api(cassette_name):
    with vcr.use_cassette(
        cassette_name,
        filter_query_parameters=[("api_key", "REDACTED_API_KEY")],
        decode_compressed_response=True,
        before_record_response=check_for_invalid_api_key,
    ):
        ...

Now, if you try to record a Flickr API call and don’t set the API key, you’ll get a helpful error explaining how to re-run the test correctly.

Wrapping everything in a fixture for convenience

This is all useful, but it’s a lot of boilerplate to add to every test. To make everything cleaner, I wrap vcrpy in a pytest fixture that returns an HTTP client I can use in my tests. This fixture allows me to configure vcrpy, and also do any other setup I need on the HTTP client – for example, adding authentication params or HTTP headers.

Here’s an example of such a fixture in a library for using the Flickr API:

@pytest.fixture
def flickr_api(cassette_name):
    with vcr.use_cassette(
        cassette_name,
        filter_query_parameters=[("api_key", "REDACTED_API_KEY")],
        decode_compressed_response=True,
        before_record_response=check_for_invalid_api_key,
    ):
        client = httpx.Client(
            params={"api_key": os.environ.get("FLICKR_API_KEY", "API_KEY")},
            headers={
                # Close the connection as soon as the API returns a
                # response, to fix pytest warnings about unclosed sockets.
                "Connection": "Close",
            },
        )

        yield client

This makes individual tests much shorter and simpler:

def test_flickr_api_without_boilerplate(flickr_api):
    resp = flickr_api.get(
        "https://api.flickr.com/services/rest/",
        params={
            "method": "flickr.urls.lookupUser",
            "url": "https://www.flickr.com/photos/alexwlchan/",
        },
    )

    assert '<user id="199258389@N04">' in resp.text

When somebody reads this test, they don’t need to think about the authentication or or mocking; they can just see the API call that we’re making.


When I don’t vcrpy

Although vcrpy is useful, there are times when I prefer to test my HTTP code in a different way. Here are a few examples.

If I’m testing error handling

If I’m testing my error handling code – errors like timeouts, connection failures, or 5xx errors – it’s difficult to record a real response. Even if I could find a reliable error case today, it might be fixed tomorrow, which makes it difficult to reproduce if I ever need to re-record a cassette.

When I test error handling, I prefer a pure-Python mock where I can see exactly what error conditions I’m creating.

If I’m fetching lots of binary files

If my HTTP code is downloading images and video, storing them in a vcrpy cassette is pretty inefficient – they have to be encoded as base64. This makes the cassettes large and inefficient, the extra decoding step slows my test down, and the files are hard to inspect.

When I’m testing with binary files, I store them as standalone files in my fixtures directory (e.g. in tests/fixtures/images), and I write my own mock to read the file from disk. I can easily inspect or modify the fixture data, and I don’t have the overhead of using cassettes.

If I’m testing future or hypothetical changes in an API

A vcrpy cassette locks in the current behaviour. But suppose I know about an upcoming change, or I want to check my code would handle an unusual response – I can’t capture that in a vcrpy cassette, because the server isn’t returning responses like that (yet).

In those cases, I either construct a vcrpy cassette with the desired response by hand, or I use a code-based mock to return my unusual response.


Summary

Using vcrpy has allowed me to write more thorough tests, and it does all the hard work of intercepting HTTP calls and serialising them to disk. It gives me high-fidelity snapshots of HTTP responses, allowing me to mock HTTP calls and avoid network requests in my tests. This makes my tests faster, consistent, and reliable.

Here’s a quick reminder of what I do to run vcrpy in production:

  • I use filter_query_parameters and filter_headers to keep secrets out of cassette files
  • I set decode_compressed_response=True to make cassettes more readable
  • I name cassettes after the test function they’re associated with
  • I throw errors if an HTTP client isn’t set up correctly when you try to record a cassette
  • I wrap everything in a fixture, to keep individual tests simpler

If you make HTTP calls from your tests, I really recommend it: https://vcrpy.readthedocs.io/en/latest/

[If the formatting of this post looks odd in your feed reader, visit the original article]

Create space-saving clones on macOS with Python

2025-08-03 22:49:06

The standard Mac filesystem, APFS, has a feature called space-saving clones. This allows you to create multiple copies of a file without using additional disk space – the filesystem only stores a single copy of the data.

Although cloned files share data, they’re independent – you can edit one copy without affecting the other (unlike symlinks or hard links). APFS uses a technique called copy-on-write to store the data efficiently on disk – the cloned files continue to share any pieces they have in common.

Cloning files is both faster and uses less disk space than copying. If you’re working with large files – like photos, videos, or datasets – space-saving clones can be a big win.

Several filesystems support cloning, but in this post, I’m focusing on macOS and APFS.

For a recent project, I wanted to clone files using Python. There’s an open ticket to support file cloning in the Python standard library. In Python 3.14, there’s a new Path.copy() function which adds support for cloning on Linux – but there’s nothing yet for macOS.

In this post, I’ll show you two ways to clone files in APFS using Python.

Table of contents


What are the benefits of cloning?

There are two main benefits to using clones rather than copies.

Cloning files uses less disk space than copying

Because the filesystem only has to keep one copy of the data, cloning a file doesn’t use more space on disk. We can see this with an experiment. Let’s start by creating a random file with 1GB of data, and checking our free disk size:

$ dd if=/dev/urandom of=1GB.bin bs=64M count=16
16+0 records in
16+0 records out
1073741824 bytes transferred in 2.113280 secs (508092550 bytes/sec)

$ df -h -I /
Filesystem        Size    Used   Avail Capacity  Mounted on
/dev/disk3s1s1   460Gi    14Gi    43Gi    25%    /

My disk currently has 43GB available.

Let’s copy the file, and check the free disk space after it’s done. Notice that it decreases to 42GB, because the filesystem is now storing a second copy of this 1GB file:

$ # Copying
$ cp 1GB.bin copy.bin

$ df -h -I /
Filesystem        Size    Used   Avail Capacity  Mounted on
/dev/disk3s1s1   460Gi    14Gi    42Gi    25%    /

Now let’s clone the file by passing the -c flag to cp. Notice that the free disk space stays the same, because the filesystem is just keeping a single copy of the data between the original and the clone:

$ # Cloning
$ cp -c 1GB.bin clone.bin

$ df -h -I /
Filesystem        Size    Used   Avail Capacity  Mounted on
/dev/disk3s1s1   460Gi    14Gi    42Gi    25%    /

Cloning files is faster than copying

When you clone a file, the filesystem only has to write a small amount of metadata about the new clone. When you copy a file,it needs to write all the bytes of the entire file. This means that cloning a file is much faster than copying, which we can see by timing the two approaches:

$ # Copying
$ time cp 1GB.bin copy.bin
Executed in  260.07 millis

$ # Cloning
$ time cp -c 1GB.bin clone.bin
Executed in    6.90 millis

This 43× difference is with my Mac’s internal SSD. In my experience, the speed difference is even more pronounced on slower disks, like external hard drives.

How do you clone files on macOS?

Using the “Duplicate” command in Finder

If you use the Duplicate command in Finder (File > Duplicate or ⌘D), it clones the file.

Using cp -c on the command line

If you use the cp (copy) command with the -c flag, and it’s possible to clone the file, you get a clone rather than a copy. If it’s not possible to clone the file – for example, if you’re on a non-APFS volume that doesn’t support cloning – you get a regular copy.

Here’s what that looks like:

$ cp -c src.txt dst.txt

Using the clonefile() function

There’s a macOS syscall clonefile() which creates space-saving clones. It was introduced alongside APFS.

Syscalls are quite low level, and they’re how programs are meant to interact with the operating system. I don’t think I’ve ever made a syscall directly – I’ve used wrappers like the Python os module, which make syscalls on my behalf, but I’ve never written my own code to call them.

Here’s a rudimentary C program that uses clonefile() to clone a file:

#include <stdio.h>
#include <stdlib.h>
#include <sys/clonefile.h>

int main(void) {
    const char *src = "1GB.bin";
    const char *dst = "clone.bin";

    /* clonefile(2) supports several options related to symlinks and
     * ownership information, but for this example we'll just use
     * the default behaviour */
    const int flags = 0;

    if (clonefile(src, dst, flags) != 0) {
        perror("clonefile failed");
        return EXIT_FAILURE;
    }

    printf("clonefile succeeded: %s ~> %s\n", src, dst);

    return EXIT_SUCCESS;
}

You can compile and run this program like so:

$ gcc clone.c

$ ./a.out
clonefile succeeded: 1GB.bin ~> clone.bin

$ ./a.out
clonefile failed: File exists

But I don’t use C in any of my projects – can I call this function from Python instead?

How do you clone files with Python?

Shelling out to cp -c using subprocess

The easiest way to clone a file in Python is by shelling out to cp -c with the subprocess module. Here’s a short example:

import subprocess

# Adding the `-c` flag means the file is cloned rather than copied,
# if possible.  See the man page for `cp`.
subprocess.check_call(["cp", "-c", "1GB.bin", "clone.bin"])

I think this snippet is pretty simple, and a new reader could understand what it’s doing. If they’re unfamiliar with file cloning on APFS, they might not immediately understand why this is different from shutil.copyfile, but they could work it out quickly.

This approach gets all the nice behaviour of the cp command – for example, if you try to clone on a volume that doesn’t support cloning, it falls back to a regular file copy instead. There’s a bit of overhead from spawning an external process, but the overall impact is negligible (and easily offset by the speed increase of cloning).

The problem with this approach is that error handling gets harder. The cp command fails with exit code 1 for every error, so you need to parse the stderr to distinguish different errors, or implement your own error handling.

In my project, I wrapped this cp call in a function which had some additional checks to spot common types of error, and throw them as more specific exceptions. Any remaining errors get thrown as a generic subprocess.CalledProcessError. Here’s an example:

from pathlib import Path
import subprocess


def clonefile(src: Path, dst: Path):
    """Clone a file on macOS by using the `cp` command."""
    # Check a couple of common error cases so we can get nice exceptions,
    # rather than relying on the `subprocess.CalledProcessError` from `cp`.
    if not src.exists():
        raise FileNotFoundError(src)

    if not dst.parent.exists():
        raise FileNotFoundError(dst.parent)

    # Adding the `-c` flag means the file is cloned rather than copied,
    # if possible.  See the man page for `cp`.
    subprocess.check_call(["cp", "-c", str(src), str(dst)])

    assert dst.exists()

For me, this code strikes a nice balance between being readable and returning good errors.

Calling the clonefile() function using ctypes

What if we want detailed error codes, and we don’t want the overhead of spawning an external process? Although I know it’s possible to make syscalls from Python using the ctypes library, I’ve never actually done it. This is my chance to learn!

Following the documentation for ctypes, these are the steps:

  1. Import ctypes and load a dynamic link library. This is the first thing we need to do – in this case, we’re loading the macOS link library that contains the clonefile() function.

    import ctypes
    
    libSystem = ctypes.CDLL("libSystem.B.dylib")
    

    I worked out that I need to load libSystem.B.dylib by looking at other examples of ctypes code on GitHub. I couldn’t find an explanation of it in Apple’s documentation.

    I later discovered that I can use otool to see the shared libraries that a compiled executable is linking to. For example, I can see that cp is linking to the same libSystem.B.dylib:

    $ otool -L /bin/cp
    /bin/cp:
        /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1351.0.0)
    

    This CDLL() call only works on macOS, which makes sense – it’s loading macOS libraries. If I run this code on my Debian web server, I get an error: OSError: libSystem.B.dylib: cannot open shared object file: No such file or directory.

  2. Tell ctypes about the function signature. If we look at the man page for clonefile(), we see the signature of the C function:

    int clonefile(const char * src, const char * dst, int flags);
    

    We need to tell ctypes to find this function inside libSystem.B.dylib, then describe the arguments and return type of the function:

    clonefile = libSystem.clonefile
    clonefile.argtypes = [ctypes.c_char_p, ctypes.c_char_p, ctypes.c_int]
    clonefile.restype = ctypes.c_int
    

    Although ctypes can call C functions if you don’t describe the signature, it’s a good practice and gives you some safety rails.

    For example, now ctypes knows that the clonefile() function takes three arguments. If I try to call the function with one or two arguments, I get a TypeError. If I didn’t specify the signature, I could call it with any number of arguments, but it might behave in weird or unexpected ways.

  3. Define the inputs for the function. This function needs three arguments.

    In the original C function, src and dst are char* – pointers to a null-terminated string of char values. In Python, this means the inputs need to be bytes values. Then flags is a regular Python int.

    # Source and destination files
    src = b"1GB.bin"
    dst = b"clone.bin"
    
    # clonefile(2) supports several options related to symlinks and
    # ownership information, but for this example we'll just use
    # the default behaviour
    flags = 0
    
  4. Call the function. Now we have the function available in Python, and the inputs in C-compatible types, we can call the function:

    import os
        
    if clonefile(src, dst, flags) != 0:
        errno = ctypes.get_errno()
        raise OSError(errno, os.strerror(errno))
        
    print(f"clonefile succeeded: {src} ~> {dst}")
    

    If the clone succeeds, this program runs successfully. But if the clone fails, we get an unhelpful error: OSError: [Errno 0] Undefined error: 0.

    The point of calling the C function is to get useful error codes, but we need to opt-in to receiving them. In particular, we need to add the use_errno parameter to our CDLL call:

    libSystem = ctypes.CDLL("libSystem.B.dylib", use_errno=True)
    

    Now, when the clone fails, we get different errors depending on the type of failure. The exception includes the numeric error code, and Python will throw named subclasses of OSError like FileNotFoundError, FileExistsError, or PermissionError. This makes it easier to write try … except blocks for specific failures.

Here’s the complete script, which clones a single file:

import ctypes
import os

# Load the libSystem library
libSystem = ctypes.CDLL("libSystem.B.dylib", use_errno=True)

# Tell ctypes about the function signature
# int clonefile(const char * src, const char * dst, int flags);
clonefile = libSystem.clonefile
clonefile.argtypes = [ctypes.c_char_p, ctypes.c_char_p, ctypes.c_int]
clonefile.restype = ctypes.c_int

# Source and destination files
src = b"1GB.bin"
dst = b"clone.bin"

# clonefile(2) supports several options related to symlinks and
# ownership information, but for this example we'll just use
# the default behaviour
flags = 0

# Actually call the clonefile() function
if clonefile(src, dst, flags) != 0:
    errno = ctypes.get_errno()
    raise OSError(errno, os.strerror(errno))
    
print(f"clonefile succeeded: {src} ~> {dst}")

I wrote this code for my own learning, and it’s definitely not production-ready. It works in the happy case and helped me understand ctypes, but if you actually wanted to use this, you’d want proper error handling and testing.

In particular, there are cases where you’d want to fall back to shutil.copyfile or similar if the clone fails – say if you’re on an older version of macOS, or you’re copying files on a volume which doesn’t support cloning. Both those cases are handled by cp -c, but not the clonefile() syscall.

In practice, how am I cloning files in Python?

In my project, I used cp -c with a wrapper like the one described above. It’s a short amount of code, pretty readable, and returns useful errors for common cases.

Calling clonefile() directly with ctypes might be slightly faster than shelling out to cp -c, but the difference is probably negligible. The downside is that it’s more fragile and harder for other people to understand – it would have been the only part of the codebase that was using ctypes.

File cloning made a noticeable difference. The project involving copying lots of files on an external USB hard drive, and cloning instead of copying full files made it much faster. Tasks that used to take over an hour were now completing in less than a minute. (The files were copied between folders on the same drive – cloned files have to be on the same APFS volume.)

I’m excited to see how file cloning works on Linux in Python 3.14 with Path.copy(), and I hope macOS support isn’t far behind.

[If the formatting of this post looks odd in your feed reader, visit the original article]