2025-10-07 14:19:18
I download a lot of videos from YouTube, and yt-dlp is my tool of choice. Sometimes I download videos as a one-off, but more often I’m downloading videos in a project – my bookmarks, my collection of TV clips, or my social media scrapbook.
I’ve noticed myself writing similar logic in each project – finding the downloaded files, converting them to MP4, getting the channel information, and so on. When you write the same thing multiple times, it’s a sign you should extract it into a shared tool – so that’s what I’ve done.
yt-dlp_alexwlchan is a script that calls yt-dlp with my preferred options, in particular:
All this is presented in a CLI command which prints a JSON object that other projects can parse. Here’s an example:
$ yt-dlp_alexwlchan.py "https://www.youtube.com/watch?v=TUQaGhPdlxs"
{
"id": "TUQaGhPdlxs",
"url": "https://www.youtube.com/watch?v=TUQaGhPdlxs",
"title": "\"new york city, manhattan, people\" - Free Public Domain Video",
"description": "All videos uploaded to this channel are in the Public Domain: Free for use by anyone for any purpose without restriction. #PublicDomain",
"date_uploaded": "2022-03-25T01:10:38Z",
"video_path": "\uff02new york city, manhattan, people\uff02 - Free Public Domain Video [TUQaGhPdlxs].mp4",
"thumbnail_path": "\uff02new york city, manhattan, people\uff02 - Free Public Domain Video [TUQaGhPdlxs].jpg",
"subtitle_path": null,
"channel": {
"id": "UCDeqps8f3hoHm6DHJoseDlg",
"name": "Public Domain Archive",
"url": "https://www.youtube.com/channel/UCDeqps8f3hoHm6DHJoseDlg",
"avatar_url": "https://yt3.googleusercontent.com/ytc/AIdro_kbeCfc5KrnLmdASZQ9u649IxrxEUXsUaxdSUR_jA_4SZQ=s0"
},
"site": "youtube"
}
Rather than using the yt-dlp CLI, I’m using the Python interface.
I can import the YouTubeDL
class and pass it some options, then pull out the important fields from the response.
The library is very flexible, and the options are well-documented.
This is similar to my create_thumbnail
tool.
I only have to define my preferred behaviour once, then other code can call it as an external script.
I have ideas for changes I might make in the future, like tidying up filenames or supporting more sites, but I’m pretty happy with this first pass. All the code is in my yt-dlp_alexwlchan GitHub repo.
This script is based on my preferences, so you probably don’t want to use it directly – but if you use yt-dlp a lot, it could be a helpful starting point for writing your own script.
Even if you don’t use yt-dlp, the idea still applies: when you find yourself copy-pasting configuration and options, turn it into a standalone tool. It keeps your projects cleaner and more consistent, and your future self will thnak you for it.
[If the formatting of this post looks odd in your feed reader, visit the original article]
2025-09-19 05:14:29
Today a colleague asked for a way to open all the files that have changed in a particular Git branch. They were reviewing a large pull request, and sometimes it’s easier to review files in your local editor than in GitHub’s code review interface. You can see the whole file, run tests or local builds, and get more context than the GitHub diffs.
This is the snippet I suggested:
git diff --name-only "$BRANCH_NAME" $(git merge-base origin/main "$BRANCH_NAME") \
| xargs open -a "Visual Studio Code"
It uses a couple of nifty Git features, so let’s break it down.
There are three parts to this command:
Work out where the dev branch diverges from main.
We can use git-merge-base
:
$ git merge-base origin/main "$BRANCH_NAME"
9ac371754d220fd4f8340dc0398d5448332676c3
This command gives us the common ancestor of our main branch and our dev branch – this is the tip of main when the developer created their branch.
In a small codebase, main might not have changed since the dev branch was created. But in a large codebase where lots of people are making changes, the main branch might have moved on since the dev branch was created.
Here’s a quick picture:
This tells us which commits we’re reviewing – what are the changes in this branch?
Get a list of files which have changed in the dev branch.
We can use git-diff
to see the difference between two commits.
If we add the --name-only
flag, it only prints a list of filenames with changes, not the full diffs.
$ git diff --name-only "$BRANCH_NAME" $(git merge-base …)
assets/2025/exif_orientation.py
src/_drafts/create-thumbnail-is-exif-aware.md
src/_images/2025/exif_orientation.svg
Because we're diffing between the tip of our dev branch, and the point where our dev branch diverged from main, this prints a list of files that have changed in the dev branch.
(I originally suggested using git diff --name-only "$BRANCH_NAME" origin/main
, but that's wrong.
That prints all the files that differ between the two branches, which includes changes merged to main after the dev branch was created.)
Open the files in our text editor.
I suggested piping to xargs
and open
, but there are many ways to do this:
$ git diff … | xargs open -a "Visual Studio Code"
The xargs
command is super useful for doing the same thing repeatedly – in this case, opening a bunch of files in VS Code.
You feed it a space-delimited string, it splits the string into different pieces, and runs the same command on each of them, one-by-one.
It’s equivalent to running:
open -a "Visual Studio Code" "assets/2025/exif_orientation.py"
open -a "Visual Studio Code" "src/_drafts/create-thumbnail-is-exif-aware.md"
open -a "Visual Studio Code" "src/_images/2025/exif_orientation.svg"
The open
command opens files, and the -a
flag tells it which application to use.
We mostly use VS Code at work, but you could pass any text editor here.
Reading the manpage for open
, I'm reminded that you can open multiple files at once, so I could have done this without using xargs
.
I instinctively reached for xargs
because I’m very familiar with it, and it’s a reliable way to take a command that takes a single input, and run it with many inputs.
[If the formatting of this post looks odd in your feed reader, visit the original article]
2025-09-15 05:44:01
One of my favourite features added to web browsers in the last few years is text fragments.
Text fragments allow you to link directly to specific text on a web page, and some browsers will highlight the linked text – for example, by scrolling to it, or adding a coloured highlight. This is useful if I’m linking to a long page that doesn’t have linkable headings – I want it to be easy for readers to find the part of the page I was looking for.
Here’s an example of a URL with a text fragment:
https://example.com/#:~:text=illustrative%20examples
But I don’t find the syntax especially intuitive – I can never remember exactly what mix of colons and tildes to add to a URL.
To help me out, I’ve written a small bookmarklet to generate these URLs:
To install the bookmarklet, drag it to my bookmarks bar.
When I’m looking at a page and want to create a text fragment link, I select the text and click the bookmarklet. It works out the correct URL and shows it in a popup, ready to copy and paste. You can try it now – select some text on this page, then click the button to see the text fragment URL.
It’s a small tool, but it’s made my link sharing much easier.
Update, 16 September 2025: Smoljaguar on Mastodon pointed out that Firefox, Chrome, and Safari all have menu items for “Copy Link with Highlight” which does something very similar. The reason I don’t use these is because I didn’t know they exist!
I use Safari as my main browser, and this item is only available in the right-click menu. One reason I like bookmarklets is that they become items in the Bookmarks menu, and then it’s easy for me to assign keyboard shortcuts.
This is the JavaScript that gets triggered when you run the bookmarklet:
const selectedText = window.getSelection().toString().trim();
if (!selectedText) {
alert("You need to select some text!");
return;
}
const url = new URL(window.location);
url.hash = `:~:text=${encodeURIComponent(selectedText)}`;
alert(url.toString());
[If the formatting of this post looks odd in your feed reader, visit the original article]
2025-09-09 06:42:48
Resizing an image is one of those programming tasks that seems simple, but has some rough edges. One common mistake is forgetting to handle the EXIF orientation, which can make resized images look very different from the original.
Last year I wrote a create_thumbnail
tool to resize images, and today I released a small update.
Now it’s aware of EXIF orientation, and it no longer mangles these images.
This is possible thanks to a new version of the Rust image
crate, which just improved its EXIF support.
Images can specify an orientation in their EXIF metadata, which can describe both rotation and reflection. This metadata is usually added by cameras and phones, which can detect how you’re holding them, and tell viewing software how to display the picture later.
For example, if you take a photo while holding your camera on its side, the camera can record that the image should be rotated 90 degrees when viewed. If you use a front-facing selfie camera, the camera could note that the picture needs to be mirrored.
There are eight different values for EXIF orientation – rotating in increments of 90°, and mirrored or not. The default value is “1” (display as-is), and here are the other seven values:
You can see the EXIF orientation value with programs like Phil Harvey’s exiftool, which helpfully converts the numeric value into a human-readable description:
$ # exiftool's default output is human-readable
$ exiftool -orientation my_picture.jpg
Orientation : Rotate 270 CW
$ # or we can get the raw numeric value
$ exiftool -n -orientation my_picture.jpg
Orientation : 8
I use the image
crate to resize images in Rust.
My old code for resizing images would open the image, resize it, then save it back to disk. Here’s a short example:
use image::imageops::FilterType;
use std::error::Error;
fn main() -> Result<(), Box<dyn Error>> {
// Old method: doesn't know about EXIF orientation
let img = image::open("original.jpg")?;
img.resize(180, 120, FilterType::Lanczos3)
.save("thumbnail.jpg")?;
Ok(())
}
The thumbnail will keep the resized pixels in the same order as the original image, but the thumbnail doesn’t have the EXIF orientation metadata. This means that if the original image had an EXIF orientation, the thumbnail could look different, because it’s no longer being rotated/reflected properly.
When I wrote create_thumbnail
, the image
crate didn’t know anything about EXIF orientation – but last week’s v0.25.8 release added several functions related to EXIF orientation.
In particular, I can now read the orientation and apply it to an image:
use image::imageops::FilterType;
use image::{DynamicImage, ImageDecoder, ImageReader};
use std::error::Error;
fn main() -> Result<(), Box<dyn Error>> {
// New methods in image v0.25.8 know about EXIF orientation,
// and allow us to apply it to the image before resizing.
let mut decoder = ImageReader::open("original.jpg")?.into_decoder()?;
let orientation = decoder.orientation()?;
let mut img = DynamicImage::from_decoder(decoder)?;
img.apply_orientation(orientation);
img.resize(180, 120, FilterType::Lanczos3)
.save("thumbnail.jpg")?;
Ok(())
}
The thumbnail still doesn’t have any EXIF orientation data, but the pixels have been rearranged so the resized image looks similar to the original. That’s what I want.
Here’s a visual comparison of the three images. Notice how the thumbnail from the old code looks upside down:
![]() |
![]() |
![]() |
This test image comes from Dave Perrett’s exif-orientation-examples repo, which has a collection of images that were very helpful for testing this code.
This is a small change, but it solves an annoyance I’ve hit in every project that deals with images. I’ve written this fix, but images with an EXIF orientation are rare enough that I always forget them when I start a new project – and I used to solve the same problem again and again.
By handling EXIF orientation in create_thumbnail
, I won’t have to think about this again.
That’s the beauty of a shared tool – I fix it once, and then it’s fixed for all my current and future projects.
[If the formatting of this post looks odd in your feed reader, visit the original article]
2025-08-29 05:36:37
Testing code that makes HTTP requests can be difficult. Real requests are slow, flaky, and hard to control. That’s why I use a Python library called vcrpy, which does a one-off recording of real HTTP interactions, then replays them during future tests.
These recordings are saved to a “cassette” – a plaintext file that I keep alongside my tests and my code. The cassette ensures that all my tests get consistent HTTP responses, which makes them faster and more reliable, especially in CI. I only have to make one real network request, and then I can run my tests locally and offline.
In this post, I’ll show you how I use vcrpy in a production codebase – not just the basics, but also the patterns, pitfalls, and fixtures that make it work for a real team.
Table of contents
There are several reasons why I avoid real HTTP requests in my tests:
I want my tests to be fast, because then I’ll run them more often and catch mistakes sooner. An individual HTTP call might be quick, but stack up hundreds of them and tests really start to drag.
Even if my code is correct, my tests could fail because of problems on the remote server. What if I’m offline? What if the server is having a temporary outage? What if the server starts rate limiting me for making too many HTTP requests?
If my tests depend on the server having certain state, then the server state could change and break or degrade my test suite.
Sometimes this change is obvious. For example, suppose I’m testing a function to fetch photos from Flickr, and then the photo I’m using in my test gets deleted. My code works correctly for photos that still exist, but now my test starts failing.
Sometimes this change is more subtle. Suppose I’ve written a regression test for an edge case, and then the server state changes, so the example I’m checking is no longer an instance of the edge case. I could break the code and never realise, because the test would keep passing. My test suite would become less effective.
A lot of my HTTP calls require secrets, like API keys or OAuth tokens. If the tests made real HTTP calls, I’d need to copy those secrets to every environment where I’m running the tests. That increases the risk of the secret getting leaked.
If there are more reasons why a test could fail, then it takes longer to work out if the failure was caused by my mistake, or a change on the server.
If my test suite is returning consistent responses for HTTP calls, and those responses are defined within the test suite itself, then my tests get faster and more reliable. I’m not making real network calls, I’m not dependent on the behaviour of a server, and I don’t need real secrets to run the tests.
There are a variety of ways to define this sort of test mock; I like to record real responses because it ensures I’m getting a high-fidelity mock, and it makes it fairly easy to add new tests.
I know two Python libraries that record real HTTP responses: vcrpy and betamax, both based on a Ruby library called vcr. I’ve used all three, they behave in a similar way, and they work well.
I prefer vcrpy for Python because it supports a wide variety of HTTP libraries, whereas betamax only works with requests. I currently use a mixture of httpx and urllib3, and it’s convenient to test them both with the same library and test helpers.
I also like that vcrpy works without needing any changes to the code I’m testing. I can write HTTP code as I normally would, then I add a vcrpy decorator in my test and the responses get recorded. I don’t like test frameworks that require me to rewrite my code to fit – the tests should follow the code, not the other way round.
Here’s a test that uses vcrpy to fetch www.example.com
, and look for some text in the response.
I use vcr.use_cassette
as a context manager around the code that makes an HTTP request:
import httpx
import vcr
def test_example_domain():
with vcr.use_cassette("fixtures/vcr_cassettes/test_example_domain.yml"):
resp = httpx.get("https://www.example.com/")
assert "<h1>Example Domain</h1>" in resp.text
Alternatively, you can use vcr.use_cassette
as a decorator:
@vcr.use_cassette("fixtures/vcr_cassettes/test_example_domain.yml")
def test_example_domain():
resp = httpx.get("https://www.example.com/")
assert "<h1>Example Domain</h1>" in resp.text
With the decorator, you can also omit the path to the cassette file, and vcrpy will name the cassette file after the function:
@vcr.use_cassette()
def test_example_domain():
resp = httpx.get("https://www.example.com/")
assert "<h1>Example Domain</h1>" in resp.text
When I run this test using pytest (python3 -m pytest test_example.py
), vcrpy will check if the cassette file exists.
If the file is missing, it makes a real HTTP call and saves it to the file.
If the file exists, it replays the previously-recorded HTTP call.
By default, the cassette is a YAML file. Here’s what it looks like: test_example_domain.yml.
If a test makes more than one HTTP request, vcrpy records all of them in the same cassette file.
The cassette files contain the complete HTTP request and response, which includes the URL, form data, and HTTP headers. If I’m testing an API that requires auth, the HTTP request could include secrets like an API key or OAuth token. I don’t want to save those secrets in the cassette file!
Fortunately, vcrpy can filter sensitive data before it’s saved to the cassette file – HTTP headers, URL query parameters, or form data.
Here’s an example where I’m using filter_query_parameters
to redact an API key.
I’m replacing the real value with the placeholder REDACTED_API_KEY
.
import os
import httpx
import vcr
def test_flickr_api():
with vcr.use_cassette(
"fixtures/vcr_cassettes/test_flickr_api.yml",
filter_query_parameters=[("api_key", "REDACTED_API_KEY")],
):
api_key = os.environ.get("FLICKR_API_KEY", "API_KEY")
resp = httpx.get(
"https://api.flickr.com/services/rest/",
params={
"api_key": api_key,
"method": "flickr.urls.lookupUser",
"url": "https://www.flickr.com/photos/alexwlchan/",
},
)
assert '<user id="199258389@N04">' in resp.text
When I run this test the first time, I need to pass an env var FLICKR_API_KEY
.
This makes a real request and records a cassette, but with my redacted value.
When I run the test again, I don’t need to pass the env var, but the test will still pass.
You can see the complete YAML file in test_flickr_api.yml.
Notice how the api_key
query parameter has been redacted in the recorded request:
interactions:
- request:
…
uri: https://api.flickr.com/services/rest/?api_key=REDACTED_API_KEY&method=flickr.urls.lookupUser&url=https%3A%2F%2Fwww.flickr.com%2Fphotos%2Falexwlchan%2F
…
You can also tell vcrpy to omit the sensitive field entirely, but I like to insert a placeholder value. It’s useful for debugging later – you can see that a value was replaced, and easily search for the code that’s doing the redaction.
If you look at the first two cassette files, you’ll notice that the response body is stored as base64-encoded binary data:
response:
body:
string: !!binary |
H4sIAAAAAAAAAH1UTXPbIBC9+1ds1UsyIyQnaRqPLWn6mWkPaQ9pDz0SsbKYCFAByfZ08t+7Qo4j
N5makYFdeLvvsZC9Eqb0uxah9qopZtljh1wUM6Bf5qVvsPi85aptED4ZxaXO0tE6G5co9BzKmluH
Po86X7FFBGkxcdbetwx/d7LPo49Ge9SeDWEjKMdZHnnc+nQIvzpAvYSkucI86iVuWmP9ZP9GCl/n
That’s because example.com
and api.flickr.com
both gzip compress their responses, and vcrpy is preserving that compression.
But gzip compression is handled by the HTTP libraries – my code never needs to worry about compression; it just gets the uncompressed response.
Where possible, I prefer to store responses in their uncompressed form. It makes the cassettes easier to read, and you can see if secrets are included in the saved response data. I also find it useful to read cassettes as an example of what an API response looks like – and in particular, what it looked like when I wrote the test. Cassettes have helped me spot undocumented changes in APIs.
Here’s an example where I’m using decode_compressed_response=True
to remove the gzip compression in the cassette:
def test_example_domain_with_decode():
with vcr.use_cassette(
"fixtures/vcr_cassettes/test_example_domain_with_decode.yml",
decode_compressed_response=True,
):
resp = httpx.get("https://www.example.com/")
assert "<h1>Example Domain</h1>" in resp.text
You can see the complete cassette file in test_example_domain_with_decode.yml. Notice the response body now contains an HTML string:
response:
body:
string: "<!doctype html>\n<html>\n<head>\n <title>Example Domain</title>\n\n
\ <meta charset=\"utf-8\" />\n <meta http-equiv=\"Content-type\" content=\"text/html;
charset=utf-8\" />\n <meta name=\"viewport\" content=\"width=device-width,
If you write a lot of tests that use vcrpy, you’ll end up with a fixtures directory that’s full of cassettes. I like cassette names to match my test functions, so they’re easy to match up later.
I could specify a cassette name explicitly in every test, but that’s extra work and prone to error. Alternatively, I could use the decorator and use the automatic cassette name – but vcrpy uses the name of the test function, which may not distinguish between tests. In particular, I often group tests into classes, or use parametrized tests to run the same test with different values.
Consider the following example:
import httpx
import pytest
import vcr
class TestExampleDotCom:
def test_status_code(self):
resp = httpx.get("https://example.com")
assert resp.status_code == 200
@vcr.use_cassette()
@pytest.mark.parametrize(
"url, status_code",
[
("https://httpbin.org/status/200", 200),
("https://httpbin.org/status/404", 404),
("https://httpbin.org/status/500", 500),
],
)
def test_status_code(url, status_code):
resp = httpx.get(url)
assert resp.status_code == status_code
This is four different tests, but vcrpy’s automatic cassette name is the same for each of them: test_status_code
.
The tests will fail if you try to run them – vcrpy will record a cassette for the first test that runs, then try to replay that cassette for the second test.
The second test makes a different HTTP request, so vcrpy will throw an error because it can’t find a matching request.
Here’s what I do instead: I have a pytest fixture to choose cassette names, which includes the name of the test class (if any) and the ID of the parametrized test case. Because I sometimes use URLs in parametrized tests, I also check the test case ID doesn’t include slashes or colons – I don’t want those in my filenames!
Here’s the decorator:
@pytest.fixture
def cassette_name(request: pytest.FixtureRequest) -> str:
"""
Returns the filename of a VCR cassette to use in tests.
The name can be made up of (up to) three parts:
- the name of the test class
- the name of the test function
- the ID of the test case in @pytest.mark.parametrize
"""
name = request.node.name
# This is to catch cases where e.g. I try to include a complete
# HTTP URL in a cassette name, which creates messy folders in
# the fixtures directory.
if ":" in name or "/" in name:
raise ValueError(
"Illegal characters in VCR cassette name - "
"please set a test ID with pytest.param(…, id='…')"
)
if request.cls is not None:
return f"{request.cls.__name__}.{name}.yml"
else:
return f"{name}.yml"
Here’s my test rewritten to use that new decorator:
class TestExampleDotCom:
def test_status_code(self, cassette_name):
with vcr.use_cassette(cassette_name):
resp = httpx.get("https://example.com")
assert resp.status_code == 200
@vcr.use_cassette()
@pytest.mark.parametrize(
"url, status_code",
[
pytest.param("https://httpbin.org/status/200", 200, id="ok"),
pytest.param("https://httpbin.org/status/404", 404, id="not_found"),
pytest.param("https://httpbin.org/status/500", 500, id="server_error"),
],
)
def test_status_code(url, status_code, cassette_name):
with vcr.use_cassette(cassette_name):
resp = httpx.get(url)
assert resp.status_code == status_code
The four tests now get distinct cassette filenames:
TestExampleDotCom.test_status_code
test_status_code[ok]
test_status_code[not_found]
test_status_code[server_error]
Most of the time, you don’t need to worry about how vcrpy works. If you’re running an existing test, then vcrpy is just a fancy test mock that happens to be reading its data from a YAML file. You don’t need to worry about the implementation details.
However, if you’re writing a new test, you need to record new cassettes. This can involve some non-obvious setup, especially if you’ve never done it before.
Let’s revisit an earlier example:
def test_flickr_api():
with vcr.use_cassette(
"fixtures/vcr_cassettes/test_flickr_api.yml",
filter_query_parameters=[("api_key", "REDACTED_API_KEY")],
):
api_key = os.environ.get("FLICKR_API_KEY", "API_KEY")
resp = httpx.get(
"https://api.flickr.com/services/rest/",
params={
"api_key": api_key,
"method": "flickr.urls.lookupUser",
"url": "https://www.flickr.com/photos/alexwlchan/",
},
)
assert '<user id="199258389@N04">' in resp.text
If you run this test without passing a FLICKR_API_KEY
environment variable, it will call the real Flickr API with the placeholder API key.
Unsurprisingly, the Flickr API will return an error response, and your test will fail:
<?xml version="1.0" encoding="utf-8" ?>
<rsp stat="fail">
<err code="100" msg="Invalid API Key (Key has invalid format)" />
</rsp>
Worse still, vcrpy will record this error in the cassette file. Even if you work out you need to re-run the test with the env var, it will keep failing as it replays the recorded error.
Can we make this better? In this scenario, what I’d prefer is:
I worked out how to get this nicer error handling.
vcrpy has a before_record_response
hook, that allows you to modify a response before writing it to the cassette file.
You could use this to redact secrets from responses, but I realised you could also use it to validate the response – and if you throw an exception, it prevents vcrpy from writing a cassette.
Here’s a hook I wrote, which checks if a vcrpy response is a Flickr API error telling us that we passed an invalid API key, and throws an exception if so:
def check_for_invalid_api_key(response):
"""
Before we record a new response to a cassette, check if it's
a Flickr API response telling us we're missing an API key -- that
means we didn't set up the test correctly.
If so, give the developer an instruction explaining what to do next.
"""
try:
body: bytes = response["body"]["string"]
except KeyError:
body = response["content"]
is_error_response = (
b'<err code="100" msg="Invalid API Key (Key has invalid format)" />' in body
)
if is_error_response:
raise RuntimeError(
"You tried to record a new call to the Flickr API, \n"
"but the tests don't have an API key.\n"
"\n"
"Pass an API key as an env var FLICKR_API_KEY=ae84…,\n"
"and re-run the test.\n"
)
return response
We can call this hook in our vcr.use_cassette
call:
def test_flickr_api(cassette_name):
with vcr.use_cassette(
cassette_name,
filter_query_parameters=[("api_key", "REDACTED_API_KEY")],
decode_compressed_response=True,
before_record_response=check_for_invalid_api_key,
):
...
Now, if you try to record a Flickr API call and don’t set the API key, you’ll get a helpful error explaining how to re-run the test correctly.
This is all useful, but it’s a lot of boilerplate to add to every test. To make everything cleaner, I wrap vcrpy in a pytest fixture that returns an HTTP client I can use in my tests. This fixture allows me to configure vcrpy, and also do any other setup I need on the HTTP client – for example, adding authentication params or HTTP headers.
Here’s an example of such a fixture in a library for using the Flickr API:
@pytest.fixture
def flickr_api(cassette_name):
with vcr.use_cassette(
cassette_name,
filter_query_parameters=[("api_key", "REDACTED_API_KEY")],
decode_compressed_response=True,
before_record_response=check_for_invalid_api_key,
):
client = httpx.Client(
params={"api_key": os.environ.get("FLICKR_API_KEY", "API_KEY")},
headers={
# Close the connection as soon as the API returns a
# response, to fix pytest warnings about unclosed sockets.
"Connection": "Close",
},
)
yield client
This makes individual tests much shorter and simpler:
def test_flickr_api_without_boilerplate(flickr_api):
resp = flickr_api.get(
"https://api.flickr.com/services/rest/",
params={
"method": "flickr.urls.lookupUser",
"url": "https://www.flickr.com/photos/alexwlchan/",
},
)
assert '<user id="199258389@N04">' in resp.text
When somebody reads this test, they don’t need to think about the authentication or or mocking; they can just see the API call that we’re making.
Although vcrpy is useful, there are times when I prefer to test my HTTP code in a different way. Here are a few examples.
If I’m testing my error handling code – errors like timeouts, connection failures, or 5xx errors – it’s difficult to record a real response. Even if I could find a reliable error case today, it might be fixed tomorrow, which makes it difficult to reproduce if I ever need to re-record a cassette.
When I test error handling, I prefer a pure-Python mock where I can see exactly what error conditions I’m creating.
If my HTTP code is downloading images and video, storing them in a vcrpy cassette is pretty inefficient – they have to be encoded as base64. This makes the cassettes large and inefficient, the extra decoding step slows my test down, and the files are hard to inspect.
When I’m testing with binary files, I store them as standalone files in my fixtures
directory (e.g. in tests/fixtures/images
), and I write my own mock to read the file from disk.
I can easily inspect or modify the fixture data, and I don’t have the overhead of using cassettes.
A vcrpy cassette locks in the current behaviour. But suppose I know about an upcoming change, or I want to check my code would handle an unusual response – I can’t capture that in a vcrpy cassette, because the server isn’t returning responses like that (yet).
In those cases, I either construct a vcrpy cassette with the desired response by hand, or I use a code-based mock to return my unusual response.
Using vcrpy has allowed me to write more thorough tests, and it does all the hard work of intercepting HTTP calls and serialising them to disk. It gives me high-fidelity snapshots of HTTP responses, allowing me to mock HTTP calls and avoid network requests in my tests. This makes my tests faster, consistent, and reliable.
Here’s a quick reminder of what I do to run vcrpy in production:
filter_query_parameters
and filter_headers
to keep secrets out of cassette filesdecode_compressed_response=True
to make cassettes more readableIf you make HTTP calls from your tests, I really recommend it: https://vcrpy.readthedocs.io/en/latest/
[If the formatting of this post looks odd in your feed reader, visit the original article]
2025-08-03 22:49:06
The standard Mac filesystem, APFS, has a feature called space-saving clones. This allows you to create multiple copies of a file without using additional disk space – the filesystem only stores a single copy of the data.
Although cloned files share data, they’re independent – you can edit one copy without affecting the other (unlike symlinks or hard links). APFS uses a technique called copy-on-write to store the data efficiently on disk – the cloned files continue to share any pieces they have in common.
Cloning files is both faster and uses less disk space than copying. If you’re working with large files – like photos, videos, or datasets – space-saving clones can be a big win.
Several filesystems support cloning, but in this post, I’m focusing on macOS and APFS.
For a recent project, I wanted to clone files using Python.
There’s an open ticket to support file cloning in the Python standard library.
In Python 3.14, there’s a new Path.copy()
function which adds support for cloning on Linux – but there’s nothing yet for macOS.
In this post, I’ll show you two ways to clone files in APFS using Python.
Table of contents
There are two main benefits to using clones rather than copies.
Because the filesystem only has to keep one copy of the data, cloning a file doesn’t use more space on disk. We can see this with an experiment. Let’s start by creating a random file with 1GB of data, and checking our free disk size:
$ dd if=/dev/urandom of=1GB.bin bs=64M count=16
16+0 records in
16+0 records out
1073741824 bytes transferred in 2.113280 secs (508092550 bytes/sec)
$ df -h -I /
Filesystem Size Used Avail Capacity Mounted on
/dev/disk3s1s1 460Gi 14Gi 43Gi 25% /
My disk currently has 43GB available.
Let’s copy the file, and check the free disk space after it’s done. Notice that it decreases to 42GB, because the filesystem is now storing a second copy of this 1GB file:
$ # Copying
$ cp 1GB.bin copy.bin
$ df -h -I /
Filesystem Size Used Avail Capacity Mounted on
/dev/disk3s1s1 460Gi 14Gi 42Gi 25% /
Now let’s clone the file by passing the -c
flag to cp
.
Notice that the free disk space stays the same, because the filesystem is just keeping a single copy of the data between the original and the clone:
$ # Cloning
$ cp -c 1GB.bin clone.bin
$ df -h -I /
Filesystem Size Used Avail Capacity Mounted on
/dev/disk3s1s1 460Gi 14Gi 42Gi 25% /
When you clone a file, the filesystem only has to write a small amount of metadata about the new clone. When you copy a file,it needs to write all the bytes of the entire file. This means that cloning a file is much faster than copying, which we can see by timing the two approaches:
$ # Copying
$ time cp 1GB.bin copy.bin
Executed in 260.07 millis
$ # Cloning
$ time cp -c 1GB.bin clone.bin
Executed in 6.90 millis
This 43× difference is with my Mac’s internal SSD. In my experience, the speed difference is even more pronounced on slower disks, like external hard drives.
If you use the Duplicate command in Finder (File > Duplicate or ⌘D), it clones the file.
cp -c
on the command lineIf you use the cp
(copy) command with the -c
flag, and it’s possible to clone the file, you get a clone rather than a copy.
If it’s not possible to clone the file – for example, if you’re on a non-APFS volume that doesn’t support cloning – you get a regular copy.
Here’s what that looks like:
$ cp -c src.txt dst.txt
clonefile()
functionThere’s a macOS syscall clonefile()
which creates space-saving clones.
It was introduced alongside APFS.
Syscalls are quite low level, and they’re how programs are meant to interact with the operating system.
I don’t think I’ve ever made a syscall directly – I’ve used wrappers like the Python os
module, which make syscalls on my behalf, but I’ve never written my own code to call them.
Here’s a rudimentary C program that uses clonefile()
to clone a file:
#include <stdio.h>
#include <stdlib.h>
#include <sys/clonefile.h>
int main(void) {
const char *src = "1GB.bin";
const char *dst = "clone.bin";
/* clonefile(2) supports several options related to symlinks and
* ownership information, but for this example we'll just use
* the default behaviour */
const int flags = 0;
if (clonefile(src, dst, flags) != 0) {
perror("clonefile failed");
return EXIT_FAILURE;
}
printf("clonefile succeeded: %s ~> %s\n", src, dst);
return EXIT_SUCCESS;
}
You can compile and run this program like so:
$ gcc clone.c
$ ./a.out
clonefile succeeded: 1GB.bin ~> clone.bin
$ ./a.out
clonefile failed: File exists
But I don’t use C in any of my projects – can I call this function from Python instead?
cp -c
using subprocess
The easiest way to clone a file in Python is by shelling out to cp -c
with the subprocess
module.
Here’s a short example:
import subprocess
# Adding the `-c` flag means the file is cloned rather than copied,
# if possible. See the man page for `cp`.
subprocess.check_call(["cp", "-c", "1GB.bin", "clone.bin"])
I think this snippet is pretty simple, and a new reader could understand what it’s doing.
If they’re unfamiliar with file cloning on APFS, they might not immediately understand why this is different from shutil.copyfile
, but they could work it out quickly.
This approach gets all the nice behaviour of the cp
command – for example, if you try to clone on a volume that doesn’t support cloning, it falls back to a regular file copy instead.
There’s a bit of overhead from spawning an external process, but the overall impact is negligible (and easily offset by the speed increase of cloning).
The problem with this approach is that error handling gets harder.
The cp
command fails with exit code 1 for every error, so you need to parse the stderr to distinguish different errors, or implement your own error handling.
In my project, I wrapped this cp
call in a function which had some additional checks to spot common types of error, and throw them as more specific exceptions.
Any remaining errors get thrown as a generic subprocess.CalledProcessError
.
Here’s an example:
from pathlib import Path
import subprocess
def clonefile(src: Path, dst: Path):
"""Clone a file on macOS by using the `cp` command."""
# Check a couple of common error cases so we can get nice exceptions,
# rather than relying on the `subprocess.CalledProcessError` from `cp`.
if not src.exists():
raise FileNotFoundError(src)
if not dst.parent.exists():
raise FileNotFoundError(dst.parent)
# Adding the `-c` flag means the file is cloned rather than copied,
# if possible. See the man page for `cp`.
subprocess.check_call(["cp", "-c", str(src), str(dst)])
assert dst.exists()
For me, this code strikes a nice balance between being readable and returning good errors.
clonefile()
function using ctypes
What if we want detailed error codes, and we don’t want the overhead of spawning an external process?
Although I know it’s possible to make syscalls from Python using the ctypes
library, I’ve never actually done it.
This is my chance to learn!
Following the documentation for ctypes
, these are the steps:
Import ctypes
and load a dynamic link library.
This is the first thing we need to do – in this case, we’re loading the macOS link library that contains the clonefile()
function.
import ctypes
libSystem = ctypes.CDLL("libSystem.B.dylib")
I worked out that I need to load libSystem.B.dylib
by looking at other examples of ctypes
code on GitHub.
I couldn’t find an explanation of it in Apple’s documentation.
I later discovered that I can use otool
to see the shared libraries that a compiled executable is linking to.
For example, I can see that cp
is linking to the same libSystem.B.dylib
:
$ otool -L /bin/cp
/bin/cp:
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1351.0.0)
This CDLL()
call only works on macOS, which makes sense – it’s loading macOS libraries.
If I run this code on my Debian web server, I get an error: OSError: libSystem.B.dylib: cannot open shared object file: No such file or directory.
Tell ctypes
about the function signature.
If we look at the man page for clonefile()
, we see the signature of the C function:
int clonefile(const char * src, const char * dst, int flags);
We need to tell ctypes
to find this function inside libSystem.B.dylib
, then describe the arguments and return type of the function:
clonefile = libSystem.clonefile
clonefile.argtypes = [ctypes.c_char_p, ctypes.c_char_p, ctypes.c_int]
clonefile.restype = ctypes.c_int
Although ctypes
can call C functions if you don’t describe the signature, it’s a good practice and gives you some safety rails.
For example, now ctypes
knows that the clonefile()
function takes three arguments.
If I try to call the function with one or two arguments, I get a TypeError
.
If I didn’t specify the signature, I could call it with any number of arguments, but it might behave in weird or unexpected ways.
Define the inputs for the function. This function needs three arguments.
In the original C function, src
and dst
are char*
– pointers to a null-terminated string of char
values.
In Python, this means the inputs need to be bytes
values.
Then flags
is a regular Python int
.
# Source and destination files
src = b"1GB.bin"
dst = b"clone.bin"
# clonefile(2) supports several options related to symlinks and
# ownership information, but for this example we'll just use
# the default behaviour
flags = 0
Call the function. Now we have the function available in Python, and the inputs in C-compatible types, we can call the function:
import os
if clonefile(src, dst, flags) != 0:
errno = ctypes.get_errno()
raise OSError(errno, os.strerror(errno))
print(f"clonefile succeeded: {src} ~> {dst}")
If the clone succeeds, this program runs successfully. But if the clone fails, we get an unhelpful error: OSError: [Errno 0] Undefined error: 0.
The point of calling the C function is to get useful error codes, but we need to opt-in to receiving them.
In particular, we need to add the use_errno
parameter to our CDLL
call:
libSystem = ctypes.CDLL("libSystem.B.dylib", use_errno=True)
Now, when the clone fails, we get different errors depending on the type of failure.
The exception includes the numeric error code, and Python will throw named subclasses of OSError
like FileNotFoundError
, FileExistsError
, or PermissionError
.
This makes it easier to write try … except
blocks for specific failures.
Here’s the complete script, which clones a single file:
import ctypes
import os
# Load the libSystem library
libSystem = ctypes.CDLL("libSystem.B.dylib", use_errno=True)
# Tell ctypes about the function signature
# int clonefile(const char * src, const char * dst, int flags);
clonefile = libSystem.clonefile
clonefile.argtypes = [ctypes.c_char_p, ctypes.c_char_p, ctypes.c_int]
clonefile.restype = ctypes.c_int
# Source and destination files
src = b"1GB.bin"
dst = b"clone.bin"
# clonefile(2) supports several options related to symlinks and
# ownership information, but for this example we'll just use
# the default behaviour
flags = 0
# Actually call the clonefile() function
if clonefile(src, dst, flags) != 0:
errno = ctypes.get_errno()
raise OSError(errno, os.strerror(errno))
print(f"clonefile succeeded: {src} ~> {dst}")
I wrote this code for my own learning, and it’s definitely not production-ready.
It works in the happy case and helped me understand ctypes
, but if you actually wanted to use this, you’d want proper error handling and testing.
In particular, there are cases where you’d want to fall back to shutil.copyfile
or similar if the clone fails – say if you’re on an older version of macOS, or you’re copying files on a volume which doesn’t support cloning.
Both those cases are handled by cp -c
, but not the clonefile()
syscall.
In my project, I used cp -c
with a wrapper like the one described above.
It’s a short amount of code, pretty readable, and returns useful errors for common cases.
Calling clonefile()
directly with ctypes
might be slightly faster than shelling out to cp -c
, but the difference is probably negligible.
The downside is that it’s more fragile and harder for other people to understand – it would have been the only part of the codebase that was using ctypes
.
File cloning made a noticeable difference. The project involving copying lots of files on an external USB hard drive, and cloning instead of copying full files made it much faster. Tasks that used to take over an hour were now completing in less than a minute. (The files were copied between folders on the same drive – cloned files have to be on the same APFS volume.)
I’m excited to see how file cloning works on Linux in Python 3.14 with Path.copy()
, and I hope macOS support isn’t far behind.
[If the formatting of this post looks odd in your feed reader, visit the original article]