2025-02-20 10:00:00
The Apache Iceberg community has raised a PR S3: Disable strong integrity checksums to disable the newly introduced integrity checksums in AWS S3 SDKs.
AWS has introduced default data integrity protections for new objects in Amazon S3, which is a positive development. However, they have also chosen to update the default settings in all SDKs, breaking compatibility with nearly all S3-compatible services.
First of all, this is a good thing for me because checksums like crc64-nvme
looks great—it's fast, secure, and has excellent SIMD support. As a user and developer, I'm excited to use it and integrate it into my projects. However, the S3 API is more than that. Many S3-compatible services are recommending that their users use the S3 SDK directly, and changing the default settings in this way can have a direct impact on their users.
For examples:
The recent AWS SDK bump introduced strong integrity checksums, and broke compatibility with many S3-compatible object storages (pre-2025 Minio, Vast, Dell EC etc).
In Trino project, we received the error report (Missing required header for this request: Content-Md5) from several users and had to disable the check temporarily. We recommend disabling it in Iceberg as well. I faced this issue when I tried upgrading Iceberg library to 1.8.0 in Trino.
Although this feature is good, the AWS team has implemented it poorly by enforcing it, causing issues for many users of related services. This reminds me of the position where Apache OpenDAL should stand.
OpenDAL integrates all services by directly communicating with APIs instead of relying on SDKs, protecting users from potential disruptions like this one. OpenDAL's community also takes checksum support into deep consideration and is working to find a solution that benefits users while ensuring compatibility with unsupported services.
Maybe it's time to move away from using S3 SDKs and switch to OpenDAL if you just want to access compatible services for data:
2025-02-18 10:00:00
I am happy to announce the release of BackON v1.4.0 via archive.is.
BackON is a rust library for making retry like a built-in feature provided by Rust.
use backon::ExponentialBuilder;
use backon::Retryable;
async fn fetch() -> Result<String> {
Ok("hello, world!".to_string())
}
let content = fetch.retry(ExponentialBuilder::default()).await?;
The biggest change in this release was introduced by @wackazong, who correctly added std
support for no-std
without requiring a global random seed. With this improvement, users on no-std
can now use jitter
properly as well.
@NumberFour8 brought us futures-timer
support, allowing backon
to be used in an async context without relying on Tokio. @Matt3o12 contributed by making some functions const
, enabling many backoff builder APIs to be used in a const
context.
Thank you all!
As of the v1.4.0 release, BackON is now:
Thank you all for your trust—let's make retries feel like a built-in feature in Rust!
2025-02-18 09:00:00
I wrote Why should fastcrc exist? via archive.is to start building fastcrc.
I encountered this problem while implementing crc64/nvem
for S3 storage. However, I found that the existing implementations were all in poor shape. Instead of simply forking and fixing a single crate, I decided to improve the entire Rust CRC ecosystem by creating fastcrc—a repository that consolidates all CRC implementations, similar to RustCrypto/hashes.
In this way, we can ensure that:
All CRC implementations will share the same core (if possible), the same dependencies, provide the similiar API, and will be maintained by the same group.
By the way, the Fast Labs is a real thing. We have built many fast projects like fastrace, fastimer, fastant and logforth. Welcome to join us!
2025-02-17 09:00:00
Happy to read that F-Droid Awarded Open Technology Fund’s FOSS Sustainability Grant from F-Droid. via archive.is
Open source is playing an increasingly important role in today's world. The discussion on how to support and sustain the open-source community has been ongoing for years, and I'm delighted to see people taking action toward directly funding projects.
We are excited to announce that F-Droid has been awarded $396,044 from the Open Technology Fund’s FOSS Sustainability Fund. This grant is specifically designed to support free and open-source software (FOSS) projects in addressing long-term sustainability challenges, and we are honored to be among the recipients.
Congrats!
The Free and Open Source Software (FOSS) Sustainability Fund is supported by Open Technology Fund, which is a USA non-profit organization that works to advance internet freedom in repressive environments by supporting the research, development, implementation, and maintenance of technologies that counter censorship and combat repressive surveillance to enable all citizens to exercise their fundamental human rights online.
This organization supported many well-known open-source projects, including F-Droid, Let's Encrypt, and FileZilla. From its latest annual report for 2022, we can observe the following numbers: "Of the funds allocated for internet freedom, $27 million was designated for OTF through USAGM".
At the same time, The Sovereign Tech Agency which supports the development, improvement, and maintenance of open digital infrastructure and sponsored by the German Federal Ministry for Economic Affairs and Climate Action has been allocated 13 million Euros for its projects in 2022.
Apart from the country's funding efforts, the industry has also launched a campaign called Open Source Pledge. Its message is super clear that: "Whether you're a CEO, CFO, CTO, or just a dev, your company surely depends on Open Source software. It's time to pay the maintainers."
The action is simple to understand:
Open Source Pledge reported that over the past year, they have paid $2.5 million to maintainers. That's a really great number to be aware of.
I hope more and more countries or companies will join this effort.
2025-02-12 09:00:00
OpenDAL received an issue from the GitHub staff regarding Time Sensitive: GitHub Actions Cache Service Integration, reported by Bassem Dghaidi, who works on GitHub Actions. via archive.is
Apache OpenDAL is an Open Data Access Layer that facilitates seamless interaction with various storage services. It offers a service called ghac, which enables OpenDAL users to access the GitHub Actions Cache Service just like other storage services such as AWS S3. This service is utilized by sccache and pants to store and retrieve build artifacts, accelerating the build process.
However, ghac
itself is not a public API service provided by GitHub. We implemented it by studying the code of actions/cache and replicating the same logic in OpenDAL.
So technically, OpenDAL is not a typical user of GHAC, and conversely, GitHub has no responsibility for OpenDAL's use of GHAC.
That's why I really appreciate the GitHub team, especially Bassem Dghaidi, for their support and helpful notifications.
We have identified that this project is integrating with the legacy cache service without using the official and supported package. Unfortunately, that means that you have to introduce code changes in order to be compatible with the new service we're rolling out.
I believe it's important to include the API version and User-Agent in API design. This ensures you can identify which version of the API users are accessing and which user agent is being used to interact with the service.
Also, from the user's side: Unless you have a specific reason to remain anonymous in certain cases, I encourage you to follow the same approach—let the service know who you are.
The new service uses an entirely new set of internal API endpoints. To help with your changes we have provided the proto definitions below to help generate compatible clients to speed up your migration.
These internal APIs were never intended for consumption the way your project is at the moment. Since this is not a paved path we endorse, it's possible there will be breaking changes in the future. We are reaching out as a courtesy because we do not wish to break the workflows dependent on this project.
Wow, this guy even provided the proto definitions! I'm almost in tears seeing words like, "We do not wish to break the workflows dependent on this project."
I mean, they didn’t have to—it's not part of their business. They could have simply blamed OpenDAL and other clients relying on GHAC, saying, "This is not a paved path we endorse." It would have been enough just to give us a heads-up before the breakage. But instead, they went the extra mile, showed up, and even provided the proto definitions to help us with the migration.
Good job!
Please introduce the necessary changes ASAP before the end of February. Otherwise, storing and retrieving cache entries will start to fail. There will be no need to offer backward compatibility as the new service will be rolled out to all repositories by February 13th 2025.
I have created a tracking issue: Tracking issue for GHAC service upgrade. Feel free to check it out!
2025-02-11 09:00:00
TPC-DS Extension from duckdb via archive.is
TPC-H and TPC-DS are the most widely used big data benchmarks maintained by TPC. However, the TPC data generator is not open-source, requires an email submission to download, and is not actively maintained. The code is usually designed for GCC 9.x and fails to compile on newer GCC versions. Consequently, generating data for TPC-H and TPC-DS tests is often a frustrating challenge.
But hey, DuckDB to the rescue! No need to build or search for documentation—just use DuckDB instead! In this post, I will demonstrate how to use DuckDB to generate TPC test data and export it as Parquet files for loading.
DuckDB is widely available in various Linux distributions. You can also install it using pip install duckdb
. On my Arch Linux system, I use paru -S duckdb
to install it.
Start DuckDB: just run duckdb
—no setup, no configuration—just works, like SQLite!
If you want to store data on disk instead of just in memory, run duckdb /path/to/duckdb.data
The extensions tpch
and tpcds
are bundled and enabled by default. This means we can directly use the functions they provide, such as:
for TPC-H:
CALL dbgen(sf = 1);
for TPC-DS
CALL dsdgen(sf = 1);
Use sf
to control the size.
After the test data has been generated, we can use its native EXPORT
SQL to export the in-memory data as Parquet:
EXPORT DATABASE 'tpcds_parquet' (FORMAT PARQUET);