MoreRSS

site iconXuanwo | 漩涡修改

ASF成员,Apache OpenDAL PMC主席,Rust贡献者,倡导数据自由。
请复制 RSS 到你的阅读器,或快速订阅到 :

Inoreader Feedly Follow Feedbin Local Reader

Xuanwo | 漩涡的 RSS 预览

Why S3 ListObjects Taking 120s to Respond?

2025-05-13 09:00:00

Everyone knows that AWS S3 ListObjects is slow, but can you imagine it being so slow that you have to wait 120 seconds for a response? I've actually seen this happen in the wild.

TL;DR

The deleted markers really affect list performance. Make sure to enable lifecycle management to remove them.

Background

Databend is a cloud-native data warehouse that supports S3 as its storage backend. It includes built-in vacuum functions to delete orphaned objects. Essentially, it loads table snapshots to determine all objects that are still referenced and deletes any objects not referenced by any snapshot. As an optimization, Databend writes data blobs to paths containing time-sortable UUIDs (specifically, UUIDv7). This enables Databend to take advantage of the ListObjects behavior, where all keys are sorted in lexicographical order. So, Databend can simply compute a delete_until key and remove all objects with keys less than delete_until.

One day, users reported that the vacuum operation failed due to an opendal list timeout.

6e8a1700-f629-4df4-9596-f9a6508c5f4b: http query has change state to Stopped, reason Err(StorageOther. Code: 4000, Text = Unexpected (persistent) at List::next => io operation timeout reached

Context:
 timeout: 60

OpenDAL is a Rust library that offers a unified interface for accessing various storage backends, including S3. It features a built-in TimeoutLayer that helps prevent problematic requests from hanging forever. The default timeout is set to 60 seconds, and the vacuum operation failed because it exceeded this limit.

Databend has become a complex system, and we've encountered many SQL hang issues in the past. So, my initial thought when addressing this ticket was whether there might be areas where we haven't handled things properly, potentially causing problems for tokio.

Debugging

After a quick review of the codebase, I didn’t spot anything obviously wrong. With no clear culprit in sight, I decided to try and reproduce the issue myself. The affected table had been in use for over a year and had accumulated a significant amount of data. Of course, I couldn’t hope to replicate the user’s dataset exactly, but my aim was to capture the general pattern. If I could demonstrate a significant slowdown in ListObjects under certain conditions, the precise scale would just be a matter of degree.

I direct the AI to use OpenDAL to assist me in generating the code.

By the way, I'm using Zed and Claude 3.7 Sonnet through Github Copilot.

The code is mostly like this:

// S3 configuration
let mut builder = S3::default();
builder = builder.bucket("s3-invalid-xml-test");
builder = builder.region("us-east-2");

let operator = Operator::new(builder)?
 .layer(RetryLayer::new().with_jitter())
 .finish();

...

// Generate a time-ordered UUIDv7
let uuid = Uuid::now_v7();
let key = format!("{}{}", PREFIX, uuid);

// Create an small file with the generated key
match op.write(&key, "hello, world").await {
 Ok(_) => {
 written.fetch_add(1, Ordering::Relaxed);
 Ok(())
 }
 Err(e) => {
 Err(anyhow::anyhow!("Failed to create key {}: {}", key, e))
 }
}

I've tried various patterns, so let's save time by not repeating them and go straight to the problematic one:

  • Have a bucket with versioning enabled.
  • Generate a large number of files (millions or even billions).
  • Delete all of them.

At this point, the bucket is empty. Listing the bucket with the prefix / or /z should, in theory, produce the same result.

What I found was that listing the entire bucket is much slower than listing with the /z prefix. For example, in a bucket where 10 million objects had been deleted, a full list operation would take over 500ms, while listing with /z took only about 8ms. In some cold-start cases, the initial list could take more than 30 seconds to return.

For example:

Starting comparison of latency differences in listing operations with different prefixes
Test parameters: Up to 1000 objects listed per operation, 5 test rounds (including 2 warm-up rounds)

Testing the entire bucket...
Performing 2 warm-up rounds...
 Warm-up #1: Listed 0 objects, time taken: 542.618835ms
 Warm-up #2: Listed 0 objects, time taken: 525.818171ms
Starting 5 official test rounds...
 Test #1: Listed 0 objects, time taken: 536.598969ms
 Test #2: Listed 0 objects, time taken: 539.10924ms
 Test #3: Listed 0 objects, time taken: 531.185516ms
 Test #4: Listed 0 objects, time taken: 536.617262ms
 Test #5: Listed 0 objects, time taken: 537.548909ms
 Average latency for entire bucket: 536.211979ms
 Median latency for entire bucket: 536.617262ms

Testing prefix 'z'...
Performing 2 warm-up rounds...
 Warm-up #1: Listed 0 objects, time taken: 9.004738ms
 Warm-up #2: Listed 0 objects, time taken: 7.567935ms
Starting 5 official test rounds...
 Test #1: Listed 0 objects, time taken: 7.752857ms
 Test #2: Listed 0 objects, time taken: 10.301437ms
 Test #3: Listed 0 objects, time taken: 8.822386ms
 Test #4: Listed 0 objects, time taken: 8.266962ms
 Test #5: Listed 0 objects, time taken: 8.190696ms
 Average latency for prefix 'z': 8.666867ms
 Median latency for prefix 'z': 8.266962ms

====== Latency Comparison Results ======
Entire bucket: Average 536.211979ms
Prefix 'z': Average 8.666867ms
Listing the entire bucket is 61.87 times slower than listing with prefix 'z'
=========================

Also, I can find the cold start of the same bucket can be quiet slow. In some cases, I can find that the warmup needs over 30s:

Testing the entire bucket...
Performing 2 rounds of warm-up...
 Warm-up #1: Listed 0 objects, took 31.881571371s
 Warm-up #2: Listed 0 objects, took 1.243807263s
Starting 5 rounds of formal testing...
 Test #1: Listed 0 objects, took 4.264687095s
 Test #2: Listed 0 objects, took 542.109058ms
 Test #3: Listed 0 objects, took 537.914204ms
 Test #4: Listed 0 objects, took 529.365008ms
 Test #5: Listed 0 objects, took 528.04485ms
 Average latency for the entire bucket: 1.280424043s
 Median latency for the entire bucket: 537.914204ms

Why? Why list objects can be so slow?

Analysis

Let's date back to How S3 Versioning works. After enabling versioning, S3 will create a delete marker for each object you delete. When calling ListObjects, S3 filters out all delete markers and returns only the current versions of the objects.

For example, here is a simple bucket containing only two objects: x and y. x has only one version, while y has two versions.

Actual Storage ListObjects Results
-------------- -------------------
x (v1) x (v1)
y (v1)
y (v2) y (v2)

Obviously, ListObjects will only return x (v1) and y (v2). If we delete y, the result will be:

Actual Storage ListObjects Results
-------------- -------------------
x (v1) x (v1)
y (v1)
y (v2)
y (v3: delete marker)

S3 will add a delete marker as v3 for y and exclude it from the results. The key point is that the delete marker still exists in the bucket, so S3 still needs to check for it when listing objects. I believe the AWS S3 team has explored various optimization methods, but this can still be an issue if your bucket contains a large number of delete markers.

In the most severe cases, such as the following:

Actual Storage ListObjects Results
----------------- -------------------
t1 (delete marker)
t2 (delete marker)
t2 (delete marker)
...
t9999999 (v1) t9999999 (v1)

S3 needs to scan a large number of delete markers before it can return the results. This is why listing object can be very slow, and may even appear as if the HTTP connection is hanging.

S3 has mentioned this in their documentation performance degradation after enabling bucket versioning but they didn't provide a detailed explanation about how the performance degradation happens.

Conclusion

Based on this analysis, we asked users to run aws s3 ls on the same prefix, and they reported that it took 120 seconds to receive the first response. We are aware that AWS S3 ListObjects can be slow, but in certain cases, it can be so slow that it triggers our timeout controls.

My takeaway from this lesson:

  • S3 versioning is not free; only enable it when necessary.
  • Enable lifecycle to remove delete markers and old non-current versions.

BackON v1.5.0 Released

2025-04-09 09:00:00

I am happy to announce the release of BackON v1.5.0.

BackON is a rust library for making retry like a built-in feature provided by Rust.

use backon::ExponentialBuilder;
use backon::Retryable;

async fn fetch() -> Result<String> {
 Ok("hello, world!".to_string())
}

let content = fetch.retry(ExponentialBuilder::default()).await?;

This release adds a new API called adjust(), which allows you to modify the backoff time for the next retry. This is useful when you want to adjust the backoff duration based on the result of the previous attempt or implement a dynamic backoff strategy based on an HTTP Retry-After header.

For example:

use core::time::Duration;
use std::error::Error;
use std::fmt::Display;
use std::fmt::Formatter;

use anyhow::Result;
use backon::ExponentialBuilder;
use backon::Retryable;
use reqwest::header::HeaderMap;
use reqwest::StatusCode;

#[derive(Debug)]
struct HttpError {
 headers: HeaderMap,
}

impl Display for HttpError {
 fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
 write!(f, "http error")
 }
}

impl Error for HttpError {}

async fn fetch() -> Result<String> {
 let resp = reqwest::get("https://www.rust-lang.org").await?;
 if resp.status() != StatusCode::OK {
 let source = HttpError {
 headers: resp.headers().clone(),
 };
 return Err(anyhow::Error::new(source));
 }
 Ok(resp.text().await?)
}

#[tokio::main(flavor = "current_thread")]
async fn main() -> Result<()> {
 let content = fetch
 .retry(ExponentialBuilder::default())
 .adjust(|err, dur| {
 match err.downcast_ref::<HttpError>() {
 Some(v) => {
 if let Some(retry_after) = v.headers.get("Retry-After") {
 // Parse the Retry-After header and adjust the backoff duration
 let retry_after = retry_after.to_str().unwrap_or("0");
 let retry_after = retry_after.parse::<u64>().unwrap_or(0);
 Some(Duration::from_secs(retry_after))
 } else {
 dur
 }
 }
 None => dur,
 }
 })
 .await?;
 println!("fetch succeeded: {}", content);

 Ok(())
}

Hope you enjoy this feature. Thank you, everyone!


As of the v1.5.0 release, BackON is now:

  • Used by 1.5k projects on GitHub
  • Has 50 reverse dependencies on crates.io
  • Downloaded approximately 6.3 million times, averaging 60k downloads per day

Thank you all for your trust—let's make retries feel like a built-in feature in Rust!

Where do you belong, system researchers?

2025-03-10 09:00:00

@xiangpeng published nice post called Where are we now, system researchers? via archive.is. In this post, he questioned the positions of system researchers. Xiangpeng is an outstanding system researcher, and this post is written from the viewpoint of someone in that field. As for me, although I have never been a system researcher, I would like to share some comments here and offer complementary ideas.


There are two main areas in the field of computer science: academia and industry. Traditionally, research is conducted in academia, and its results are applied in industry. However, the boundary between these two domains is becoming increasingly blurred. Sometimes, industry develops something relatively new that ultimately brings significant changes to academia.

Yet, whenever I discuss these developments with friends in academia, they simply laugh at me and say, "That's not new. A paper published in the 1990s already explored this idea."

Ah, the idea. But where is the implementation?

This post said:

We waste too much time babbling about knowledge we learn from papers – how to schedule a million machines, how to train a billion parameters, how to design infinitely scalable systems. Just thinking about these problems makes us feel important as researchers, although most of us have never deployed a service in the cloud, never used the techniques we proposed, and never worked with the filesystems, kernels, compilers, networks, or databases we studied. We waste time on these theoretical discussions because we don’t know how to code and are unwilling to practice. As Feynman said, “What I cannot create, I do not understand.” Simply knowing how a system works from 1000 feet doesn’t mean we can build it. The nuances of real systems often explain why they’re built in particular ways. Without diving into these details, we’re merely scratching the surface.

I think this is a very good point. I've seen many papers that present interesting ideas but are never implemented. Some develop great abstractions but lack practicality. Others propose excellent concepts without discussing how they could actually work. Sometimes, I feel that friends in academia don't really care about real users.

(Writing code does not make you a good researcher, but not writing code makes you a bad one.)

As I stated above, I'm not a systems researcher. I'm curious whether it's possible for a good researcher to be unable to write good code. That said, can someone conduct excellent research without producing any nice code? Are there any examples of this?

The system research community does not need more novel solutions – novel solutions are essentially combinations of existing techniques. When we need to solve a problem, most of us would figure out a similar solution, and what matters is the execution of the ideas.

Instead, we need more people willing to sit down and code, build real systems, and talk to real users. Be a solid practitioner, don’t be a feel-good researcher.

I believe that's a valid point. I'm looking forward to collaborating with more system researchers to push the boundaries of system research forward.

Paper publishing takes too much time. We spend too much effort arguing what’s new and what’s hard, instead of focusing on doing the actual research. Writing a paper already takes too much time, and then we need to anonymize artifacts, register abstracts, wait for reviews, write rebuttals, revise the paper, and can still be rejected for arbitrary reasons. The turnaround time for a single submission can be up to 6 months.

Ah, writing papers is increasingly becoming a specialized skill. I have failed to master it.

In today's world, arXiv is becoming an increasingly important platform for publishing papers and initiating discussions.

The real difference between papers often lies in numerous small details that sound trivial but are actually essential for relevance. In most cases, figuring out these details takes much more time and demonstrates more novelty than coming up with the initial idea itself.

Referring back to my previous comments: Papers are primarily about ideas. I also agree with Xiangpeng that the real difference between papers often lies in numerous small details that may seem trivial but are actually crucial for relevance.

Conclusion

So, back to the title—where do you belong, system researchers? My answer is: open source.

Try integrating your work with open-source projects or publishing it as open source. More and more researchers are doing this, and I believe it's a great trend. One great example is S3-FIFO.

Open source is a great way to share your work with the world and receive feedback from real users. It's also an excellent opportunity to practice coding and build real systems.

MCP Server OpenDAL

2025-03-05 09:00:00

I'm excited to introduce MCP Server OpenDAL, a model context protocol server for Apache OpenDAL™.

Model Context Protocol

Before discussing MCP, we should first establish some background on model context. At its most basic level, a model can be viewed as a pure function that operates like f(input) -> output, meaning it has no side effects or dependencies on external states. To make meaningful use of an AI model, we must provide all relevant information needed for the task as input.

For example, when building a chatbot, we need to supply the conversation history each time we invoke the model. Otherwise, it would be unable to understand the context of the conversation. However, different AI models have different ways of handling context, making it difficult to scale and migrate between them. Following the same way like Language Server Protocol, we can define a standardized interface for model context so developers can easily integrate with various AI models without testing them individually. That's Model Context Protocol.

It's general architecture could be described as follows:

AI tools will function as MCP clients and connect to various MCP servers via MCP. Each server will specify the resources or tools it has and provide a schema detailing the required input. Then, the model can utilize the tools provided by the MCP server to manage context.

MCP Server OpenDAL

Apache OpenDAL (/ˈoʊ.pən.dæl/, pronounced "OH-puhn-dal") is an Open Data Access Layer that enables seamless interaction with diverse storage services. It's development is guided by its vision of One Layer, All Storage and its core principles: Open Community, Solid Foundation, Fast Access, Object Storage First, and Extensible Architecture.

So MCP Server OpenDAL can be used as a MCP server to provide storage services for model context. It supports various storage services such as local file system, AWS S3, Google Cloud Storage, etc. Developers can easily integrate with OpenDAL to manage model context.

This project is still in its early stages, and I'm continuing to learn more about AI and Python. It should be exciting to see how it evolves.

ArchLinux removed deprecated repo community

2025-03-03 09:00:00

Archlinux announced that they will remove the deprecated repository in Cleaning up old repositories via archive.is

Now it's happened.

If you have seen errors like:

:) paru
:: Synchronizing package databases...
 core 115.5 KiB 700 KiB/s 00:00 [####################################################] 100%
 extra 7.7 MiB 8.90 MiB/s 00:01 [####################################################] 100%
 community.db failed to download
 archlinuxcn 1404.2 KiB 4.06 MiB/s 00:00 [####################################################] 100%
error: failed retrieving file 'community.db' from mirrors.tuna.tsinghua.edu.cn : The requested URL returned error: 404
error: failed retrieving file 'community.db' from mirrors.ustc.edu.cn : The requested URL returned error: 404
error: failed retrieving file 'community.db' from mirrors.xjtu.edu.cn : The requested URL returned error: 404
error: failed retrieving file 'community.db' from mirrors.nju.edu.cn : The requested URL returned error: 404
error: failed retrieving file 'community.db' from mirrors.jlu.edu.cn : The requested URL returned error: 404

Please go check your /etc/pacman.conf and remove the section that is deprecated:

[community]
Include = /etc/pacman.d/mirrorlist

All deprecated repositories include:

  • [community]
  • [community-testing]
  • [testing]
  • [testing-debug]
  • [staging]
  • [staging-debug]

Release Arrow Rust 54.2.1 in 6 hours

2025-02-27 10:00:00

Today I helped the arrow-rust community to release Arrow Rust 54.2.1 in 6 hours.


Background

chrono v0.4.40 implemented a quarter method that causes build errors when building with arrow.

error[E0034]: multiple applicable items in scope
 --> /Users/yutannihilation/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/arrow-arith-54.2.0/src/temporal.rs:92:36
 |
92 | DatePart::Quarter => |d| d.quarter() as i32,
 | ^^^^^^^ multiple `quarter` found
 |
note: candidate #1 is defined in the trait `ChronoDateExt`
 --> /Users/yutannihilation/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/arrow-arith-54.2.0/src/temporal.rs:638:5
 |
638 | fn quarter(&self) -> u32;
 | ^^^^^^^^^^^^^^^^^^^^^^^^^
note: candidate #2 is defined in the trait `Datelike`
 --> /Users/yutannihilation/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/chrono-0.4.40/src/traits.rs:47:5
 |
47 | fn quarter(&self) -> u32 {
 | ^^^^^^^^^^^^^^^^^^^^^^^^
help: disambiguate the method for candidate #1
 |
92 | DatePart::Quarter => |d| ChronoDateExt::quarter(&d) as i32,
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~
help: disambiguate the method for candidate #2
 |
92 | DatePart::Quarter => |d| Datelike::quarter(&d) as i32,
 | ~~~~~~~~~~~~~~~~~~~~~

For more information about this error, try `rustc --explain E0034`.
error: could not compile `arrow-arith` (lib) due to 1 previous error

In arrow, we have a ChronoDateExt trait that provides additional APIs not yet supported by chrono. However, this could become a breaking change if chrono later implements the same API in its own Datelike trait.

Solution

The simplest solution without changing the code is to pin the chrono version to 0.4.39, but this may prevent the entire project from receiving future chrono releases, requiring another manual fix later. The best solution is for arrow-rs to release a patch that includes the fix, allowing users to simply run cargo update to apply it.

But since arrow-rs is an ASF project and requires three days to complete a release, does that mean our users will experience disruptions for three days?

ASF Emergency Release

It's a widely held misconception that ASF releases are slow. They must require three binding votes and a three-day voting period before being approved for release.

It's not true.

The ASF release policy said that:

Release votes SHOULD remain open for at least 72 hours.

However, this policy is relaxed in real-world practice. The PMC can issue an emergency release if necessary, which can be completed within a few hours, provided it secures enough binding votes from PMC members. The release of arrow-rs requires an emergency update: without this fix, all arrow-rs users will fail to build.

Release Arrow Rust 54.2.1 in 6 hours

I have started handling these issues and am working to reach a consensus in the arrow community for a quick patch release. Following the arrow-rs release pattern, I created issue Release arrow-rs / parquet patch version 54.2.1 (Feb 2025) (HOTFIX).

I previously attempted to backport the fix fix: Use chrono's quarter() to avoid conflict, but after discussing it with @tustvold, I changed to use chrono = ">= 0.4.34, < 0.4.40" instead to minimize disruption for users.

@tustvold merged this PR in 2 minutes, and after that, I started a bump PR for the branch checked out from the 54.2.0 tag: [54.2.0_maintenance] Bump Arrow version to 54.2.1. In this PR, I requested an emergency release in this way:

I request that this release be made under emergency circumstances, allowing it to be released as soon as possible after gathering three +1 binding votes, without waiting for three days.

@alamb stepped up and volunteered to handle the release. He helped merge the PR, prepared the release candidate, and then started the release voting. This vote passed in two hours, and the 54.2.1 release of arrow-rs is now ready.

Conclusion

The ASF has rules for good reasons, and they are designed to ensure the stability and security of the project. These rules are not arbitrary, but rather reflect the experiences of the community and the lessons learned from past mistakes. Instead of blindly following these rules, we should thoroughly understand the reasons behind them, adapt them to the real world, and ultimately bring our practices into ASF discussions to refine and improve those rules.

So, we successfully release arrow-rs 54.2.1 in 6 hours. My next step is to discuss this case with the broader ASF community to see what we can learn from it and how we can improve our rules. I also welcome other ASF projects to share their experiences and learnings from similar situations.