2025-03-10 09:00:00
@xiangpeng published nice post called Where are we now, system researchers? via archive.is. In this post, he questioned the positions of system researchers. Xiangpeng is an outstanding system researcher, and this post is written from the viewpoint of someone in that field. As for me, although I have never been a system researcher, I would like to share some comments here and offer complementary ideas.
There are two main areas in the field of computer science: academia and industry. Traditionally, research is conducted in academia, and its results are applied in industry. However, the boundary between these two domains is becoming increasingly blurred. Sometimes, industry develops something relatively new that ultimately brings significant changes to academia.
Yet, whenever I discuss these developments with friends in academia, they simply laugh at me and say, "That's not new. A paper published in the 1990s already explored this idea."
Ah, the idea. But where is the implementation?
This post said:
We waste too much time babbling about knowledge we learn from papers – how to schedule a million machines, how to train a billion parameters, how to design infinitely scalable systems. Just thinking about these problems makes us feel important as researchers, although most of us have never deployed a service in the cloud, never used the techniques we proposed, and never worked with the filesystems, kernels, compilers, networks, or databases we studied. We waste time on these theoretical discussions because we don’t know how to code and are unwilling to practice. As Feynman said, “What I cannot create, I do not understand.” Simply knowing how a system works from 1000 feet doesn’t mean we can build it. The nuances of real systems often explain why they’re built in particular ways. Without diving into these details, we’re merely scratching the surface.
I think this is a very good point. I've seen many papers that present interesting ideas but are never implemented. Some develop great abstractions but lack practicality. Others propose excellent concepts without discussing how they could actually work. Sometimes, I feel that friends in academia don't really care about real users.
(Writing code does not make you a good researcher, but not writing code makes you a bad one.)
As I stated above, I'm not a systems researcher. I'm curious whether it's possible for a good researcher to be unable to write good code. That said, can someone conduct excellent research without producing any nice code? Are there any examples of this?
The system research community does not need more novel solutions – novel solutions are essentially combinations of existing techniques. When we need to solve a problem, most of us would figure out a similar solution, and what matters is the execution of the ideas.
Instead, we need more people willing to sit down and code, build real systems, and talk to real users. Be a solid practitioner, don’t be a feel-good researcher.
I believe that's a valid point. I'm looking forward to collaborating with more system researchers to push the boundaries of system research forward.
Paper publishing takes too much time. We spend too much effort arguing what’s new and what’s hard, instead of focusing on doing the actual research. Writing a paper already takes too much time, and then we need to anonymize artifacts, register abstracts, wait for reviews, write rebuttals, revise the paper, and can still be rejected for arbitrary reasons. The turnaround time for a single submission can be up to 6 months.
Ah, writing papers is increasingly becoming a specialized skill. I have failed to master it.
In today's world, arXiv is becoming an increasingly important platform for publishing papers and initiating discussions.
The real difference between papers often lies in numerous small details that sound trivial but are actually essential for relevance. In most cases, figuring out these details takes much more time and demonstrates more novelty than coming up with the initial idea itself.
Referring back to my previous comments: Papers are primarily about ideas. I also agree with Xiangpeng that the real difference between papers often lies in numerous small details that may seem trivial but are actually crucial for relevance.
So, back to the title—where do you belong, system researchers? My answer is: open source.
Try integrating your work with open-source projects or publishing it as open source. More and more researchers are doing this, and I believe it's a great trend. One great example is S3-FIFO.
Open source is a great way to share your work with the world and receive feedback from real users. It's also an excellent opportunity to practice coding and build real systems.
2025-03-05 09:00:00
I'm excited to introduce MCP Server OpenDAL, a model context protocol server for Apache OpenDAL™.
Before discussing MCP, we should first establish some background on model context. At its most basic level, a model can be viewed as a pure function that operates like f(input) -> output
, meaning it has no side effects or dependencies on external states. To make meaningful use of an AI model, we must provide all relevant information needed for the task as input.
For example, when building a chatbot, we need to supply the conversation history each time we invoke the model. Otherwise, it would be unable to understand the context of the conversation. However, different AI models have different ways of handling context, making it difficult to scale and migrate between them. Following the same way like Language Server Protocol, we can define a standardized interface for model context so developers can easily integrate with various AI models without testing them individually. That's Model Context Protocol.
It's general architecture could be described as follows:
AI tools will function as MCP clients and connect to various MCP servers via MCP. Each server will specify the resources or tools it has and provide a schema detailing the required input. Then, the model can utilize the tools provided by the MCP server to manage context.
Apache OpenDAL (/ˈoʊ.pən.dæl/, pronounced "OH-puhn-dal") is an Open Data Access Layer that enables seamless interaction with diverse storage services. It's development is guided by its vision of One Layer, All Storage and its core principles: Open Community, Solid Foundation, Fast Access, Object Storage First, and Extensible Architecture.
So MCP Server OpenDAL can be used as a MCP server to provide storage services for model context. It supports various storage services such as local file system, AWS S3, Google Cloud Storage, etc. Developers can easily integrate with OpenDAL to manage model context.
This project is still in its early stages, and I'm continuing to learn more about AI and Python. It should be exciting to see how it evolves.
2025-03-03 09:00:00
Archlinux announced that they will remove the deprecated repository in Cleaning up old repositories via archive.is
Now it's happened.
If you have seen errors like:
:) paru
:: Synchronizing package databases...
core 115.5 KiB 700 KiB/s 00:00 [####################################################] 100%
extra 7.7 MiB 8.90 MiB/s 00:01 [####################################################] 100%
community.db failed to download
archlinuxcn 1404.2 KiB 4.06 MiB/s 00:00 [####################################################] 100%
error: failed retrieving file 'community.db' from mirrors.tuna.tsinghua.edu.cn : The requested URL returned error: 404
error: failed retrieving file 'community.db' from mirrors.ustc.edu.cn : The requested URL returned error: 404
error: failed retrieving file 'community.db' from mirrors.xjtu.edu.cn : The requested URL returned error: 404
error: failed retrieving file 'community.db' from mirrors.nju.edu.cn : The requested URL returned error: 404
error: failed retrieving file 'community.db' from mirrors.jlu.edu.cn : The requested URL returned error: 404
Please go check your /etc/pacman.conf
and remove the section that is deprecated:
[community]
Include = /etc/pacman.d/mirrorlist
All deprecated repositories include:
[community]
[community-testing]
[testing]
[testing-debug]
[staging]
[staging-debug]
2025-02-27 10:00:00
Today I helped the arrow-rust community to release Arrow Rust 54.2.1 in 6 hours.
chrono v0.4.40 implemented a quarter method that causes build errors when building with arrow.
error[E0034]: multiple applicable items in scope
--> /Users/yutannihilation/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/arrow-arith-54.2.0/src/temporal.rs:92:36
|
92 | DatePart::Quarter => |d| d.quarter() as i32,
| ^^^^^^^ multiple `quarter` found
|
note: candidate #1 is defined in the trait `ChronoDateExt`
--> /Users/yutannihilation/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/arrow-arith-54.2.0/src/temporal.rs:638:5
|
638 | fn quarter(&self) -> u32;
| ^^^^^^^^^^^^^^^^^^^^^^^^^
note: candidate #2 is defined in the trait `Datelike`
--> /Users/yutannihilation/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/chrono-0.4.40/src/traits.rs:47:5
|
47 | fn quarter(&self) -> u32 {
| ^^^^^^^^^^^^^^^^^^^^^^^^
help: disambiguate the method for candidate #1
|
92 | DatePart::Quarter => |d| ChronoDateExt::quarter(&d) as i32,
| ~~~~~~~~~~~~~~~~~~~~~~~~~~
help: disambiguate the method for candidate #2
|
92 | DatePart::Quarter => |d| Datelike::quarter(&d) as i32,
| ~~~~~~~~~~~~~~~~~~~~~
For more information about this error, try `rustc --explain E0034`.
error: could not compile `arrow-arith` (lib) due to 1 previous error
In arrow, we have a ChronoDateExt
trait that provides additional APIs not yet supported by chrono
. However, this could become a breaking change if chrono
later implements the same API in its own Datelike
trait.
The simplest solution without changing the code is to pin the chrono version to 0.4.39
, but this may prevent the entire project from receiving future chrono releases, requiring another manual fix later. The best solution is for arrow-rs to release a patch that includes the fix, allowing users to simply run cargo update
to apply it.
But since arrow-rs is an ASF project and requires three days to complete a release, does that mean our users will experience disruptions for three days?
It's a widely held misconception that ASF releases are slow. They must require three binding votes and a three-day voting period before being approved for release.
It's not true.
The ASF release policy said that:
Release votes SHOULD remain open for at least 72 hours.
However, this policy is relaxed in real-world practice. The PMC can issue an emergency release if necessary, which can be completed within a few hours, provided it secures enough binding votes from PMC members. The release of arrow-rs requires an emergency update: without this fix, all arrow-rs users will fail to build.
I have started handling these issues and am working to reach a consensus in the arrow community for a quick patch release. Following the arrow-rs release pattern, I created issue Release arrow-rs / parquet patch version 54.2.1 (Feb 2025) (HOTFIX).
I previously attempted to backport the fix fix: Use chrono's quarter() to avoid conflict, but after discussing it with @tustvold, I changed to use chrono = ">= 0.4.34, < 0.4.40"
instead to minimize disruption for users.
@tustvold
merged this PR in 2 minutes, and after that, I started a bump PR for the branch checked out from the 54.2.0
tag: [54.2.0_maintenance] Bump Arrow version to 54.2.1. In this PR, I requested an emergency release in this way:
I request that this release be made under emergency circumstances, allowing it to be released as soon as possible after gathering three +1 binding votes, without waiting for three days.
@alamb stepped up and volunteered to handle the release. He helped merge the PR, prepared the release candidate, and then started the release voting. This vote passed in two hours, and the 54.2.1
release of arrow-rs is now ready.
The ASF has rules for good reasons, and they are designed to ensure the stability and security of the project. These rules are not arbitrary, but rather reflect the experiences of the community and the lessons learned from past mistakes. Instead of blindly following these rules, we should thoroughly understand the reasons behind them, adapt them to the real world, and ultimately bring our practices into ASF discussions to refine and improve those rules.
So, we successfully release arrow-rs 54.2.1 in 6 hours. My next step is to discuss this case with the broader ASF community to see what we can learn from it and how we can improve our rules. I also welcome other ASF projects to share their experiences and learnings from similar situations.
2025-02-27 09:00:00
I'm happy to see that zlib-rs is faster than C via archive.is
People often assume that Rust is slower than C because it has some unavoidable overhead compared to C. However, such assumptions often overlook an important prerequisite: both projects have been allocated the same level of resources. Rust has been quite hyped in recent years, but hype can be a good thing—many talented developers are drawn to Rust and dedicate their efforts to it.
Today, multiversioning is not natively supported in rust. There are proposals for adding it (which we're very excited about!), but for now, we have to implement it manually which unfortunately involves some unsafe code. We'll write more about this soon (for the impatient, the relevant code is here).
All crates that use SIMD instructions currently need to implement multiversioning manually. I'm eagerly anticipating simd-multiversioning
too!
The current methods for doing that are similar to this:
fn inflate_fast_help(state: &mut State, start: usize) {
#[cfg(any(target_arch = "x86_64", target_arch = "x86"))]
if crate::cpu_features::is_enabled_avx2() {
// SAFETY: we've verified the target features
return unsafe { inflate_fast_help_avx2(state, start) };
}
inflate_fast_help_vanilla(state, start);
}
#[cfg(any(target_arch = "x86_64", target_arch = "x86"))]
#[target_feature(enable = "avx2")]
unsafe fn inflate_fast_help_avx2(state: &mut State, start: usize) {
inflate_fast_help_impl::<{ CpuFeatures::AVX2 }>(state, start);
}
fn inflate_fast_help_vanilla(state: &mut State, start: usize) {
inflate_fast_help_impl::<{ CpuFeatures::NONE }>(state, start);
}
inflate_fast_help
serves as the entry point for the API call. It first checks the CPU features and then invokes the appropriate implementation.
I find it a bit unusual that zlib-rs
marks inflate_fast_help_avx2
as an unsafe function. My assumption is that they use unsafe
to indicate that calling the function without first verifying the target features is not safe.
Nikita Popov suggested we try the -Cllvm-args=-enable-dfa-jump-thread option, which recovers most of the performance here. It performs a kind of jump threading for deterministic finite automata, and our decompression logic matches this pattern.
enable-dfa-jump-thread
is a bit magic.
Here is my understanding of it:
First of all, Jump threading is a compiler optimization that can be thought of as a "shortcut optimization."
Take the following code as an example. Please note that this is not how jump threading works in a compiler; the example is meant to represent the logic.
10. a = SomeNumber();
20. IF a > 10 GOTO 50
...
50. IF a > 0 GOTO 100
...
It's easy to find that if a > 10
, then a > 0
is always true. The compiler can optimize this by jumping directly to GOTO 100
like this:
10. a = SomeNumber();
20. IF a > 10 GOTO 100
...
50. IF a > 0 GOTO 100
...
This optimization will eliminate the unnecessary dynamically executed jumps, makes way for further optimizations.
Then, enable-dfa-jump-thread
is a flag that enables jump threading for DFA (deterministic finite automata).
For example:
fn check_status(value: i32) -> String {
let status;
if value > 50 {
status = "OK";
} else {
status = "ERR";
}
// Other code here.
let mut result = String::from("Result: ");
if status == "OK" {
result.push_str("Passed");
} else {
result.push_str("Failed");
}
result
}
After enable-dfa-jump-thread
, LLVM can optimize the code by jumping directly to the appropriate branch based on the status value.
fn check_status(value: i32) -> String {
// Other code here.
let mut result = String::from("Result: ");
if value > 50 {
result.push_str("Passed");
} else {
result.push_str("Failed");
}
result
}
There are many DFAs within the decompressor of zlib, so enabling DFA jump threading can significantly improve performance, especially when handling small datasets.
The LLVM community is working to enable this flag by default at [DFAJumpThreading] Enable the pass by default.
Our implementation is mostly done, and clearly performs extremely well. However, we're missing some less commonly used API functions related to gzip files that would make us a complete drop-in replacement in all cases.
Most functions, such as gzclose
and gzflush
, seem easy to implement. Let's take a look!
2025-02-24 09:00:00
I just published the announcement for New OpenDAL Committer: xxchan.
I knew xxchan even before OpenDAL. We used to work together at BeyondStorage, a discontinued project similar to today's OpenDAL but written in Golang.
It's quite nice to see xxchan showing up again in the OpenDAL community. After leaving the same company, I haven't talked with him for a while, and I never invited him to actively contribute to OpenDAL. He just appears from time to time and fixes issues randomly. He has been contributing to OpenDAL since February 11, 2023, at his own pace, based on his needs.
That's really a great thing for me. An open-source community needs passionate contributors to drive projects forward by adding major features and improvements. However, it also needs contributors like xxchan, who fix issues, clean up messy code, and contribute at their own pace.
Wish him have fun and enjoy his contributions to OpenDAL!