MoreRSS

site iconYufan | 雨帆修改

2007 年起使用 “雨帆” 这个名字开始写博客。
请复制 RSS 到你的阅读器,或快速订阅到 :

Inoreader Feedly Follow Feedbin Local Reader

Yufan | 雨帆的 RSS 预览

《Go 语言设计与实现》简评

2025-10-01 20:57:19

Go 语言设计与实现

第一次接触左老师的博客《面向信仰编程》应该是在 2020 年,那时候我在腾讯云工作,内部要求统一转 Go。作为一个 Java,在寻找学习资料时,我找到了左老师的博客,被里面关于 Go 语言的细节文章和插图惊艳到。后来在和熟悉的图灵编辑英子老师聊天时,才知道她正好是左老师的《Go 语言设计与实现》的编辑。于是,在书出版的第一时间,我就拿到了对应的样书。

转眼已经过去四年,上周一我还特意重新阅读。我觉得是时候认真地点评一下这本书了。

翻开这本书的豆瓣评价,可以发现它有着极大的两极化反应。喜欢的人如饮甘露,赞不绝口;不喜欢的人则将语言抬升到极端,批评得非常尖锐。

我个人印象最深的是,编辑老师曾提到左老师的语言过于简洁。简洁的语言在写博客时是可以接受的,因为博文一般较短,通常只有千字左右,更多的是讲重点。而写书的要求则完全不同,给出版社写书的朋友可能知道,最初拟定出版合同时,图书的大纲和选题已经经过内部评审,且每个阶段要交的稿件基本上已经确定。因此,在写书之初,就需要有一个清晰的框架,而不是想到哪里写到哪里。

框架(目录)确定之后,接下来就是在框架上填充内容,明确每个章节的重点。这实际上是语言组织与表达的艺术。当然,编辑在此过程中会帮忙审定和优化字句,但写作的核心始终是作者本身,编辑更多是对文字的二次加工。

《Go 语言设计与实现》的首要问题,便是写作时的表达问题,尤其是在已有框架的基础上。例如,第一章讲解调试源代码,第二章讲解编译原理。两章共计约 50 页的内容,涵盖了大学本科一学期编译原理的课程内容。值得一提的是,许多读者并非计算机专业出身,缺乏系统学习相关课程的背景。即使是有相关基础的读者,作者在组织内容时也存在一定的失衡。例如,在讲解词法分析时提到的 Lex,我相信大多数读者并不理解。作者自己也提到,从 Go 1.5 开始,Go 编译器不再依赖 C 的 Lex。不过,这只是干巴巴地附上一句,而没有深入探讨,甚至连自举的内容都懒得提一下。再到文法分析部分,涉及到的 LL、LA、LR 等概念,估计很多读者都会一头雾水。与之对比,ANTLR4 的作者在其类似的书中,通过清晰的图示和流程图将闭包、DFA 等概念呈现得一目了然,读者很容易就能理解。

另外,书中的代码片段也值得一提,甚至可以说是灾难。代码应该服务于书本内容,但左老师的做法似乎是讲到哪就贴上相应的代码。正确的做法是,首先讲解宏观的设计思想和代码的整体结构,然后再逐步细化到具体的代码,解释每一行代码的作用。与雨痕和郑建勋老师的书籍相比,左老师在这方面做得稍显不足。

起初,插图是我对左老师的博客非常赞叹的地方,甚至研读过他关于如何绘制插图的博文。但在书中,虽然插图依然精美,但并不总是恰到好处。很多时候,插图的色彩堆砌反而让人分心,读者的注意力容易集中在色彩选择上,而不是图示本身的内容。例如,为什么有些部分是红色的,为什么有些部分是绿色的?实际研究后发现,这些配色并非为了区分主次,只是为了视觉效果而已。

说到缺点,我就提到这里。整体而言,这本书并不是一本差书,否则我也不会重温它。只可惜,左老师本可以做得更好。他在 Go 语言的知识体系和研究深度方面,远远超过了我这个门外汉。很多时候,我们对一本书的评价,可能轻轻松松地说一句“好”或“不好”。但实际上,写一本书并不是一件简单的事情。只不过,我觉得这本书更像是写给作者自己看的。如果想要让更多人看懂,或许还需要更多的努力。

学会向这个世界妥协

2025-09-28 23:58:28

藍切 健さんオリジナル2曲描き下ろし!! - inika

9 月初,我发了一个关于《围城》的段子。评论区毫不意外地迎来了一阵骂声,言辞犀利,仿佛和我有不共戴天之仇。一如年初我提到京台高速时,解释“台”指的是台湾,也同样被人群起而攻之。

互联网时代最显著的特点,就是“放大”。原本小众的亚文化,可能因为传播而进入主流;细微的声音,也可能瞬间成为热梗。与此同时,各种铺天盖地的杂声也被无限放大,几乎每一次发声都可能引发涟漪——有时是赞扬,有时则是苛责。

记得大学时,我常读阮一峰老师的博客,经常能发现文章里的小错误。评论区里总有人指正,也有人冷嘲热讽。阮老师的处理方式很特别:修正错误,但不删评论。当时我并不理解,觉得刺眼的言论删掉就好。我曾深信“美国式的言论自由”:你有“挥拳头”的权利,我也有“不被打”的权利。

然而,网络里的冷暴力因为传播而被放大,锋利得像刀,很容易伤人。年轻时的我常常咽不下这口气,总觉得一定要争个你死我活,即便争论落了下风,也会像阿 Q 一样,在心里默念自己“被儿子骂了”。2015 年,我在知乎混迹时,知乎推出过“友善度”系统,提醒大家不要过度攻击他人。可即便如此,我仍常因争论而与他人冲突,友善度一度降到最低。

如今回头看,那些争吵、执拗和愤怒,大多没有意义。你删不完评论,也赢不了所有的辩论。学会妥协,并不是承认错误,而是承认世界不会完全如你所愿。与世界相处的方式,往往不是“打败”,而是“放下”。

学会放下,并非软弱,而是一种智慧。它让你在面对网络的喧嚣时,依然能保持内心的宁静;它让你在生活的纷扰中,不再因小事消耗过多精力。有人说,成长就是学会和自己和解,也学会与世界和解。

星と街 - naru

人生,很短也很长

2025-09-22 09:45:28

RIZ3 - なにこれ幸せ

2021 年年底,我的大女儿桃桃出生了,而我也多了一个新的角色——父亲。次年新年,我专门带着刚满百天的桃桃回老家见外祖母,四世同堂,其乐融融。可惜宝宝太小,天气寒冷,吃了顿饭便回了芜湖。本准备等桃桃大一点再带去陪陪她的曾外祖母,可没曾想,久卧病床的外祖母终究还是没能熬过秋天,离开了人世。一抔碎骨,一路弹唱,敲敲打打,鞭炮齐鸣。老人家生前安安静静,死后却以极其喧嚣的仪式走完了这最后一程。

那一刻,我忽然想起一个说法。书上说,人有三次死亡。第一次是医学上的告别,心脏停跳,呼吸停止。第二次是法律上的落幕,死亡证明落笔,户口页上盖下“注销”的红章。第三次,是世界上最后一个记得你的人把你忘记。桃桃肯定不记得这位只有一面之缘的曾外祖母,但她却不会忘记挚爱的爷爷奶奶,一如我永远记得我的爷爷奶奶一样。

亲人的离去,于我而言并不陌生。在我只有 5 岁的时候,我的四叔便因肝癌离世。印象最深的是父亲的眼泪。我在家里大老远就听到外面的哭声。开门一看,一向不苟言笑的父亲孤零零地从田埂走回来,一边走一边抹眼泪。那时的我并不懂事,正想张口问父亲怎么了,母亲阻止了我,轻声说道:“你四叔去世了,现在不要打搅你爸。”

正式面对死亡,是在 7 岁的春天。母亲专门给我请了假,带我去参加小姑的葬礼。懵懂的我到了火葬场,先被换上了一身麻衣,头上扎着白布,左胳膊套上黑色袖章,和姑表妹一起跪坐在火盆边烧纸。黄纸和各类元宝纸钞一叠叠放在火上,灰烬随风飘散,周遭十分安静,只听得见姑表妹窸窸窣窣的哭声。随后便和大人们依次排队步入灵堂。水晶棺里躺着的,是以前常来家里玩的小姑,如今面无血色地安静躺着;奶奶走在前面,由大姑和妈妈搀扶,哭得几乎瘫倒。那一刻,我第一次直观地感受到:人,再也醒不来了。

再往后,便是爷爷奶奶在我读大学时相继离世,大姑也因癌症于数年前离去。身边熟悉的亲人一个接一个地离开,心里也随之缺了永远补不上的一块。我因此相当害怕死亡,觉得人生过于短暂,寥寥数载,便回归天地。汩余若将不及今,恐年岁不吾与。

也许正因如此,每一次新生,才更像命运递来的馈赠。2025 年 2 月初春,老二柚柚的哭声在产房里绽开。而我,成为了两个女孩的父亲。此时的桃桃趴在病床边,看着妈妈怀里的妹妹,撅着嘴,似有不满。但一转头,却又表现得十分乖巧,还关心起妈妈。写到这,我忽然失笑,原来“争风吃醋”的序幕,从柚柚出生那天起便已悄然拉开。

柚柚满月后,家里的小摩擦成了日常。桃桃有时会突然打一下妹妹,把妹妹弄哭后再跑开。我们翻出她以前的玩具给妹妹玩,她会抢走,理由是“妹妹还小,不会用”。把以前的爬爬垫和围栏重新组装好后,桃桃会一个人霸占整个空间,不让妹妹进来。起初我只当她不懂事,直到有一天我抱着柚柚,目光一瞥,看见桃桃落寞地坐在那里,才忽然明白:她并非不懂事,只是因为那份独属于自己的爱,被分走了半分。

而在两个孩子的哭声与笑声之间,我愈发清晰地看见父母老去的模样。以前母亲总爱抱着桃桃出去逛,如今抱柚柚没几分钟就会说“胳膊酸”,坐下时还得扶着腰慢慢挪。父亲的鬓角愈发花白,桃桃抱着他的腿,央求要坐在肩头。恍惚间,我仿佛看见 8 岁的自己,在父亲务工回家的新年,也一样要骑在父亲的肩头,爬上 6 楼的家……

日子一天天过,桃桃渐渐学会了当姐姐。她会在睡前搂着妹妹一起玩,会在妹妹哭的时候帮妈妈哄她,甚至会允许妹妹在她上幼儿园时玩她最爱的爬爬垫。而我也在看着两个孩子成长、陪着父母老去的过程中,慢慢放下了对死亡的恐惧。

我慢慢开始明白,人生的“短”,是因为生命终有尽头,就像亲人们终将离开。可人生的“长”,是因为爱与记忆会一直延续。外祖母的故事会留在母亲的讲述里,传给桃桃和柚柚。父母对我的疼爱,会化作我对两个孩子的呵护。而桃桃和柚柚之间的手足情,又会成为她们未来人生里的温暖支撑。

原来,人生从不是一场孤单的奔赴,而是一场带着爱与牵挂的接力。那些离开的人,会变成记忆里的光;那些陪伴在身边的人,会成为前行的力量。这样看来,人生其实不短,因为每一份爱与记忆,都在把时光拉得很长很长。

小さな手のひら - tatsuya

蓝色时期

2025-04-08 14:13:32

蓝色时期 01

对美术史和毕加索略有了解的朋友,想必都知道“蓝色时期”指的是毕加索在 1900 年至 1904 年间,以蓝色和绿色为主色调创作的近乎单色的作品。这部漫画以此为标题,讲述的是主角八虎——一个从未想过以绘画为人生方向的高中生,却因在美术课上用纯蓝色描绘清晨的涩谷而爱上画画,并最终通过努力考入东京艺术大学的故事。

关于艺考,日本和中国的情况其实颇为相似。我读高二时,班里曾有一位同学突然请了长假,直到高考前才再次出现。再见时,他手里提着一个装满几百支削好铅笔的画箱,还有各式沉甸甸的绘画工具。后来才知道,他选择了艺考。在画室里,学生们里三层外三层地围着模特,没日没夜地练习素描、速写和水彩,只为了在竞争中多一分胜算。

后来和妻子(版画专业)恋爱时,我才更深入地了解到艺考背后的艰辛——那简直是一场“苦难行军”。在中国,艺考与高考是分开的,且时间更早。在艺术类统考尚未普及的年代,考生们必须带着沉重的画具,奔波于各地参加不同院校的独立招生考试。他们需要在人物素描、水彩等科目的评选中脱颖而出,拿到专业合格证后,再回头恶补文化课,确保高考成绩达到录取线。这种双重压力,非亲身经历者难以体会。

目前我刚看完《蓝色时期》的前六卷,故事恰好停在八虎考上东京艺术大学。尽管剧情走向与我提到的真实经历大体吻合,但漫画对紧张感的刻画简直令人窒息。无论是考前争分夺秒的练习,还是面对挫折时的迷茫与自我怀疑,那种交织着焦虑与渴望的情绪,想必每一位经历过高考的人都能感同身受。

更触动人心的是,漫画赤裸地展现了“热爱”与“现实”的落差:八虎最初只因喜欢画画而拿起画笔,可真正踏入这条道路后,才发现一切远非想象中简单。努力未必能换来回报,有人天赋异禀轻松成功,也有人看似从容实则拼尽全力。这种残酷的真实性,通过漫画的叙事扑面而来,让人无处可逃。

此外,作者巧妙地将美术知识融入剧情。从色彩理论到构图透视,这些本可能枯燥的内容,通过八虎的学习过程生动呈现。对大多数读者而言,这或许是第一次系统接触这些概念——若能因此对美术产生兴趣,甚至尝试创作,便是这部漫画额外的馈赠。

用一晚读完六卷,既让我重温了青春时代的热血与执着,也再次感叹王道漫画的感染力。但现实远比故事残酷:漫画中“2000 人录取 50 人”的比例已令人咋舌,而国内四大美院的竞争,实则是万里挑一的修罗场。

蓝色时期 02

百年大同,两年情深

2025-03-29 09:16:21

厦门大同中学老校门

我的初中大同中学占地仅20余亩,是厦门市规模最小的中学之一。然而,这所小小的学校却带给我们家庭作坊般的温暖,更教会我们何谓”大爱”。正如新校歌所唱:“诞生在1925年,与祖国一同历经战火洗礼”。如今,大同中学已迎来建校百年华诞。在这跨越新旧中国的办学历程中,无数莘莘学子曾在此求学。我有幸于2005年入读大同中学,见证了学校 80 周年校庆的盛况。

对我而言,虽然在大同的求学时光仅有两年,却成为记忆中永不褪色的光点。当年的许多老师至今仍与我保持联系。我的物理启蒙老师叶小静,以生动的教学让我爱上物理;化学老师林舜卿则利用课余时间组织全年级的化学竞赛培训,为我打开了化学世界的大门。时至今日,我仍深深感激她们的悉心栽培。

大同中学的旧校歌虽难在互联网上寻得音频,但这首改编自《礼记·大同篇》的校歌,我至今仍能完整吟唱:

大道之行,天下为公。幼有所长,老有所终。

弼成教义,吾校所宗。莘莘学子,以陶以熔。

如金在冶,如玉在砻。顾名思义,是谓大同。

维此大同,靖山之麓,岩有虎溪,洞有白鹿。

名贤遗踪,宏兹乐育,挹彼清芬,松风谡谡。

百年树人,十年树木。发扬踔厉,振我民族。

凤凰花开 - 厦门大学中学卢嘉锡雕像

Vector DB Research for comparing the Milvus with Elasticsearch

2025-01-16 00:13:32

Background

In the application scenarios of Elasticsearch, the storage of large amounts of data may significantly impact the read and write performance of Elasticsearch. Therefore, it is necessary to split indexes according to certain data types. This article explains through relevant technical research whether splitting data on Elasticsearch will affect query results in AI search scenarios. It also compares the implementation principles of other vector databases currently available in the industry with those currently using Elasticsearch.

Goals

  1. Elasticsearch vs. Milvus: Comparison in AIC use cases

    Investigate the data storage mechanisms and query processes of mainstream vector databases in the current industry (Qdrant, Milvus). Conduct an in-depth analysis of how they handle data updates (such as incremental updates and deletion operations) and compare them with Elasticsearch.

  2. The impact of single-table and multi-table design on similarity calculation in the Elasticsearch BM25 model

    Study the Elasticsearch differences between single-index and multi-index structures in the BM25 calculation, particularly their impact on efficiency and accuracy during calculations.

Elasticsearch vs. Milvus: Comparison in storage, query, etc

Overall Architecture

Elasticsearch Architecture

Elasticsearch architecture is straightforward. Each node in a cluster can handle requests and redirect them to the appropriate data nodes for searching. We use blue-green deployment for scaling up or down, which enhances stability requirements.

Cons: Currently, we only use two types of Elasticsearch nodes: data nodes and master nodes. Every data node serves all roles, which may not be as clear-cut as Milvus’s architecture.

Multiple Milvus Architecture

The Milvus Lite is the core search engine part with the embedded storage for local prototype verification. It’s written in Python and can be integrated into any AI python project.

The Milvus standalone is based on Docker compose with a milvus instance, a MinIO instance and an etcd instance. The Milvus Distributed is used in Cloud and production with all the required modules. In the most case, we are talking about the Milvus Distributed in this report.

Milvus Distributed Architecture

Milvus has a shared storage massively parallel processing (MPP) architecture, with storage and computing resources independent of one another. The data and the control plane are disaggregated, and its architecture comprises four layers: access layer, coordinator services, worker nodes, and storage. Each layer is independent of the others for better disaster recovery and scalability.

  • Access Layer: This layer serves as the endpoint for the users. Composed of stateless proxies, the access layer validates client requests before returning the final results to the client. The proxy uses load-balancing components like Nginx and NodePort to provide a unified service address.
  • Coordinator Service: This layer serves as the system’s brain, assigning tasks to worker nodes. The coordinator service layer performs critical operations, including data management, load balancing, data declaration, cluster topology management, and timestamp generation.
  • Worker Nodes: The worker nodes follow the instructions from the coordinator service layer and execute data manipulation language (DML) commands. Due to the separation of computing and storage, these nodes are stateless in nature. When deployed on Kubernetes, the worker nodes facilitate disaster recovery and system scale-out.
  • Storage: Responsible for data persistence, the storage layer consists of meta storage, log broker, and object storage. Meta storage stores snapshots of metadata, such as message consumption checkpoints and node status. On the other hand, object storage stores snapshots of index files, logs, and intermediate query results. The log broker functions as a pub-sub system supporting data playback and recovery.

Even in a minimal standalone Milvus deployment. We need an OSS service like Minio or S3, A etcd standalone cluster and a milvus instance. It’s quite complex architecture and mainly deployed and used on K8S.

Summary

Elasticsearch Milvus
Complexity Simple, only master nodes and data nodes. Complex, require OSS, etcd and different types of milvus nodes.
But can be deployed by using Amazon EKS.
Potential Bottleneck As the increase of the number of Elasticsearch cluster. We may need more replicas to balance the query for avoiding hot zone. Etcd requires high performance disk for better serving metadata. It could be a bottleneck when the query increases.
Files on object storage need to be pulled to the local disk and eventually loaded into memory for querying. If this process switches frequently, the performance may not necessarily be good.
Scaling Require blue-green deployment to get the online cluster to be scaled Easy to scale on k8s. The compute node instance number can be changed on demand.
Storage Every data node’s hard disk. Require to add new data node to increase the storage. S3 is only used as the backup storage. OSS based. S3 can be used for storage all the metrics.
AA Switch Require two identical Elasticsearch cluster. No need to AA switch. Just reload the query nodes or add more query nodes.
Upgrade Same as the scaling. Use helm command on k8s cluster.

Data Writing Flow

Index Flow in Elasticsearch

In this diagram, we can see how a new document is stored by Elasticsearch. As soon as it “arrives”, it is committed to a transaction log called “translog” and to a memory buffer. The translog is how Elasticsearch can recover data that was only in memory in case of a crash.

All the documents in the memory buffer will generate a single in-memory Lucene segment when the “refresh” operation happens. This operation is used to make new documents available for search.

Depending on different triggers, eventually, all of those segments are merged into a single segment and saved to disk and the translog is cleared.

This diagram shows the whole routine for a simple index request.

Data Writing Flow in Milvus

The picture above shows all the modules used in data writing. All the data writing requests are triggered in the SDK. The SDK send the request through the Load Balancer to the proxy node. The number of the proxy node instances could be varied. The Proxy node cached data and request the segment information for writing the data into the message storage.

Message storage is mainly a Pulsar based platform for persistence the data. It is the same as the translog in Elasticsearch. The main difference is that Milvus don’t need a MQ service in the frontend. You can directly write data through it’s interface. And don’t need bulk request in Elasticsearch.

The data node consumes the data through message storage and flush it into the object storage finally.

Data model in Vector

Data Model Elasticsearch

As we can see from the diagram, Elasticsearch shards each Lucene index across the available nodes. A shard can be a primary shard or replica shard. Each shard is a Lucene Index, each one of those indexes can have multiple segments, each segment is an complete HNSW graph.

Data Model in Milvus

Milvus provides users with the largest concept called Collection, which can be mapped to a table in a traditional database and is equivalent to an Index in Elasticsearch. Each Collection is divided into multiple Shards, with two Shards by default. The number of Shards depends on how much data you need to write and how many nodes you want to distribute the writing across for processing.

Each Shard contains many Partitions, which have their own data attributes. A Shard itself is divided based on the hash of the primary key, while Partitions are often divided based on fields or Partition Tags that you specify. Common ways of partitioning include dividing by the date of data entry, by user gender, or by user age. One major advantage of using Partitions during queries is that if you add a Partition tag, it can help filter out a lot of data.

Shard is more about helping you expand write operations, while Partition helps improve read performance during read operations. Each Partition within a Shard corresponds to many small Segments. A Segment is the smallest unit of scheduling in our entire system and is divided into Growing Segments and Sealed Segments. A Growing Segment is subscribed by the Query Node, where users continuously write data until it becomes large enough; once it reaches the default limit of 512MB, writing is prohibited, turning it into a Sealed Segment, upon which some vector indexes are built for the Sealed Segment.

A stored procedure is organized by segments and uses a columnar storage method, where each primary key, column, and vector is stored in a separate file.

Vector Query

Index Types

Both Elasticsearch and Milvus require memory to load vector files and perform queries. But Milvus offers a file-based index type named DiskANN for large datasets, which doesn’t require loading all the data but indexes into memory for reducing the memory consumption.

As for Elasticsearch, the dense vector on HNSW is the only solution. The default dimension is float. But Elasticsearch provides the optimized HNSW for reducing the size or increase the performance. To use a quantized index, you can set your index type to int8_hnsw, int4_hnsw, or bbq_hnsw.

Supported index Classification Scenario
FLAT N/A
  • Relatively small dataset
  • Requires a 100% recall rate
IVF_FLAT N/A
  • High-speed query
  • Requires a recall rate as high as possible
IVF_SQ8 Quantization-based index
  • Very high-speed query
  • Limited memory resources
  • Accepts minor compromise in recall rate
IVF_PQ Quantization-based index
  • High-speed query
  • Limited memory resources
  • Accepts minor compromise in recall rate
HNSW Graph-based index
  • Very high-speed query
  • Requires a recall rate as high as possible
  • Large memory resources
HNSW_SQ Quantization-based index
  • Very high-speed query
  • Limited memory resources
  • Accepts minor compromise in recall rate
HNSW_PQ Quantization-based index
  • Medium speed query
  • Very limited memory resources
  • Accepts minor compromise in recall rate
HNSW_PRQ Quantization-based index
  • Medium speed query
  • Very limited memory resources
  • Accepts minor compromise in recall rate
SCANN Quantization-based index
  • Very high-speed query
  • Requires a recall rate as high as possible
  • Large memory resources

Query Flow in Elasticsearch

The query phase above consists of the following three steps:

  1. The client sends a search request to Node 3, which creates an empty priority queue of size from + size.
  2. Node 3 forwards the search request to a primary or replica copy of every shard in the index. Each shard executes the query locally and adds the results into a local sorted priority queue of size from + size.
  3. Each shard returns the doc IDs and sort values of all the docs in its priority queue to the coordinating node, Node 3, which merges these values into its own priority queue to produce a globally sorted list of results.

The distributed fetch phase consists of the following steps:

  1. The coordinating node identifies which documents need to be fetched and issues a multi GET request to the relevant shards.
  2. Each shard loads the documents and enriches them, if required, and then returns the documents to the coordinating node.
  3. Once all documents have been fetched, the coordinating node returns the results to the client.

Query Flow in Milvus

In the reading path, query requests are broadcast through DqRequestChannel, and query results are aggregated to the proxy via gRPC.

As a producer, the proxy writes query requests into DqRequestChannel. The way Query Node consumes DqRequestChannel is quite special: each Query Node subscribes to this Channel so that every message in the Channel is broadcasted to all Query Nodes.

After receiving a request, the Query Node performs a local query and aggregates at the Segment level before sending the aggregated result back to the corresponding Proxy via gRPC. It should be noted that there is a unique ProxyID in the query request identifying its originator. Based on this, different query results are routed by Query Nodes to their respective Proxies.

Once it determines that it has collected all of the Query Nodes’ results, Proxy performs global aggregation to obtain the final query result and returns it to the client. It should be noted that both in queries and results there exists an identical and unique RequestID which marks each individual query; based on this ID, Proxy distinguishes which set of results belong to one specific request.

Compare BM25 between Elasticsearch and Milvus

Why we still care about BM25 in RAG

Hybrid Search has long been an important method for improving the quality of Retrieval-Augmented Generation (RAG) search. Despite the remarkable performance of dense embedding-based search techniques, which have demonstrated significant progress in building deep semantic interactions between queries and documents as the model scale and pre-training datasets have expanded, there are still notable limitations. These include issues such as poor interoperability and suboptimal performance when dealing with long-tail queries and rare terms.

For many RAG applications, pre-trained models often lack domain-specific corpus support, and in some scenarios, their performance is even inferior to BM25-based keyword matching retrieval. Against this backdrop, Hybrid Search combines the semantic understanding capabilities of dense vector search with the precision of keyword matching, offering a more efficient solution to address these challenges. It has become a key technology for enhancing search effectiveness.

How to calculate BM25

BM25 (best matching) is a ranking function used by search engine to estimate the relevance of documents to a given search query.

score(D,Q)=i=1nIDF(qi)f(qi,D)(k1+1)f(qi,D)+k1(1b+b|D|avgdl)

Here is BM25 calculation formula for a query Q on document Q. Q contains keywords q1, q2, … , qn.

  1. f(qi,D) is the number of the times that the keyword qi occurs in the document D.
  2. |D| is the length of the document D in words.
  3. avgdl (average document length) is the average document length in the text collection from which documents are drawn.
  4. k1 and b are free parameters, used for advanced optimization. In common case, k12.0 && k11.2 and b=0.75.
IDF(qi)=ln(Nn(qi)+0.5n(qi)+0.5+1)

IDF (inverse document frequency) weight of the query term q, where N is the total number of documents in the collection, and n(qi) is the number of documents containing qi.

Why TF-IDF (BM25) as the main calculation

A term that appears in many documents does not provide as much information about the relevance of a document. Using a logarithmic scale ensures that as the document frequency of a term increases, its influence on the BM25 score grows more slowly. Without a logarithmic function, common terms would disproportionately affect the score.

How Elasticsearch calculate the BM25

By default, Elasticsearch calculates scores on a per-shard basis by leveraging the Lucene built-in function org.apache.lucene.search.similarities.BM25Similarity. It’s also the default similarity algorithm in the Lucene’s IndexSearcher. If we want to get the index level score calculation, we need to change the search_type from query_then_fetch to dfs_query_then_fetch.

In dfs_query_then_fetch search, we will add the org.elasticsearch.search.dfs.DfsPhase in searching. It will collect all the status in DfsSearchResult which contains the shards document information and hits, etc. The SearchPhaseController will aggregate all the dfs search results into a AggregatedDfs to calculate the score. We can use this search type to get a consistent BM25 score across multiple index.

Do we need use dfs_query_then_fetch in cross-indexes query

The only difference between multiple indexes or shard based BM25 calculation is the IDF. But if the data are well distributed among all the indexes and the document count are large enough in every shard. The difference for IDF could be tiny because we use logarithmic. You can get the growth trend in the second chart above. In this scenario, we don’t need to use dfs_query_then_fetch to calculate the global BM25 which requires more resource to cache and calculate.

Sparse-BM25 in Milvus

Starting from version 2.4, Milvus supports sparse vectors, and from version 2.5, it provides BM25 retrieval capabilities based on sparse vectors. With the built-in Sparse-BM25, Milvus offers native support for lexical retrieval. The specific features include:

  1. Tokenization and Data Preprocessing: Implemented based on the open-source search library Tantivy, including features such as stemming, lemmatization, and stop-word filtering.
  2. Distributed Vocabulary and Term Frequency Management: Efficient support for managing and calculating term frequencies in large-scale corpora.
  3. Sparse Vector Generation and Similarity Calculation: Sparse vectors are constructed using the term frequency (Corpus TF) of the corpus, and query sparse vectors are built based on the query term frequency (Query TF) and global inverse document frequency (IDF). Similarity is then calculated using a specific BM25 distance function.
  4. Inverted Index Support: Implements an inverted index based on the WAND algorithm, with support for the Block-Max WAND algorithm and graph indexing currently under development.

Pros and Cons of Sparse-BM25 in Milvus

  • Full-text search in Milvus is still under heavy development which can see a lot of bugs in GitHub.
  • Full-text search require creating extra Spare-Index on collections (the document set) which isn’t out of box like Elasticsearch.
  • Hybrid search on a collection with both ANN with BM25 can be ranked in a single requests and get the top K like Elasticsearch’s reciprocal rank fusion (RRF) since 8.8.0.