MoreRSS

site iconManjusaka修改

一个喜欢编程的香港记者,热爱 Python ,讨厌 Java。前饿了么。「捕蛇者说」播客之一。
请复制 RSS 到你的阅读器,或快速订阅到 :

Inoreader Feedly Follow Feedbin Local Reader

Manjusaka的 RSS 预览

怎么样 tracing 你的 SQL?

2026-02-22 18:49:00

过年是个好时候,可以猛猛干活或者看论文。不过上了几天磨,看了几天论文后觉得还是需要整点活

所以这篇文章来聊聊一个经典话题,How to trace your SQL?

正文

在 OpenTelementry 这一套开始铺展开来后,我们对于代码整个生命周期的 tracing 有了一个较为成熟的方案。包括各类 auto-instrumentation 的库,能够让我们在不修改代码的情况下就能对整个调用链进行 tracing。

但是始终有一朵乌云盘绕着 Tracing 世界的大厦上,我们怎么样将 SQL 的执行如同业务代码一样从黑盒中拆出来

在现阶段,我们对于整个 Tracing 的引入都是在做加法,我们选择构建一个 context,注入一些 metadata,让代码中不同的环节都可以获取到上下文。但是还是有一个问题,怎么样在 SQL 中做加法呢?

最直观的想法是,我们可以尝试一个类似机制

1
2
3
4
begin;
set saka.tracing_id = '1234567890';
select * from users where id = 1;
commit;

我们可以在 SQL 中设置一些 tracing_id 之类的东西,这样我们就可以在明确一个事务的上下文了。但是这个方案有一个问题,这会改变我们使用 SQL 的 pattern,我们需要在每个 SQL 语句前面都加上 set saka.tracing_id = ‘1234567890’ 这样的东西,这样就会导致我们在代码中需要修改大量的 SQL 语句,这显然是不可行的。

那么另外一种方式实现的我们可以将 tracing 注入到 SQL 中,something like this:

1
select * from users where id = 1 /* tracing_id: 1234567890 */;

那么怎么做?

Google 提出了一个通用的方案 or 叫一个事实上的的标准吧,叫作 SQLCommenter,它的核心思想就是通过 Hook ORM 等手段,让我们尽可能简单的在 SQL 中注入一些 comment,这些 comment 中包含了 tracing/Custom Tag 的信息,这样我们可以将元数据注入的成本降到最低。

那么我们来根据 SQLCommenter 的方案来看看我们怎么样在 Python 中实现这个功能,

以 pymysql 为例,我们需要实现这个非常简单

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from pymysql.cursors import SSCursor, SSDictCursor
from flask import g

from main.utils.context import INJECT_SQL_COMMENT, TRACE_ID
from main.utils.ip_utils import local_ip

try:
CURRENT_IP = local_ip()
except:
CURRENT_IP = "127.0.0.1"


def inject_meta_info(query: str) -> str:
if INJECT_SQL_COMMENT.get():
if TRACE_ID.get() != "None":
trace_id = getattr(g, "TRACE_ID")
sql_comment = f"/*X-Amzn-Trace-Id={trace_id}*/"
query = f"{sql_comment} {query}"
query = f"/*source_ip={CURRENT_IP}*/ {query}"
return query


class CustomSSCursor(SSCursor):
def execute(self, query: str, args: Any = None) -> int:
return super(CustomSSCursor, self).execute(inject_meta_info(query), args)


class CustomSSDictCursor(SSDictCursor):
def execute(self, query: str, args: Any = None) -> int:
return super(CustomSSDictCursor, self).execute(inject_meta_info(query), args)

然后我们在使用时,注入自定义的 Cursor 就好了

Python 的生态还是幸福,但是很可惜,我现在是被迫在写 Node.js 的代码了,Node.js 的生态就没有那么幸福了。由于历史原因,我们现在用的是极为美味的 Prisma 作为 ORM。那么我们需要在 Prisma 来看一下怎么样注入 SQL Comment。

首先在 Prisma 最新的 v7.x 版本中,Prisma 本身实现了 SQLCommenter 的功能,something like this:

1
2
3
4
5
6
7
8
9
10
import { queryTags, withQueryTags } from "@prisma/sqlcommenter-query-tags";
import { PrismaClient } from "../prisma/generated/client";
const prisma = new PrismaClient({
adapter,
comments: [queryTags()],
});
// Wrap your queries to add tags
const users = await withQueryTags({ route: "/api/users", requestId: "abc-123" }, () =>
prisma.user.findMany(),
);

最终会生成类似这样的 SQL

1
SELECT ... FROM "User" /*requestId='abc-123',route='/api/users'*/

OK, 很不错,是预期内行为

但是问题在于 Prisma V7 是一个极为屎一样的版本,我们完全无法如品鉴母鸡卡一样品鉴这个功能。因为性能问题(v7 比 v6 慢了 30%-40%),我们完全无法在生产环境中使用 v7 的版本,所以我们只能在 v6 中实现这个功能了。

在 Prisma v6 中,Prisma 数据映射部分和核心的 Query Engine 是完全分开,他们通过走 NAPI-RS 进行通信,而他们自定义了一套 json based 的传输协议,协议样例如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
{
"modelName": "User", // 可选,raw query 不需要
"action": "findMany", // 操作类型
"query": { // 查询详情
"arguments": { // where/orderBy/take 等
"where": { "email": { "contains": "prisma.io" } }
},
"selection": { // 字段选择
"$scalars": true,
"$composites": true,
"posts": { // 关联查询嵌套
"arguments": {},
"selection": { "$scalars": true }
}
}
}
}

而在这一次,我想实现类似官方的语义

1
2
3
4
return this.prisma.post.findMany({
take: 100,
sqlComments: getSqlComments(req),
});

OK,那么我们直接用一个流程图来输出一下对应的实现流程

flowchart TD    subgraph TS["TypeScript (prisma repo)"]        A["用户代码<br/>prisma.user.findMany({sqlComments: {...}, where: {...}})"]        B["getPrismaClient.ts :: _executeRequest()<br/>调用 serializeJsonQuery()"]        C["serializeJsonQuery.ts :: serializeJsonQuery() <b>[改动1]</b><br/>提取 sqlComments 放入 JsonQuery 顶层<br/>输出: {modelName, action, query, sqlComments}"]        D["RequestHandler.ts :: singleLoader / batchLoader<br/>调用 _engine.request(protocolQuery, {traceparent})"]        E["LibraryEngine.ts :: request()<br/>JSON.stringify(query) + JSON.stringify({traceparent})<br/>engine.query(queryStr, headerStr, txId)"]    end    subgraph FFI["C ABI / NAPI FFI 边界"]        F(("FFI"))    end    subgraph Rust["Rust (prisma-engines repo)"]        G["QueryEngine::query()<br/>解析 body_str → RequestBody<br/>从 header 提取 traceparent"]        H["RequestHandler::handle() <b>[改动2]</b><br/>body.into_doc() → (QueryDocument, SqlCommentsVec)<br/>组装 QueryContext {traceparent, sql_comments}"]        I["Single 查询<br/>QueryContext::new(traceparent, sql_comments#0)"]        J["Batch 查询<br/>每个 operation 独立 QueryContext<br/>共享 traceparent,各自 sql_comments"]        K["JsonBody::into_doc() <b>[改动3]</b><br/>JsonSingleQuery 新增 sql_comments 字段<br/>extract_sql_comments() → Vec&lt;(String,String)&gt;"]        L["QueryExecutor::execute() <b>[改动4]</b><br/>traceparent → query_context: QueryContext<br/>管道传递: execute_operation → interpreter → read/write"]        M["interpreter :: read.rs / write.rs <b>[改动5]</b><br/>query_context.sql_trace() → SqlTrace {traceparent, sql_comments}"]        N["ReadOperations / WriteOperations <b>[改动6]</b><br/>traceparent: Option&lt;TraceParent&gt; → trace: SqlTrace"]        O["sql-query-connector :: connection.rs <b>[改动7]</b><br/>Context::new(&connection_info, trace)"]        P["sql-query-builder :: read/write/select <b>[改动8]</b><br/>构建 Quaint AST<br/>调用 .add_trace_id(ctx)"]        Q["sql_trace.rs :: add_trace_id() <b>[改动9]</b><br/>build_trace_comment(sql_comments, traceparent)<br/>1. sql_comments 按 key 字母排序<br/>2. key/value URL 编码, 格式: key='value'<br/>3.traceparent sampled 则追加<br/>4. 逗号拼接, 调用 .comment(result)"]        R["Quaint AST :: .comment(...)<br/>渲染时在 SQL 末尾追加 /* ... */"]    end    S["最终 SQL<br/>SELECT ... FROM &quot;User&quot; WHERE ...<br/>/* controller='UserController',route='%2Fapi%2Fusers' */"]    A --> B --> C --> D --> E    E --> F --> G --> H    H --> I --> L    H --> J --> L    H -.->|内部调用| K    L --> M --> N --> O --> P --> Q --> R --> S

OK,在调整完 FFI ,扩展完 Query Engine 后,我们再调整一下 Prisma 代码生成相关的部分即可

然后我们可以在业务代码中这样使用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
function getSqlComments(req: Request): Record<string, string> {
const comments: Record<string, string> = {};

// Request info
comments.route = req.path;
comments.method = req.method;
if (req.route?.path) {
comments["route_pattern"] = req.route.path;
}

// OpenTelemetry trace context
const span = trace.getSpan(context.active());
if (span) {
const ctx = span.spanContext();
comments.traceparent = `00-${ctx.traceId}-${ctx.spanId}-0${ctx.traceFlags}`;
}

return comments;
}

@Controller()
export class AppController {
constructor(private readonly prisma: PrismaService) {}

@Get()
getHello(): string {
return `Hello World!`;
}

@Get("posts")
getPosts(@Req() req: Request) {
return this.prisma.post.findMany({
take: 100,
sqlComments: getSqlComments(req),
});
}

@Get("posts/:id")
getPostsById(@Param("id") id: string, @Req() req: Request) {
return this.prisma.post.findUnique({
where: { id },
sqlComments: getSqlComments(req),
});
}

@Get("posts-with-comments")
getPostsWithComments(@Req() req: Request) {
return this.prisma.post.findMany({
take: 100,
include: {
comments: true,
},
sqlComments: getSqlComments(req),
});
}

@Get("posts-with-comments/:id")
getPostWithCommentsById(@Param("id") id: string, @Req() req: Request) {
return this.prisma.post.findUnique({
where: { id },
include: {
comments: true,
},
sqlComments: getSqlComments(req),
});
}

@Post("posts")
createPost(@Body() body: { title: string; body: string }, @Req() req: Request) {
return this.prisma.post.create({
data: {
title: body.title,
body: body.body,
},
sqlComments: getSqlComments(req),
});
}
}

然后我们可以得到这样的 SQL

1
2
3
4
2026-02-22 07:56:38.681 GMT [190] LOG:  execute s412381: SELECT "public"."Post"."id", "public"."Post"."title", "public"."Post"."body", "public"."Post"."createdAt", "public"."Post"."modifiedAt" FROM "public"."Post" WHERE ("public"."Post"."id" = $1 AND 1=1) LIMIT $2 OFFSET $3 /* method='GET',route='%2Fposts%2Fff4ccd6d-2c15-4979-a5f3-0c27e4e2f169',route_pattern='%2Fposts%2F:id',traceparent='00-e984180b391935fbdf2b1e1f6f3b2b12-1286ff6754fbbc57-01' */
2026-02-22 07:56:38.681 GMT [190] DETAIL: parameters: $1 = 'ff4ccd6d-2c15-4979-a5f3-0c27e4e2f169', $2 = '1', $3 = '0'
2026-02-22 07:56:38.681 GMT [190] LOG: execute s412382: SELECT "public"."Post"."id", "public"."Post"."title", "public"."Post"."body", "public"."Post"."createdAt", "public"."Post"."modifiedAt" FROM "public"."Post" WHERE ("public"."Post"."id" = $1 AND 1=1) LIMIT $2 OFFSET $3 /* method='GET',route='%2Fposts%2Fff4ccd6d-2c15-4979-a5f3-0c27e4e2f169',route_pattern='%2Fposts%2F:id',traceparent='00-fa7f8686b25c2efe8942718207c28079-bbeb59140c47895c-01' */
2026-02-22 07:56:38.681 GMT [190] DETAIL: parameters

通常来说,这样的语句已经能在我们常见的数据库调试流程中已经起到很大的帮助了,可以查到某一条慢 SQL 的来源,上下文等信息

但是这就够了吗?我们能不能把数据库也接入到 OpenTelemetry 的 tracing 体系中来呢?我们能不能在数据库的层面上看到整个调用链的 tracing 信息呢?

那没有问题的,这里以我现在更熟悉一点的 PostgreSQL 生态举个例子

Datadog 之前有一个工作,他们在 PostgreSQL 上实现了一个扩展叫作 pg_tracing https://github.com/DataDog/pg_tracing

通过使用 PostgreSQL 的一些扩展点

  1. post_parse_analyze_hook
  2. planner_hook
  3. ExecutorStart_hook
  4. ExecutorRun_hook
  5. ExecutorFinish_hook
  6. ExecutorEnd_hook
  7. ProcessUtility_hook
  8. xact_callback

然后配合一些环形缓冲区和共享内存,就可以在数据库层面上实现一个 tracing 的功能了,这样我们就可以将 PostgreSQL 接入到 OTEL 生态中了。当然这个库的实现里面有不少小技巧,改天可以单独写个文章来聊聊

MySQL 虽然内部的实现是一坨,但是我想如果要在 MySQL 上实现类似的功能应该工作量也不会太大,

最终的效果如下所示

OTEL tracing

差不多就这样

总结

大家都在写各种 AI/Agent 的文章的时候,我还在搞点这种 old school 的东西。恍惚间看到我的前方有一个巨大的风车。But anyway, 我喜欢这些东西, 这就够了

祝大家看的开心。

笑ってほしくて

2026-01-01 23:00:00

每年都会选一句话作为年终总结的标题,去年是“本当の僕らをありがとう”,今年我选择“笑ってほしくて”。

这句话出自 《葬送的芙莉莲》的片尾曲《Anytime anywhere》。

愿你露出笑容

可能是我想对逝去之人,逝去的猫说的,可能也是他们想对我说的

也是我想对所有看到这篇文章的人说的

开篇

实际上一度想放弃写这篇文章,因为总是会害怕。今年在生死边缘搏战了许久。但终究是没有挽留住。每次想起点点滴滴时,总会不免破防

但是就如同歌词所说

こんなに胸が痛いのは/胸口传来的痛楚

あなたといた証かな/是与你同在的证明

那么所以还是要写点什么来纪念这特殊的2025, 也是 saka 的2025

生活

今年生活中最大的事件就是猫咪的过世。小熊是我们在2023年从小区收养的猫咪。一最开始就给上了强度,重度口炎+肾衰。然后后续情况稳定后带回家进行日常的治疗。

而在24年经过一整年的折腾(开腹两次,病危三次),25年年初医生终于说可以两月复查后,我们以为小熊还能陪伴我们很长时间

但是事与愿违,在6月份过完到家纪念日后,小熊的状态就时好时坏。在8月份最后一次住进医院后,虽然期间还是有好转的时候,但是整体情况还是急转直下。最终我们选择放他离开

很难说,小熊的离去对我的打击是怎么样的。我同事说感觉我从中恢复过来还蛮快的。不过中个泪流时分也只有自己知道(其实写这篇文章的时候也是在默默流泪)

今年也是我好友离开我的第五个年头。恍惚间,我已经比他大了

君埋泉下泥销骨,我寄人间雪满头

不得不说古人是真的会写啊。

聊点开心的,今年我的日常生活比去年又丰富了很多。史无前例的,saka 获取了,他的,第一个全成就游戏!所以 2025 saka 严选最佳游戏的是?

是《空之轨迹 1st 重制版》

对不起 P5R,我也很爱你的(

P5R 白金

今年也还在继续的拍照,买了 Z135 F1.8S 圆神,Z105 F2.8 百微,Z35 F1.2S 三个头,在摄影器材的路上越走越远,整体的快门数快 10w 了,整体以猫狗为主。某种意义上来说,我很庆幸用相机记录下了小熊这几年最精彩的时光

今年对于我来说另一个在最大的改变就是,我去了出来能去的最远的地方——日本

我精神图腾中的几大支柱(奥特曼,摇曳露营)都发源自日本,而我第一次出国便是去日本(某种意义上算冥冥之中的一种巧合)

在日本出差的这两周里,解锁了很多新奇的体验。去秋叶原爆米,去抚子的故乡爆米,在出租车上和司机一起唱奥特曼主题曲,和好友一起逛街爆米,和同事一起去热海合宿,见了好看的海边,去了沼津,吃了好吃的海鲜饭等等等。

某种意义上今年开启了我过去很多人生中未曾想过的体验。也让我坚定的了一个想法、

人与人的欢乐实际上是可以相通的

铁道痴

感情

感情进入了第七个年头。如果要为今年的感情沉淀一个主题的话,那可能就是四个字,生死相依

小熊走前的一个月,我们在医院陪护。仪器的滴鸣声,时而奔走的医护。时而逝去的生命。那种氛围真的很压抑

这种时候,两个人的相互的守望,可能是枯燥且祈祷着奇迹发生的日子里,已然发生少数的奇迹。

小熊离去后,我们两也久违的带着小狗出远门去散心,去一起给小狗拍了很多好看的照片,一起救助了其余的小猫,一起给医院捐献了设备。

能在人群中遇到彼此守望的伴侣,本身已然是奇迹

工作&技术

今年年初在朋友邀请下,我加入了一家 AIGC 创业公司,负责一些 infra 上的工作。这算是我职业生涯一个全新的转折点。

如果说这里面最大的感受是什么,那就是我可能在之前聊过的,身份转变所带来的更多的责任与事情。

先聊聊务虚的部分,往常我只需要 focus 在具体的事情的落地,可能其余的东西都会有人帮我兜底。而我今年开始成为一个试图去帮助其余同事能够更好落地事情的人。说实话这种身份上的转变实际上会让你有很多想法转变。就如同我和团队成员 1:1 沟通时我一直在问的一个问题一样“你觉得我还能做什么,能帮助你落地现在的事情?”。很多时候我去思考事情的时候,我不能仅考虑这个事情本身,而是我需要去思考这个事情怎么样才能够帮助团队/同事实现更好的价值。

与此同时,我的职责以及角色会变得更多元化。这一点也是需要走出舒适区。比如今年我在公司做了很多涉及到 DBA 和搜索相关的工作。严格来说,这一部分工作其实离我的好球区实际上很远,但是如果一个事情是重要的,同时当下没有人比我更擅长这件事(换句话说让我做不会是最坏的结果),那么我就需要去承担起这份责任。

接着是务实的部分,如果要说今年很有成就感的事情的话,就是在加入公司后,从头开始做稳定性相关的建设,且收获颇丰。无论是在多次互联网 infra 大范围 crash 后我们能在最短的时间内恢复业务,还是给我们的业务带来了实际上的稳定性与性能提升。我所做的工作都能直接的作用在用户价值上。某种意义上来说,这也是人生的一件幸事

以上都是好的 part,坏的 part 也不是没有,那就是

教练,我想写代码

我现在写代码基本上只能靠工作之外写写代码来保持手感和练手,呜呜呜呜呜,我真的好想转岗去写代码啊(

不过这也不算坏事,我现在非常享受业余时间写代码的生活,所以今年在社区的贡献比去年多了不少,包括把自建图床项目的代码拾掇了很多,给 CPython 的 JIT 贡献了不少代码。也让自己名字第一次在官方文档里留下了痕迹。

如果说明年有什么想法的话,简单的说就是在工作上带好团队,能够继续突破我自己的舒适区。以及希望我能给 CPython 下一阶段的 JIT 贡献更多的代码(其实有一些想法,正好趁着元旦假期梳理一下)

啊,突然想起,我26年还得好好学学 GPU 相关的编程的东西,说实话学习新的东西真的是非常开心的一件事!

saka 的2025

关于 AI

算是凑个热点,聊聊我对 AI 的看法。

很多人都会陷入一种焦虑,“AI 会不会替代我自己”

我自己的看法愈发的清晰,AI 时代会让人的价值更大,而不是被消泯

这里的逻辑很简单,AI 让一切都变得更为高效,内容的生产更为高效,开发效率更为高效,公司的形态更为灵活,在这种情况下,传统的很多东西在 AI 时代都面临的新的挑战。比如我举几个例子

对于愈发高效的内容生产效率,UGC 等形态的 AI Startup 会面临越来越大的 anti nsfw 的合规的挑战

而对于算力的愈发的渴求,边缘 GPU 算力的管控与接入也愈发成为一个重要的课题

AI 带来的生产力提升其实会让人的价值在这个时代更为凸显,而不是被消解

总结

差不多就是这样吧。过去一年有过不少很多的泪水,也有过很多的难眠之夜,也有过很多的快乐。

坦白来说能挺过这一年,是因为身边有着很多的陪伴,有我女朋友,有一群好朋友,有一群好同事。

每一个人的陪伴加上生活中辛酸苦辣甜汇聚在一起就成为了2025年的 saka,或者是 saka 的2025。

曾经想对于以后试图列出很多的展望,但是死生经历过一轮后,觉得剩下的都不太重要了

如同标题一样,

笑ってほしくて/愿你露出笑容

愿我们每一个人都能以笑容度过2026年的每一天

一万年太长,我们只笑今朝(

新年快乐!

成为榜样,但不要成为偶像

2025-10-03 04:00:00

最近刚把空轨 1st 打通关。差不多国庆假期也要结束了,要开始准备接下来的学习和工作了。趁着这个机会,写点东西,当个迟到的 PyCon China 2025 的总结吧

正文

其实去年在结束去年的会议后,我就在考虑要不要彻底退出 PyCon China 一段时间。原因其实也很简单,太累了。不过我向来是个很拖延癌的人。恰逢那时候正在职业的变动期,所以想着要不要再看看。

今年来了一家 AIGC 公司做 infra,说实话我干的还蛮开心的。每天都会学习新的东西。状态相比于去年好了不少(除了每天都需要和该死的 Any 做斗争),所以一度准备继续参加今年的 PyCon China

不过到了筹备期后,原本的计划突生变故。陪伴需求的猫咪病危,去世。一直在医院通宵,陪护以及最终要面临的生死别离。对我的精神和体力造成了极大的消耗。是否继续参加今年的 PyCon China 也成为一个问题。

不过最终还是决定参加了。今年突然出版社那边给我加了一个活,我翻译的书要在现场签售。这就又成为我心里一个新的疑问:嘛,真的会有人来买吗?

嘛,其实这也是每年在参加 PyCon 时都会有的自我怀疑,我真的有资格在这里吗?我讲的东西真的会有人听吗?

不过做了最终的决定那么就继续做吧。所以今年还是给了一个白银赞助+一个主题演讲。

时间过得很快,转眼到了会议前一天,临上飞机前我还在被 AWS 的傻叉 OpenSearch 折磨。下飞机后 yihong,piglei,空想家,jay 他们去开 impact 了。好好好开 impact 不带我是吧。我就孤零零的跑去了酒店。有人演讲前12h文件夹都还没建是谁我不说。

在酒店顶了几个小时把 PPT 赶完,工作收个尾。勉强睡了2h,然后就去会场了。

今年比较轻松的是早上的主持不用我了,我只要负责暖场一下。不过临开场前发现 C 会场的 PPT 还没收,然后摇了晚枫帮忙。

暖场还是我每年的保留节目,Saka 三问.jpg:

  1. 现在还在写 Python 的举个手
  2. 现在还没写 Python 的举个手(拖出去
  3. 用 2.7 的举个手

看起来效果还不错。暖场完我就跑出去了。不过卧槽,今年怎么这么多人来单杀我,妈耶社恐狂哭了

不过说实话,在经历生死一圈后,和老朋友打打闹闹,认识一下新朋友,还是蛮开心的。特别是很多人过来说我影响了他们,我是他们的偶像,他们从我的博客和分享里学到很多的时候。我一直以来的疑惑也得到了解决。虽然说人的自我认同最理想应该是内源性的,但是大家的认可也真的会让我很开心。

哦对了,还有很多人说他们也很喜欢摇曳露营!

摇曳露营的光辉指引我们

花絮

值得一提的是,我司的宝藏同事从日本来一起面基了(是的,我们是 Remote 公司),还给我带了手办

和同事一起

当然有某屑 HR 说要来参会结果早上睡过了,是谁我不点名了

上午的时间其实过去的很快,很快就结束了。然后我就来到了签售地方。出乎意料的,有很多人都来了,有来捧场的老朋友们,有刚刚线下刚对上号的新朋友们。大家一起打打闹闹,我用我的丑字写了很多祝福,也有很多杂话。要说哪一句最真心,那我觉得是“不要用 Next.js”罢。

不要用 Next.js

签售完火急火燎的吃完盒饭,抽烟的时候遇到师父,我开始基情的抚摸他的胸肌,这可能算是每次我们师徒相聚的仪式感了。问他了一个问题:你现在还需要我帮你收尸么?这个出处源自于我们之前约好他要是自杀我会来帮他处理后事。他想了想说,我们不如想想怎么活到150岁吧。很好,很强大,我很喜欢的回答。那我就当不会了。

后面纯爷也来加入了聊天,我顺便向他倾诉了一下把 PostgreSQL 当 MongoDB 用的痛苦。纯爷也只能用爱莫能助的眼光看看我

下午 C 会场实际上因为我需要抽烟提神迟到了几分钟,yihong 帮我暖场缓解尴尬,不得不给他磕一个,以及 yihong 抱起来手感很好,建议网友有条件的可以去试试

下午到我的时候其实因为控场的原因给我的时间比预期的要短一些,所以我临时调整了一些内容。要上去讲的时候,发现很多人都从其余会场赶了过来,算是非常满足了。在 QA 的时候,我给大家说我现在在一家用着 Node.js 写着 Any,把 PostgreSQL 当 MongoDB 用,以及还用着 GraphQL 的公司工作。大家一片会心一笑。我想我们屑 HR 大概今天在现场招不到人了罢

下午的时候,屑 HR 和我们在上海的另外一位同事也来,带来了我需要昏睡红茶零度可乐。说实话在一个 Remote 公司大家面基的机会还是很少的。理所当然,我成为公司群内表情包的一部分。该考虑下找屑 HR 要肖像费了

屑同事们做的表情包

晚上散场后,去和沪爷阿蔡以及几个同事一起组了一个局。不过说实话拉着明天要去霓虹的两位同事去吃日式拉面算不算另一种职场 80?笑死

总结

说实话今年 PyCon 的当天是我这两个月最快乐的一点,也许在很多年后我记不得了今年讲了什么,但是还会记得当天最简单快乐。

嘛,从18年到现在,7年过去了,我也从一个刚出校园的年轻人变成了一个老登。似乎很多东西都在变,但是很多东西又没变。

我还是很菜,但是我好像比之前影响了更多的人,帮到了更多的人?我还有很多东西不会,但是我好像能学的东西也更多了?还是会经历很多痛苦,很多迷茫,很多挣扎。但是生活似乎也还是一如既往的有着无限的希望与美好?

回北京后,一次吃饭时,我给我女朋友说,你知道吗?很多人说我是他们的偶像。我女朋友说:不,你是他们的榜样

是的,成为榜样,但是不要成为偶像。

这篇文章差不多写到这里。要到中秋了,除了祝大家中秋快乐,阖家欢乐以外。也祝大家每个人都能在这个快速迭代的世道里永葆初心。用 Piglei 老师的话说就是“老而不登”

这个世界唯有爱,希望,奥特曼与摇曳露营不可辜负,抚门!

Python 3.14 的进一步性能进化: Tail Call Interpreter

2025-07-02 23:49:00

最近做安全做的我头晕脑胀,来点轻松的换换脑子,让自己放松下

Python 3.14 正式引入了一个新的机制叫作 Tail Call Interpreter(Made by Ken Jin),这无疑又是一个奠定未来基础的重大工作

正文

在聊 Python 3.14 的 Tail Call Interpreter 之前,我们先要来聊 C 语言中最基本的语法 switch-case

1
2
3
4
5
6
7
8
9
10
switch (x) {
case 1:
// do something
break;
case 2:
// do something else
break;
default:
// do default thing
}

对于这样的代码我们最终的汇编会是什么样的呢?可能大家第一反应是先 cmp 然后 je ,不等式秒了,我们编译一个版本来看看

对于这样一段代码

1
2
3
4
5
6
7
8
void small_switch(int x) {
switch(x) {
case 1: printf("One\n"); break;
case 2: printf("Two\n"); break;
case 3: printf("Three\n"); break;
default: printf("Other\n"); break;
}
}

最终汇编的产物会是这样的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
00000000000011f0 <small_switch>:
11f0:83 ff 02 cmp $0x2,%edi
11f3:74 2b je 1220 <small_switch+0x30>
11f5:83 ff 03 cmp $0x3,%edi
11f8:74 16 je 1210 <small_switch+0x20>
11fa:83 ff 01 cmp $0x1,%edi
11fd:75 31 jne 1230 <small_switch+0x40>
11ff:48 8d 3d fe 0d 00 00 lea 0xdfe(%rip),%rdi # 2004 <_IO_stdin_used+0x4>
1206:e9 25 fe ff ff jmp 1030 <puts@plt>
120b:0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
1210:48 8d 3d f5 0d 00 00 lea 0xdf5(%rip),%rdi # 200c <_IO_stdin_used+0xc>
1217:e9 14 fe ff ff jmp 1030 <puts@plt>
121c:0f 1f 40 00 nopl 0x0(%rax)
1220:48 8d 3d e1 0d 00 00 lea 0xde1(%rip),%rdi # 2008 <_IO_stdin_used+0x8>
1227:e9 04 fe ff ff jmp 1030 <puts@plt>
122c:0f 1f 40 00 nopl 0x0(%rax)
1230:48 8d 3d db 0d 00 00 lea 0xddb(%rip),%rdi # 2012 <_IO_stdin_used+0x12>
1237:e9 f4 fd ff ff jmp 1030 <puts@plt>
123c:0f 1f 40 00 nopl 0x0(%rax)

我们能看到整体如我们所预期的一样,不断的 cmp 然后不断的 je,然后我们评估一下这里的复杂度呢?哦,时间复杂度 O(n) 秒了。

卧槽,那对于 Python 这样一个超大的 switch case 的结构,岂不是每次都是一个 O(n) ?这不得原地升天?

其实不是,通常来说,编译器会根据数据的类型和规模来用不同的方案处理 switch case 的结构

我们来准备几组例子

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
void small_switch(int x) {
switch(x) {
case 1: printf("One\n"); break;
case 2: printf("Two\n"); break;
case 3: printf("Three\n"); break;
default: printf("Other\n"); break;
}
}

void dense_switch(int x) {
switch(x) {
case 10: printf("Ten\n"); break;
case 11: printf("Eleven\n"); break;
case 12: printf("Twelve\n"); break;
case 13: printf("Thirteen\n"); break;
case 14: printf("Fourteen\n"); break;
case 15: printf("Fifteen\n"); break;
case 16: printf("Sixteen\n"); break;
case 17: printf("Seventeen\n"); break;
default: printf("Other\n"); break;
}
}

void sparse_switch(int x) {
switch(x) {
case 1: printf("One\n"); break;
case 100: printf("Hundred\n"); break;
case 1000: printf("Thousand\n"); break;
case 10000: printf("Ten thousand\n"); break;
default: printf("Other\n"); break;
}
}

void large_dense_switch(int x) {
switch(x) {
case 1: printf("Case 1\n"); break;
case 2: printf("Case 2\n"); break;
case 3: printf("Case 3\n"); break;
case 4: printf("Case 4\n"); break;
case 5: printf("Case 5\n"); break;
case 6: printf("Case 6\n"); break;
case 7: printf("Case 7\n"); break;
case 8: printf("Case 8\n"); break;
case 9: printf("Case 9\n"); break;
case 10: printf("Case 10\n"); break;
case 11: printf("Case 11\n"); break;
case 12: printf("Case 12\n"); break;
case 13: printf("Case 13\n"); break;
case 14: printf("Case 14\n"); break;
case 15: printf("Case 15\n"); break;
case 16: printf("Case 16\n"); break;
case 17: printf("Case 17\n"); break;
case 18: printf("Case 18\n"); break;
case 19: printf("Case 19\n"); break;
case 20: printf("Case 20\n"); break;
default: printf("Other\n"); break;
}
}

void mixed_switch(int x) {
switch(x) {
case 1: printf("Case 1\n"); break;
case 2: printf("Case 2\n"); break;
case 3: printf("Case 3\n"); break;

case 50: printf("Case 50\n"); break;

case 100: printf("Case 100\n"); break;
case 101: printf("Case 101\n"); break;
case 102: printf("Case 102\n"); break;

default: printf("Other\n"); break;
}
}

void char_switch(char c) {
switch(c) {
case 'a': printf("Letter a\n"); break;
case 'b': printf("Letter b\n"); break;
case 'c': printf("Letter c\n"); break;
case 'd': printf("Letter d\n"); break;
case 'e': printf("Letter e\n"); break;
default: printf("Other char\n"); break;
}
}

然后我们反汇编以下,看下结果(这里我只把关键的部分贴出来)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
00000000000011f0 <small_switch>:
11f0:83 ff 02 cmp $0x2,%edi # 比较是否为2
11f3:74 2b je 1220 # 跳转到case 2
11f5:83 ff 03 cmp $0x3,%edi # 比较是否为3
11f8:74 16 je 1210 # 跳转到case 3
11fa:83 ff 01 cmp $0x1,%edi # 比较是否为1
11fd:75 31 jne 1230 # 不是则跳转到default

0000000000001240 <dense_switch>:
1240:83 ef 0a sub $0xa,%edi # 减去10 (最小case值)
1243:83 ff 07 cmp $0x7,%edi # 比较范围 (17-10=7)
1246:0f 87 90 00 00 00 ja 12dc # 超出范围跳转default
124c:48 8d 15 15 0f 00 00 lea 0xf15(%rip),%rdx # 加载跳转表地址
1253:48 63 04 ba movslq (%rdx,%rdi,4),%rax # 获取偏移量
1257:48 01 d0 add %rdx,%rax # 计算目标地址
125a:ff e0 jmp *%rax # 间接跳转

00000000000012f0 <sparse_switch>:
12f0:81 ff e8 03 00 00 cmp $0x3e8,%edi # 比较1000
12f6:74 40 je 1338 # 等于则跳转
12f8:7f 16 jg 1310 # 大于1000则继续检查
12fa:83 ff 01 cmp $0x1,%edi # 小于1000,检查1
12fd:74 49 je 1348
12ff:83 ff 64 cmp $0x64,%edi # 检查100
1302:75 24 jne 1328 # 都不是则default
...
1310:81 ff 10 27 00 00 cmp $0x2710,%edi # 检查10000

0000000000001360 <large_dense_switch>:
1360:83 ff 14 cmp $0x14,%edi # 检查是否≤20
1363:0f 87 53 01 00 00 ja 14bc # 超出范围
1369:48 8d 15 18 0e 00 00 lea 0xe18(%rip),%rdx # 跳转表地址
1372:48 63 04 ba movslq (%rdx,%rdi,4),%rax # 直接索引
1376:48 01 d0 add %rdx,%rax
1379:ff e0 jmp *%rax

00000000000014d0 <mixed_switch>:
14d0:83 ff 32 cmp $0x32,%edi # 比较50
14d3:74 7b je 1550
14d5:7f 29 jg 1500 # >50的情况
14d7:83 ff 02 cmp $0x2,%edi # ≤50,检查小值
14da:74 64 je 1540
14dc:83 ff 03 cmp $0x3,%edi
...
1500:83 ff 65 cmp $0x65,%edi # >50,检查100,101,102
1503:74 5b je 1560

0000000000001580 <char_switch>:
1580:83 ef 61 sub $0x61,%edi # 减去'a'的ASCII值
1583:40 80 ff 04 cmp $0x4,%dil # 检查是否≤4 (a-e)
1587:77 63 ja 15ec
1589:48 8d 15 4c 0c 00 00 lea 0xc4c(%rip),%rdx
1590:40 0f b6 ff movzbl %dil,%edi # 零扩展到32位
1594:48 63 04 ba movslq (%rdx,%rdi,4),%rax

我们这里能看到编译器根据数据的不同类型来处理了 switch case 的结构,这里我用一个表格总结一下

Switch类型 Case数量 分布特点 编译器策略 时间复杂度
small_switch 3个 连续(1,2,3) 线性比较 O(n)
dense_switch 8个 连续(10-17) 偏移跳转表 O(1)
sparse_switch 4个 稀疏(1,100,1000,10000) 二分查找 O(log n)
large_dense_switch 20个 连续(1-20) 标准跳转表 O(1)
mixed_switch 7个 部分连续 嵌套比较 O(log n)
char_switch 5个 连续(‘a’-‘e’) 字符偏移表 O(1)

OK,这里我们发现,Switch-case 最终的实现因为数据类型的不一样,会导致我们最终的代码存在一个不可预测性。那么我们有没有办法来优化这个问题呢?答案是有的。

我们来看下面一段代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#include <stdio.h>

void basic_computed_goto(int operation) {
static void* jump_table[] = {
&&op_add,
&&op_sub,
&&op_mul,
&&op_div,
&&op_mod,
&&op_default
};

int a = 10, b = 3;
int result;

if (operation < 0 || operation > 4) {
operation = 5;
}

printf("Operation %d: a=%d, b=%d -> ", operation, a, b);

goto *jump_table[operation];

op_add:
result = a + b;
printf("ADD: %d\n", result);
return;

op_sub:
result = a - b;
printf("SUB: %d\n", result);
return;

op_mul:
result = a * b;
printf("MUL: %d\n", result);
return;

op_div:
if (b != 0) {
result = a / b;
printf("DIV: %d\n", result);
} else {
printf("DIV: Error (division by zero)\n");
}
return;

op_mod:
if (b != 0) {
result = a % b;
printf("MOD: %d\n", result);
} else {
printf("MOD: Error (division by zero)\n");
}
return;

op_default:
printf("Unknown operation\n");
return;
}

我们能看到这里核心的一个操作是将我们 Switch-cased 的每个 case 都变成了一个标签,然后我们通过一个 jump_table 来直接跳转到对应的标签上, 我们来看一下最关键位置的汇编

1
2
11d3:48 8d 05 c6 2b 00 00 lea    0x2bc6(%rip),%rax        # 3da0 <jump_table.0>
11da:ff 24 d8 jmp *(%rax,%rbx,8)

这里我们可以总结出来使用 Computed Goto 相较于传统的 switch-case 有以下几点优势

  1. 减少分支预测 fallback 的代价
  2. 指令缓存局部性上更优
  3. 减少了 cmp 指令的数量和开销

那么具体能有多快呢?可以参见 CPython 引入的 Computed Goto 的一个测试结果,大概是整体提升了15% 左右

那么 Computed Goto 的方式就是完美的吗?其实也不是,目前 CPython 的解释器 ceval.c 也是目前最大的一个 switch case 中存在几个典型问题

  1. Computed Goto 作为 clang 和 gcc 特化的功能,那么其余平台受益的可能性较小
  2. 目前 Computed Goto 其实并不成熟,在同一个编译器不同的版本可能会有不同的问题
  3. 超大型的 switch case 会导致编译器对于 switch case 的优化不够好
  4. 我们无法使用 perf 精确的去对我们整个过程中 per opcode 的开销进行定量分析,这在于让 Python 变得更快的大背景下将会是一个很大的问题

1,3,4 都很好理解,我们来看一下2的一个例子(感谢 Ken Jin 提供的例子)

在 GCC 11 的时候,switch-case 某个部分会正常的代码

1
2
3
4
5
6
738f: movq%r13, %r15
7392: movzbl%ah, %ebx
7395: movzbl%al, %eax
7398: movq(,%rax,8), %rax
73a0: movl%ebx, -0x248(%rbp)
73a6: jmpq*%rax

而在 GCC 13-15Beta 的时候,则会产生这样的代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
747a: movzbl%ah, %ebx
747d: movzbl%al, %eax
7480: movl%ebx, -0x248(%rbp)
7486: movq(,%rax,8), %rax
748e: jmp0x72a0 <_PyEval_EvalFrameDefault+0x970>

72a0: movq%r15, %xmm0
72a5: movq%r13, %xmm3
72aa: movq%r15, %rbx
72ad: punpcklqdq%xmm3, %xmm0
72b1: movhlps%xmm0, %xmm2
72b4: movq%xmm2, %r10
72b9: movq%r10, %r11
72bc: jmpq*%rax

我们能发现,额外的寄存器被引入了。体系结构 101,额外的寄存器意味着额外的开销。寄存器是很贵的!

那么我们有没有办法来迭代上面的超大的 Switch case 呢?估计有同学在想,既然上面的 switch case 超级大,那么我们将其拆分为多个小的函数
这样编译器可以有足够的上下文来优化,同时我们的 perf 也可以精确的去分析每个函数的开销,岂不美哉?

但是又有同学反对了,函数调用会触发 call 的指令,会带来额外的寄存器入栈和出栈的开销,这样会不会又变慢了呢?

那么能不能优化一下呢?答案是可以的,很多同学可能会想到了,tail call

假设我们有这样一段代码

1
2
3
4
5
6
__attribute__((noinline)) void g(int x) {
printf("Value: %d\n", x);
};
void f(int x) {
return g(x);
}

我们能看到这样一段汇编

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
0000000000001140 <g>:
1140:55 push %rbp
1141:48 89 e5 mov %rsp,%rbp
1144:48 83 ec 10 sub $0x10,%rsp
1148:89 7d fc mov %edi,-0x4(%rbp)
114b:8b 75 fc mov -0x4(%rbp),%esi
114e:48 8d 3d af 0e 00 00 lea 0xeaf(%rip),%rdi # 2004 <_IO_stdin_used+0x4>
1155:b0 00 mov $0x0,%al
1157:e8 d4 fe ff ff call 1030 <printf@plt>
115c:48 83 c4 10 add $0x10,%rsp
1160:5d pop %rbp
1161:c3 ret
1162:66 66 66 66 66 2e 0f data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
1169:1f 84 00 00 00 00 00

0000000000001170 <f>:
1170:55 push %rbp
1171:48 89 e5 mov %rsp,%rbp
1174:48 83 ec 10 sub $0x10,%rsp
1178:89 7d fc mov %edi,-0x4(%rbp)
117b:8b 7d fc mov -0x4(%rbp),%edi
117e:e8 bd ff ff ff call 1140 <g>
1183:48 83 c4 10 add $0x10,%rsp
1187:5d pop %rbp
1188:c3 ret
1189:0f 1f 80 00 00 00 00 nopl 0x0(%rax)

其中 call 1140 <g> 指令非常显眼。这也是函数调用本身的开销的一个重要来源

在现在编译器中,存在一种特殊的优化叫作尾递归,即当函数的最后一步是调用另一个函数时,编译器可以优化掉这个调用的开销

我们来测试一下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#include <stdio.h>
__attribute__((preserve_none)) void g(int x);
__attribute__((noinline, preserve_none)) void g(int x){
printf("Value: %d\n", x);
}

__attribute__((preserve_none)) void f(int x) {
[[clang::musttail]] return g(x);
}

int main() {
f(42);
return 0;
}

我们来看下相关汇编

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
0000000000001140 <g>:
1140:55 push %rbp
1141:48 89 e5 mov %rsp,%rbp
1144:48 83 ec 10 sub $0x10,%rsp
1148:44 89 65 fc mov %r12d,-0x4(%rbp)
114c:8b 75 fc mov -0x4(%rbp),%esi
114f:48 8d 3d ae 0e 00 00 lea 0xeae(%rip),%rdi # 2004 <_IO_stdin_used+0x4>
1156:31 c0 xor %eax,%eax
1158:e8 d3 fe ff ff call 1030 <printf@plt>
115d:48 83 c4 10 add $0x10,%rsp
1161:5d pop %rbp
1162:c3 ret
1163:66 66 66 66 2e 0f 1f data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
116a:84 00 00 00 00 00

0000000000001170 <f>:
1170:55 push %rbp
1171:48 89 e5 mov %rsp,%rbp
1174:44 89 65 fc mov %r12d,-0x4(%rbp)
1178:44 8b 65 fc mov -0x4(%rbp),%r12d
117c:5d pop %rbp
117d:e9 be ff ff ff jmp 1140 <g>
1182:66 66 66 66 66 2e 0f data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
1189:1f 84 00 00 00 00 00

我们能看到,f 函数的最后一步是 jmp 1140 <g>,而不是 call 1140 <g>,这就意味着我们在调用 g 函数的时候不会有额外的寄存器分配等传统 call 指令带来的开销。

可能有同学回过味来了,那么这里在做尾递归处理后,感觉完全可以当作一种高性能 Goto 来看嘛。

Bingo,这里其实思路也是差不多这样的,在77年的一篇论文《Debunking the ‘Expensive Procedure Call’ Myth, or, Procedure Call Implementations Considered Harmful, or, Lambda: The Ultimate GOTO》就提到了,高效的过程调用可以和 Goto 性能相近,而在实现上会更简洁。

在 Python 3.14 中,Tail Call Interpreter 的实现就是基于这个思路的。

我们能看到我们对于 opcode 进行 dispatch 的宏进行了尾递归的处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#   define Py_MUSTTAIL [[clang::musttail]]
# define Py_PRESERVE_NONE_CC __attribute__((preserve_none))
Py_PRESERVE_NONE_CC typedef PyObject* (*py_tail_call_funcptr)(TAIL_CALL_PARAMS);

# define TARGET(op) Py_PRESERVE_NONE_CC PyObject *_TAIL_CALL_##op(TAIL_CALL_PARAMS)
# define DISPATCH_GOTO() \
do { \
Py_MUSTTAIL return (INSTRUCTION_TABLE[opcode])(TAIL_CALL_ARGS); \
} while (0)
# define JUMP_TO_LABEL(name) \
do { \
Py_MUSTTAIL return (_TAIL_CALL_##name)(TAIL_CALL_ARGS); \
} while (0)
# ifdef Py_STATS
# define JUMP_TO_PREDICTED(name) \
do { \
Py_MUSTTAIL return (_TAIL_CALL_##name)(frame, stack_pointer, tstate, this_instr, oparg, lastopcode); \
} while (0)
# else
# define JUMP_TO_PREDICTED(name) \
do { \
Py_MUSTTAIL return (_TAIL_CALL_##name)(frame, stack_pointer, tstate, this_instr, oparg); \
} while (0)
# endif
# define LABEL(name) TARGET(name)

那么在保证我们基线性能和 Compute GOTO 甚至更优一点的同时,我们可以得到如下的一些好处

  1. 更广泛的平台支持
  2. 将 case 拆分后,编译器更不容易犯错,整体的性能的可预测性更强
  3. happy perf
  4. 以及我可以用 eBPF 之类的工具做更多的骚操作(

总结

这篇文章差不多就是这样,虽然说是只介绍 Python 3.14 的 Tail Call Interpreter,但是还是完整的介绍了一些整个的一个演进思路

这也带给我一个启发,很多时候,可预测性真的是非常重要的一个特性。

这算是 3.14 中和 remote debug 一起并列为我最喜欢的两个feature,可观测性万岁!

Further Performance Evolution in Python 3.14: Tail Call Interpreter

2025-07-02 23:49:00

I’ve been overwhelmed by security work lately, so let me switch to something lighter to relax my mind.

Python 3.14 has officially introduced a new mechanism called Tail Call Interpreter (Made by Ken Jin), which is undoubtedly another major milestone that lays the foundation for the future.

Main Content

Before discussing Python 3.14’s Tail Call Interpreter, we need to talk about the most basic syntax in C - switch-case.

1
2
3
4
5
6
7
8
9
10
switch (x) {
case 1:
// do something
break;
case 2:
// do something else
break;
default:
// do default thing
}

What would the final assembly look like for such code? Most people’s first reaction might be to use cmp followed by je, and if it doesn’t match, continue. Let’s compile a version to see.

For this code:

1
2
3
4
5
6
7
8
void small_switch(int x) {
switch(x) {
case 1: printf("One\n"); break;
case 2: printf("Two\n"); break;
case 3: printf("Three\n"); break;
default: printf("Other\n"); break;
}
}

The final assembly output would be:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
00000000000011f0 <small_switch>:
11f0:83 ff 02 cmp $0x2,%edi
11f3:74 2b je 1220 <small_switch+0x30>
11f5:83 ff 03 cmp $0x3,%edi
11f8:74 16 je 1210 <small_switch+0x20>
11fa:83 ff 01 cmp $0x1,%edi
11fd:75 31 jne 1230 <small_switch+0x40>
11ff:48 8d 3d fe 0d 00 00 lea 0xdfe(%rip),%rdi # 2004 <_IO_stdin_used+0x4>
1206:e9 25 fe ff ff jmp 1030 <puts@plt>
120b:0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
1210:48 8d 3d f5 0d 00 00 lea 0xdf5(%rip),%rdi # 200c <_IO_stdin_used+0xc>
1217:e9 14 fe ff ff jmp 1030 <puts@plt>
121c:0f 1f 40 00 nopl 0x0(%rax)
1220:48 8d 3d e1 0d 00 00 lea 0xde1(%rip),%rdi # 2008 <_IO_stdin_used+0x8>
1227:e9 04 fe ff ff jmp 1030 <puts@plt>
122c:0f 1f 40 00 nopl 0x0(%rax)
1230:48 8d 3d db 0d 00 00 lea 0xddb(%rip),%rdi # 2012 <_IO_stdin_used+0x12>
1237:e9 f4 fd ff ff jmp 1030 <puts@plt>
123c:0f 1f 40 00 nopl 0x0(%rax)

We can see that overall it’s as we expected - continuous cmp followed by continuous je. Now let’s evaluate the complexity here? Oh, time complexity O(n), that’s straightforward.

Damn, for Python with such a huge switch case structure, wouldn’t that be O(n) every time? Wouldn’t that be a performance disaster?

Actually, no. Usually, compilers use different strategies to handle switch case structures based on data type and scale.

Let’s prepare several examples:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
void small_switch(int x) {
switch(x) {
case 1: printf("One\n"); break;
case 2: printf("Two\n"); break;
case 3: printf("Three\n"); break;
default: printf("Other\n"); break;
}
}

void dense_switch(int x) {
switch(x) {
case 10: printf("Ten\n"); break;
case 11: printf("Eleven\n"); break;
case 12: printf("Twelve\n"); break;
case 13: printf("Thirteen\n"); break;
case 14: printf("Fourteen\n"); break;
case 15: printf("Fifteen\n"); break;
case 16: printf("Sixteen\n"); break;
case 17: printf("Seventeen\n"); break;
default: printf("Other\n"); break;
}
}

void sparse_switch(int x) {
switch(x) {
case 1: printf("One\n"); break;
case 100: printf("Hundred\n"); break;
case 1000: printf("Thousand\n"); break;
case 10000: printf("Ten thousand\n"); break;
default: printf("Other\n"); break;
}
}

void large_dense_switch(int x) {
switch(x) {
case 1: printf("Case 1\n"); break;
case 2: printf("Case 2\n"); break;
case 3: printf("Case 3\n"); break;
case 4: printf("Case 4\n"); break;
case 5: printf("Case 5\n"); break;
case 6: printf("Case 6\n"); break;
case 7: printf("Case 7\n"); break;
case 8: printf("Case 8\n"); break;
case 9: printf("Case 9\n"); break;
case 10: printf("Case 10\n"); break;
case 11: printf("Case 11\n"); break;
case 12: printf("Case 12\n"); break;
case 13: printf("Case 13\n"); break;
case 14: printf("Case 14\n"); break;
case 15: printf("Case 15\n"); break;
case 16: printf("Case 16\n"); break;
case 17: printf("Case 17\n"); break;
case 18: printf("Case 18\n"); break;
case 19: printf("Case 19\n"); break;
case 20: printf("Case 20\n"); break;
default: printf("Other\n"); break;
}
}

void mixed_switch(int x) {
switch(x) {
case 1: printf("Case 1\n"); break;
case 2: printf("Case 2\n"); break;
case 3: printf("Case 3\n"); break;

case 50: printf("Case 50\n"); break;

case 100: printf("Case 100\n"); break;
case 101: printf("Case 101\n"); break;
case 102: printf("Case 102\n"); break;

default: printf("Other\n"); break;
}
}

void char_switch(char c) {
switch(c) {
case 'a': printf("Letter a\n"); break;
case 'b': printf("Letter b\n"); break;
case 'c': printf("Letter c\n"); break;
case 'd': printf("Letter d\n"); break;
case 'e': printf("Letter e\n"); break;
default: printf("Other char\n"); break;
}
}

Then we disassemble and look at the results (I’ll only paste the key parts here):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
00000000000011f0 <small_switch>:
11f0:83 ff 02 cmp $0x2,%edi # Compare if it's 2
11f3:74 2b je 1220 # Jump to case 2
11f5:83 ff 03 cmp $0x3,%edi # Compare if it's 3
11f8:74 16 je 1210 # Jump to case 3
11fa:83 ff 01 cmp $0x1,%edi # Compare if it's 1
11fd:75 31 jne 1230 # If not, jump to default

0000000000001240 <dense_switch>:
1240:83 ef 0a sub $0xa,%edi # Subtract 10 (minimum case value)
1243:83 ff 07 cmp $0x7,%edi # Compare range (17-10=7)
1246:0f 87 90 00 00 00 ja 12dc # Out of range, jump to default
124c:48 8d 15 15 0f 00 00 lea 0xf15(%rip),%rdx # Load jump table address
1253:48 63 04 ba movslq (%rdx,%rdi,4),%rax # Get offset
1257:48 01 d0 add %rdx,%rax # Calculate target address
125a:ff e0 jmp *%rax # Indirect jump

00000000000012f0 <sparse_switch>:
12f0:81 ff e8 03 00 00 cmp $0x3e8,%edi # Compare 1000
12f6:74 40 je 1338 # If equal, jump
12f8:7f 16 jg 1310 # If greater than 1000, continue checking
12fa:83 ff 01 cmp $0x1,%edi # Less than 1000, check 1
12fd:74 49 je 1348
12ff:83 ff 64 cmp $0x64,%edi # Check 100
1302:75 24 jne 1328 # If none match, default
...
1310:81 ff 10 27 00 00 cmp $0x2710,%edi # Check 10000

0000000000001360 <large_dense_switch>:
1360:83 ff 14 cmp $0x14,%edi # Check if ≤20
1363:0f 87 53 01 00 00 ja 14bc # Out of range
1369:48 8d 15 18 0e 00 00 lea 0xe18(%rip),%rdx # Jump table address
1372:48 63 04 ba movslq (%rdx,%rdi,4),%rax # Direct indexing
1376:48 01 d0 add %rdx,%rax
1379:ff e0 jmp *%rax

00000000000014d0 <mixed_switch>:
14d0:83 ff 32 cmp $0x32,%edi # Compare 50
14d3:74 7b je 1550
14d5:7f 29 jg 1500 # >50 case
14d7:83 ff 02 cmp $0x2,%edi # ≤50, check small values
14da:74 64 je 1540
14dc:83 ff 03 cmp $0x3,%edi
...
1500:83 ff 65 cmp $0x65,%edi # >50, check 100,101,102
1503:74 5b je 1560

0000000000001580 <char_switch>:
1580:83 ef 61 sub $0x61,%edi # Subtract ASCII value of 'a'
1583:40 80 ff 04 cmp $0x4,%dil # Check if ≤4 (a-e)
1587:77 63 ja 15ec
1589:48 8d 15 4c 0c 00 00 lea 0xc4c(%rip),%rdx
1590:40 0f b6 ff movzbl %dil,%edi # Zero-extend to 32 bit
1594:48 63 04 ba movslq (%rdx,%rdi,4),%rax

Here we can see that the compiler handles switch case structures differently based on different data types. Let me summarize this with a table:

Switch Type Case Count Distribution Compiler Strategy Time Complexity
small_switch 3 Consecutive(1,2,3) Linear comparison O(n)
dense_switch 8 Consecutive(10-17) Offset jump table O(1)
sparse_switch 4 Sparse(1,100,1000,10000) Binary search O(log n)
large_dense_switch 20 Consecutive(1-20) Standard jump table O(1)
mixed_switch 7 Partially consecutive Nested comparison O(log n)
char_switch 5 Consecutive(‘a’-‘e’) Character offset table O(1)

OK, here we find that the final implementation of switch-case varies depending on data types, leading to unpredictability in our final code. So do we have ways to optimize this problem? The answer is yes.

Let’s look at the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#include <stdio.h>

void basic_computed_goto(int operation) {
static void* jump_table[] = {
&&op_add,
&&op_sub,
&&op_mul,
&&op_div,
&&op_mod,
&&op_default
};

int a = 10, b = 3;
int result;

if (operation < 0 || operation > 4) {
operation = 5;
}

printf("Operation %d: a=%d, b=%d -> ", operation, a, b);

goto *jump_table[operation];

op_add:
result = a + b;
printf("ADD: %d\n", result);
return;

op_sub:
result = a - b;
printf("SUB: %d\n", result);
return;

op_mul:
result = a * b;
printf("MUL: %d\n", result);
return;

op_div:
if (b != 0) {
result = a / b;
printf("DIV: %d\n", result);
} else {
printf("DIV: Error (division by zero)\n");
}
return;

op_mod:
if (b != 0) {
result = a % b;
printf("MOD: %d\n", result);
} else {
printf("MOD: Error (division by zero)\n");
}
return;

op_default:
printf("Unknown operation\n");
return;
}

We can see that the core operation here is to turn each case of our switch-case into a label, then we use a jump_table to directly jump to the corresponding label. Let’s look at the assembly of the most critical part:

1
2
11d3:48 8d 05 c6 2b 00 00 lea    0x2bc6(%rip),%rax        # 3da0 <jump_table.0>
11da:ff 24 d8 jmp *(%rax,%rbx,8)

Here we can summarize that using Computed Goto compared to traditional switch-case has the following advantages:

  1. Reduces the cost of branch prediction fallback
  2. Better instruction cache locality
  3. Reduces the number and overhead of cmp instructions

So how much faster can it be? You can refer to the test results of Computed Goto introduced in CPython, which showed an overall improvement of about 15%.

So is the Computed Goto approach perfect? Actually, no. Currently, CPython’s interpreter ceval.c, which is also currently the largest switch case, has several typical problems:

  1. Computed Goto as a specialized feature of clang and gcc, other platforms have limited benefits
  2. Currently Computed Goto is not mature, different versions of the same compiler may have different issues
  3. Extremely large switch cases cause compilers to not optimize switch cases well enough
  4. We cannot use perf to precisely perform quantitative analysis of per-opcode overhead in our entire process, which will be a big problem in the context of making Python faster

Points 1, 3, and 4 are easy to understand. Let’s look at an example of point 2 (thanks to Ken Jin for providing the example).

In GCC 11, switch-case would generate normal code in certain parts:

1
2
3
4
5
6
738f: movq%r13, %r15
7392: movzbl%ah, %ebx
7395: movzbl%al, %eax
7398: movq(,%rax,8), %rax
73a0: movl%ebx, -0x248(%rbp)
73a6: jmpq*%rax

While in GCC 13-15Beta, it would generate code like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
747a: movzbl%ah, %ebx
747d: movzbl%al, %eax
7480: movl%ebx, -0x248(%rbp)
7486: movq(,%rax,8), %rax
748e: jmp0x72a0 <_PyEval_EvalFrameDefault+0x970>

72a0: movq%r15, %xmm0
72a5: movq%r13, %xmm3
72aa: movq%r15, %rbx
72ad: punpcklqdq%xmm3, %xmm0
72b1: movhlps%xmm0, %xmm2
72b4: movq%xmm2, %r10
72b9: movq%r10, %r11
72bc: jmpq*%rax

We can see that additional registers were introduced. Computer Architecture 101: additional registers mean additional overhead. Registers are expensive!

So do we have ways to iterate on the extremely large switch case above? Some students might be thinking, since the switch case above is extremely large, why don’t we split it into multiple small functions so that the compiler can have enough context to optimize, and our perf can also precisely analyze the overhead of each function. Wouldn’t that be great?

But other students object: function calls trigger call instructions, which bring additional overhead of register push and pop operations. Won’t this make it slower again?

So can we optimize this? The answer is yes. Many students might have thought of it - tail call.

Suppose we have this code:

1
2
3
4
5
6
__attribute__((noinline)) void g(int x) {
printf("Value: %d\n", x);
};
void f(int x) {
return g(x);
}

We can see this assembly:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
0000000000001140 <g>:
1140:55 push %rbp
1141:48 89 e5 mov %rsp,%rbp
1144:48 83 ec 10 sub $0x10,%rsp
1148:89 7d fc mov %edi,-0x4(%rbp)
114b:8b 75 fc mov -0x4(%rbp),%esi
114e:48 8d 3d af 0e 00 00 lea 0xeaf(%rip),%rdi # 2004 <_IO_stdin_used+0x4>
1155:b0 00 mov $0x0,%al
1157:e8 d4 fe ff ff call 1030 <printf@plt>
115c:48 83 c4 10 add $0x10,%rsp
1160:5d pop %rbp
1161:c3 ret
1162:66 66 66 66 66 2e 0f data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
1169:1f 84 00 00 00 00 00

0000000000001170 <f>:
1170:55 push %rbp
1171:48 89 e5 mov %rsp,%rbp
1174:48 83 ec 10 sub $0x10,%rsp
1178:89 7d fc mov %edi,-0x4(%rbp)
117b:8b 7d fc mov -0x4(%rbp),%edi
117e:e8 bd ff ff ff call 1140 <g>
1183:48 83 c4 10 add $0x10,%rsp
1187:5d pop %rbp
1188:c3 ret
1189:0f 1f 80 00 00 00 00 nopl 0x0(%rax)

The call 1140 <g> instruction is very prominent. This is also an important source of function call overhead.

In modern compilers, there’s a special optimization called tail recursion, where when the last step of a function is calling another function, the compiler can optimize away the overhead of this call.

Let’s test this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#include <stdio.h>
__attribute__((preserve_none)) void g(int x);
__attribute__((noinline, preserve_none)) void g(int x){
printf("Value: %d\n", x);
}

__attribute__((preserve_none)) void f(int x) {
[[clang::musttail]] return g(x);
}

int main() {
f(42);
return 0;
}

Let’s look at the related assembly:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
0000000000001140 <g>:
1140:55 push %rbp
1141:48 89 e5 mov %rsp,%rbp
1144:48 83 ec 10 sub $0x10,%rsp
1148:44 89 65 fc mov %r12d,-0x4(%rbp)
114c:8b 75 fc mov -0x4(%rbp),%esi
114f:48 8d 3d ae 0e 00 00 lea 0xeae(%rip),%rdi # 2004 <_IO_stdin_used+0x4>
1156:31 c0 xor %eax,%eax
1158:e8 d3 fe ff ff call 1030 <printf@plt>
115d:48 83 c4 10 add $0x10,%rsp
1161:5d pop %rbp
1162:c3 ret
1163:66 66 66 66 2e 0f 1f data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
116a:84 00 00 00 00 00

0000000000001170 <f>:
1170:55 push %rbp
1171:48 89 e5 mov %rsp,%rbp
1174:44 89 65 fc mov %r12d,-0x4(%rbp)
1178:44 8b 65 fc mov -0x4(%rbp),%r12d
117c:5d pop %rbp
117d:e9 be ff ff ff jmp 1140 <g>
1182:66 66 66 66 66 2e 0f data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
1189:1f 84 00 00 00 00 00

We can see that the last step of function f is jmp 1140 <g>, not call 1140 <g>. This means when we call function g, we won’t have additional overhead like register allocation that traditional call instructions bring.

Some students might have realized that after tail recursion processing, this can completely be viewed as a high-performance Goto.

Bingo, the idea here is similar. A 1977 paper “Debunking the ‘Expensive Procedure Call’ Myth, or, Procedure Call Implementations Considered Harmful, or, Lambda: The Ultimate GOTO” mentioned that efficient procedure calls can have performance close to Goto, while being more concise in implementation.

In Python 3.14, the implementation of Tail Call Interpreter is based on this idea.

We can see that we’ve applied tail recursion processing to the macro that dispatches opcodes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#   define Py_MUSTTAIL [[clang::musttail]]
# define Py_PRESERVE_NONE_CC __attribute__((preserve_none))
Py_PRESERVE_NONE_CC typedef PyObject* (*py_tail_call_funcptr)(TAIL_CALL_PARAMS);

# define TARGET(op) Py_PRESERVE_NONE_CC PyObject *_TAIL_CALL_##op(TAIL_CALL_PARAMS)
# define DISPATCH_GOTO() \
do { \
Py_MUSTTAIL return (INSTRUCTION_TABLE[opcode])(TAIL_CALL_ARGS); \
} while (0)
# define JUMP_TO_LABEL(name) \
do { \
Py_MUSTTAIL return (_TAIL_CALL_##name)(TAIL_CALL_ARGS); \
} while (0)
# ifdef Py_STATS
# define JUMP_TO_PREDICTED(name) \
do { \
Py_MUSTTAIL return (_TAIL_CALL_##name)(frame, stack_pointer, tstate, this_instr, oparg, lastopcode); \
} while (0)
# else
# define JUMP_TO_PREDICTED(name) \
do { \
Py_MUSTTAIL return (_TAIL_CALL_##name)(frame, stack_pointer, tstate, this_instr, oparg); \
} while (0)
# endif
# define LABEL(name) TARGET(name)

So while ensuring our baseline performance is as good as or even better than Computed Goto, we can get the following benefits:

  1. Broader platform support
  2. After splitting cases, compilers are less likely to make mistakes, and overall performance predictability is stronger
  3. Happy perf
  4. And I can do more cool stuff with tools like eBPF

Summary

This article is pretty much it. Although it claims to only introduce Python 3.14’s Tail Call Interpreter, it still completely introduces the entire evolution of thinking.

This also gives me an insight: predictability is really a very important characteristic in many cases.

This, along with remote debug, are the two features I like most in 3.14. Long live observability!

Python 3.14: Python 世界的一大步

2025-04-26 22:49:00

Python 3.14 目前主要的一些主要的特性其实已经固定了,在我看来,Python 3.14 是一个未来很多年的一个核心版本。因为其确定了是时代的 Python
调试生态的基准,这篇文章将会来聊聊这个 Python 世界中的史诗级改进

正文

在我们日常调试 Python 代码的时候,我们经常会遇到这样一个问题,我们需要采样当前的 Python Runtime 的状态,进而进一步调试我们的 Python 进程

常见的手段莫过于两种

  1. 通过 eBPF + UProbe 等手段来触发
  2. 通过 process_vm_readv 等 Syscall 来直接整块读取内存

无论这两种方式都有一个核心的问题,我们怎么样来解析内存中的数据?

https://github.com/jschwinger233/perf-examples/blob/main/cpython310_backtrace/bpf.c 来做一个例子,在之前的很多年的时候,我们会怎么做

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
#define PAGE_SIZE (1<<12)
#define KASAN_STACK_ORDER 0
#define THREAD_SIZE_ORDER (2 + KASAN_STACK_ORDER)
#define THREAD_SIZE ((__u64)(PAGE_SIZE << THREAD_SIZE_ORDER))
#define TOP_OF_KERNEL_STACK_PADDING ((__u64)0)

const static u32 ZERO = 0;

struct PyTypeObject {
char _[24];
char *tp_name;
};

struct PyObject {
char _[8];
struct PyTypeObject *ob_type;
};

struct PyVarObject {
struct PyObject ob_base;
char _[8];
};

struct PyASCIIObject {
__u8 _[16];
__u64 length;
__u8 __[24];
};

struct _PyStr {
struct PyASCIIObject ascii;
char buf[100];
};

struct PyCodeObject {
char _[104];
struct _PyStr *co_filename;
struct _PyStr *co_name;
};

struct PyFrameObject {
struct PyVarObject ob_base;
struct PyFrameObject *f_back;
struct PyCodeObject *f_code;
char _[60];
int f_lineno;
};

struct event {
__u64 rip;
__u8 user_mode;
__s8 python_stack_depth;
__u64 filename_len[20];
__u64 funcname_len[20];
unsigned char filename[20][100];
unsigned char funcname[20][100];
};

struct {
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
__uint(max_entries, 1);
__type(key, u32);
__type(value, struct event);
} events SEC(".maps");

struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 1<<29);
} ringbuf SEC(".maps");

SEC("perf_event/cpython310")
int perf_event_cpython310(struct bpf_perf_event_data *ctx)
{
__u64 rsp;
struct event *event;
struct PyFrameObject *frame;

event = bpf_map_lookup_elem(&events, &ZERO);
if (!event)
return 0;

rsp = ctx->regs.sp;
event->rip = ctx->regs.ip;
event->user_mode = !!(ctx->regs.cs & 3);

if (!event->user_mode) {
struct task_struct *task = (struct task_struct *)bpf_get_current_task();
__u64 __ptr = (__u64)BPF_CORE_READ(task, stack);
__ptr += THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING;
struct pt_regs *pt_regs = ((struct pt_regs *)__ptr) - 1;

rsp = BPF_CORE_READ(pt_regs, sp);
event->rip = BPF_CORE_READ(pt_regs, ip);
}

char name[5];
bool found = false;

for (int i = 0; i < 200; i++) {
bpf_probe_read_user(&frame, sizeof(frame), (void *)rsp + 8*i);
if (!frame)
continue;

char *tp_name = BPF_PROBE_READ_USER(frame, ob_base.ob_base.ob_type, tp_name);
bpf_probe_read_user(&name, sizeof(name), (void *)tp_name);
if (bpf_strncmp(name, 5, "frame") == 0) {
found = true;
break;
}
}

if (!found) {
event->python_stack_depth = -1;
bpf_ringbuf_output(&ringbuf, event, sizeof(*event), 0);
return 0;
}

for (int i = 0; i < 20; i++) {
event->python_stack_depth = i;
BPF_PROBE_READ_USER_INTO(&event->filename_len[i], frame, f_code, co_filename, ascii.length);
BPF_PROBE_READ_USER_INTO(&event->filename[i], frame, f_code, co_filename, buf);
BPF_PROBE_READ_USER_INTO(&event->funcname_len[i], frame, f_code, co_name, ascii.length);
BPF_PROBE_READ_USER_INTO(&event->funcname[i], frame, f_code, co_name, buf);
frame = BPF_PROBE_READ_USER(frame, f_back);
if (!frame)
break;
}

bpf_ringbuf_output(&ringbuf, event, sizeof(*event), 0);
return 0;
}

char __license[] SEC("license") = "Dual MIT/GPL";

上面的核心代码其实没多少,核心的逻辑就还是我们手动模拟 Python 中关键的 PyFrameObject 结构体,然后我们在内存中不断做一次搜索,暴力匹配到特征一致的内存

其余诸如 PySpy 这样的工具也是类似的思路

这个方式最核心的问题是在于说,Python 每个版本的 ABI 都可能发生变化,所以我们需要不断的根据不同的版本去做兼容(比如 PySpy 维护了从3.7到3.12的不同的 PyFrameObject

那么我们有没有更好的方法来处理这个问题?或者说我们能不能更好的去定位?

可以的,写 Python 的同学肯定都知道我们 Python 中有一个全局的变量 _PyRuntime,其类型为 pyruntimestate,大致的布局如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
struct pyruntimestate {

_Py_DebugOffsets debug_offsets;

int _initialized;

int preinitializing;

int preinitialized;

int core_initialized;

int initialized;

PyThreadState *_finalizing;

unsigned long _finalizing_id;

struct pyinterpreters {
PyMutex mutex;
PyInterpreterState *head;

PyInterpreterState *main;

int64_t next_id;
} interpreters;


unsigned long main_thread;
PyThreadState *main_tstate;


_PyXI_global_state_t xi;

struct _pymem_allocators allocators;
struct _obmalloc_global_state obmalloc;
struct pyhash_runtime_state pyhash_state;
struct _pythread_runtime_state threads;
struct _signals_runtime_state signals;

Py_tss_t autoTSSkey;

Py_tss_t trashTSSkey;

PyWideStringList orig_argv;

struct _parser_runtime_state parser;

struct _atexit_runtime_state atexit;

struct _import_runtime_state imports;
struct _ceval_runtime_state ceval;
struct _gilstate_runtime_state {

int check_enabled;

PyInterpreterState *autoInterpreterState;
} gilstate;
struct _getargs_runtime_state {
struct _PyArg_Parser *static_parsers;
} getargs;
struct _fileutils_state fileutils;
struct _faulthandler_runtime_state faulthandler;
struct _tracemalloc_runtime_state tracemalloc;
struct _reftracer_runtime_state ref_tracer;

_PyRWMutex stoptheworld_mutex;
struct _stoptheworld_state stoptheworld;

PyPreConfig preconfig;
Py_OpenCodeHookFunction open_code_hook;
void *open_code_userdata;
struct {
PyMutex mutex;
struct _Py_AuditHookEntry *head;
} audit_hooks;

struct _py_object_runtime_state object_state;
struct _Py_float_runtime_state float_state;
struct _Py_unicode_runtime_state unicode_state;
struct _types_runtime_state types;
struct _Py_time_runtime_state time;

#if defined(__EMSCRIPTEN__) && defined(PY_CALL_TRAMPOLINE)

int (*emscripten_count_args_function)(PyCFunctionWithKeywords func);
#endif
struct _Py_cached_objects cached_objects;
struct _Py_static_objects static_objects;

PyInterpreterState _main_interpreter;

};

眼尖的同学肯定看到了,我们其中有一段核心的代码

1
2
3
4
5
6
7
8
struct pyinterpreters {
PyMutex mutex;
PyInterpreterState *head;

PyInterpreterState *main;

int64_t next_id;
} interpreters;

维护了一个 PyInterpreterState 的链表,我们可以通过 PyInterpreterState 来获取当前的 Frame,PyInterpreterState 中的 TreadState 来获取当前的线程状态

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
struct pythreads {
uint64_t next_unique_id;
/* The linked list of threads, newest first. */
PyThreadState *head;
_PyThreadStateImpl *preallocated;
/* The thread currently executing in the __main__ module, if any. */
PyThreadState *main;
/* Used in Modules/_threadmodule.c. */
Py_ssize_t count;
/* Support for runtime thread stack size tuning.
A value of 0 means using the platform's default stack size
or the size specified by the THREAD_STACK_SIZE macro. */
/* Used in Python/thread.c. */
size_t stacksize;
} threads;

PyThreadState 中和核心的 struct _PyInterpreterFrame *current_frame 就是我们需要的 frame state,整个流程大概如下

graph TD    PyRuntime["_PyRuntime (pyruntimestate)"] --> Interpreters["interpreters (pyinterpreters)"]    Interpreters -->|head| InterpreterStateHead["PyInterpreterState *head"]    Interpreters -->|main| InterpreterStateMain["PyInterpreterState *main"]        %% Define interpreter state structure    subgraph PyInterpreterState        InterpreterID["int64_t id"]         ThreadsStruct["struct pythreads threads"]        NextInterpreter["PyInterpreterState *next"]    end        InterpreterStateHead --- PyInterpreterState    InterpreterStateMain --- PyInterpreterState        %% Link to threads structure    ThreadsStruct --> ThreadHead["PyThreadState *head"]    ThreadsStruct --> ThreadMain["PyThreadState *main"]        %% Define thread state structure    subgraph PyThreadState        ThreadID["uint64_t thread_id"]        InterpreterPtr["PyInterpreterState *interp"]        CurrentFrame["_PyInterpreterFrame *current_frame"]        NextThread["PyThreadState *next"]    end        ThreadHead --- PyThreadState    ThreadMain --- PyThreadState        %% Frame structure    CurrentFrame --> Frame["_PyInterpreterFrame structure"]        subgraph _PyInterpreterFrame        PreviousFrame["_PyInterpreterFrame *previous"]        CodeObject["PyCodeObject *f_code"]        Locals["PyObject **localsplus"]    end        %% Connected paths in color    PyRuntime ==>|"Main Path"| Interpreters    Interpreters ==>|"Main Path"| InterpreterStateMain    InterpreterStateMain ==>|"Main Path"| ThreadsStruct    ThreadsStruct ==>|"Main Path"| ThreadMain    ThreadMain ==>|"Main Path"| CurrentFrame    CurrentFrame ==>|"Main Path"| Frame        class PyRuntime,InterpreterStateMain,ThreadMain,CurrentFrame,Frame mainPath;    classDef mainPath fill:#f96,stroke:#333,stroke-width:2px;    classDef mainNodes fill:#f9f,stroke:#333,stroke-width:2px;

那么我们现在来解决第一个问题,我们怎么样获取在内存中的 _PyRuntime 的地址呢?

我们把这个问题抽象成下面最简单一个 C 代码

1
2
3
4
5
6
7
8
#include <stdio.h>

int abc=3;

int main() {
printf("abc: %p\n", &abc);
return 0;
}

我们怎么样获取 abc 的地址呢?这里写过 C 的同学可能反应过来了,我们可以使用 __attribute__((section())) 的语法,来将其放到一个特定的段中

1
2
3
4
5
6
7
8
#include <stdio.h>

int abc __attribute__((section(".my_section"))) = 3;

int main() {
printf("abc: %p\n", &abc);
return 0;
}

我们编译,并用 readelf 来解析一下二进制

1
2
╰─ readelf -S ./a.out| grep my_section 
[25] .my_section PROGBITS 0000000000004018 00003018

我们能看到这里我们得到了一个相对地址。后续我们就可以通过解析 ELF 来遍历寻找到 abc 变量的地址

那么在 Python 中同样如此,在代码中有这样一段代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#define GENERATE_DEBUG_SECTION(name, declaration)     \
_GENERATE_DEBUG_SECTION_WINDOWS(name) \
_GENERATE_DEBUG_SECTION_APPLE(name) \
declaration \
_GENERATE_DEBUG_SECTION_LINUX(name)

// Please note that section names are truncated to eight bytes
// on Windows!
#if defined(MS_WINDOWS)
#define _GENERATE_DEBUG_SECTION_WINDOWS(name) \
_Pragma(Py_STRINGIFY(section(Py_STRINGIFY(name), read, write))) \
__declspec(allocate(Py_STRINGIFY(name)))
#else
#define _GENERATE_DEBUG_SECTION_WINDOWS(name)
#endif

#if defined(__APPLE__)
#define _GENERATE_DEBUG_SECTION_APPLE(name) \
__attribute__((section(SEG_DATA "," Py_STRINGIFY(name)))) \
__attribute__((used))
#else
#define _GENERATE_DEBUG_SECTION_APPLE(name)
#endif

#if defined(__linux__) && (defined(__GNUC__) || defined(__clang__))
#define _GENERATE_DEBUG_SECTION_LINUX(name) \
__attribute__((section("." Py_STRINGIFY(name)))) \
__attribute__((used))
#else
#define _GENERATE_DEBUG_SECTION_LINUX(name)
#endif

GENERATE_DEBUG_SECTION(PyRuntime, _PyRuntimeState _PyRuntime)
= _PyRuntimeState_INIT(_PyRuntime, _Py_Debug_Cookie);
_Py_COMP_DIAG_POP

这样我们就能比较方便的获取到 PyRuntime 在内存中的地址。

那么现在第二个问题是,我们怎么样通过我们前面介绍的调用链获取到地址?

大家可能第一反应还是想通过维护不同版本的数据结构来获取具体的地址。不过这里我们有没有办法可以用更简单的方法来处理呢?答案是有的

眼尖的同学可能看到了我们在 pyruntimestate 中有一个字段叫 debug_offsets,我们来看下我们怎么初始化这个字段的吧

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
#define _Py_DebugOffsets_INIT(debug_cookie) { \
.cookie = debug_cookie, \
.version = PY_VERSION_HEX, \
.free_threaded = _Py_Debug_Free_Threaded, \
.runtime_state = { \
.size = sizeof(_PyRuntimeState), \
.finalizing = offsetof(_PyRuntimeState, _finalizing), \
.interpreters_head = offsetof(_PyRuntimeState, interpreters.head), \
}, \
.interpreter_state = { \
.size = sizeof(PyInterpreterState), \
.id = offsetof(PyInterpreterState, id), \
.next = offsetof(PyInterpreterState, next), \
.threads_head = offsetof(PyInterpreterState, threads.head), \
.threads_main = offsetof(PyInterpreterState, threads.main), \
.gc = offsetof(PyInterpreterState, gc), \
.imports_modules = offsetof(PyInterpreterState, imports.modules), \
.sysdict = offsetof(PyInterpreterState, sysdict), \
.builtins = offsetof(PyInterpreterState, builtins), \
.ceval_gil = offsetof(PyInterpreterState, ceval.gil), \
.gil_runtime_state = offsetof(PyInterpreterState, _gil), \
.gil_runtime_state_enabled = _Py_Debug_gilruntimestate_enabled, \
.gil_runtime_state_locked = offsetof(PyInterpreterState, _gil.locked), \
.gil_runtime_state_holder = offsetof(PyInterpreterState, _gil.last_holder), \
}, \
.thread_state = { \
.size = sizeof(PyThreadState), \
.prev = offsetof(PyThreadState, prev), \
.next = offsetof(PyThreadState, next), \
.interp = offsetof(PyThreadState, interp), \
.current_frame = offsetof(PyThreadState, current_frame), \
.thread_id = offsetof(PyThreadState, thread_id), \
.native_thread_id = offsetof(PyThreadState, native_thread_id), \
.datastack_chunk = offsetof(PyThreadState, datastack_chunk), \
.status = offsetof(PyThreadState, _status), \
}, \
.interpreter_frame = { \
.size = sizeof(_PyInterpreterFrame), \
.previous = offsetof(_PyInterpreterFrame, previous), \
.executable = offsetof(_PyInterpreterFrame, f_executable), \
.instr_ptr = offsetof(_PyInterpreterFrame, instr_ptr), \
.localsplus = offsetof(_PyInterpreterFrame, localsplus), \
.owner = offsetof(_PyInterpreterFrame, owner), \
.stackpointer = offsetof(_PyInterpreterFrame, stackpointer), \
}, \
.code_object = { \
.size = sizeof(PyCodeObject), \
.filename = offsetof(PyCodeObject, co_filename), \
.name = offsetof(PyCodeObject, co_name), \
.qualname = offsetof(PyCodeObject, co_qualname), \
.linetable = offsetof(PyCodeObject, co_linetable), \
.firstlineno = offsetof(PyCodeObject, co_firstlineno), \
.argcount = offsetof(PyCodeObject, co_argcount), \
.localsplusnames = offsetof(PyCodeObject, co_localsplusnames), \
.localspluskinds = offsetof(PyCodeObject, co_localspluskinds), \
.co_code_adaptive = offsetof(PyCodeObject, co_code_adaptive), \
}, \
.pyobject = { \
.size = sizeof(PyObject), \
.ob_type = offsetof(PyObject, ob_type), \
}, \
.type_object = { \
.size = sizeof(PyTypeObject), \
.tp_name = offsetof(PyTypeObject, tp_name), \
.tp_repr = offsetof(PyTypeObject, tp_repr), \
.tp_flags = offsetof(PyTypeObject, tp_flags), \
}, \
.tuple_object = { \
.size = sizeof(PyTupleObject), \
.ob_item = offsetof(PyTupleObject, ob_item), \
.ob_size = offsetof(PyTupleObject, ob_base.ob_size), \
}, \
.list_object = { \
.size = sizeof(PyListObject), \
.ob_item = offsetof(PyListObject, ob_item), \
.ob_size = offsetof(PyListObject, ob_base.ob_size), \
}, \
.set_object = { \
.size = sizeof(PySetObject), \
.used = offsetof(PySetObject, used), \
.table = offsetof(PySetObject, table), \
.mask = offsetof(PySetObject, mask), \
}, \
.dict_object = { \
.size = sizeof(PyDictObject), \
.ma_keys = offsetof(PyDictObject, ma_keys), \
.ma_values = offsetof(PyDictObject, ma_values), \
}, \
.float_object = { \
.size = sizeof(PyFloatObject), \
.ob_fval = offsetof(PyFloatObject, ob_fval), \
}, \
.long_object = { \
.size = sizeof(PyLongObject), \
.lv_tag = offsetof(PyLongObject, long_value.lv_tag), \
.ob_digit = offsetof(PyLongObject, long_value.ob_digit), \
}, \
.bytes_object = { \
.size = sizeof(PyBytesObject), \
.ob_size = offsetof(PyBytesObject, ob_base.ob_size), \
.ob_sval = offsetof(PyBytesObject, ob_sval), \
}, \
.unicode_object = { \
.size = sizeof(PyUnicodeObject), \
.state = offsetof(PyUnicodeObject, _base._base.state), \
.length = offsetof(PyUnicodeObject, _base._base.length), \
.asciiobject_size = sizeof(PyASCIIObject), \
}, \
.gc = { \
.size = sizeof(struct _gc_runtime_state), \
.collecting = offsetof(struct _gc_runtime_state, collecting), \
}, \
.gen_object = { \
.size = sizeof(PyGenObject), \
.gi_name = offsetof(PyGenObject, gi_name), \
.gi_iframe = offsetof(PyGenObject, gi_iframe), \
.gi_frame_state = offsetof(PyGenObject, gi_frame_state), \
}, \
.debugger_support = { \
.eval_breaker = offsetof(PyThreadState, eval_breaker), \
.remote_debugger_support = offsetof(PyThreadState, remote_debugger_support), \
.remote_debugging_enabled = offsetof(PyInterpreterState, config.remote_debug), \
.debugger_pending_call = offsetof(_PyRemoteDebuggerSupport, debugger_pending_call), \
.debugger_script_path = offsetof(_PyRemoteDebuggerSupport, debugger_script_path), \
.debugger_script_path_size = MAX_SCRIPT_PATH_SIZE, \
}, \
}

我们能看到我们使用了 offsetof 这个非常经典的宏来将一下我们常用的字段相较于结构体的偏移写入到 debug_offsets 中去。而 debug_offsets 将固定存在于 pyruntimestate 的第一个字段,同时起改变频率相对较低,所以我们就可以通过 debugger_support 获取不同地址的偏移量来获取最终我们想要的数据。

通过这样的做法,我们实际上就有很多很好玩的事情可以做了。实际上官方也是基于这样一套机制提出了 PEP 768 – Safe external debugger interface for CPython https://peps.python.org/pep-0768/。可以允许用户远程的为一个 Python 进程注入一段调试代码

我们来看一下这个 PEP 的核心实现

在前面介绍过的 ThreadState 中新增了一组结构

1
2
3
4
typedef struct _remote_debugger_support {
int32_t debugger_pending_call;
char debugger_script_path[MAX_SCRIPT_PATH_SIZE];
} _PyRemoteDebuggerSupport;

在执行过程中,如果 debugger_pending_call 为 1 的时候,我们就会去执行 debugger_script_path 中的脚本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
int _PyRunRemoteDebugger(PyThreadState *tstate)
{
const PyConfig *config = _PyInterpreterState_GetConfig(tstate->interp);
if (config->remote_debug == 1
&& tstate->remote_debugger_support.debugger_pending_call == 1)
{
tstate->remote_debugger_support.debugger_pending_call = 0;

// Immediately make a copy in case of a race with another debugger
// process that's trying to write to the buffer. At least this way
// we'll be internally consistent: what we audit is what we run.
const size_t pathsz
= sizeof(tstate->remote_debugger_support.debugger_script_path);

char *path = PyMem_Malloc(pathsz);
if (path) {
// And don't assume the debugger correctly null terminated it.
memcpy(
path,
tstate->remote_debugger_support.debugger_script_path,
pathsz);
path[pathsz - 1] = '\0';
if (*path) {
run_remote_debugger_script(path);
}
PyMem_Free(path);
}
}
return 0;
}

那么问题来了,我们现在怎么样给目标 Python 进程注入对应的值呢?我们来看看 remote_debugging.c 中的实现

首先入口函数为 _PySysRemoteDebug_SendExec

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
int
_PySysRemoteDebug_SendExec(int pid, int tid, const char *debugger_script_path)
{
#if !defined(Py_SUPPORTS_REMOTE_DEBUG)
PyErr_SetString(PyExc_RuntimeError, "Remote debugging is not supported on this platform");
return -1;
#elif !defined(Py_REMOTE_DEBUG)
PyErr_SetString(PyExc_RuntimeError, "Remote debugging support has not been compiled in");
return -1;
#else

PyThreadState *tstate = _PyThreadState_GET();
const PyConfig *config = _PyInterpreterState_GetConfig(tstate->interp);
if (config->remote_debug != 1) {
PyErr_SetString(PyExc_RuntimeError, "Remote debugging is not enabled");
return -1;
}

proc_handle_t handle;
if (init_proc_handle(&handle, pid) < 0) {
return -1;
}

int rc = send_exec_to_proc_handle(&handle, tid, debugger_script_path);
cleanup_proc_handle(&handle);
return rc;
#endif
}

前面都是一些例行的检查,我们来看看 send_exec_to_proc_handle 这个函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
static int
send_exec_to_proc_handle(proc_handle_t *handle, int tid, const char *debugger_script_path)
{
uintptr_t runtime_start_address;
struct _Py_DebugOffsets debug_offsets;

if (read_offsets(handle, &runtime_start_address, &debug_offsets)) {
return -1;
}

uintptr_t interpreter_state_list_head = (uintptr_t)debug_offsets.runtime_state.interpreters_head;

uintptr_t interpreter_state_addr;
if (0 != read_memory(
handle,
runtime_start_address + interpreter_state_list_head,
sizeof(void*),
&interpreter_state_addr))
{
return -1;
}

if (interpreter_state_addr == 0) {
PyErr_SetString(PyExc_RuntimeError, "Can't find a running interpreter in the remote process");
return -1;
}

int is_remote_debugging_enabled = 0;
if (0 != read_memory(
handle,
interpreter_state_addr + debug_offsets.debugger_support.remote_debugging_enabled,
sizeof(int),
&is_remote_debugging_enabled))
{
return -1;
}

if (is_remote_debugging_enabled != 1) {
PyErr_SetString(
PyExc_RuntimeError,
"Remote debugging is not enabled in the remote process");
return -1;
}

uintptr_t thread_state_addr;
unsigned long this_tid = 0;

if (tid != 0) {
if (0 != read_memory(
handle,
interpreter_state_addr + debug_offsets.interpreter_state.threads_head,
sizeof(void*),
&thread_state_addr))
{
return -1;
}
while (thread_state_addr != 0) {
if (0 != read_memory(
handle,
thread_state_addr + debug_offsets.thread_state.native_thread_id,
sizeof(this_tid),
&this_tid))
{
return -1;
}

if (this_tid == (unsigned long)tid) {
break;
}

if (0 != read_memory(
handle,
thread_state_addr + debug_offsets.thread_state.next,
sizeof(void*),
&thread_state_addr))
{
return -1;
}
}

if (thread_state_addr == 0) {
PyErr_SetString(
PyExc_RuntimeError,
"Can't find the specified thread in the remote process");
return -1;
}
} else {
if (0 != read_memory(
handle,
interpreter_state_addr + debug_offsets.interpreter_state.threads_main,
sizeof(void*),
&thread_state_addr))
{
return -1;
}

if (thread_state_addr == 0) {
PyErr_SetString(
PyExc_RuntimeError,
"Can't find the main thread in the remote process");
return -1;
}
}

// Ensure our path is not too long
if (debug_offsets.debugger_support.debugger_script_path_size <= strlen(debugger_script_path)) {
PyErr_SetString(PyExc_ValueError, "Debugger script path is too long");
return -1;
}

uintptr_t debugger_script_path_addr = (uintptr_t)(
thread_state_addr +
debug_offsets.debugger_support.remote_debugger_support +
debug_offsets.debugger_support.debugger_script_path);
if (0 != write_memory(
handle,
debugger_script_path_addr,
strlen(debugger_script_path) + 1,
debugger_script_path))
{
return -1;
}

int pending_call = 1;
uintptr_t debugger_pending_call_addr = (uintptr_t)(
thread_state_addr +
debug_offsets.debugger_support.remote_debugger_support +
debug_offsets.debugger_support.debugger_pending_call);
if (0 != write_memory(
handle,
debugger_pending_call_addr,
sizeof(int),
&pending_call))

{
return -1;
}

uintptr_t eval_breaker;
if (0 != read_memory(
handle,
thread_state_addr + debug_offsets.debugger_support.eval_breaker,
sizeof(uintptr_t),
&eval_breaker))
{
return -1;
}

eval_breaker |= _PY_EVAL_PLEASE_STOP_BIT;

if (0 != write_memory(
handle,
thread_state_addr + (uintptr_t)debug_offsets.debugger_support.eval_breaker,
sizeof(uintptr_t),
&eval_breaker))

{
return -1;
}

return 0;
}

我们先不考虑具体的细节的话,这段函数的逻辑还是非常明确的,通过 read_offsets 获取目标的地址偏移,通过 read_memory 这个函数读取不同地址,然后做一些处理后,通过 write_memory 来写入到目标进程中去

read_offsets 这个函数就是我们前面核心提到过的怎么样使用目前 Python 给出的调试信息的例子,我们来看一下其在 Linux 下的实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
static int
read_offsets(
proc_handle_t *handle,
uintptr_t *runtime_start_address,
_Py_DebugOffsets* debug_offsets
) {
if (_Py_RemoteDebug_ReadDebugOffsets(handle, runtime_start_address, debug_offsets)) {
return -1;
}
if (ensure_debug_offset_compatibility(debug_offsets)) {
return -1;
}
return 0;
}

这里的核心函数是 _Py_RemoteDebug_ReadDebugOffsets, 我们接着来看这个的实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
static int
_Py_RemoteDebug_ReadDebugOffsets(
proc_handle_t *handle,
uintptr_t *runtime_start_address,
_Py_DebugOffsets* debug_offsets
) {
*runtime_start_address = _Py_RemoteDebug_GetPyRuntimeAddress(handle);
if (!*runtime_start_address) {
if (!PyErr_Occurred()) {
PyErr_SetString(
PyExc_RuntimeError, "Failed to get PyRuntime address");
}
return -1;
}
size_t size = sizeof(struct _Py_DebugOffsets);
if (0 != _Py_RemoteDebug_ReadRemoteMemory(handle, *runtime_start_address, size, debug_offsets)) {
return -1;
}
return 0;
}

我们注意到,这里的核心还是我们先要获取到 PyRuntime 的地址,那么我们来看看 _Py_RemoteDebug_GetPyRuntimeAddress 的实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
static uintptr_t
_Py_RemoteDebug_GetPyRuntimeAddress(proc_handle_t* handle)
{
uintptr_t address;
address = search_linux_map_for_section(handle, "PyRuntime", "python");
if (address == 0) {
// Error out: 'python' substring covers both executable and DLL
PyErr_SetString(PyExc_RuntimeError, "Failed to find the PyRuntime section in the process.");
}
return address;
}

static uintptr_t
search_linux_map_for_section(proc_handle_t *handle, const char* secname, const char* substr)
{
char maps_file_path[64];
sprintf(maps_file_path, "/proc/%d/maps", handle->pid);

FILE* maps_file = fopen(maps_file_path, "r");
if (maps_file == NULL) {
PyErr_SetFromErrno(PyExc_OSError);
return 0;
}

size_t linelen = 0;
size_t linesz = PATH_MAX;
char *line = PyMem_Malloc(linesz);
if (!line) {
fclose(maps_file);
PyErr_NoMemory();
return 0;
}

uintptr_t retval = 0;
while (fgets(line + linelen, linesz - linelen, maps_file) != NULL) {
linelen = strlen(line);
if (line[linelen - 1] != '\n') {
// Read a partial line: realloc and keep reading where we left off.
// Note that even the last line will be terminated by a newline.
linesz *= 2;
char *biggerline = PyMem_Realloc(line, linesz);
if (!biggerline) {
PyMem_Free(line);
fclose(maps_file);
PyErr_NoMemory();
return 0;
}
line = biggerline;
continue;
}

// Read a full line: strip the newline
line[linelen - 1] = '\0';
// and prepare to read the next line into the start of the buffer.
linelen = 0;

unsigned long start = 0;
unsigned long path_pos = 0;
sscanf(line, "%lx-%*x %*s %*s %*s %*s %ln", &start, &path_pos);

if (!path_pos) {
// Line didn't match our format string. This shouldn't be
// possible, but let's be defensive and skip the line.
continue;
}

const char *path = line + path_pos;
const char *filename = strrchr(path, '/');
if (filename) {
filename++; // Move past the '/'
} else {
filename = path; // No directories, or an empty string
}

if (strstr(filename, substr)) {
retval = search_elf_file_for_section(handle, secname, start, path);
if (retval) {
break;
}
}
}

PyMem_Free(line);
fclose(maps_file);

return retval;
}

我们这里能看到 _Py_RemoteDebug_GetPyRuntimeAddress 调用了 search_linux_map_for_section 来获取当前的 PyRuntime 的地址,而 search_linux_map_for_section 则是通过 /proc/${pid}/maps ,暴力遍历 maps 中的内存段来获取具体的地址。

我们来看看 search_elf_file_for_section 的实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
search_elf_file_for_section(
proc_handle_t *handle,
const char* secname,
uintptr_t start_address,
const char *elf_file)
{
if (start_address == 0) {
return 0;
}

uintptr_t result = 0;
void* file_memory = NULL;

int fd = open(elf_file, O_RDONLY);
if (fd < 0) {
PyErr_SetFromErrno(PyExc_OSError);
goto exit;
}

struct stat file_stats;
if (fstat(fd, &file_stats) != 0) {
PyErr_SetFromErrno(PyExc_OSError);
goto exit;
}

file_memory = mmap(NULL, file_stats.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
if (file_memory == MAP_FAILED) {
PyErr_SetFromErrno(PyExc_OSError);
goto exit;
}

Elf_Ehdr* elf_header = (Elf_Ehdr*)file_memory;

Elf_Shdr* section_header_table = (Elf_Shdr*)(file_memory + elf_header->e_shoff);

Elf_Shdr* shstrtab_section = &section_header_table[elf_header->e_shstrndx];
char* shstrtab = (char*)(file_memory + shstrtab_section->sh_offset);

Elf_Shdr* section = NULL;
for (int i = 0; i < elf_header->e_shnum; i++) {
char* this_sec_name = shstrtab + section_header_table[i].sh_name;
// Move 1 character to account for the leading "."
this_sec_name += 1;
if (strcmp(secname, this_sec_name) == 0) {
section = &section_header_table[i];
break;
}
}

Elf_Phdr* program_header_table = (Elf_Phdr*)(file_memory + elf_header->e_phoff);
// Find the first PT_LOAD segment
Elf_Phdr* first_load_segment = NULL;
for (int i = 0; i < elf_header->e_phnum; i++) {
if (program_header_table[i].p_type == PT_LOAD) {
first_load_segment = &program_header_table[i];
break;
}
}

if (section != NULL && first_load_segment != NULL) {
uintptr_t elf_load_addr = first_load_segment->p_vaddr
- (first_load_segment->p_vaddr % first_load_segment->p_align);
result = start_address + (uintptr_t)section->sh_addr - elf_load_addr;
}

exit:
if (file_memory != NULL) {
munmap(file_memory, file_stats.st_size);
}
if (fd >= 0 && close(fd) != 0) {
PyErr_SetFromErrno(PyExc_OSError);
}
return result;
}

这段代码稍微有点复杂,我们来拆分看一下

首先函数的声明

1
2
3
4
5
search_elf_file_for_section(
proc_handle_t *handle,
const char* secname,
uintptr_t start_address,
const char *elf_file)

用于在ELF文件中搜索特定的section。参数包括:进程句柄、要查找的section名称、起始地址(文件在进程空间的映射位置)、ELF文件路径。

1
2
3
4
5
int fd = open(elf_file, O_RDONLY);
if (fd < 0) {
PyErr_SetFromErrno(PyExc_OSError);
goto exit;
}

以只读方式打开ELF文件,如果失败则设置Python异常并跳转到退出处理。

1
2
3
4
5
file_memory = mmap(NULL, file_stats.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
if (file_memory == MAP_FAILED) {
PyErr_SetFromErrno(PyExc_OSError);
goto exit;
}

将文件内容映射到内存,以只读和私有方式,从文件头开始。失败则设置异常并退出。

1
2
Elf_Ehdr* elf_header = (Elf_Ehdr*)file_memory;
Elf_Shdr* section_header_table = (Elf_Shdr*)(file_memory + elf_header->e_shoff);

将文件开头 cast 为ELF文件头结构,并找到section header表的位置,它在文件偏移e_shoff处。

1
2
3
4
5
6
7
8
9
10
11
12
Elf_Shdr* shstrtab_section = &section_header_table[elf_header->e_shstrndx];
char* shstrtab = (char*)(file_memory + shstrtab_section->sh_offset);
Elf_Shdr* section = NULL;
for (int i = 0; i < elf_header->e_shnum; i++) {
char* this_sec_name = shstrtab + section_header_table[i].sh_name;
// Move 1 character to account for the leading "."
this_sec_name += 1;
if (strcmp(secname, this_sec_name) == 0) {
section = &section_header_table[i];
break;
}
}

获取section字符串表(包含所有section名称的表),通过e_shstrndx索引定位。同时遍历所有section,查找匹配的section名称。注意需要跳过section名字的”.”前缀。

1
2
3
4
5
6
7
8
9
Elf_Phdr* program_header_table = (Elf_Phdr*)(file_memory + elf_header->e_phoff);
// Find the first PT_LOAD segment
Elf_Phdr* first_load_segment = NULL;
for (int i = 0; i < elf_header->e_phnum; i++) {
if (program_header_table[i].p_type == PT_LOAD) {
first_load_segment = &program_header_table[i];
break;
}
}

找到program header表,然后搜索第一个PT_LOAD类型的segment,它定义了程序加载时的基地址。

1
2
3
4
5
if (section != NULL && first_load_segment != NULL) {
uintptr_t elf_load_addr = first_load_segment->p_vaddr
- (first_load_segment->p_vaddr % first_load_segment->p_align);
result = start_address + (uintptr_t)section->sh_addr - elf_load_addr;
}

如果找到了目标section和第一个LOAD segment,计算目标section的运行时地址:

  1. 计算ELF文件的加载基地址(考虑对齐)
  2. 目标地址 = 进程中映射的起始地址 + section的虚拟地址 - ELF加载基地址

经过这样一个流程,我们就能最终的获取到 _PyRuntime 中的地址,然后基于此做一些包括 PEP 768 在内很有趣的工作。

总结

Python 3.14 官方其实将进程信息以半正式化的形式形成了一组相对稳定的 ABI,这样可以使我们调试工具能以更好的方式对 Python 进程进行无侵入的调试与观测。PEP 768 其实是这个过程中一个的有效产物。而基于 PEP768 处理的比如 Remote PDB debug,目前也已合入分支。

可以说从 Python 3.14 起,Python 的调试工具和手段将得到极大的丰富与增强。建议大家在出来后的第一时间进行升级(

差不多就这样(