Manjusaka的 RSS 预览

成为榜样，但不要成为偶像

2025-10-03 04:00:00

最近刚把空轨 1st 打通关。差不多国庆假期也要结束了，要开始准备接下来的学习和工作了。趁着这个机会，写点东西，当个迟到的 PyCon China 2025 的总结吧

正文

其实去年在结束去年的会议后，我就在考虑要不要彻底退出 PyCon China 一段时间。原因其实也很简单，太累了。不过我向来是个很拖延癌的人。恰逢那时候正在职业的变动期，所以想着要不要再看看。

今年来了一家 AIGC 公司做 infra，说实话我干的还蛮开心的。每天都会学习新的东西。状态相比于去年好了不少（除了每天都需要和该死的 Any 做斗争），所以一度准备继续参加今年的 PyCon China

不过到了筹备期后，原本的计划突生变故。陪伴需求的猫咪病危，去世。一直在医院通宵，陪护以及最终要面临的生死别离。对我的精神和体力造成了极大的消耗。是否继续参加今年的 PyCon China 也成为一个问题。

不过最终还是决定参加了。今年突然出版社那边给我加了一个活，我翻译的书要在现场签售。这就又成为我心里一个新的疑问：嘛，真的会有人来买吗？

嘛，其实这也是每年在参加 PyCon 时都会有的自我怀疑，我真的有资格在这里吗？我讲的东西真的会有人听吗？

不过做了最终的决定那么就继续做吧。所以今年还是给了一个白银赞助+一个主题演讲。

时间过得很快，转眼到了会议前一天，临上飞机前我还在被 AWS 的傻叉 OpenSearch 折磨。下飞机后 yihong，piglei，空想家，jay 他们去开 impact 了。好好好开 impact 不带我是吧。我就孤零零的跑去了酒店。有人演讲前12h文件夹都还没建是谁我不说。

在酒店顶了几个小时把 PPT 赶完，工作收个尾。勉强睡了2h，然后就去会场了。

今年比较轻松的是早上的主持不用我了，我只要负责暖场一下。不过临开场前发现 C 会场的 PPT 还没收，然后摇了晚枫帮忙。

暖场还是我每年的保留节目，Saka 三问.jpg:

现在还在写 Python 的举个手
现在还没写 Python 的举个手（拖出去
用 2.7 的举个手

看起来效果还不错。暖场完我就跑出去了。不过卧槽，今年怎么这么多人来单杀我，~~妈耶社恐狂哭了~~。

不过说实话，在经历生死一圈后，和老朋友打打闹闹，认识一下新朋友，还是蛮开心的。特别是很多人过来说我影响了他们，我是他们的偶像，他们从我的博客和分享里学到很多的时候。我一直以来的疑惑也得到了解决。虽然说人的自我认同最理想应该是内源性的，但是大家的认可也真的会让我很开心。

哦对了，还有很多人说他们也很喜欢摇曳露营！

值得一提的是，我司的宝藏同事从日本来一起面基了（是的，我们是 Remote 公司），还给我带了手办

当然有某屑 HR 说要来参会结果早上睡过了，是谁我不点名了

上午的时间其实过去的很快，很快就结束了。然后我就来到了签售地方。出乎意料的，有很多人都来了，有来捧场的老朋友们，有刚刚线下刚对上号的新朋友们。大家一起打打闹闹，我用我的丑字写了很多祝福，也有很多杂话。要说哪一句最真心，那我觉得是“不要用 Next.js”罢。

签售完火急火燎的吃完盒饭，抽烟的时候遇到师父，~~我开始基情的抚摸他的胸肌~~，这可能算是每次我们师徒相聚的仪式感了。问他了一个问题：你现在还需要我帮你收尸么？这个出处源自于我们之前约好他要是自杀我会来帮他处理后事。他想了想说，我们不如想想怎么活到150岁吧。很好，很强大，我很喜欢的回答。那我就当不会了。

后面纯爷也来加入了聊天，我顺便向他倾诉了一下把 PostgreSQL 当 MongoDB 用的痛苦。纯爷也只能用爱莫能助的眼光看看我

下午 C 会场实际上因为我需要抽烟提神迟到了几分钟，yihong 帮我暖场缓解尴尬，不得不给他磕一个，以及 yihong 抱起来手感很好，建议网友有条件的可以去试试

下午到我的时候其实因为控场的原因给我的时间比预期的要短一些，所以我临时调整了一些内容。要上去讲的时候，发现很多人都从其余会场赶了过来，算是非常满足了。在 QA 的时候，我给大家说我现在在一家用着 Node.js 写着 Any，把 PostgreSQL 当 MongoDB 用，以及还用着 GraphQL 的公司工作。大家一片会心一笑。我想我们屑 HR 大概今天在现场招不到人了罢

下午的时候，屑 HR 和我们在上海的另外一位同事也来，带来了我需要~~昏睡红茶~~零度可乐。说实话在一个 Remote 公司大家面基的机会还是很少的。理所当然，我成为公司群内表情包的一部分。该考虑下找屑 HR 要肖像费了

晚上散场后，去和沪爷阿蔡以及几个同事一起组了一个局。不过说实话拉着明天要去霓虹的两位同事去吃日式拉面算不算另一种职场 80？笑死

总结

说实话今年 PyCon 的当天是我这两个月最快乐的一点，也许在很多年后我记不得了今年讲了什么，但是还会记得当天最简单快乐。

嘛，从18年到现在，7年过去了，我也从一个刚出校园的年轻人变成了一个老登。似乎很多东西都在变，但是很多东西又没变。

我还是很菜，但是我好像比之前影响了更多的人，帮到了更多的人？我还有很多东西不会，但是我好像能学的东西也更多了？还是会经历很多痛苦，很多迷茫，很多挣扎。但是生活似乎也还是一如既往的有着无限的希望与美好？

回北京后，一次吃饭时，我给我女朋友说，你知道吗？很多人说我是他们的偶像。我女朋友说：不，你是他们的榜样

是的，成为榜样，但是不要成为偶像。

这篇文章差不多写到这里。要到中秋了，除了祝大家中秋快乐，阖家欢乐以外。也祝大家每个人都能在这个快速迭代的世道里永葆初心。用 Piglei 老师的话说就是“老而不登”

这个世界唯有爱，希望，奥特曼与摇曳露营不可辜负，抚门！

Further Performance Evolution in Python 3.14: Tail Call Interpreter

2025-07-02 23:49:00

I’ve been overwhelmed by security work lately, so let me switch to something lighter to relax my mind.

Python 3.14 has officially introduced a new mechanism called Tail Call Interpreter (Made by Ken Jin), which is undoubtedly another major milestone that lays the foundation for the future.

Main Content

Before discussing Python 3.14’s Tail Call Interpreter, we need to talk about the most basic syntax in C - switch-case.

switch (x) {
    case 1:
        // do something
        break;
    case 2:
        // do something else
        break;
    default:
        // do default thing
}

What would the final assembly look like for such code? Most people’s first reaction might be to use cmp followed by je, and if it doesn’t match, continue. Let’s compile a version to see.

For this code:

void small_switch(int x) {
    switch(x) {
        case 1: printf("One\n"); break;
        case 2: printf("Two\n"); break;
        case 3: printf("Three\n"); break;
        default: printf("Other\n"); break;
    }
}

The final assembly output would be:

00000000000011f0 <small_switch>:
    11f0:83 ff 02             cmp    $0x2,%edi
    11f3:74 2b                je     1220 <small_switch+0x30>
    11f5:83 ff 03             cmp    $0x3,%edi
    11f8:74 16                je     1210 <small_switch+0x20>
    11fa:83 ff 01             cmp    $0x1,%edi
    11fd:75 31                jne    1230 <small_switch+0x40>
    11ff:48 8d 3d fe 0d 00 00 lea    0xdfe(%rip),%rdi        # 2004 <_IO_stdin_used+0x4>
    1206:e9 25 fe ff ff       jmp    1030 <puts@plt>
    120b:0f 1f 44 00 00       nopl   0x0(%rax,%rax,1)
    1210:48 8d 3d f5 0d 00 00 lea    0xdf5(%rip),%rdi        # 200c <_IO_stdin_used+0xc>
    1217:e9 14 fe ff ff       jmp    1030 <puts@plt>
    121c:0f 1f 40 00          nopl   0x0(%rax)
    1220:48 8d 3d e1 0d 00 00 lea    0xde1(%rip),%rdi        # 2008 <_IO_stdin_used+0x8>
    1227:e9 04 fe ff ff       jmp    1030 <puts@plt>
    122c:0f 1f 40 00          nopl   0x0(%rax)
    1230:48 8d 3d db 0d 00 00 lea    0xddb(%rip),%rdi        # 2012 <_IO_stdin_used+0x12>
    1237:e9 f4 fd ff ff       jmp    1030 <puts@plt>
    123c:0f 1f 40 00          nopl   0x0(%rax)

We can see that overall it’s as we expected - continuous cmp followed by continuous je. Now let’s evaluate the complexity here? Oh, time complexity O(n), that’s straightforward.

Damn, for Python with such a huge switch case structure, wouldn’t that be O(n) every time? Wouldn’t that be a performance disaster?

Actually, no. Usually, compilers use different strategies to handle switch case structures based on data type and scale.

Let’s prepare several examples:

void small_switch(int x) {
    switch(x) {
        case 1: printf("One\n"); break;
        case 2: printf("Two\n"); break;
        case 3: printf("Three\n"); break;
        default: printf("Other\n"); break;
    }
}

void dense_switch(int x) {
    switch(x) {
        case 10: printf("Ten\n"); break;
        case 11: printf("Eleven\n"); break;
        case 12: printf("Twelve\n"); break;
        case 13: printf("Thirteen\n"); break;
        case 14: printf("Fourteen\n"); break;
        case 15: printf("Fifteen\n"); break;
        case 16: printf("Sixteen\n"); break;
        case 17: printf("Seventeen\n"); break;
        default: printf("Other\n"); break;
    }
}

void sparse_switch(int x) {
    switch(x) {
        case 1: printf("One\n"); break;
        case 100: printf("Hundred\n"); break;
        case 1000: printf("Thousand\n"); break;
        case 10000: printf("Ten thousand\n"); break;
        default: printf("Other\n"); break;
    }
}

void large_dense_switch(int x) {
    switch(x) {
        case 1: printf("Case 1\n"); break;
        case 2: printf("Case 2\n"); break;
        case 3: printf("Case 3\n"); break;
        case 4: printf("Case 4\n"); break;
        case 5: printf("Case 5\n"); break;
        case 6: printf("Case 6\n"); break;
        case 7: printf("Case 7\n"); break;
        case 8: printf("Case 8\n"); break;
        case 9: printf("Case 9\n"); break;
        case 10: printf("Case 10\n"); break;
        case 11: printf("Case 11\n"); break;
        case 12: printf("Case 12\n"); break;
        case 13: printf("Case 13\n"); break;
        case 14: printf("Case 14\n"); break;
        case 15: printf("Case 15\n"); break;
        case 16: printf("Case 16\n"); break;
        case 17: printf("Case 17\n"); break;
        case 18: printf("Case 18\n"); break;
        case 19: printf("Case 19\n"); break;
        case 20: printf("Case 20\n"); break;
        default: printf("Other\n"); break;
    }
}

void mixed_switch(int x) {
    switch(x) {
        case 1: printf("Case 1\n"); break;
        case 2: printf("Case 2\n"); break;
        case 3: printf("Case 3\n"); break;
        
        case 50: printf("Case 50\n"); break;
        
        case 100: printf("Case 100\n"); break;
        case 101: printf("Case 101\n"); break;
        case 102: printf("Case 102\n"); break;
        
        default: printf("Other\n"); break;
    }
}

void char_switch(char c) {
    switch(c) {
        case 'a': printf("Letter a\n"); break;
        case 'b': printf("Letter b\n"); break;
        case 'c': printf("Letter c\n"); break;
        case 'd': printf("Letter d\n"); break;
        case 'e': printf("Letter e\n"); break;
        default: printf("Other char\n"); break;
    }
}

Then we disassemble and look at the results (I’ll only paste the key parts here):

00000000000011f0 <small_switch>:
    11f0:83 ff 02             cmp    $0x2,%edi      # Compare if it's 2
    11f3:74 2b                je     1220            # Jump to case 2
    11f5:83 ff 03             cmp    $0x3,%edi      # Compare if it's 3
    11f8:74 16                je     1210            # Jump to case 3
    11fa:83 ff 01             cmp    $0x1,%edi      # Compare if it's 1
    11fd:75 31                jne    1230            # If not, jump to default

0000000000001240 <dense_switch>:
    1240:83 ef 0a             sub    $0xa,%edi      # Subtract 10 (minimum case value)
    1243:83 ff 07             cmp    $0x7,%edi      # Compare range (17-10=7)
    1246:0f 87 90 00 00 00    ja     12dc           # Out of range, jump to default
    124c:48 8d 15 15 0f 00 00 lea    0xf15(%rip),%rdx # Load jump table address
    1253:48 63 04 ba          movslq (%rdx,%rdi,4),%rax # Get offset
    1257:48 01 d0             add    %rdx,%rax      # Calculate target address
    125a:ff e0                jmp    *%rax          # Indirect jump

00000000000012f0 <sparse_switch>:
    12f0:81 ff e8 03 00 00    cmp    $0x3e8,%edi    # Compare 1000
    12f6:74 40                je     1338           # If equal, jump
    12f8:7f 16                jg     1310           # If greater than 1000, continue checking
    12fa:83 ff 01             cmp    $0x1,%edi      # Less than 1000, check 1
    12fd:74 49                je     1348           
    12ff:83 ff 64             cmp    $0x64,%edi     # Check 100
    1302:75 24                jne    1328           # If none match, default
    ...
    1310:81 ff 10 27 00 00    cmp    $0x2710,%edi   # Check 10000

0000000000001360 <large_dense_switch>:
    1360:83 ff 14             cmp    $0x14,%edi     # Check if ≤20
    1363:0f 87 53 01 00 00    ja     14bc           # Out of range
    1369:48 8d 15 18 0e 00 00 lea    0xe18(%rip),%rdx # Jump table address
    1372:48 63 04 ba          movslq (%rdx,%rdi,4),%rax # Direct indexing
    1376:48 01 d0             add    %rdx,%rax
    1379:ff e0                jmp    *%rax

00000000000014d0 <mixed_switch>:
    14d0:83 ff 32             cmp    $0x32,%edi     # Compare 50
    14d3:74 7b                je     1550
    14d5:7f 29                jg     1500           # >50 case
    14d7:83 ff 02             cmp    $0x2,%edi      # ≤50, check small values
    14da:74 64                je     1540
    14dc:83 ff 03             cmp    $0x3,%edi
    ...
    1500:83 ff 65             cmp    $0x65,%edi     # >50, check 100,101,102
    1503:74 5b                je     1560

0000000000001580 <char_switch>:
    1580:83 ef 61             sub    $0x61,%edi     # Subtract ASCII value of 'a'
    1583:40 80 ff 04          cmp    $0x4,%dil      # Check if ≤4 (a-e)
    1587:77 63                ja     15ec
    1589:48 8d 15 4c 0c 00 00 lea    0xc4c(%rip),%rdx
    1590:40 0f b6 ff          movzbl %dil,%edi      # Zero-extend to 32 bit
    1594:48 63 04 ba          movslq (%rdx,%rdi,4),%rax

Here we can see that the compiler handles switch case structures differently based on different data types. Let me summarize this with a table:

Switch Type	Case Count	Distribution	Compiler Strategy	Time Complexity
small_switch	3	Consecutive(1,2,3)	Linear comparison	O(n)
dense_switch	8	Consecutive(10-17)	Offset jump table	O(1)
sparse_switch	4	Sparse(1,100,1000,10000)	Binary search	O(log n)
large_dense_switch	20	Consecutive(1-20)	Standard jump table	O(1)
mixed_switch	7	Partially consecutive	Nested comparison	O(log n)
char_switch	5	Consecutive(‘a’-‘e’)	Character offset table	O(1)

OK, here we find that the final implementation of switch-case varies depending on data types, leading to unpredictability in our final code. So do we have ways to optimize this problem? The answer is yes.

Let’s look at the following code:

#include <stdio.h>

void basic_computed_goto(int operation) {
    static void* jump_table[] = {
        &&op_add,   
        &&op_sub,   
        &&op_mul,   
        &&op_div,   
        &&op_mod,   
        &&op_default
    };
    
    int a = 10, b = 3;
    int result;
    
    if (operation < 0 || operation > 4) {
        operation = 5;
    }
    
    printf("Operation %d: a=%d, b=%d -> ", operation, a, b);
    
    goto *jump_table[operation];
    
op_add:
    result = a + b;
    printf("ADD: %d\n", result);
    return;
    
op_sub:
    result = a - b;
    printf("SUB: %d\n", result);
    return;
    
op_mul:
    result = a * b;
    printf("MUL: %d\n", result);
    return;
    
op_div:
    if (b != 0) {
        result = a / b;
        printf("DIV: %d\n", result);
    } else {
        printf("DIV: Error (division by zero)\n");
    }
    return;
    
op_mod:
    if (b != 0) {
        result = a % b;
        printf("MOD: %d\n", result);
    } else {
        printf("MOD: Error (division by zero)\n");
    }
    return;
    
op_default:
    printf("Unknown operation\n");
    return;
}

We can see that the core operation here is to turn each case of our switch-case into a label, then we use a jump_table to directly jump to the corresponding label. Let’s look at the assembly of the most critical part:

1 2	11d3:48 8d 05 c6 2b 00 00 lea 0x2bc6(%rip),%rax # 3da0 <jump_table.0> 11da:ff 24 d8 jmp *(%rax,%rbx,8)

Here we can summarize that using Computed Goto compared to traditional switch-case has the following advantages:

Reduces the cost of branch prediction fallback
Better instruction cache locality
Reduces the number and overhead of cmp instructions

So how much faster can it be? You can refer to the test results of Computed Goto introduced in CPython, which showed an overall improvement of about 15%.

So is the Computed Goto approach perfect? Actually, no. Currently, CPython’s interpreter ceval.c, which is also currently the largest switch case, has several typical problems:

Computed Goto as a specialized feature of clang and gcc, other platforms have limited benefits
Currently Computed Goto is not mature, different versions of the same compiler may have different issues
Extremely large switch cases cause compilers to not optimize switch cases well enough
We cannot use perf to precisely perform quantitative analysis of per-opcode overhead in our entire process, which will be a big problem in the context of making Python faster

Points 1, 3, and 4 are easy to understand. Let’s look at an example of point 2 (thanks to Ken Jin for providing the example).

In GCC 11, switch-case would generate normal code in certain parts:

738f: movq%r13, %r15
7392: movzbl%ah, %ebx
7395: movzbl%al, %eax
7398: movq(,%rax,8), %rax
73a0: movl%ebx, -0x248(%rbp)
73a6: jmpq*%rax

While in GCC 13-15Beta, it would generate code like this:

747a: movzbl%ah, %ebx
747d: movzbl%al, %eax
7480: movl%ebx, -0x248(%rbp)
7486: movq(,%rax,8), %rax
748e: jmp0x72a0 <_PyEval_EvalFrameDefault+0x970>

72a0: movq%r15, %xmm0
72a5: movq%r13, %xmm3
72aa: movq%r15, %rbx
72ad: punpcklqdq%xmm3, %xmm0
72b1: movhlps%xmm0, %xmm2
72b4: movq%xmm2, %r10
72b9: movq%r10, %r11
72bc: jmpq*%rax

We can see that additional registers were introduced. Computer Architecture 101: additional registers mean additional overhead. Registers are expensive!

So do we have ways to iterate on the extremely large switch case above? Some students might be thinking, since the switch case above is extremely large, why don’t we split it into multiple small functions so that the compiler can have enough context to optimize, and our perf can also precisely analyze the overhead of each function. Wouldn’t that be great?

But other students object: function calls trigger call instructions, which bring additional overhead of register push and pop operations. Won’t this make it slower again?

So can we optimize this? The answer is yes. Many students might have thought of it - tail call.

Suppose we have this code:

__attribute__((noinline)) void g(int x) {
    printf("Value: %d\n", x);
};
void f(int x) {
    return g(x);
}

We can see this assembly:

0000000000001140 <g>:
    1140:55                   push   %rbp
    1141:48 89 e5             mov    %rsp,%rbp
    1144:48 83 ec 10          sub    $0x10,%rsp
    1148:89 7d fc             mov    %edi,-0x4(%rbp)
    114b:8b 75 fc             mov    -0x4(%rbp),%esi
    114e:48 8d 3d af 0e 00 00 lea    0xeaf(%rip),%rdi        # 2004 <_IO_stdin_used+0x4>
    1155:b0 00                mov    $0x0,%al
    1157:e8 d4 fe ff ff       call   1030 <printf@plt>
    115c:48 83 c4 10          add    $0x10,%rsp
    1160:5d                   pop    %rbp
    1161:c3                   ret
    1162:66 66 66 66 66 2e 0f data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
    1169:1f 84 00 00 00 00 00 

0000000000001170 <f>:
    1170:55                   push   %rbp
    1171:48 89 e5             mov    %rsp,%rbp
    1174:48 83 ec 10          sub    $0x10,%rsp
    1178:89 7d fc             mov    %edi,-0x4(%rbp)
    117b:8b 7d fc             mov    -0x4(%rbp),%edi
    117e:e8 bd ff ff ff       call   1140 <g>
    1183:48 83 c4 10          add    $0x10,%rsp
    1187:5d                   pop    %rbp
    1188:c3                   ret
    1189:0f 1f 80 00 00 00 00 nopl   0x0(%rax)

The call 1140 <g> instruction is very prominent. This is also an important source of function call overhead.

In modern compilers, there’s a special optimization called tail recursion, where when the last step of a function is calling another function, the compiler can optimize away the overhead of this call.

Let’s test this:

#include <stdio.h>
__attribute__((preserve_none)) void g(int x);
__attribute__((noinline, preserve_none)) void g(int x){
    printf("Value: %d\n", x);
}

__attribute__((preserve_none)) void f(int x) {
    [[clang::musttail]] return g(x);
}

int main() {
    f(42);
    return 0;
}

Let’s look at the related assembly:

0000000000001140 <g>:
    1140:55                   push   %rbp
    1141:48 89 e5             mov    %rsp,%rbp
    1144:48 83 ec 10          sub    $0x10,%rsp
    1148:44 89 65 fc          mov    %r12d,-0x4(%rbp)
    114c:8b 75 fc             mov    -0x4(%rbp),%esi
    114f:48 8d 3d ae 0e 00 00 lea    0xeae(%rip),%rdi        # 2004 <_IO_stdin_used+0x4>
    1156:31 c0                xor    %eax,%eax
    1158:e8 d3 fe ff ff       call   1030 <printf@plt>
    115d:48 83 c4 10          add    $0x10,%rsp
    1161:5d                   pop    %rbp
    1162:c3                   ret
    1163:66 66 66 66 2e 0f 1f data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
    116a:84 00 00 00 00 00 

0000000000001170 <f>:
    1170:55                   push   %rbp
    1171:48 89 e5             mov    %rsp,%rbp
    1174:44 89 65 fc          mov    %r12d,-0x4(%rbp)
    1178:44 8b 65 fc          mov    -0x4(%rbp),%r12d
    117c:5d                   pop    %rbp
    117d:e9 be ff ff ff       jmp    1140 <g>
    1182:66 66 66 66 66 2e 0f data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
    1189:1f 84 00 00 00 00 00

We can see that the last step of function f is jmp 1140 <g>, not call 1140 <g>. This means when we call function g, we won’t have additional overhead like register allocation that traditional call instructions bring.

Some students might have realized that after tail recursion processing, this can completely be viewed as a high-performance Goto.

Bingo, the idea here is similar. A 1977 paper “Debunking the ‘Expensive Procedure Call’ Myth, or, Procedure Call Implementations Considered Harmful, or, Lambda: The Ultimate GOTO” mentioned that efficient procedure calls can have performance close to Goto, while being more concise in implementation.

In Python 3.14, the implementation of Tail Call Interpreter is based on this idea.

We can see that we’ve applied tail recursion processing to the macro that dispatches opcodes:

#   define Py_MUSTTAIL [[clang::musttail]]
#   define Py_PRESERVE_NONE_CC __attribute__((preserve_none))
    Py_PRESERVE_NONE_CC typedef PyObject* (*py_tail_call_funcptr)(TAIL_CALL_PARAMS);

#   define TARGET(op) Py_PRESERVE_NONE_CC PyObject *_TAIL_CALL_##op(TAIL_CALL_PARAMS)
#   define DISPATCH_GOTO() \
        do { \
            Py_MUSTTAIL return (INSTRUCTION_TABLE[opcode])(TAIL_CALL_ARGS); \
        } while (0)
#   define JUMP_TO_LABEL(name) \
        do { \
            Py_MUSTTAIL return (_TAIL_CALL_##name)(TAIL_CALL_ARGS); \
        } while (0)
#   ifdef Py_STATS
#       define JUMP_TO_PREDICTED(name) \
            do { \
                Py_MUSTTAIL return (_TAIL_CALL_##name)(frame, stack_pointer, tstate, this_instr, oparg, lastopcode); \
            } while (0)
#   else
#       define JUMP_TO_PREDICTED(name) \
            do { \
                Py_MUSTTAIL return (_TAIL_CALL_##name)(frame, stack_pointer, tstate, this_instr, oparg); \
            } while (0)
#   endif
#    define LABEL(name) TARGET(name)

So while ensuring our baseline performance is as good as or even better than Computed Goto, we can get the following benefits:

Broader platform support
After splitting cases, compilers are less likely to make mistakes, and overall performance predictability is stronger
Happy perf
And I can do more cool stuff with tools like eBPF

Summary

This article is pretty much it. Although it claims to only introduce Python 3.14’s Tail Call Interpreter, it still completely introduces the entire evolution of thinking.

This also gives me an insight: predictability is really a very important characteristic in many cases.

This, along with remote debug, are the two features I like most in 3.14. Long live observability!

Python 3.14 的进一步性能进化: Tail Call Interpreter

2025-07-02 23:49:00

最近做安全做的我头晕脑胀，来点轻松的换换脑子，让自己放松下

Python 3.14 正式引入了一个新的机制叫作 Tail Call Interpreter（Made by Ken Jin），这无疑又是一个奠定未来基础的重大工作

正文

在聊 Python 3.14 的 Tail Call Interpreter 之前，我们先要来聊 C 语言中最基本的语法 switch-case

switch (x) {
    case 1:
        // do something
        break;
    case 2:
        // do something else
        break;
    default:
        // do default thing
}

对于这样的代码我们最终的汇编会是什么样的呢？可能大家第一反应是先 cmp 然后 je ，不等式秒了，我们编译一个版本来看看

对于这样一段代码

void small_switch(int x) {
    switch(x) {
        case 1: printf("One\n"); break;
        case 2: printf("Two\n"); break;
        case 3: printf("Three\n"); break;
        default: printf("Other\n"); break;
    }
}

最终汇编的产物会是这样的

00000000000011f0 <small_switch>:
    11f0:83 ff 02             cmp    $0x2,%edi
    11f3:74 2b                je     1220 <small_switch+0x30>
    11f5:83 ff 03             cmp    $0x3,%edi
    11f8:74 16                je     1210 <small_switch+0x20>
    11fa:83 ff 01             cmp    $0x1,%edi
    11fd:75 31                jne    1230 <small_switch+0x40>
    11ff:48 8d 3d fe 0d 00 00 lea    0xdfe(%rip),%rdi        # 2004 <_IO_stdin_used+0x4>
    1206:e9 25 fe ff ff       jmp    1030 <puts@plt>
    120b:0f 1f 44 00 00       nopl   0x0(%rax,%rax,1)
    1210:48 8d 3d f5 0d 00 00 lea    0xdf5(%rip),%rdi        # 200c <_IO_stdin_used+0xc>
    1217:e9 14 fe ff ff       jmp    1030 <puts@plt>
    121c:0f 1f 40 00          nopl   0x0(%rax)
    1220:48 8d 3d e1 0d 00 00 lea    0xde1(%rip),%rdi        # 2008 <_IO_stdin_used+0x8>
    1227:e9 04 fe ff ff       jmp    1030 <puts@plt>
    122c:0f 1f 40 00          nopl   0x0(%rax)
    1230:48 8d 3d db 0d 00 00 lea    0xddb(%rip),%rdi        # 2012 <_IO_stdin_used+0x12>
    1237:e9 f4 fd ff ff       jmp    1030 <puts@plt>
    123c:0f 1f 40 00          nopl   0x0(%rax)

我们能看到整体如我们所预期的一样，不断的 cmp 然后不断的 je，然后我们评估一下这里的复杂度呢？哦，时间复杂度 O(n) 秒了。

卧槽，那对于 Python 这样一个超大的 switch case 的结构，岂不是每次都是一个 O(n) ？这不得原地升天？

其实不是，通常来说，编译器会根据数据的类型和规模来用不同的方案处理 switch case 的结构

我们来准备几组例子

void small_switch(int x) {
    switch(x) {
        case 1: printf("One\n"); break;
        case 2: printf("Two\n"); break;
        case 3: printf("Three\n"); break;
        default: printf("Other\n"); break;
    }
}

void dense_switch(int x) {
    switch(x) {
        case 10: printf("Ten\n"); break;
        case 11: printf("Eleven\n"); break;
        case 12: printf("Twelve\n"); break;
        case 13: printf("Thirteen\n"); break;
        case 14: printf("Fourteen\n"); break;
        case 15: printf("Fifteen\n"); break;
        case 16: printf("Sixteen\n"); break;
        case 17: printf("Seventeen\n"); break;
        default: printf("Other\n"); break;
    }
}

void sparse_switch(int x) {
    switch(x) {
        case 1: printf("One\n"); break;
        case 100: printf("Hundred\n"); break;
        case 1000: printf("Thousand\n"); break;
        case 10000: printf("Ten thousand\n"); break;
        default: printf("Other\n"); break;
    }
}

void large_dense_switch(int x) {
    switch(x) {
        case 1: printf("Case 1\n"); break;
        case 2: printf("Case 2\n"); break;
        case 3: printf("Case 3\n"); break;
        case 4: printf("Case 4\n"); break;
        case 5: printf("Case 5\n"); break;
        case 6: printf("Case 6\n"); break;
        case 7: printf("Case 7\n"); break;
        case 8: printf("Case 8\n"); break;
        case 9: printf("Case 9\n"); break;
        case 10: printf("Case 10\n"); break;
        case 11: printf("Case 11\n"); break;
        case 12: printf("Case 12\n"); break;
        case 13: printf("Case 13\n"); break;
        case 14: printf("Case 14\n"); break;
        case 15: printf("Case 15\n"); break;
        case 16: printf("Case 16\n"); break;
        case 17: printf("Case 17\n"); break;
        case 18: printf("Case 18\n"); break;
        case 19: printf("Case 19\n"); break;
        case 20: printf("Case 20\n"); break;
        default: printf("Other\n"); break;
    }
}

void mixed_switch(int x) {
    switch(x) {
        case 1: printf("Case 1\n"); break;
        case 2: printf("Case 2\n"); break;
        case 3: printf("Case 3\n"); break;
        
        case 50: printf("Case 50\n"); break;
        
        case 100: printf("Case 100\n"); break;
        case 101: printf("Case 101\n"); break;
        case 102: printf("Case 102\n"); break;
        
        default: printf("Other\n"); break;
    }
}

void char_switch(char c) {
    switch(c) {
        case 'a': printf("Letter a\n"); break;
        case 'b': printf("Letter b\n"); break;
        case 'c': printf("Letter c\n"); break;
        case 'd': printf("Letter d\n"); break;
        case 'e': printf("Letter e\n"); break;
        default: printf("Other char\n"); break;
    }
}

然后我们反汇编以下，看下结果（这里我只把关键的部分贴出来）

00000000000011f0 <small_switch>:
    11f0:83 ff 02             cmp    $0x2,%edi      # 比较是否为2
    11f3:74 2b                je     1220            # 跳转到case 2
    11f5:83 ff 03             cmp    $0x3,%edi      # 比较是否为3
    11f8:74 16                je     1210            # 跳转到case 3
    11fa:83 ff 01             cmp    $0x1,%edi      # 比较是否为1
    11fd:75 31                jne    1230            # 不是则跳转到default

0000000000001240 <dense_switch>:
    1240:83 ef 0a             sub    $0xa,%edi      # 减去10 (最小case值)
    1243:83 ff 07             cmp    $0x7,%edi      # 比较范围 (17-10=7)
    1246:0f 87 90 00 00 00    ja     12dc           # 超出范围跳转default
    124c:48 8d 15 15 0f 00 00 lea    0xf15(%rip),%rdx # 加载跳转表地址
    1253:48 63 04 ba          movslq (%rdx,%rdi,4),%rax # 获取偏移量
    1257:48 01 d0             add    %rdx,%rax      # 计算目标地址
    125a:ff e0                jmp    *%rax          # 间接跳转

00000000000012f0 <sparse_switch>:
    12f0:81 ff e8 03 00 00    cmp    $0x3e8,%edi    # 比较1000
    12f6:74 40                je     1338           # 等于则跳转
    12f8:7f 16                jg     1310           # 大于1000则继续检查
    12fa:83 ff 01             cmp    $0x1,%edi      # 小于1000，检查1
    12fd:74 49                je     1348           
    12ff:83 ff 64             cmp    $0x64,%edi     # 检查100
    1302:75 24                jne    1328           # 都不是则default
    ...
    1310:81 ff 10 27 00 00    cmp    $0x2710,%edi   # 检查10000

0000000000001360 <large_dense_switch>:
    1360:83 ff 14             cmp    $0x14,%edi     # 检查是否≤20
    1363:0f 87 53 01 00 00    ja     14bc           # 超出范围
    1369:48 8d 15 18 0e 00 00 lea    0xe18(%rip),%rdx # 跳转表地址
    1372:48 63 04 ba          movslq (%rdx,%rdi,4),%rax # 直接索引
    1376:48 01 d0             add    %rdx,%rax
    1379:ff e0                jmp    *%rax

00000000000014d0 <mixed_switch>:
    14d0:83 ff 32             cmp    $0x32,%edi     # 比较50
    14d3:74 7b                je     1550
    14d5:7f 29                jg     1500           # >50的情况
    14d7:83 ff 02             cmp    $0x2,%edi      # ≤50，检查小值
    14da:74 64                je     1540
    14dc:83 ff 03             cmp    $0x3,%edi
    ...
    1500:83 ff 65             cmp    $0x65,%edi     # >50，检查100,101,102
    1503:74 5b                je     1560

0000000000001580 <char_switch>:
    1580:83 ef 61             sub    $0x61,%edi     # 减去'a'的ASCII值
    1583:40 80 ff 04          cmp    $0x4,%dil      # 检查是否≤4 (a-e)
    1587:77 63                ja     15ec
    1589:48 8d 15 4c 0c 00 00 lea    0xc4c(%rip),%rdx
    1590:40 0f b6 ff          movzbl %dil,%edi      # 零扩展到32位
    1594:48 63 04 ba          movslq (%rdx,%rdi,4),%rax

我们这里能看到编译器根据数据的不同类型来处理了 switch case 的结构，这里我用一个表格总结一下

Switch类型	Case数量	分布特点	编译器策略	时间复杂度
small_switch	3个	连续(1,2,3)	线性比较	O(n)
dense_switch	8个	连续(10-17)	偏移跳转表	O(1)
sparse_switch	4个	稀疏(1,100,1000,10000)	二分查找	O(log n)
large_dense_switch	20个	连续(1-20)	标准跳转表	O(1)
mixed_switch	7个	部分连续	嵌套比较	O(log n)
char_switch	5个	连续(‘a’-‘e’)	字符偏移表	O(1)

OK，这里我们发现，Switch-case 最终的实现因为数据类型的不一样，会导致我们最终的代码存在一个不可预测性。那么我们有没有办法来优化这个问题呢？答案是有的。

我们来看下面一段代码

#include <stdio.h>

void basic_computed_goto(int operation) {
    static void* jump_table[] = {
        &&op_add,   
        &&op_sub,   
        &&op_mul,   
        &&op_div,   
        &&op_mod,   
        &&op_default
    };
    
    int a = 10, b = 3;
    int result;
    
    if (operation < 0 || operation > 4) {
        operation = 5;
    }
    
    printf("Operation %d: a=%d, b=%d -> ", operation, a, b);
    
    goto *jump_table[operation];
    
op_add:
    result = a + b;
    printf("ADD: %d\n", result);
    return;
    
op_sub:
    result = a - b;
    printf("SUB: %d\n", result);
    return;
    
op_mul:
    result = a * b;
    printf("MUL: %d\n", result);
    return;
    
op_div:
    if (b != 0) {
        result = a / b;
        printf("DIV: %d\n", result);
    } else {
        printf("DIV: Error (division by zero)\n");
    }
    return;
    
op_mod:
    if (b != 0) {
        result = a % b;
        printf("MOD: %d\n", result);
    } else {
        printf("MOD: Error (division by zero)\n");
    }
    return;
    
op_default:
    printf("Unknown operation\n");
    return;
}

我们能看到这里核心的一个操作是将我们 Switch-cased 的每个 case 都变成了一个标签，然后我们通过一个 jump_table 来直接跳转到对应的标签上, 我们来看一下最关键位置的汇编

1 2	11d3:48 8d 05 c6 2b 00 00 lea 0x2bc6(%rip),%rax # 3da0 <jump_table.0> 11da:ff 24 d8 jmp *(%rax,%rbx,8)

这里我们可以总结出来使用 Computed Goto 相较于传统的 switch-case 有以下几点优势

减少分支预测 fallback 的代价
指令缓存局部性上更优
减少了 cmp 指令的数量和开销

那么具体能有多快呢？可以参见 CPython 引入的 Computed Goto 的一个测试结果，大概是整体提升了15% 左右

那么 Computed Goto 的方式就是完美的吗？其实也不是，目前 CPython 的解释器 ceval.c 也是目前最大的一个 switch case 中存在几个典型问题

Computed Goto 作为 clang 和 gcc 特化的功能，那么其余平台受益的可能性较小
目前 Computed Goto 其实并不成熟，在同一个编译器不同的版本可能会有不同的问题
超大型的 switch case 会导致编译器对于 switch case 的优化不够好
我们无法使用 perf 精确的去对我们整个过程中 per opcode 的开销进行定量分析，这在于让 Python 变得更快的大背景下将会是一个很大的问题

1，3，4 都很好理解，我们来看一下2的一个例子（感谢 Ken Jin 提供的例子）

在 GCC 11 的时候，switch-case 某个部分会正常的代码

738f: movq%r13, %r15
7392: movzbl%ah, %ebx
7395: movzbl%al, %eax
7398: movq(,%rax,8), %rax
73a0: movl%ebx, -0x248(%rbp)
73a6: jmpq*%rax

而在 GCC 13-15Beta 的时候，则会产生这样的代码

747a: movzbl%ah, %ebx
747d: movzbl%al, %eax
7480: movl%ebx, -0x248(%rbp)
7486: movq(,%rax,8), %rax
748e: jmp0x72a0 <_PyEval_EvalFrameDefault+0x970>

72a0: movq%r15, %xmm0
72a5: movq%r13, %xmm3
72aa: movq%r15, %rbx
72ad: punpcklqdq%xmm3, %xmm0
72b1: movhlps%xmm0, %xmm2
72b4: movq%xmm2, %r10
72b9: movq%r10, %r11
72bc: jmpq*%rax

我们能发现，额外的寄存器被引入了。体系结构 101，额外的寄存器意味着额外的开销。寄存器是很贵的！

那么我们有没有办法来迭代上面的超大的 Switch case 呢？估计有同学在想，既然上面的 switch case 超级大，那么我们将其拆分为多个小的函数
这样编译器可以有足够的上下文来优化，同时我们的 perf 也可以精确的去分析每个函数的开销，岂不美哉？

但是又有同学反对了，函数调用会触发 call 的指令，会带来额外的寄存器入栈和出栈的开销，这样会不会又变慢了呢？

那么能不能优化一下呢？答案是可以的，很多同学可能会想到了，tail call

假设我们有这样一段代码

__attribute__((noinline)) void g(int x) {
    printf("Value: %d\n", x);
};
void f(int x) {
    return g(x);
}

我们能看到这样一段汇编

0000000000001140 <g>:
    1140:55                   push   %rbp
    1141:48 89 e5             mov    %rsp,%rbp
    1144:48 83 ec 10          sub    $0x10,%rsp
    1148:89 7d fc             mov    %edi,-0x4(%rbp)
    114b:8b 75 fc             mov    -0x4(%rbp),%esi
    114e:48 8d 3d af 0e 00 00 lea    0xeaf(%rip),%rdi        # 2004 <_IO_stdin_used+0x4>
    1155:b0 00                mov    $0x0,%al
    1157:e8 d4 fe ff ff       call   1030 <printf@plt>
    115c:48 83 c4 10          add    $0x10,%rsp
    1160:5d                   pop    %rbp
    1161:c3                   ret
    1162:66 66 66 66 66 2e 0f data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
    1169:1f 84 00 00 00 00 00 

0000000000001170 <f>:
    1170:55                   push   %rbp
    1171:48 89 e5             mov    %rsp,%rbp
    1174:48 83 ec 10          sub    $0x10,%rsp
    1178:89 7d fc             mov    %edi,-0x4(%rbp)
    117b:8b 7d fc             mov    -0x4(%rbp),%edi
    117e:e8 bd ff ff ff       call   1140 <g>
    1183:48 83 c4 10          add    $0x10,%rsp
    1187:5d                   pop    %rbp
    1188:c3                   ret
    1189:0f 1f 80 00 00 00 00 nopl   0x0(%rax)

其中 call 1140 <g> 指令非常显眼。这也是函数调用本身的开销的一个重要来源

在现在编译器中，存在一种特殊的优化叫作尾递归，即当函数的最后一步是调用另一个函数时，编译器可以优化掉这个调用的开销

我们来测试一下

#include <stdio.h>
__attribute__((preserve_none)) void g(int x);
__attribute__((noinline, preserve_none)) void g(int x){
    printf("Value: %d\n", x);
}

__attribute__((preserve_none)) void f(int x) {
    [[clang::musttail]] return g(x);
}

int main() {
    f(42);
    return 0;
}

我们来看下相关汇编

0000000000001140 <g>:
    1140:55                   push   %rbp
    1141:48 89 e5             mov    %rsp,%rbp
    1144:48 83 ec 10          sub    $0x10,%rsp
    1148:44 89 65 fc          mov    %r12d,-0x4(%rbp)
    114c:8b 75 fc             mov    -0x4(%rbp),%esi
    114f:48 8d 3d ae 0e 00 00 lea    0xeae(%rip),%rdi        # 2004 <_IO_stdin_used+0x4>
    1156:31 c0                xor    %eax,%eax
    1158:e8 d3 fe ff ff       call   1030 <printf@plt>
    115d:48 83 c4 10          add    $0x10,%rsp
    1161:5d                   pop    %rbp
    1162:c3                   ret
    1163:66 66 66 66 2e 0f 1f data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
    116a:84 00 00 00 00 00 

0000000000001170 <f>:
    1170:55                   push   %rbp
    1171:48 89 e5             mov    %rsp,%rbp
    1174:44 89 65 fc          mov    %r12d,-0x4(%rbp)
    1178:44 8b 65 fc          mov    -0x4(%rbp),%r12d
    117c:5d                   pop    %rbp
    117d:e9 be ff ff ff       jmp    1140 <g>
    1182:66 66 66 66 66 2e 0f data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
    1189:1f 84 00 00 00 00 00

我们能看到，f 函数的最后一步是 jmp 1140 <g>，而不是 call 1140 <g>，这就意味着我们在调用 g 函数的时候不会有额外的寄存器分配等传统 call 指令带来的开销。

可能有同学回过味来了，那么这里在做尾递归处理后，感觉完全可以当作一种高性能 Goto 来看嘛。

Bingo，这里其实思路也是差不多这样的，在77年的一篇论文《Debunking the ‘Expensive Procedure Call’ Myth, or, Procedure Call Implementations Considered Harmful, or, Lambda: The Ultimate GOTO》就提到了，高效的过程调用可以和 Goto 性能相近，而在实现上会更简洁。

在 Python 3.14 中，Tail Call Interpreter 的实现就是基于这个思路的。

我们能看到我们对于 opcode 进行 dispatch 的宏进行了尾递归的处理

#   define Py_MUSTTAIL [[clang::musttail]]
#   define Py_PRESERVE_NONE_CC __attribute__((preserve_none))
    Py_PRESERVE_NONE_CC typedef PyObject* (*py_tail_call_funcptr)(TAIL_CALL_PARAMS);

#   define TARGET(op) Py_PRESERVE_NONE_CC PyObject *_TAIL_CALL_##op(TAIL_CALL_PARAMS)
#   define DISPATCH_GOTO() \
        do { \
            Py_MUSTTAIL return (INSTRUCTION_TABLE[opcode])(TAIL_CALL_ARGS); \
        } while (0)
#   define JUMP_TO_LABEL(name) \
        do { \
            Py_MUSTTAIL return (_TAIL_CALL_##name)(TAIL_CALL_ARGS); \
        } while (0)
#   ifdef Py_STATS
#       define JUMP_TO_PREDICTED(name) \
            do { \
                Py_MUSTTAIL return (_TAIL_CALL_##name)(frame, stack_pointer, tstate, this_instr, oparg, lastopcode); \
            } while (0)
#   else
#       define JUMP_TO_PREDICTED(name) \
            do { \
                Py_MUSTTAIL return (_TAIL_CALL_##name)(frame, stack_pointer, tstate, this_instr, oparg); \
            } while (0)
#   endif
#    define LABEL(name) TARGET(name)

那么在保证我们基线性能和 Compute GOTO 甚至更优一点的同时，我们可以得到如下的一些好处

更广泛的平台支持
将 case 拆分后，编译器更不容易犯错，整体的性能的可预测性更强
happy perf
以及我可以用 eBPF 之类的工具做更多的骚操作（

总结

这篇文章差不多就是这样，虽然说是只介绍 Python 3.14 的 Tail Call Interpreter，但是还是完整的介绍了一些整个的一个演进思路

这也带给我一个启发，很多时候，可预测性真的是非常重要的一个特性。

这算是 3.14 中和 remote debug 一起并列为我最喜欢的两个feature，可观测性万岁！

Python 3.14: Python 世界的一大步

2025-04-26 22:49:00

Python 3.14 目前主要的一些主要的特性其实已经固定了，在我看来，Python 3.14 是一个未来很多年的一个核心版本。因为其确定了是时代的 Python
调试生态的基准，这篇文章将会来聊聊这个 Python 世界中的史诗级改进

正文

在我们日常调试 Python 代码的时候，我们经常会遇到这样一个问题，我们需要采样当前的 Python Runtime 的状态，进而进一步调试我们的 Python 进程

常见的手段莫过于两种

通过 eBPF + UProbe 等手段来触发
通过 process_vm_readv 等 Syscall 来直接整块读取内存

无论这两种方式都有一个核心的问题，我们怎么样来解析内存中的数据？

用 https://github.com/jschwinger233/perf-examples/blob/main/cpython310_backtrace/bpf.c 来做一个例子，在之前的很多年的时候，我们会怎么做

#define PAGE_SIZE (1<<12)
#define KASAN_STACK_ORDER 0
#define THREAD_SIZE_ORDER (2 + KASAN_STACK_ORDER)
#define THREAD_SIZE  ((__u64)(PAGE_SIZE << THREAD_SIZE_ORDER))
#define TOP_OF_KERNEL_STACK_PADDING ((__u64)0)

const static u32 ZERO = 0;

struct PyTypeObject {
    char _[24];
    char *tp_name;
};

struct PyObject {
    char _[8];
    struct PyTypeObject *ob_type;
};

struct PyVarObject {
    struct PyObject ob_base;
    char _[8];
};

struct PyASCIIObject {
__u8 _[16];
__u64 length;
__u8 __[24];
};

struct _PyStr {
    struct PyASCIIObject ascii;
    char buf[100];
};

struct PyCodeObject {
    char _[104];
    struct _PyStr *co_filename;
    struct _PyStr *co_name;
};

struct PyFrameObject {
    struct PyVarObject ob_base;
    struct PyFrameObject *f_back;
    struct PyCodeObject *f_code;
    char _[60];
    int f_lineno;
};

struct event {
__u64 rip;
__u8 user_mode;
__s8 python_stack_depth;
__u64 filename_len[20];
__u64 funcname_len[20];
unsigned char filename[20][100];
unsigned char funcname[20][100];
};

struct {
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
__uint(max_entries, 1);
__type(key, u32);
__type(value, struct event);
} events SEC(".maps");

struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 1<<29);
} ringbuf SEC(".maps");

SEC("perf_event/cpython310")
int perf_event_cpython310(struct bpf_perf_event_data *ctx)
{
__u64 rsp;
struct event *event;
struct PyFrameObject *frame;

event = bpf_map_lookup_elem(&events, &ZERO);
if (!event)
return 0;

rsp = ctx->regs.sp;
event->rip = ctx->regs.ip;
event->user_mode = !!(ctx->regs.cs & 3);

if (!event->user_mode) {
struct task_struct *task = (struct task_struct *)bpf_get_current_task();
__u64 __ptr = (__u64)BPF_CORE_READ(task, stack);
__ptr += THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING;
struct pt_regs *pt_regs = ((struct pt_regs *)__ptr) - 1;

rsp = BPF_CORE_READ(pt_regs, sp);
event->rip = BPF_CORE_READ(pt_regs, ip);
}

char name[5];
bool found = false;

for (int i = 0; i < 200; i++) {
bpf_probe_read_user(&frame, sizeof(frame), (void *)rsp + 8*i);
if (!frame)
continue;

char *tp_name = BPF_PROBE_READ_USER(frame, ob_base.ob_base.ob_type, tp_name);
bpf_probe_read_user(&name, sizeof(name), (void *)tp_name);
if (bpf_strncmp(name, 5, "frame") == 0) {
found = true;
break;
}
}

if (!found) {
event->python_stack_depth = -1;
bpf_ringbuf_output(&ringbuf, event, sizeof(*event), 0);
return 0;
}

for (int i = 0; i < 20; i++) {
event->python_stack_depth = i;
BPF_PROBE_READ_USER_INTO(&event->filename_len[i], frame, f_code, co_filename, ascii.length);
BPF_PROBE_READ_USER_INTO(&event->filename[i], frame, f_code, co_filename, buf);
BPF_PROBE_READ_USER_INTO(&event->funcname_len[i], frame, f_code, co_name, ascii.length);
BPF_PROBE_READ_USER_INTO(&event->funcname[i], frame, f_code, co_name, buf);
frame = BPF_PROBE_READ_USER(frame, f_back);
if (!frame)
break;
}

bpf_ringbuf_output(&ringbuf, event, sizeof(*event), 0);
return 0;
}

char __license[] SEC("license") = "Dual MIT/GPL";

上面的核心代码其实没多少，核心的逻辑就还是我们手动模拟 Python 中关键的 PyFrameObject 结构体，然后我们在内存中不断做一次搜索，暴力匹配到特征一致的内存

其余诸如 PySpy 这样的工具也是类似的思路

这个方式最核心的问题是在于说，Python 每个版本的 ABI 都可能发生变化，所以我们需要不断的根据不同的版本去做兼容（比如 PySpy 维护了从3.7到3.12的不同的 PyFrameObject。

那么我们有没有更好的方法来处理这个问题？或者说我们能不能更好的去定位？

可以的，写 Python 的同学肯定都知道我们 Python 中有一个全局的变量 _PyRuntime，其类型为 pyruntimestate，大致的布局如下

struct pyruntimestate {

    _Py_DebugOffsets debug_offsets;

    int _initialized;

    int preinitializing;

    int preinitialized;

    int core_initialized;

    int initialized;

    PyThreadState *_finalizing;

    unsigned long _finalizing_id;

    struct pyinterpreters {
        PyMutex mutex;
        PyInterpreterState *head;

        PyInterpreterState *main;

        int64_t next_id;
    } interpreters;


    unsigned long main_thread;
    PyThreadState *main_tstate;


    _PyXI_global_state_t xi;

    struct _pymem_allocators allocators;
    struct _obmalloc_global_state obmalloc;
    struct pyhash_runtime_state pyhash_state;
    struct _pythread_runtime_state threads;
    struct _signals_runtime_state signals;

    Py_tss_t autoTSSkey;

    Py_tss_t trashTSSkey;

    PyWideStringList orig_argv;

    struct _parser_runtime_state parser;

    struct _atexit_runtime_state atexit;

    struct _import_runtime_state imports;
    struct _ceval_runtime_state ceval;
    struct _gilstate_runtime_state {

        int check_enabled;

        PyInterpreterState *autoInterpreterState;
    } gilstate;
    struct _getargs_runtime_state {
        struct _PyArg_Parser *static_parsers;
    } getargs;
    struct _fileutils_state fileutils;
    struct _faulthandler_runtime_state faulthandler;
    struct _tracemalloc_runtime_state tracemalloc;
    struct _reftracer_runtime_state ref_tracer;

    _PyRWMutex stoptheworld_mutex;
    struct _stoptheworld_state stoptheworld;

    PyPreConfig preconfig;
    Py_OpenCodeHookFunction open_code_hook;
    void *open_code_userdata;
    struct {
        PyMutex mutex;
        struct _Py_AuditHookEntry *head;
    } audit_hooks;

    struct _py_object_runtime_state object_state;
    struct _Py_float_runtime_state float_state;
    struct _Py_unicode_runtime_state unicode_state;
    struct _types_runtime_state types;
    struct _Py_time_runtime_state time;

#if defined(__EMSCRIPTEN__) && defined(PY_CALL_TRAMPOLINE)

    int (*emscripten_count_args_function)(PyCFunctionWithKeywords func);
#endif
    struct _Py_cached_objects cached_objects;
    struct _Py_static_objects static_objects;

    PyInterpreterState _main_interpreter;

};

眼尖的同学肯定看到了，我们其中有一段核心的代码

struct pyinterpreters {
    PyMutex mutex;
    PyInterpreterState *head;

    PyInterpreterState *main;

    int64_t next_id;
} interpreters;

维护了一个 PyInterpreterState 的链表，我们可以通过 PyInterpreterState 来获取当前的 Frame，PyInterpreterState 中的 TreadState 来获取当前的线程状态

struct pythreads {
    uint64_t next_unique_id;
    /* The linked list of threads, newest first. */
    PyThreadState *head;
    _PyThreadStateImpl *preallocated;
    /* The thread currently executing in the __main__ module, if any. */
    PyThreadState *main;
    /* Used in Modules/_threadmodule.c. */
    Py_ssize_t count;
    /* Support for runtime thread stack size tuning.
       A value of 0 means using the platform's default stack size
       or the size specified by the THREAD_STACK_SIZE macro. */
    /* Used in Python/thread.c. */
    size_t stacksize;
} threads;

而 PyThreadState 中和核心的 struct _PyInterpreterFrame *current_frame 就是我们需要的 frame state，整个流程大概如下

graph TD    PyRuntime["_PyRuntime (pyruntimestate)"] --> Interpreters["interpreters (pyinterpreters)"]    Interpreters -->|head| InterpreterStateHead["PyInterpreterState *head"]    Interpreters -->|main| InterpreterStateMain["PyInterpreterState *main"]        %% Define interpreter state structure    subgraph PyInterpreterState        InterpreterID["int64_t id"]         ThreadsStruct["struct pythreads threads"]        NextInterpreter["PyInterpreterState *next"]    end        InterpreterStateHead --- PyInterpreterState    InterpreterStateMain --- PyInterpreterState        %% Link to threads structure    ThreadsStruct --> ThreadHead["PyThreadState *head"]    ThreadsStruct --> ThreadMain["PyThreadState *main"]        %% Define thread state structure    subgraph PyThreadState        ThreadID["uint64_t thread_id"]        InterpreterPtr["PyInterpreterState *interp"]        CurrentFrame["_PyInterpreterFrame *current_frame"]        NextThread["PyThreadState *next"]    end        ThreadHead --- PyThreadState    ThreadMain --- PyThreadState        %% Frame structure    CurrentFrame --> Frame["_PyInterpreterFrame structure"]        subgraph _PyInterpreterFrame        PreviousFrame["_PyInterpreterFrame *previous"]        CodeObject["PyCodeObject *f_code"]        Locals["PyObject **localsplus"]    end        %% Connected paths in color    PyRuntime ==>|"Main Path"| Interpreters    Interpreters ==>|"Main Path"| InterpreterStateMain    InterpreterStateMain ==>|"Main Path"| ThreadsStruct    ThreadsStruct ==>|"Main Path"| ThreadMain    ThreadMain ==>|"Main Path"| CurrentFrame    CurrentFrame ==>|"Main Path"| Frame        class PyRuntime,InterpreterStateMain,ThreadMain,CurrentFrame,Frame mainPath;    classDef mainPath fill:#f96,stroke:#333,stroke-width:2px;    classDef mainNodes fill:#f9f,stroke:#333,stroke-width:2px;

那么我们现在来解决第一个问题，我们怎么样获取在内存中的 _PyRuntime 的地址呢？

我们把这个问题抽象成下面最简单一个 C 代码

#include <stdio.h>

int abc=3;

int main() {
    printf("abc: %p\n", &abc);
    return 0;
}

我们怎么样获取 abc 的地址呢？这里写过 C 的同学可能反应过来了，我们可以使用 __attribute__((section())) 的语法，来将其放到一个特定的段中

#include <stdio.h>

int abc __attribute__((section(".my_section"))) = 3;

int main() {
    printf("abc: %p\n", &abc);
    return 0;
}

我们编译，并用 readelf 来解析一下二进制

1 2	╰─ readelf -S ./a.out\| grep my_section [25] .my_section PROGBITS 0000000000004018 00003018

我们能看到这里我们得到了一个相对地址。后续我们就可以通过解析 ELF 来遍历寻找到 abc 变量的地址

那么在 Python 中同样如此，在代码中有这样一段代码

#define GENERATE_DEBUG_SECTION(name, declaration)     \
   _GENERATE_DEBUG_SECTION_WINDOWS(name)            \
   _GENERATE_DEBUG_SECTION_APPLE(name)              \
   declaration                                      \
   _GENERATE_DEBUG_SECTION_LINUX(name)

// Please note that section names are truncated to eight bytes
// on Windows!
#if defined(MS_WINDOWS)
#define _GENERATE_DEBUG_SECTION_WINDOWS(name)                       \
   _Pragma(Py_STRINGIFY(section(Py_STRINGIFY(name), read, write))) \
   __declspec(allocate(Py_STRINGIFY(name)))
#else
#define _GENERATE_DEBUG_SECTION_WINDOWS(name)
#endif

#if defined(__APPLE__)
#define _GENERATE_DEBUG_SECTION_APPLE(name) \
   __attribute__((section(SEG_DATA "," Py_STRINGIFY(name))))      \
   __attribute__((used))
#else
#define _GENERATE_DEBUG_SECTION_APPLE(name)
#endif

#if defined(__linux__) && (defined(__GNUC__) || defined(__clang__))
#define _GENERATE_DEBUG_SECTION_LINUX(name) \
   __attribute__((section("." Py_STRINGIFY(name))))               \
   __attribute__((used))
#else
#define _GENERATE_DEBUG_SECTION_LINUX(name)
#endif

GENERATE_DEBUG_SECTION(PyRuntime, _PyRuntimeState _PyRuntime)
= _PyRuntimeState_INIT(_PyRuntime, _Py_Debug_Cookie);
_Py_COMP_DIAG_POP

这样我们就能比较方便的获取到 PyRuntime 在内存中的地址。

那么现在第二个问题是，我们怎么样通过我们前面介绍的调用链获取到地址？

大家可能第一反应还是想通过维护不同版本的数据结构来获取具体的地址。不过这里我们有没有办法可以用更简单的方法来处理呢？答案是有的

眼尖的同学可能看到了我们在 pyruntimestate 中有一个字段叫 debug_offsets，我们来看下我们怎么初始化这个字段的吧

#define _Py_DebugOffsets_INIT(debug_cookie) { \
    .cookie = debug_cookie, \
    .version = PY_VERSION_HEX, \
    .free_threaded = _Py_Debug_Free_Threaded, \
    .runtime_state = { \
        .size = sizeof(_PyRuntimeState), \
        .finalizing = offsetof(_PyRuntimeState, _finalizing), \
        .interpreters_head = offsetof(_PyRuntimeState, interpreters.head), \
    }, \
    .interpreter_state = { \
        .size = sizeof(PyInterpreterState), \
        .id = offsetof(PyInterpreterState, id), \
        .next = offsetof(PyInterpreterState, next), \
        .threads_head = offsetof(PyInterpreterState, threads.head), \
        .threads_main = offsetof(PyInterpreterState, threads.main), \
        .gc = offsetof(PyInterpreterState, gc), \
        .imports_modules = offsetof(PyInterpreterState, imports.modules), \
        .sysdict = offsetof(PyInterpreterState, sysdict), \
        .builtins = offsetof(PyInterpreterState, builtins), \
        .ceval_gil = offsetof(PyInterpreterState, ceval.gil), \
        .gil_runtime_state = offsetof(PyInterpreterState, _gil), \
        .gil_runtime_state_enabled = _Py_Debug_gilruntimestate_enabled, \
        .gil_runtime_state_locked = offsetof(PyInterpreterState, _gil.locked), \
        .gil_runtime_state_holder = offsetof(PyInterpreterState, _gil.last_holder), \
    }, \
    .thread_state = { \
        .size = sizeof(PyThreadState), \
        .prev = offsetof(PyThreadState, prev), \
        .next = offsetof(PyThreadState, next), \
        .interp = offsetof(PyThreadState, interp), \
        .current_frame = offsetof(PyThreadState, current_frame), \
        .thread_id = offsetof(PyThreadState, thread_id), \
        .native_thread_id = offsetof(PyThreadState, native_thread_id), \
        .datastack_chunk = offsetof(PyThreadState, datastack_chunk), \
        .status = offsetof(PyThreadState, _status), \
    }, \
    .interpreter_frame = { \
        .size = sizeof(_PyInterpreterFrame), \
        .previous = offsetof(_PyInterpreterFrame, previous), \
        .executable = offsetof(_PyInterpreterFrame, f_executable), \
        .instr_ptr = offsetof(_PyInterpreterFrame, instr_ptr), \
        .localsplus = offsetof(_PyInterpreterFrame, localsplus), \
        .owner = offsetof(_PyInterpreterFrame, owner), \
        .stackpointer = offsetof(_PyInterpreterFrame, stackpointer), \
    }, \
    .code_object = { \
        .size = sizeof(PyCodeObject), \
        .filename = offsetof(PyCodeObject, co_filename), \
        .name = offsetof(PyCodeObject, co_name), \
        .qualname = offsetof(PyCodeObject, co_qualname), \
        .linetable = offsetof(PyCodeObject, co_linetable), \
        .firstlineno = offsetof(PyCodeObject, co_firstlineno), \
        .argcount = offsetof(PyCodeObject, co_argcount), \
        .localsplusnames = offsetof(PyCodeObject, co_localsplusnames), \
        .localspluskinds = offsetof(PyCodeObject, co_localspluskinds), \
        .co_code_adaptive = offsetof(PyCodeObject, co_code_adaptive), \
    }, \
    .pyobject = { \
        .size = sizeof(PyObject), \
        .ob_type = offsetof(PyObject, ob_type), \
    }, \
    .type_object = { \
        .size = sizeof(PyTypeObject), \
        .tp_name = offsetof(PyTypeObject, tp_name), \
        .tp_repr = offsetof(PyTypeObject, tp_repr), \
        .tp_flags = offsetof(PyTypeObject, tp_flags), \
    }, \
    .tuple_object = { \
        .size = sizeof(PyTupleObject), \
        .ob_item = offsetof(PyTupleObject, ob_item), \
        .ob_size = offsetof(PyTupleObject, ob_base.ob_size), \
    }, \
    .list_object = { \
        .size = sizeof(PyListObject), \
        .ob_item = offsetof(PyListObject, ob_item), \
        .ob_size = offsetof(PyListObject, ob_base.ob_size), \
    }, \
    .set_object = { \
        .size = sizeof(PySetObject), \
        .used = offsetof(PySetObject, used), \
        .table = offsetof(PySetObject, table), \
        .mask = offsetof(PySetObject, mask), \
    }, \
    .dict_object = { \
        .size = sizeof(PyDictObject), \
        .ma_keys = offsetof(PyDictObject, ma_keys), \
        .ma_values = offsetof(PyDictObject, ma_values), \
    }, \
    .float_object = { \
        .size = sizeof(PyFloatObject), \
        .ob_fval = offsetof(PyFloatObject, ob_fval), \
    }, \
    .long_object = { \
        .size = sizeof(PyLongObject), \
        .lv_tag = offsetof(PyLongObject, long_value.lv_tag), \
        .ob_digit = offsetof(PyLongObject, long_value.ob_digit), \
    }, \
    .bytes_object = { \
        .size = sizeof(PyBytesObject), \
        .ob_size = offsetof(PyBytesObject, ob_base.ob_size), \
        .ob_sval = offsetof(PyBytesObject, ob_sval), \
    }, \
    .unicode_object = { \
        .size = sizeof(PyUnicodeObject), \
        .state = offsetof(PyUnicodeObject, _base._base.state), \
        .length = offsetof(PyUnicodeObject, _base._base.length), \
        .asciiobject_size = sizeof(PyASCIIObject), \
    }, \
    .gc = { \
        .size = sizeof(struct _gc_runtime_state), \
        .collecting = offsetof(struct _gc_runtime_state, collecting), \
    }, \
    .gen_object = { \
        .size = sizeof(PyGenObject), \
        .gi_name = offsetof(PyGenObject, gi_name), \
        .gi_iframe = offsetof(PyGenObject, gi_iframe), \
        .gi_frame_state = offsetof(PyGenObject, gi_frame_state), \
    }, \
    .debugger_support = { \
        .eval_breaker = offsetof(PyThreadState, eval_breaker), \
        .remote_debugger_support = offsetof(PyThreadState, remote_debugger_support),  \
        .remote_debugging_enabled = offsetof(PyInterpreterState, config.remote_debug),  \
        .debugger_pending_call = offsetof(_PyRemoteDebuggerSupport, debugger_pending_call),  \
        .debugger_script_path = offsetof(_PyRemoteDebuggerSupport, debugger_script_path),  \
        .debugger_script_path_size = MAX_SCRIPT_PATH_SIZE, \
    }, \
}

我们能看到我们使用了 offsetof 这个非常经典的宏来将一下我们常用的字段相较于结构体的偏移写入到 debug_offsets 中去。而 debug_offsets 将固定存在于 pyruntimestate 的第一个字段，同时起改变频率相对较低，所以我们就可以通过 debugger_support 获取不同地址的偏移量来获取最终我们想要的数据。

通过这样的做法，我们实际上就有很多很好玩的事情可以做了。实际上官方也是基于这样一套机制提出了 PEP 768 – Safe external debugger interface for CPython https://peps.python.org/pep-0768/。可以允许用户远程的为一个 Python 进程注入一段调试代码

我们来看一下这个 PEP 的核心实现

在前面介绍过的 ThreadState 中新增了一组结构

typedef struct _remote_debugger_support {
    int32_t debugger_pending_call;
    char debugger_script_path[MAX_SCRIPT_PATH_SIZE];
} _PyRemoteDebuggerSupport;

在执行过程中，如果 debugger_pending_call 为 1 的时候，我们就会去执行 debugger_script_path 中的脚本

int _PyRunRemoteDebugger(PyThreadState *tstate)
{
    const PyConfig *config = _PyInterpreterState_GetConfig(tstate->interp);
    if (config->remote_debug == 1
         && tstate->remote_debugger_support.debugger_pending_call == 1)
    {
        tstate->remote_debugger_support.debugger_pending_call = 0;

        // Immediately make a copy in case of a race with another debugger
        // process that's trying to write to the buffer. At least this way
        // we'll be internally consistent: what we audit is what we run.
        const size_t pathsz
            = sizeof(tstate->remote_debugger_support.debugger_script_path);

        char *path = PyMem_Malloc(pathsz);
        if (path) {
            // And don't assume the debugger correctly null terminated it.
            memcpy(
                path,
                tstate->remote_debugger_support.debugger_script_path,
                pathsz);
            path[pathsz - 1] = '\0';
            if (*path) {
                run_remote_debugger_script(path);
            }
            PyMem_Free(path);
        }
    }
    return 0;
}

那么问题来了，我们现在怎么样给目标 Python 进程注入对应的值呢？我们来看看 remote_debugging.c 中的实现

首先入口函数为 _PySysRemoteDebug_SendExec

int
_PySysRemoteDebug_SendExec(int pid, int tid, const char *debugger_script_path)
{
#if !defined(Py_SUPPORTS_REMOTE_DEBUG)
    PyErr_SetString(PyExc_RuntimeError, "Remote debugging is not supported on this platform");
    return -1;
#elif !defined(Py_REMOTE_DEBUG)
    PyErr_SetString(PyExc_RuntimeError, "Remote debugging support has not been compiled in");
    return -1;
#else

    PyThreadState *tstate = _PyThreadState_GET();
    const PyConfig *config = _PyInterpreterState_GetConfig(tstate->interp);
    if (config->remote_debug != 1) {
        PyErr_SetString(PyExc_RuntimeError, "Remote debugging is not enabled");
        return -1;
    }

    proc_handle_t handle;
    if (init_proc_handle(&handle, pid) < 0) {
        return -1;
    }

    int rc = send_exec_to_proc_handle(&handle, tid, debugger_script_path);
    cleanup_proc_handle(&handle);
    return rc;
#endif
}

前面都是一些例行的检查，我们来看看 send_exec_to_proc_handle 这个函数

static int
send_exec_to_proc_handle(proc_handle_t *handle, int tid, const char *debugger_script_path)
{
    uintptr_t runtime_start_address;
    struct _Py_DebugOffsets debug_offsets;

    if (read_offsets(handle, &runtime_start_address, &debug_offsets)) {
        return -1;
    }

    uintptr_t interpreter_state_list_head = (uintptr_t)debug_offsets.runtime_state.interpreters_head;

    uintptr_t interpreter_state_addr;
    if (0 != read_memory(
            handle,
            runtime_start_address + interpreter_state_list_head,
            sizeof(void*),
            &interpreter_state_addr))
    {
        return -1;
    }

    if (interpreter_state_addr == 0) {
        PyErr_SetString(PyExc_RuntimeError, "Can't find a running interpreter in the remote process");
        return -1;
    }

    int is_remote_debugging_enabled = 0;
    if (0 != read_memory(
            handle,
            interpreter_state_addr + debug_offsets.debugger_support.remote_debugging_enabled,
            sizeof(int),
            &is_remote_debugging_enabled))
    {
        return -1;
    }

    if (is_remote_debugging_enabled != 1) {
        PyErr_SetString(
            PyExc_RuntimeError,
            "Remote debugging is not enabled in the remote process");
        return -1;
    }

    uintptr_t thread_state_addr;
    unsigned long this_tid = 0;

    if (tid != 0) {
        if (0 != read_memory(
                handle,
                interpreter_state_addr + debug_offsets.interpreter_state.threads_head,
                sizeof(void*),
                &thread_state_addr))
        {
            return -1;
        }
        while (thread_state_addr != 0) {
            if (0 != read_memory(
                    handle,
                    thread_state_addr + debug_offsets.thread_state.native_thread_id,
                    sizeof(this_tid),
                    &this_tid))
            {
                return -1;
            }

            if (this_tid == (unsigned long)tid) {
                break;
            }

            if (0 != read_memory(
                    handle,
                    thread_state_addr + debug_offsets.thread_state.next,
                    sizeof(void*),
                    &thread_state_addr))
            {
                return -1;
            }
        }

        if (thread_state_addr == 0) {
            PyErr_SetString(
                PyExc_RuntimeError,
                "Can't find the specified thread in the remote process");
            return -1;
        }
    } else {
        if (0 != read_memory(
                handle,
                interpreter_state_addr + debug_offsets.interpreter_state.threads_main,
                sizeof(void*),
                &thread_state_addr))
        {
            return -1;
        }

        if (thread_state_addr == 0) {
            PyErr_SetString(
                PyExc_RuntimeError,
                "Can't find the main thread in the remote process");
            return -1;
        }
    }

    // Ensure our path is not too long
    if (debug_offsets.debugger_support.debugger_script_path_size <= strlen(debugger_script_path)) {
        PyErr_SetString(PyExc_ValueError, "Debugger script path is too long");
        return -1;
    }

    uintptr_t debugger_script_path_addr = (uintptr_t)(
        thread_state_addr +
        debug_offsets.debugger_support.remote_debugger_support +
        debug_offsets.debugger_support.debugger_script_path);
    if (0 != write_memory(
            handle,
            debugger_script_path_addr,
            strlen(debugger_script_path) + 1,
            debugger_script_path))
    {
        return -1;
    }

    int pending_call = 1;
    uintptr_t debugger_pending_call_addr = (uintptr_t)(
        thread_state_addr +
        debug_offsets.debugger_support.remote_debugger_support +
        debug_offsets.debugger_support.debugger_pending_call);
    if (0 != write_memory(
            handle,
            debugger_pending_call_addr,
            sizeof(int),
            &pending_call))

    {
        return -1;
    }

    uintptr_t eval_breaker;
    if (0 != read_memory(
            handle,
            thread_state_addr + debug_offsets.debugger_support.eval_breaker,
            sizeof(uintptr_t),
            &eval_breaker))
    {
        return -1;
    }

    eval_breaker |= _PY_EVAL_PLEASE_STOP_BIT;

    if (0 != write_memory(
            handle,
            thread_state_addr + (uintptr_t)debug_offsets.debugger_support.eval_breaker,
            sizeof(uintptr_t),
            &eval_breaker))

    {
        return -1;
    }

    return 0;
}

我们先不考虑具体的细节的话，这段函数的逻辑还是非常明确的，通过 read_offsets 获取目标的地址偏移，通过 read_memory 这个函数读取不同地址，然后做一些处理后，通过 write_memory 来写入到目标进程中去

而 read_offsets 这个函数就是我们前面核心提到过的怎么样使用目前 Python 给出的调试信息的例子，我们来看一下其在 Linux 下的实现

static int
read_offsets(
    proc_handle_t *handle,
    uintptr_t *runtime_start_address,
    _Py_DebugOffsets* debug_offsets
) {
    if (_Py_RemoteDebug_ReadDebugOffsets(handle, runtime_start_address, debug_offsets)) {
        return -1;
    }
    if (ensure_debug_offset_compatibility(debug_offsets)) {
        return -1;
    }
    return 0;
}

这里的核心函数是 _Py_RemoteDebug_ReadDebugOffsets，我们接着来看这个的实现

static int
_Py_RemoteDebug_ReadDebugOffsets(
    proc_handle_t *handle,
    uintptr_t *runtime_start_address,
    _Py_DebugOffsets* debug_offsets
) {
    *runtime_start_address = _Py_RemoteDebug_GetPyRuntimeAddress(handle);
    if (!*runtime_start_address) {
        if (!PyErr_Occurred()) {
            PyErr_SetString(
                PyExc_RuntimeError, "Failed to get PyRuntime address");
        }
        return -1;
    }
    size_t size = sizeof(struct _Py_DebugOffsets);
    if (0 != _Py_RemoteDebug_ReadRemoteMemory(handle, *runtime_start_address, size, debug_offsets)) {
        return -1;
    }
    return 0;
}

我们注意到，这里的核心还是我们先要获取到 PyRuntime 的地址，那么我们来看看 _Py_RemoteDebug_GetPyRuntimeAddress 的实现

static uintptr_t
_Py_RemoteDebug_GetPyRuntimeAddress(proc_handle_t* handle)
{
    uintptr_t address;
    address = search_linux_map_for_section(handle, "PyRuntime", "python");
    if (address == 0) {
        // Error out: 'python' substring covers both executable and DLL
        PyErr_SetString(PyExc_RuntimeError, "Failed to find the PyRuntime section in the process.");
    }
    return address;
}

static uintptr_t
search_linux_map_for_section(proc_handle_t *handle, const char* secname, const char* substr)
{
    char maps_file_path[64];
    sprintf(maps_file_path, "/proc/%d/maps", handle->pid);

    FILE* maps_file = fopen(maps_file_path, "r");
    if (maps_file == NULL) {
        PyErr_SetFromErrno(PyExc_OSError);
        return 0;
    }

    size_t linelen = 0;
    size_t linesz = PATH_MAX;
    char *line = PyMem_Malloc(linesz);
    if (!line) {
        fclose(maps_file);
        PyErr_NoMemory();
        return 0;
    }

    uintptr_t retval = 0;
    while (fgets(line + linelen, linesz - linelen, maps_file) != NULL) {
        linelen = strlen(line);
        if (line[linelen - 1] != '\n') {
            // Read a partial line: realloc and keep reading where we left off.
            // Note that even the last line will be terminated by a newline.
            linesz *= 2;
            char *biggerline = PyMem_Realloc(line, linesz);
            if (!biggerline) {
                PyMem_Free(line);
                fclose(maps_file);
                PyErr_NoMemory();
                return 0;
            }
            line = biggerline;
            continue;
        }

        // Read a full line: strip the newline
        line[linelen - 1] = '\0';
        // and prepare to read the next line into the start of the buffer.
        linelen = 0;

        unsigned long start = 0;
        unsigned long path_pos = 0;
        sscanf(line, "%lx-%*x %*s %*s %*s %*s %ln", &start, &path_pos);

        if (!path_pos) {
            // Line didn't match our format string.  This shouldn't be
            // possible, but let's be defensive and skip the line.
            continue;
        }

        const char *path = line + path_pos;
        const char *filename = strrchr(path, '/');
        if (filename) {
            filename++;  // Move past the '/'
        } else {
            filename = path;  // No directories, or an empty string
        }

        if (strstr(filename, substr)) {
            retval = search_elf_file_for_section(handle, secname, start, path);
            if (retval) {
                break;
            }
        }
    }

    PyMem_Free(line);
    fclose(maps_file);

    return retval;
}

我们这里能看到 _Py_RemoteDebug_GetPyRuntimeAddress 调用了 search_linux_map_for_section 来获取当前的 PyRuntime 的地址，而 search_linux_map_for_section 则是通过 /proc/${pid}/maps ，暴力遍历 maps 中的内存段来获取具体的地址。

我们来看看 search_elf_file_for_section 的实现

search_elf_file_for_section(
        proc_handle_t *handle,
        const char* secname,
        uintptr_t start_address,
        const char *elf_file)
{
    if (start_address == 0) {
        return 0;
    }

    uintptr_t result = 0;
    void* file_memory = NULL;

    int fd = open(elf_file, O_RDONLY);
    if (fd < 0) {
        PyErr_SetFromErrno(PyExc_OSError);
        goto exit;
    }

    struct stat file_stats;
    if (fstat(fd, &file_stats) != 0) {
        PyErr_SetFromErrno(PyExc_OSError);
        goto exit;
    }

    file_memory = mmap(NULL, file_stats.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
    if (file_memory == MAP_FAILED) {
        PyErr_SetFromErrno(PyExc_OSError);
        goto exit;
    }

    Elf_Ehdr* elf_header = (Elf_Ehdr*)file_memory;

    Elf_Shdr* section_header_table = (Elf_Shdr*)(file_memory + elf_header->e_shoff);

    Elf_Shdr* shstrtab_section = &section_header_table[elf_header->e_shstrndx];
    char* shstrtab = (char*)(file_memory + shstrtab_section->sh_offset);

    Elf_Shdr* section = NULL;
    for (int i = 0; i < elf_header->e_shnum; i++) {
        char* this_sec_name = shstrtab + section_header_table[i].sh_name;
        // Move 1 character to account for the leading "."
        this_sec_name += 1;
        if (strcmp(secname, this_sec_name) == 0) {
            section = &section_header_table[i];
            break;
        }
    }

    Elf_Phdr* program_header_table = (Elf_Phdr*)(file_memory + elf_header->e_phoff);
    // Find the first PT_LOAD segment
    Elf_Phdr* first_load_segment = NULL;
    for (int i = 0; i < elf_header->e_phnum; i++) {
        if (program_header_table[i].p_type == PT_LOAD) {
            first_load_segment = &program_header_table[i];
            break;
        }
    }

    if (section != NULL && first_load_segment != NULL) {
        uintptr_t elf_load_addr = first_load_segment->p_vaddr
            - (first_load_segment->p_vaddr % first_load_segment->p_align);
        result = start_address + (uintptr_t)section->sh_addr - elf_load_addr;
    }

exit:
    if (file_memory != NULL) {
        munmap(file_memory, file_stats.st_size);
    }
    if (fd >= 0 && close(fd) != 0) {
        PyErr_SetFromErrno(PyExc_OSError);
    }
    return result;
}

这段代码稍微有点复杂，我们来拆分看一下

首先函数的声明

search_elf_file_for_section(
        proc_handle_t *handle,
        const char* secname,
        uintptr_t start_address,
        const char *elf_file)

用于在ELF文件中搜索特定的section。参数包括：进程句柄、要查找的section名称、起始地址（文件在进程空间的映射位置）、ELF文件路径。

int fd = open(elf_file, O_RDONLY);
if (fd < 0) {
    PyErr_SetFromErrno(PyExc_OSError);
    goto exit;
}

以只读方式打开ELF文件，如果失败则设置Python异常并跳转到退出处理。

file_memory = mmap(NULL, file_stats.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
if (file_memory == MAP_FAILED) {
    PyErr_SetFromErrno(PyExc_OSError);
    goto exit;
}

将文件内容映射到内存，以只读和私有方式，从文件头开始。失败则设置异常并退出。

1 2	Elf_Ehdr* elf_header = (Elf_Ehdr)file_memory; Elf_Shdr section_header_table = (Elf_Shdr*)(file_memory + elf_header->e_shoff);

将文件开头 cast 为ELF文件头结构，并找到section header表的位置，它在文件偏移e_shoff处。

Elf_Shdr* shstrtab_section = &section_header_table[elf_header->e_shstrndx];
char* shstrtab = (char*)(file_memory + shstrtab_section->sh_offset);
Elf_Shdr* section = NULL;
for (int i = 0; i < elf_header->e_shnum; i++) {
    char* this_sec_name = shstrtab + section_header_table[i].sh_name;
    // Move 1 character to account for the leading "."
    this_sec_name += 1;
    if (strcmp(secname, this_sec_name) == 0) {
        section = &section_header_table[i];
        break;
    }
}

获取section字符串表（包含所有section名称的表），通过e_shstrndx索引定位。同时遍历所有section，查找匹配的section名称。注意需要跳过section名字的”.”前缀。

Elf_Phdr* program_header_table = (Elf_Phdr*)(file_memory + elf_header->e_phoff);
// Find the first PT_LOAD segment
Elf_Phdr* first_load_segment = NULL;
for (int i = 0; i < elf_header->e_phnum; i++) {
    if (program_header_table[i].p_type == PT_LOAD) {
        first_load_segment = &program_header_table[i];
        break;
    }
}

找到program header表，然后搜索第一个PT_LOAD类型的segment，它定义了程序加载时的基地址。

if (section != NULL && first_load_segment != NULL) {
    uintptr_t elf_load_addr = first_load_segment->p_vaddr
        - (first_load_segment->p_vaddr % first_load_segment->p_align);
    result = start_address + (uintptr_t)section->sh_addr - elf_load_addr;
}

如果找到了目标section和第一个LOAD segment，计算目标section的运行时地址：

计算ELF文件的加载基地址（考虑对齐）
目标地址 = 进程中映射的起始地址 + section的虚拟地址 - ELF加载基地址

经过这样一个流程，我们就能最终的获取到 _PyRuntime 中的地址，然后基于此做一些包括 PEP 768 在内很有趣的工作。

总结

Python 3.14 官方其实将进程信息以半正式化的形式形成了一组相对稳定的 ABI，这样可以使我们调试工具能以更好的方式对 Python 进程进行无侵入的调试与观测。PEP 768 其实是这个过程中一个的有效产物。而基于 PEP768 处理的比如 Remote PDB debug，目前也已合入分支。

可以说从 Python 3.14 起，Python 的调试工具和手段将得到极大的丰富与增强。建议大家在出来后的第一时间进行升级（

差不多就这样（

简单聊聊常见的负载均衡算法

2025-03-23 20:00:00

这篇文章鸽了很久，最终决定还是老老实实写完，来介绍一下常见的一些负载均衡算法实现。本文的代码最终都会放在 load-balancer-algorithm¹ 这个 repo 中

~~我从来没有觉得写博客快乐过~~

正文

先行准备

既然是讲 LoadBalancer 中常用的一些负载均衡算法，我们先来对一些前置准备做一些讨论

我们目前需要两个基础的数据结构

代表着 Backend 节点的结构
代表着请求上下文的结构

那么我们可以得出下面一些基础代码

import dataclasses


@dataclasses.dataclass
class Node:
    host: str = ""
    port: int = 0
    node_available: bool = True

    @property
    def available(self) -> bool:
        return self.node_available

import dataclasses


@dataclasses.dataclass
class RequestContext:
    pass

同时我们在没有后端节点可供选择的时候，我们需要抛出一个异常

1 2	class NoNodesAvailableError(ValueError): pass

好了，我们现在可以进行更进一步的抽象，我们可以将我们的负载均衡算法抽象为策略(Strategy), 那么我们可以得出如下的一些代码

from __future__ import annotations

import typing
from abc import ABC, abstractmethod

if typing.TYPE_CHECKING:
    from load_balancer_algorithm.context import RequestContext
    from load_balancer_algorithm.node import Node


class Strategy(ABC):
    nodes: list[Node] = []

    def __init__(self, nodes:list[Node]) -> None:
        self.nodes = nodes

    @abstractmethod
    def get_node(self, ctx: RequestContext) -> Node:
        pass

    def add_node(self, node: Node) -> None:
        self.nodes.append(node)

    def remove_node(self, node: Node) -> None:
        self.nodes= list(filter(lambda n: n != node, self.nodes))

好了，我们现在可以往下去实现一些负载均衡算法了

随机选择

负载均衡最简单的一个算法是做一个随机的选择，实现非常简单，最简单的伪代码实现差不多这样

1 2	a = [] random.choice(a)

我们来完整实现一下

class RandomStrategy(Strategy):
    def get_node(self, ctx: RequestContext) -> Node:
        nodes = list(filter(lambda node: node.available, self.nodes))
        if not nodes:
            raise NoNodesAvailableError

        return random.choice(nodes)

OK，现在我们增加一个需求，现在我们每个节点都需要有一个权重值，权重值越高的节点被选中的概率越高。我们可以使用 random.choices 来实现这个需求，不过在此之前我们需要对 Node 进行一些修改

import dataclasses


@dataclasses.dataclass
class Node:
    host: str = ""
    port: int = 0
    node_available: bool = True
    weight: int = 0

    @property
    def available(self) -> bool:
        return self.node_available

然后我们来实现一下 WeightedRandomStrategy


class WeightedRandomStrategy(Strategy):
    def get_node(self, ctx: RequestContext) -> Node:
        nodes = list(filter(lambda node: node.available, self.nodes))
        if not nodes:
            raise NoNodesAvailableError

        weights = [node.weight for node in nodes]
        return random.choices(nodes, weights=weights)[0]

Random 确实是我们非常常用的一套负载均衡算法，但是缺点也很明显，其负载均衡的效果有一定的不可预测性，是神是鬼全靠你使用的 Random 函数的质量。运气不好就会出现分布非常密集的情况。那么我们有没有可用的更好的负载均衡算法呢？

轮询算法

我们对于负载均衡算法常见的需求是在逻辑上有一定的可预测性，从这角度上讲，轮询算法是一个非常好的选择。我们可以使用一个 index 来记录当前的节点，然后每次请求的时候都将 index + 1，直到 index 超过节点的数量，然后 index = 0

class RoundRobinStrategy(Strategy):
    def __init__(self, nodes: list[Node]) -> None:
        super().__init__(nodes)
        self.index = 0

    def get_node(self, ctx: RequestContext) -> Node:
        nodes = list(filter(lambda node: node.available, self.nodes))
        if not nodes:
            raise NoNodesAvailableError

        node = nodes[self.index]
        self.index += 1
        if self.index >= len(nodes):
            self.index = 0

        return node

这里我们实现了一个最基础的轮询算法（我们假设不存在节点不可用，节点增删改的情况），所以我们 index 一直可以有规律的变化

这里的结果很明显，如果有一个 [A, B] 的节点列表，那么我们会得到一个 [A, B, A, B, A, B] 的结果

那么现在我们更改一下需求，我们需要实现一个类似 WeightedRandomStrategy 的轮询算法，权重越高的节点被选中的概率越高。

class WeightedRoundRobinStrategy(Strategy):
    def __init__(self, nodes: list[Node]) -> None:
        super().__init__(nodes)
        self.index = 0
    
    def get_node(self, ctx: RequestContext) -> Node:
        nodes = list(filter(lambda node: node.available, self.nodes))
    
        if not nodes:
            raise NoNodesAvailableError
        nodes=[node for node in nodes for _ in range(node.weight)]
        node = nodes[self.index]
        self.index += 1
        if self.index >= len(nodes):
            self.index = 0
        return node

这里的核心算法很简单，我们基于每个节点的权重，得到一个扩展后的节点列表，然后我们就可以使用最基础的轮询算法来实现了

但是这里核心的一个弊端很明显，假设我们有 [A(weight=2),B(weight=1)] 这样一个节点组合，我们会得到 [A, A, B] 这样一个选择结果，这里的节点分布会非常不均匀。那么怎么办呢？我们可以参考一种来自 Nginx 的平滑算法²

我们首先给节点加上一个 current_weight 的熟悉，记录当前节点的权重值

import dataclasses


@dataclasses.dataclass
class Node:
    host: str = ""
    port: int = 0
    node_available: bool = True
    weight = 0
    current_weight: int = 0

    @property
    def available(self) -> bool:
        return self.node_available

    def __post_init__(self):
        self.current_weight = self.weight

然后我们来实现一下 WeightedRoundRobinStrategy

class WeightedRoundRobinStrategy(RoundRobinStrategy):
    def get_node(self, ctx: RequestContext) -> Node:
        nodes = list(filter(lambda node: node.available, self.nodes))
        if not nodes:
            raise NoNodesAvailableError
        best_node = None
        total = 0
        for node in nodes:
            total += node.weight
            node.current_weight = node.weight
            if not best_node or node.current_weight > best_node.current_weight:
                best_node = node
        if not best_node:
            raise NoNodesAvailableError
        best_node.current_weight -= total
        return best_node

这里新增的 current_weight 的作用很简单，

每次选取节点时，遍历可用节点，遍历时把当前节点的 current_weight 的值加上它的 weight
同时累加所有节点的 weight 值为 total 。
如果当前节点的 current_weight 值最大，那么这个节点就是被选中的节点，同时把它的 current_weight 减去 total
没有被选中的节点的 current_weight 不用减少。

这本质上其实很巧妙的将节点打散，同时将 index 的属性利用 current_weight 来处理，经过处理，我们假设有 [A(weight=3),B(weight=2),C(weight=1)] 这样一个节点组合，我们会得到 [A, B, A, C, B, A] 这样一个选择结果，这里的节点分布会相对均匀很多

OK，现在我们轮询函数实现完成了，我们能发现，Random 和轮询算法本质上是两种无状态的算法（最原始的 RoundRobin 有状态，但是我们通过 current_weight 的方式将其变成了无状态），但是我们通常在业务上会有一些根据状态来选择节点的需求，常见的场景有

我们需要请求去往目前负载最低的节点
某一类请求我们需要去往同一个节点

因此下面我们会来介绍两种算法

最小链接/加权最小链接
一致性 Hash 算法

最小链接算法

最小链接算法是一个非常简单的算法，我们需要在每次请求的时候，遍历所有的节点，找到当前连接数最少的节点，然后将请求转发到这个节点上。我们可以使用一个连接数的属性来记录当前节点的连接数

class LeastConnectionStrategy(Strategy):
    def get_node(self, ctx: RequestContext) -> Node:
        best = None
        for node in self.nodes:
            if not node.available:
                continue
            if not best or node.connections < best.connections:
                best = node
        if not best:
            raise NoNodesAvailableError
        best.connections += 1
        return best

OK，那么我们接下来老规矩需要考虑加权的 LeastConnection 算法，这里稍晚有一点绕

假设用 C 表示连接数、W 表示权重、S 表示被选中的节点、Sn 表示未被选中的节点
那么 S 必须满足 C(S) / W(S) < C(Sn) / W(Sn) ，这个条件也可以表示为 C(S) x W(Sn) < C(Sn) x W(S)

那么我们来实现一下

class WeightedLeastConnectionStrategy(LeastConnectionStrategy):
    def get_node(self, ctx: RequestContext) -> Node:
        best = None
        for node in self.nodes:
            if not node.available:
                continue
            if not best or (node.connections / node.weight) < (best.connections / best.weight):
                best = node
        if not best:
            raise NoNodesAvailableError
        best.connections += 1
        return best

当然我们这里实际上有一点问题是，这里的选择可能会连续选择到同一个节点上（因为权重的不均匀），这里可以考虑把符合条件的节点放到一个列表中，然后使用我们前面提到过的 RoundRobin/Random 来选择一个节点来进行请求转发

这里我就不实现了，大家可以自己实现一下

一致性 Hash 算法

我们在业务中经常有这样一种需求，我们需要将同一类请求转发到同一个节点上，这个时候我们就需要使用一致性 Hash 算法来实现了

最基础的一致性 Hash 算法是将请求的 key 和节点的 key 进行 hash 计算，然后将请求转发到 hash 值最接近的节点上。我们可以使用一个 ring 来表示所有的节点，然后在 ring 上找到离请求最近的节点。

但是这样存在比较大的问题是，如果有节点的增删改，这个时候我们已经分配好的逻辑会存在 rebalance 的问题。所以我们需要将这个变动变得最小。

目前主流的几种一致性 Hash 算法的核心思路都是通过虚拟节点来解决这个问题。我们可以将每个节点映射到多个虚拟节点上，然后在 ring 上找到离请求最近的虚拟节点，然后将请求转发到对应的真实节点上。

这样我们就可以将节点的增删改对请求的影响降到最低。

我们将以 Google 的 Maglev 算法为基础来实现一致性 Hash 算法

首先我们更改一下 Node 的代码

import dataclasses


@dataclasses.dataclass
class Node:
    host: str = ""
    port: int = 0
    node_available: bool = True
    weight: int = 0
    current_weight: int = 0
    connections: int = 0

    @property
    def available(self) -> bool:
        return self.node_available

    def __str__(self) -> str:
        return f"{self.host}:{self.port}"

这里我们可以用 str(node) 来获取 nodekey

然后我们来介绍一下 Maglev 算法的核心思路（这里只介绍最简化版本的细节，详情可以参考 Maglev: A Fast and Reliable Software Network Load Balancer³）这篇论文

首先，我们要确定经过预处理后的产物 lookup table 的长度 M。所有 Key 都会被 hash 到这个 lookup table 中去，而 lookup table 中的每个元素都会被映射到一个 Node 上

而计算 lookup table 的计算分为两步

计算每一个 node 对于每一个 lookup table 项的一个取值（也就是原文中提到的 permutation）；
根据这个值，去计算每一个 lookup table 项所映射到的 node（放在 entry 中，此处 entry 用原文的话来讲就是叫做 the final lookup table）。

permutation 是一个 NM 的矩阵，列对应 *lookup table，行对应 node。为了计算 permutation，需要挑选两个 hash 算法，分别计算两个值 offset 与 skip 。最后根据 offset 和 skip 的值来填充 permutation，计算方式描述如下：

offset = hash1(name[i]) mod M
skip = hash2(name[i]) mod (M − 1)+ 1
permutation[i][j] = (offset+ j × skip) mod M

其中 hash1 和 hash2 是两个不同的 hash 函数，我们后续会使用 xxhash 和 mmh3 这两种 hash 函数来实现

然后我们可以给出 lookup table 的计算方式

def calculate_lookup_table(n: int, m: int, permutation: list[list[int]]) -> list[int]:
    # result 是最终记录分布的 Hash 表
    result: list[int] = [-1] * m
    # next 是用来解决冲突的，在遍历过程中突然想要填入的 entry 表已经被占用，
    # 则通过 next 找到下一行。一直进行该过程直到找到一个空位。
    # 因为每一列都包含有 0~M-1 的每一个值，所以最终肯定能遍历完每一行。
    # 计算复杂度为 O(M logM) ~ O(M^2)
    next: list[int] = [0] * n
    flag = 0
    while True:
        for i in range(n):
            x = permutation[i][next[i]]
            while True:
                # 找到空位，退出查找
                if result[x] == -1:
                    break
                next[i] += 1
                x = permutation[i][next[i]]
            result[x] = i
            next[i] += 1
            flag += 1
            # 表已经填满，退出计算
            if flag == m:
                return result

在这里我们能看到，这段循环代码必然结束，而最坏情况下，复杂度会非常高，最坏的情况可能会到 O(M^2)。原文中建议找一个远大于 N 的 M （To avoid this happening we always choose M such that M ≫ N.）可以使平均复杂度维持在 O(MlogM)

我们可以用论文中的图来评估下如果节点存在移除的情况，整体的 rebalance 的效果

我们现在来完整实现一下 Maglev 算法，我们先确定用请求中的 url 来作为 hash key，所以我们需要对 RequestContext 进行一些修改

import dataclasses


@dataclasses.dataclass
class RequestContext:
    url: str = ""

好了，来把剩下的部分实现了

M = 65537


class MaglevStrategy(Strategy):
    @staticmethod
    def calculate_lookup_table(n: int, m: int, permutations: list[list[int]]) -> list[int]:
        # result 是最终记录分布的 Hash 表
        result: list[int] = [-1] * m
        # next 是用来解决冲突的，在遍历过程中突然想要填入的 entry 表已经被占用，
        # 则通过 next 找到下一行。一直进行该过程直到找到一个空位。
        # 因为每一列都包含有 0~M-1 的每一个值，所以最终肯定能遍历完每一行。
        # 计算复杂度为 O(M logM) ~ O(M^2)
        next: list[int] = [0] * n
        flag = 0
        while True:
            for i in range(n):
                x = permutations[i][next[i]]
                while True:
                    # 找到空位，退出查找
                    if result[x] == -1:
                        break
                    next[i] += 1
                    x = permutations[i][next[i]]
                result[x] = i
                next[i] += 1
                flag += 1
                # 表已经填满，退出计算
                if flag == m:
                    return result

    def __init__(self, nodes: list[Node]) -> None:
        super().__init__(nodes)
        permutations = []
        for i in range(len(nodes)):
            permutation = [0] * M
            offset = mmh3.hash(str(nodes[i])) % M
            skip = (xxhash.xxh32(str(nodes[i])).intdigest() % (M - 1)) + 1
            for j in range(M):
                permutation[j] = (offset + j * skip) % M
            permutations.append(permutation)
        self.tables = self.calculate_lookup_table(len(nodes), M, permutations)

    def get_node(self, ctx: RequestContext) -> Node:
        hash_value = mmh3.hash(str(ctx))
        index = hash_value % M
        node_index = self.tables[index]
        return self.nodes[node_index]

如果大家对 Google 整个 Maglev 系统感兴趣，可以去参考一篇我之前写博客，简单聊聊 Maglev ，来自 Google 的软负载均衡实践⁴

总结

好了，这次负载均衡算法告一段落，其实工作中还有一些更组合的场景，比如 sharding 轮询之类的，不过整体思路都不会发生太大变化。希望大家看的开心

Reference

https://github.com/Zheaoli/load-balancer-algorithm

https://github.com/nginx/nginx/commit/52327e0627f49dbda1e8db695e63a4b0af4448b1

https://research.google/pubs/maglev-a-fast-and-reliable-software-network-load-balancer/

https://www.manjusaka.blog/posts/2020/05/22/a-simple-introduction-about-maglev

简单吐槽一下摇曳露营的台配

2025-02-04 02:00:00

看了一下台配摇曳露营的 PV 放松，有些地方很想吐槽，写个短文聊一下

正文

先放出两版本对比

下面是原版

下面是台配

现在我们来聊一聊我觉得这个配音出现了什么样的问题

首先这是出自摇曳露营 S1E1 的最开始的部分。实际上是一个倒序的模式，将五人组的富士山露营在最开始进行展现，然后在季末进行收尾。

这个做法的效果和意图都很明确，其核心在于

让声优用声音将角色本身的性格立住

在这里面我认为问题最大的两个人，

大垣千明
齐藤惠那

其实犬山葵的问题也挺大的，但是出于方言角色确实不太好把握，这里就不多说了。

这两个角色的其实问题都很一致，配音者对于角色的性格把握不准。大垣千明是一个很干脆利落的角色，在四人组中是一个类似数码宝贝中太一一样 Leader 的角色，喜欢开玩笑，有一点假小子的感觉。而齐藤惠那性格和大垣千明有一些类似，不过齐藤会有更多一些少女味，所以她也成为四人组与凛之间的融合剂。

在原配中，大垣千明全程以很干脆利落的声线立住了人设。而齐藤惠那主要的两句台词“犬山同学，这样可以吗？”和“欸？真的吗？”声优很巧妙的换了不同的发声方式，让角色瞬间立体了起来。

在台配中，两者的声质都显得非常黏，或者用更不客气的观感来说，五个人的声质完全很难立住人设，属于是教科书里应该出现的声音，而不是动画里应该出现的声音。

而且台配还有一个比较明显的问题，声优对于情绪的把握出现了问题，比如还是经典的齐藤惠那的“欸？真的吗？”（背景是犬山提醒烤棉花糖不要离火太近，否则会烤焦）这句台词，原配中是一个带着学到新东西的惊讶的语气，而在台配中却在声音中显出了一些焦急。我觉得这是合格的声优不应该出现的意料外的问题。

另外一点其实被很多人忽略了，日语和中文的发音节奏和习惯是不一样的。在引入过程中，台词可能需要做一些适当的调整。比如在原配中，大垣千明的“欸！来芝麻凛的可可”(ほいしまりんココア一丁) ，这里 “一丁” 是日语中一个很口语的用法，声优选择在这里加了一个重音，来体现一个服务生的感觉，从而表现出大垣千明的古灵精怪。而在台配中，直接处理为“来，志摩凛的可可”，这里就没有很好的本土化。如果是我的话，我可能会选择更符合中文语境的 “来，志摩凛，你可可来咯！”这种口语化表达

这点在曾经上海电影译制厂译制的各种作品中体现的非常不错。我举个例子，在爱迪奥特曼第44话，激ファイト！ 80vsウルトラセブン/激斗，爱迪对战奥特赛文。中，不良暴走族在被假冒赛文狂追的时候，说“まだついてきやがるチクショウ”，直译为“该死，他还在跟着我。”，上译的老前辈们处理为“赛文还在追我，TMD”。而且用了非常痞子的声线，我觉得这就是展现了一个非常好的本土化的正面例子

放一个片段大家感受下

总结

差不多就吐槽这么多吧，翻译是一个累活苦活，希望大家也能多包容。希望不同地方的译者也能给我们带来不同的文化碰撞带来的惊喜。

差不多这样，祝大家新年快乐

Manjusaka修改