JasonkayZK | 张小凯的 RSS 预览

下一代提示词工程语言POML简明教程

2025-08-20 20:39:59

传统的提示词工程通常涉及编写自由文本，随着应用的发展，提示词文本会变得越来越复杂。

从而引出：提示词难以维护、难以进行版本控制、在不同场景下难以重用，几乎不可能进行系统化测试等一系列问题。

如何解决这些问题呢？

Microsoft给出了一个工程化的答案：POML！

文章和 Colab 配合，学习效果更佳：

https://colab.research.google.com/drive/1RrZyqB16XMvsFBjir90m-NCXE35kWFdy?usp=sharing

下一代提示词工程语言POML简明教程

一、简介

POML通过引入一种结构化方法，使用类似于 HTML 的格式编写提示词内容；

用户无需编写纯文本提示词，而是可以使用 <role>、<task> 和 <example> 等语义组件来组织提示词意图，从而带来更好的 LLM 性能和更便捷的提示词维护。

同时，POML具有类似CSS的样式系统，将内容与定义表示分离；

（一）核心架构

POML采用三层架构运行，分离关注点并支持灵活的提示词开发：

该架构通过几个阶段处理POML文件：

解析：将类似XML的语法转换为结构化的中间表示；
处理：应用样式表、解析模板并集成外部数据；
生成：以各种格式生成最终的优化提示词；

这种分离使开发者能够在不改变核心逻辑的情况下修改表示样式，无缝集成外部数据源，并在项目中保持一致的提示词结构。

（二）主要特性

1、结构化标记系统

POML使用类似HTML的语法和语义组件，使提示词更具可读性和可维护性。

主要组件包括：

<role>：定义LLM应采用的角色或身份；
<task>：指定LLM需要完成的任务；
<example>：提供少样本学习示例；
<output-format>：控制预期的响应格式；
<hint>：提供额外的上下文或约束；

例如：

<poml>  <role>You are a patient teacher explaining concepts to a 10-year-old.</role>  <task>Explain the concept of photosynthesis using the provided image as a reference.</task>  <img src="photosynthesis.png" alt="Diagram of photosynthesis" />  <output-format>    Keep the explanation simple, engaging, and under 100 words.    Start with "Hey there, future scientist!".  </output-format></poml>

2、外部数据集成

POML通过专用组件来集成外部数据：

<document>：嵌入文本文件、PDF或Word文档；
<table>：集成电子表格或CSV文件中的结构化数据
<img>：包含带有替代文本的图像，用于支持视觉的模型
<audio>：处理多模态应用的音频文件；

例如：

<hint captionStyle="header" caption="Background Knowledge">  <Document src="assets/tom_and_jerry.docx"/></hint><example>  <input>    <img src="assets/tom_cat.jpg" alt="The image contains the Tom cat character." syntax="multimedia" />  </input>  <output>    <Document src="assets/tom_introduction.txt"/>  </output></example>

3、解耦的表示样式

POML具有类似CSS的样式系统，将内容与表示分离。

例如：

<stylesheet>  role {    verbosity: concise;    format: markdown;  }  task {    emphasis: strong;  }</stylesheet>

这允许开发者修改详细程度、输出格式和强调等样式方面，而无需改变核心提示词逻辑，显著降低调整提示词时格式漂移的风险。

4、模板引擎

POML包含强大的模板引擎，用于动态提示词生成：

变量：{ { variable_name } }
循环：<for each="item in items">...</for>
条件：<if condition="variable > 0">...</if>
定义：<let name="variable" value="expression" />

这支持创建数据驱动的提示词，能够适应不同的上下文和输入。

（三）开发生态

POML提供全面的开发工具包，提高生产力：

1、VSCode扩展

Visual Studio Code扩展提供：

语法高亮和语言支持
上下文感知的自动补全
实时预览功能
与LLM提供商的集成测试
错误诊断和验证
可重用组件的提示词库

2、多语言SDK

POML为Python和TypeScript/JavaScript提供SDK：

Python SDK:

from poml import load, render# Load and render a POML fileprompt = load("example.poml")result = render(prompt, variables={"topic": "photosynthesis"})

TypeScript SDK:

import { loadPoml, renderPoml } from 'pomljs';// Load and render a POML fileconst prompt = await loadPoml('example.poml');const result = await renderPoml(prompt, { topic: 'photosynthesis' });

二、基本使用

（一）安装

Node.js (via npm)：

npm install pomljs

Python (via pip)：

pip install poml

（二）第一个案例

1、编写POML文件

编写一个名为 example.poml 的文件，内容如下：

example.poml

<poml>  <role>You are a patient teacher(named {{teacher_name}}) explaining concepts to a 10-year-old.</role>  <task>Explain the concept of photosynthesis using the provided image as a reference.</task>  <input>  <img src="photosynthesis.jpg" alt="Diagram of photosynthesis" syntax="multimedia"/>  </input>  <output-format>    Keep the explanation simple, engaging, and under 100 words.    Start with "Hey there, future scientist!".  </output-format></poml>

示例定义了：

LLM 的角色和任务，包含一张图片作为上下文，并指定了所需的输出格式。
同时，包含了一个变量 teacher_name；

编写完成后，如果你安装了 Visual Studio Code poml 插件，则可以进行预览：

2、解析并渲染POML

借助 POML 工具包，此提示词可以轻松渲染为灵活的格式，并可通过 LLM 进行测试。

例如在 Python 中：

from poml import poml# Process a POML file# result = poml("example.poml")# Process with context variablesresult = poml("example.poml", context={"teacher_name": "Jasonkay"})print(f"Process with context variables: {result}")# Get OpenAI chat format(Within the higher version)# messages = poml("example.poml", format="openai_chat")# print(f"Get OpenAI chat format: {messages}")

poml 函数接受以下参数：

markup：POML 内容（字符串或文件路径）
context：可选的模板注入数据
stylesheet：可选的样式自定义
format：输出格式（”dict”、”openai_chat”、”langchain”、”pydantic” 或 “raw”）

执行代码后，输出结果为：

Process with context variables: [{'speaker': 'system', 'content': '# Role\n\nYou are a patient teacher(named Jasonkay) explaining concepts to a 10-year-old.\n\n# Task\n\nExplain the concept of photosynthesis using the provided image as a reference.'}, {'speaker': 'human', 'content': [{'type': 'image/webp', 'base64': 'UklGRg......', 'alt': 'Diagram of photosynthesis'}, '# Output Format\n\nKeep the explanation simple, engaging, and under 100 words. Start with "Hey there, future scientist!".']}]

可以看到，输出的内容将内容进行了渲染！

3、与LLM系统集成(Gemini)

最后，将我们的提示词和外部 LLM 系统相结合！

由于目前最新的 POML SDK 还不支持使用 format 参数来渲染 openai_chat 类型的 Prompt；

因此，这里使用 Gemini API 来发送图片！

使用下面的 poml 文件来渲染：

example.poml

<poml>  <role>You are a patient teacher(named {{teacher_name}}) explaining concepts to a 10-year-old.</role>  <task>Explain the concept of photosynthesis using the provided image as a reference.</task>  <output-format>    Keep the explanation simple, engaging, and under 100 words.    Start with "Hey there, future scientist!".  </output-format></poml>

首先安装 Gemini SDK：

pip install -U google-genai

要运行下面的代码，你需要创建一个 Gemini 的 API Key：

https://aistudio.google.com/app/apikey

随后，将下面的 YOUR_API_KEY 替换为你生成的 Key！

from google import genaifrom poml import pomlfrom google.genai import typesGEMINI_API_KEY="YOUR_API_KEY"client = genai.Client(api_key=GEMINI_API_KEY)# Read the picturewith open('photosynthesis.jpg', 'rb') as f:    image_bytes = f.read()# Render the POML fileresult = poml("example.poml", context={"teacher_name": "Jasonkay"}, chat=False)# print(f"Process with context variables: {result}")response = client.models.generate_content(    model="gemini-2.5-flash",     contents=[        result,        types.Part.from_bytes(          data=image_bytes,          mime_type='image/jpeg',      ),    ])print(response.text)

最后，执行即可输出内容：

Look at our amazing plant friend! Just like you need food, plants need to eat too! This image shows how they do it, a process called **photosynthesis**. Plants use sunlight from the sun, and "drink" water through their roots. They also breathe in a gas called CO2 (carbon dioxide) from the air, shown by the blue arrow going in. Using these, they make their own sugary food to grow! As a super cool bonus, they release O2 (oxygen) for us to breathe, shown by the blue arrow going out. Amazing, right?

（三）使用样式

现在，让我们为上面的例子增加相关的样式，来优化的 Prompt 配置！

example-2.poml

<poml>  <role>You are a patient teacher(named {{teacher_name}}) explaining concepts to a 10-year-old.</role>  <task>Explain the concept of photosynthesis using the provided image as a reference.</task>  <output-format>    <list listStyle="dash">        <item className="explanation">Keep the explanation simple, engaging, and under 100 words.</item>        <item className="greeting">    Start with "Hey there, future scientist!".     </item>    </list>  </output-format></poml><stylesheet>  {    ".explanation": {      "syntax": "json"    },    "list" : {      "whiteSpace": "trim"    }  }</stylesheet>

渲染结果如下：

# RoleYou are a patient teacher(named ) explaining concepts to a 10-year-old.# TaskExplain the concept of photosynthesis using the provided image as a reference.# Output Format```json"Keep the explanation simple, engaging, and under 100 words."```- Start with "Hey there, future scientist!".

更多内容可以参考官方文档：

https://microsoft.github.io/poml/latest/language/standalone/#stylesheet

三、深入学习

在完成了基础学习之后，可以继续阅读下面的内容：

进行更加深度的学习！

附录

文章和 Colab 配合，学习效果更佳：

https://colab.research.google.com/drive/1RrZyqB16XMvsFBjir90m-NCXE35kWFdy?usp=sharing

参考文章：

开了一个新的专门学习日语的博客

2025-07-30 13:22:44

之前一直都在这里发一些学习日语相关的内容。但是感觉这些可以单独开一个新的网站来总结；

最近比较有时间，就开了一个新的坑；

博客主题用的是：hexo-theme-anzhiyu，真的很酷！

也是花了半个小时，使用hexo，从零搭建了一个博客；

之前没有好好总结，这里就又简单记录了一下搭建的过程，如果你还没有博客，可以跟着我一步一步来搭建一个GithubPages博客！

新博客地址：

https://jasonkayzk.github.io/jp/

源代码：

开了一个新的专门学习日语的博客

一、安装Node.js&Hexo

可以在官网下载Node：

https://nodejs.org/zh-cn/download

我这里使用的是 fnm 作为版本管理工具；

配置国内源：
  # 国内 淘宝 镜像源  npm config set registry https://registry.npmmirror.com/
参考：

npm、yarn、pnpm 设置最新国内镜像源（附官方镜像源和最新阿里源），以及 nrm 的使用教程【2025】

也可以把 yarn 安装了：

npm install -g yarn# 国内 淘宝 镜像源yarn config set registry https://registry.npmmirror.com/

随后下载 Hexo-cli 命令行工具：

npm install hexo-cli -g

参考：

https://hexo.io/zh-cn/

二、初始化Hexo项目

直接通过命令行初始化项目 jp：

# hexo init jpINFO  Cloning hexo-starter https://github.com/hexojs/hexo-starter.gitINFO  Install dependencies......INFO  Start blogging with Hexo!

安装依赖并测试：

cd jpnpm ihexo s # 本地调试

此时访问：

http://localhost:4000/

你就能看到你的博客了！

参考文档：

https://hexo.io/zh-cn/

三、更换主题

默认的主题不太好看，可以在 github 上搜索 hexo 的相关主题（一般为 hexo- 开头）；

以 hexo-theme-anzhiyu 为例，文档：

https://docs.anheyu.com/

首先在你项目的根目录将主题 clone 下来：

git clone -b main https://github.com/anzhiyu-c/hexo-theme-anzhiyu.git themes/anzhiyu

随后打开 Hexo 根目录下的 config.yml, 找到以下配置项，把主题改为anzhiyu

# Extensions## Plugins: https://hexo.io/plugins/## Themes: https://hexo.io/themes/theme: anzhiyu

然后安装 pug 和 stylus 渲染插件：

npm install hexo-renderer-pug hexo-renderer-stylus --save

再次执行：

hexo s

此时再次访问：

http://localhost:4000/

你就能看到你新的主题的博客了！

四、自定义配置

拉下来的新的主题，很多内容大概率都不符合你的预期（比如：title、分类等等）；

此时你可以参考文档进行个性化的配置：

https://docs.anheyu.com/page/front-matter.html

需要注意的是：

在根目录下存在 _config.yml、各自主题下也有这个文件，优先级为：根目录 > 各自主题配置！

五、撰写新的文章

通过：

hexo new xxx

即可创建一篇新的博文！

实际上就是在 source/_posts/ 目录下创建了一个新的*.md文件而已！

可以增加 Post Front-matter 来对文章进行配置！

文章内容就按照 markdown 的格式去写就可以了！

六、发布到Github

本地调试无误后，修改根目录下的 _config.yml：

# Deployment## Docs: https://hexo.io/docs/one-command-deploymentdeploy:  type: 'git'  repo: [email protected]:jasonkayzk/jp.git  # 替换为你的仓库地址  branch: main  # 部署分支

安装插件：

npm install hexo-deployer-git --save

随后，在发布时，通常情况下：

首先，通过 hexo g 生成静态资源；
然后使用 hexo d 即可部署！

参考：

https://hexo.io/docs/one-command-deployment

但是这种方式每次都要在本地生成静态资源，效率太低！

1、使用GitHub Actions自动部署

可以通过 Github Actions，每次 push 代码之后自动提交！

在 .github/workflows/ 目录下创建：

deploy.yml

name: Build & Deploy Blogon:  workflow_dispatch:  push:    branches:      - devjobs:  build:    runs-on: ubuntu-latest    strategy:      matrix:        node_version: [18]    steps:      - name: Checkout source        uses: actions/checkout@v3        with:          ref: dev      - name: Cache dependencies        uses: actions/cache@v3        with:          path: |            node_modules            public          key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}          restore-keys: |            ${{ runner.os }}-node-      - name: Use Node.js ${{ matrix.node_version }}        uses: actions/setup-node@v3        with:          version: ${{ matrix.node_version }}      - name: Add SSH key        uses: webfactory/[email protected]        with:          ssh-private-key: ${{ secrets.hexo_deploy_pri }}      - name: Setup hexo        run: |          git config --global user.email "[email protected]"          git config --global user.name "jasonkayzk"          npm install hexo-cli -g          npm install      - name: Hexo deploy        env:          GIT_SSH_COMMAND: ssh -o StrictHostKeyChecking=no        run: |          hexo clean          hexo g          hexo d

随后在仓库的 Secrets and variables 设置中，添加 Repository secrets：

key为：HEXO_DEPLOY_PRI；
值为：你连接 Github 的私钥！（注意是私钥！）

提交到 dev 分支后即可自动部署！

2、使用新分支保存笔记

同时，生成的内容可能会覆盖原本的笔记（hexo d 会强行覆盖部署分支！）：

我们只需要 checkout 到另外的一个分支去编写博客，然后部署到 main 分支即可！

例如：

git checkout -b devgit push

附录

新博客地址：

https://jasonkayzk.github.io/jp/

源代码：

一、并行编程导论与CUDA入门

2025-07-29 17:41:50

随着人工智能的发展，科学计算（尤其是矩阵/张量计算）越来越重要；因此，基于CUDA的张量编程也越来越重要。

在上一篇笔记中翻译了《An Even Easier Introduction to CUDA》，但是感觉作者写的不是很好；

这里重新写了一篇。同时，也作为CUDA和并行编程的开篇。

源代码：

https://github.com/JasonkayZK/high-performance-computing-learn/blob/main/cuda/1_introduction_to_parallel_programming_and_cuda.ipynb

一、并行编程导论与CUDA入门

温馨提示：本文章配合 Colab 一同执行学习效果更佳：

https://colab.research.google.com/github/JasonkayZK/high-performance-computing-learn/blob/main/cuda/1_introduction_to_parallel_programming_and_cuda.ipynb

（一）、CUDA编程概述

1、什么是CUDA

CUDA 是 NVIDIA 开发的并行计算平台和编程模型；

具有以下特点：

C/C++ 语法；
SIMT（Single Instruction Multiple Threads）模式：一个指令会被多个线程同时执行！
需要与CPU协作：CPU负责整理结果、处理逻辑；
自动调度：根据设定的执行参数，自动调度；

2、CUDA 运算硬件单元

（1）SM 单元

下面是一个 GPU 硬件单元：

每个核心中包含了多个 SM（Stream Multi-processor），任务在 SM 中处理；

SM 中包含了：

CUDA Core/SP：进行并行的加减法等计算；
Tensor Core：张量计算
……

（2）CPU与GPU协作

CPU 与 GPU 协同工作的流程如下：

首先，习惯上将：

CPU 所在端称为：Host 端，对应内存为 RAM；
GPU 所在称为：Device 端，对应内存为 Global Memory（通常对应 GPU RAM，显存）；

通常，Global Memory 在其范围和生命周期中是全局的！

也就是说，每个在thread block grid 中的 thread 都可以访问Global Memory，并且生命周期与程序的执行时间一样长！

更多内容：

https://modal.com/gpu-glossary/device-software/global-memory

CUDA 程序执行时主要分为以下几个步骤：

CPU 准备（CPU Prepare）：在主机端（Host ，包含 CPU 和 RAM 主存），CPU 负责初始化数据、设置计算参数等准备工作，为后续在 GPU 上的运算任务做铺垫，确定要处理的数据和运算逻辑；
CPU 传输数据至 GPU（CPU Transfers Data to GPU）：通过总线（Bus），CPU 把主存（RAM）中准备好的数据传输到 GPU 端的全局内存（Global Memory ，GM），因为 GPU 运算需要的数据要先存放到其可访问的内存空间；
从 GM 读数据（Read Data from GM）：GPU（如 NVIDIA A100 ）从自身的全局内存中读取需要参与运算的数据，将数据加载到运算单元可处理的位置；
运算（Compute）：NVIDIA A100 等 GPU 设备利用自身的并行运算核心，对读取的数据执行 CUDA 核函数定义的运算操作，发挥 GPU 并行计算优势，高效处理大规模数据计算任务；
写回 GM（Write Back to GM）：运算完成后，GPU 将运算结果写回到全局内存中，暂存运算产出的数据；
GPU 传输数据至 CPU（GPU Transfers Data to CPU）：再次通过总线，GPU 把全局内存中存储的运算结果传输回主机端的主存（RAM），供 CPU 进一步处理（如数据展示、后续其他主机端逻辑运算等），完成一次 CUDA 编程的计算流程；

CUDA 这种流程实现了 CPU 与 GPU 协同，让 GPU 承担并行计算 heavy - lifting ，提升计算密集型任务效率，广泛用于深度学习训练推理、科学计算等场景！

（二）、CUDA运算示例：加法

1、CPU加法

add_cpu.cpp

#include <cmath>#include <iostream>#include <vector>// Step 2: Define add functionvoid add_cpu(std::vector<float> &c, const std::vector<float> &a, const std::vector<float> &b) {    // CPU use loop to calculate    for (size_t i = 0; i < a.size(); i++) {        c[i] = a[i] + b[i];    }}int main() {    // Step 1: Prepare & initialize data    constexpr size_t N = 1 << 20; // ~1M elements    // Initialize data    const std::vector<float> a(N, 1);    const std::vector<float> b(N, 2);    std::vector<float> c(N, 0);    // Step 3: Call the cpu addition function    add_cpu(c, a, b);    // Step 4: Check for errors (all values should be 3.0f)    float maxError = 0.0f;    for (int i = 0; i < N; i++) {        maxError = fmax(maxError, fabs(c[i] - 3.0f));    }    std::cout << "Max error: " << maxError << std::endl;}

主要分为以下几个步骤：

准备和初始化数据；
定义加法函数
- 靠循环来进行所有的元素加法
调用函数
验证结果

2、修改为GPU加法（重点！）

分为以下几个步骤：

修改文件名为 *.cu：例如add_cuda.cu（表示 CUDA 程序）
准备和初始化数据（CPU）：使用 vector 等进行 Host 端的 RAM 分配；
数据传输到 GPU：使用 cudaMalloc 分配显存、使用 cudaMemcpy 复制数据等；
GPU 从 GM 中读取并计算后写回（调用核（kernel）函数计算）：
- 修改核函数声明：
- 修改调用方式：
将 GPU 数据传输回 CPU：
验证结果

下面分别来看；

（1）修改文件名为 `*.cu`

CUDA 规定其文件扩展名为 *.cu，语法和 C++ 类似！

（2）准备和初始化数据（CPU）

这步比较简单：

add_cuda.cu

// Step 1: Prepare & initialize dataconstexpr size_t N = 1 << 20; // ~1M elementsconstexpr size_t size_bytes = sizeof(float) * N;// Initialize dataconst std::vector<float> h_a(N, 1);const std::vector<float> h_b(N, 2);std::vector<float> h_c(N, 0);

此时在 Host 端的 RAM 分配内存；

（3）数据传输到 GPU

数据传输到 GPU 使用 CUDA 提供的函数：

使用 cudaMalloc 分配显存；
使用 cudaMemcpy 复制数据；

add_cuda.cu

float *d_a, *d_b, *d_c;CUDA_CHECK(cudaMalloc(&d_a, size_bytes));CUDA_CHECK(cudaMalloc(&d_b, size_bytes));CUDA_CHECK(cudaMalloc(&d_c, size_bytes));CUDA_CHECK(cudaMemcpy(d_a, h_a.data(), size_bytes, cudaMemcpyHostToDevice));CUDA_CHECK(cudaMemcpy(d_b, h_b.data(), size_bytes, cudaMemcpyHostToDevice));CUDA_CHECK(cudaMemcpy(d_c, h_c.data(), size_bytes, cudaMemcpyHostToDevice));

这里使用了：

CUDA_CHECK 宏进行校验；
cudaMemcpyHostToDevice 指定数据流方向；

CUDA_CHECK 宏定义如下：

#define CUDA_CHECK(call) \{ \    cudaError_t err = call; \    if (err != cudaSuccess) { \        std::cerr << "CUDA Error at " << __FILE__ << ":" << __LINE__ \        << " - " << cudaGetErrorString(err) << std::endl; \    } \}

（4：补）CUDA层级结构

i.线程层级结构

在 CPU 中，使用循环进行执行；

而在 GPU 中，使用的是 SIMT，即：一条命令会同时被多个线程执行！

此时需要指挥每个线程：组织结构和编号！

在 CUDA 中，包含：

Grid：
Block：
Thread：

如下图：

其中：每一个 Grid 中包含多个已编号的 Block，而每一个 Block 中包含多个已编号的 Thread！

同时，每个 Block 中包含的线程数是一样的！

一共有：0~N-1个Thread（假设每个 Block 包含 N 个 Thread）；

ii.线程索引计算方法

在 CUDA 中：

每个线程都有 独一无二 的编号索引(idx)；
idx = BlockID * BlockSize + ThreadID；

如下图：

（4）编写和调用核函数

相对于 CPU 中使用循环的方式执行，在 GPU 中主要使用的是：多线程并行；

步骤如下：

定义 block 的数量和大小来指挥线程、进行/并行计算；
定义 GPU 上的加法函数（核函数）；
结合定义的信息调用 GPU 加法函数；

层级结构定义：

// Set up kernel configurationdim3 block_dim(256);dim3 grid_dim((N + block_dim.x - 1) / block_dim.x);

核函数定义：

template<typename T>__global__ void add_kernel(T *c, const T *a, const T *b, int n) {    int idx = blockIdx.x * blockDim.x + threadIdx.x;    if (idx < n) {        c[idx] = a[idx] + b[idx];    }}

只能通过指针的方式传递！

因为像是 vector 等数据结构，都是在 Host 端定义的，并不能在 Global Memory 中分配！

核函数调用：

// Call cuda add kerneladd_kernel<<<grid_dim, block_dim>>>(d_c, d_a, d_b, N);

其中：

dim3：CUDA 表示线程层级结构的类型（包括：x、y、z 三个维度）；
<<<>>>：传递线程层级信息给核函数；
核函数(kernel)：设备侧的入口函数；
__global__：表示这是个核函数；
blockldx：表示 block 的编号，第几个 block；
blockDim：表示 block 的大小，一个 block 多少个线程；
threadldx：表示 thread 的编号，表示第几个 thread；

（5）将 GPU 数据传输回 CPU

同样，使用 cudaMemcpy：

CUDA_CHECK(cudaMemcpy(h_c.data(), d_c, size_bytes, cudaMemcpyDeviceToHost));

（6）验证结果，释放内存

验证结果使用已经复制到 h_c 中的数据；

释放内存使用 cudaFree：

add_cuda.cu

float maxError = 0.0f;for (int i = 0; i < N; i++) {    maxError = fmax(maxError, fabs(h_c[i] - 3.0f));}std::cout << "Max error: " << maxError << std::endl;if (d_a) {    CUDA_CHECK(cudaFree(d_a));}if (d_b) {    CUDA_CHECK(cudaFree(d_b));}if (d_c) {    CUDA_CHECK(cudaFree(d_c));}

3、编译&运行CUDA程序

需要使用 NVCC（NIVIDEA CUDA Compiler） 的编译器来编译程序；

NVCC 是 CUDA Toolkit 的一部分：

https://developer.nvidia.com/cuda-toolkit

（1）编译流程

如下图所示：

流程如下：

每个 cu：Host 代码与 Device 代码分离（部分在CPU执行、部分在GPU执行）
每个虚拟架构：Device 代码编译出 fatbin
Host 端使用系统的 C++ 编译器(如 g++)
链接（device，host）
最终获得可使用 GPU 的可执行二进制文件

补：GPU虚拟架构

NVIDIA 不同年代生产的GPU可能有不同的架构，如下图所示：

以 A100 为例，A100 为 Ampere 架构；同时，各个架构间有区别；

因此提出：Compute Capability (CC)

类似版本，表示能支持的功能和指令集合
A100 (Ampere 架构)是：cc8.0

虽然 A100 举例，但从 CUDA 编程的角度目前各种架构没有本质区别！

正因为如此，所以说CUDA是一个编程平台

同时，在编译时也可以指定架构编译选项：

-arch：指定虚拟架构，PTX生成目标。决定代码中可使用的CUDA 功能；
-code：指定实际架构，生成针对特定 GPU 硬件的二进制机器码(cubin)；

（2）编译命令

通过：

nvcc add_cuda.cu -o add_cuda

即可编译！

运行：

./add_cuda

（三）、GPU性能测试

可以通过：

nvidia-smi

观察 GPU 利用率！

1、并行加法性能对比

分别对比：

CPU；
<<<1,1>>>；
<<<256，256>>>；
GPU 满载；

代码如下：

add_cuda_profiling.cu

#include <cmath>#include <iostream>#include <vector>#define CUDA_CHECK(call) \{ \    cudaError_t err = call; \    if (err != cudaSuccess) { \        std::cerr << "CUDA Error at " << __FILE__ << ":" << __LINE__ \        << " - " << cudaGetErrorString(err) << std::endl; \    } \}// Step 3: Define add kerneltemplate<typename T>__global__ void add_kernel(T *c, const T *a, const T *b, const size_t n, const size_t step) {    int idx = blockIdx.x * blockDim.x + threadIdx.x + step;    if (idx < n) {        c[idx] = a[idx] + b[idx];    }}template<typename T>void vector_add(T *c, const T *a, const T *b, size_t n, const dim3& grid_dim, const dim3& block_dim) {    size_t step = grid_dim.x * block_dim.x;    for (size_t i = 0; i < n; i += step) {        add_kernel<<<grid_dim, block_dim>>>(c, a, b, n, i);    }}int main() {    // Step 1: Prepare & initialize data    constexpr size_t N = 1 << 20; // ~1M elements    constexpr size_t size_bytes = sizeof(float) * N;    // Initialize data    const std::vector<float> h_a(N, 1);    const std::vector<float> h_b(N, 2);    std::vector<float> h_c(N, 0);    // Step 2: Allocate device memory & transfer to global memory    float *d_a, *d_b, *d_c;    CUDA_CHECK(cudaMalloc(&d_a, size_bytes));    CUDA_CHECK(cudaMalloc(&d_b, size_bytes));    CUDA_CHECK(cudaMalloc(&d_c, size_bytes));    CUDA_CHECK(cudaMemcpy(d_a, h_a.data(), size_bytes, cudaMemcpyHostToDevice));    CUDA_CHECK(cudaMemcpy(d_b, h_b.data(), size_bytes, cudaMemcpyHostToDevice));    CUDA_CHECK(cudaMemcpy(d_c, h_c.data(), size_bytes, cudaMemcpyHostToDevice));    // Step 4: Call the cpu addition function    // Set up kernel configuration    dim3 block_dim(1);    dim3 grid_dim(1);    // Call cuda add kernel    vector_add(d_c, d_a, d_b, N, block_dim, grid_dim);    // Step 5: Transfer data from global mem to host mem    CUDA_CHECK(cudaMemcpy(h_c.data(), d_c, size_bytes, cudaMemcpyDeviceToHost));    // Step 6: Check for errors (all values should be 3.0f)    float sumError = 0.0f;    for (int i = 0; i < N; i++) {        sumError += fabs(h_c[i] - 3.0f);    }    std::cout << "Sum error: " << sumError << std::endl;    if (d_a) {        CUDA_CHECK(cudaFree(d_a));    }    if (d_b) {        CUDA_CHECK(cudaFree(d_b));    }    if (d_c) {        CUDA_CHECK(cudaFree(d_c));    }}

可以修改其中的：
  dim3 block_dim(1);  dim3 grid_dim(1);

不同情况下的性能如下：

可以使用 Nsight Systems（nsys，NVIDIA系统级性能分析工具）来分析；

启动 profiling：

nsys profile -t cuda,nvtx,osrt -o add_cuda_profiling -f true ./add_cuda_profiling

解析并统计性能信息：

nsys stats add_cuda_profiling.nsys-rep ** OS Runtime Summary (osrt_sum): Time (%)  Total Time (ns)  Num Calls    Avg (ns)       Med (ns)      Min (ns)    Max (ns)     StdDev (ns)            Name          --------  ---------------  ---------  -------------  -------------  ----------  -----------  -------------  ----------------------     56.2    7,592,724,284         84   90,389,574.8  100,130,776.0       2,330  370,626,986   45,049,255.4  poll                       42.4    5,736,493,727         26  220,634,374.1  189,702,756.5  41,077,614  752,975,386  124,762,585.8  sem_wait                    1.2      164,252,099        543      302,490.1       13,509.0         529  111,402,991    4,818,716.4  ioctl                       0.1       14,968,499         38      393,907.9      131,267.0         135    5,539,804      890,642.6  pthread_rwlock_wrlock                 ...... ** CUDA API Summary (cuda_api_sum): Time (%)  Total Time (ns)  Num Calls    Avg (ns)     Med (ns)    Min (ns)   Max (ns)     StdDev (ns)            Name          --------  ---------------  ---------  ------------  -----------  --------  -----------  -------------  ----------------------     96.9    6,504,565,162  1,048,576       6,203.2      5,159.0     2,928   37,814,020       99,097.6  cudaLaunchKernel            3.0      203,141,797          3  67,713,932.3    103,908.0    73,162  202,964,727  117,130,625.1  cudaMalloc                  0.1        4,017,591          4   1,004,397.8  1,012,632.0   941,545    1,050,782       45,652.8  cudaMemcpy                  0.0          524,788          3     174,929.3    136,182.0   122,785      265,821       78,999.0  cudaFree                    0.0            2,584          1       2,584.0      2,584.0     2,584        2,584            0.0  cuModuleGetLoadingMode......

各个类型 API Summary 分析结果如下：

可以看到：<<<1,1>>> cudaLaunchKernel 占比非常高这是由于：

核函数调用有开销，在外面多次循环调用开销巨大！

因此，需要进行优化！

2、将循环放入核函数（Grid-strided loop）

由于在循环中频繁的调用核函数具有巨大的性能开销，因此可以将循环放入核函数中：

template<typename T>__global__ void add_kernel_inner_loop(T *c, const T *a, const T *b, const size_t n, const size_t step) {    int idx = blockIdx.x * blockDim.x + threadIdx.x;    for (size_t i = idx; i < n; i += step) {        if (i < n) {            c[i] = a[i] + b[i];        }    }}template<typename T>void vector_add(T *c, const T *a, const T *b, size_t n, const dim3& grid_dim, const dim3& block_dim) {    size_t step = grid_dim.x * block_dim.x;    add_kernel_inner_loop<<<grid_dim, block_dim>>>(c, a, b, n, step);}

分析后结果如下图：

同时使用 nsys 分析：

 ** CUDA API Summary (cuda_api_sum): Time (%)  Total Time (ns)  Num Calls    Avg (ns)     Med (ns)    Min (ns)   Max (ns)     StdDev (ns)            Name          --------  ---------------  ---------  ------------  -----------  --------  -----------  -------------  ----------------------     55.4      204,935,456          3  68,311,818.7    104,741.0    79,097  204,751,618  118,160,333.0  cudaMalloc                 44.4      164,057,041          4  41,014,260.3  1,000,521.5   926,775  161,129,223   80,076,651.2  cudaMemcpy                  0.2          653,441          3     217,813.7    204,732.0   194,409      254,300       32,016.9  cudaFree                    0.1          264,055          1     264,055.0    264,055.0   264,055      264,055            0.0  cudaLaunchKernel            0.0            2,429          1       2,429.0      2,429.0     2,429        2,429            0.0  cuModuleGetLoadingMode

可以看到 cudaLaunchKernel 的确少了很多！

这说明：

核函数的发射数量减少，因此总体执行时间降低！

3、CUDA并行加法性能评估（加速比）

指标：

加速比 = T_cpu / T_gpu

其中：

T_cpu 是任务在 CPU 上的执行时间；
T_gpu 是任务在 GPU 上的执行时间；

理想加速比与实际加速比

理想加速比：当任务完全并行化且没有任何开销时，加速比等于处理器核心数之比。例如，一个具有 1000 个 CUDA 核心的 GPU 理论上可以实现 1000 倍的加速（相对于单核 CPU）。

实际加速比：由于以下因素，实际加速比通常远低于理想值：

任务中存在无法并行化的串行部分

数据在 CPU 和 GPU 之间的传输开销

线程同步和内存访问延迟

算法在 GPU 架构上的效率低下

为什么`<<<1,1>>>` 比 CPU慢？

这是由于，单个 GPU 的核心实际上要比 CPU 的能力要弱！

实际上，GPU 是由于干活的人多，所以快！

4、CUDA并行加法性能评估（总耗时）

实际上观察 nsys 的输出结果：

 ** CUDA GPU Kernel Summary (cuda_gpu_kern_sum): Time (%)  Total Time (ns)  Instances    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)                                              Name                                              --------  ---------------  ---------  -------------  -------------  -----------  -----------  -----------  ---------------------------------------------------------------------------------------------    100.0      160,054,287          1  160,054,287.0  160,054,287.0  160,054,287  160,054,287          0.0  void add_kernel_inner_loop<float>(T1 *, const T1 *, const T1 *, unsigned long, unsigned long)Processing [add_cuda_profiling2.sqlite] with [/usr/local/cuda-12.1/nsight-systems-2023.1.2/host-linux-x64/reports/cuda_gpu_mem_time_sum.py]...  ** CUDA GPU MemOps Summary (by Time) (cuda_gpu_mem_time_sum): Time (%)  Total Time (ns)  Count  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)      Operation      --------  ---------------  -----  ---------  ---------  --------  --------  -----------  ------------------     78.4        2,318,310      3  772,770.0  763,159.0   761,400   793,751     18,191.4  [CUDA memcpy HtoD]     21.6          640,473      1  640,473.0  640,473.0   640,473   640,473          0.0  [CUDA memcpy DtoH]

总体的耗时应当是三个部分：

总耗时 = T_H2D + T_kernel + T_D2H

并且，对于 <<<256,256>>> 来说：HtoD 和 DtoH 的耗时会远大于 kernel 的运行时间！

这就说明，来回的移动和复制数据比计算更消耗时间！

能否对这一部分进行优化呢？

后面的文章中会讲解！

（四）、设备信息

对于 GPU 而言：越多的线程 => 越大的并行度 => 越好的性能

GPU 最大可以启动的线程数可以参考：

官网查询硬件：https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications
代码动态获取（cudaDeivceProp）：https://docs.nvidia.com/cuda/cuda-runtime-api/index.html

也可以参考：

https://www.nvidia.cn/docs/IO/51635/NVIDIA_CUDA_Programming_Guide_1.1_chs.pdf

1、cudaDeivceProp

重点的几个参数：

maxGridSize（int[3]）：x、y、z三个方向分别最多可支持的 block 数；
maxBlockSize（int[3]）：每个 Block中x、y、z三个方向分别最多可支持的线程数；
maxThreadsPerBlock（int）：每个 block 中最多可有的线程数

其中：Blocksize 需同时满足这两组条件：maxBlockSize、maxThreadsPerBlock：

x、y、z加起来不超过：maxThreadsPerBlock；

x、y、z各个方向不超过：maxBlockSize；

2、CUDA版本

查看 CUDA 版本使用：

# CUDA版本nvcc --versionnvcc: NVIDIA (R) Cuda compiler driverCopyright (c) 2005-2023 NVIDIA CorporationBuilt on Tue_Feb__7_19:32:13_PST_2023Cuda compilation tools, release 12.1, V12.1.66Build cuda_12.1.r12.1/compiler.32415258_0

可以看到 CUDA 为 12.1！

而 nvidia-smi 命令输出的是：驱动支持的的最高版本，而非实际正在使用的版本！

Tue Jul 29 09:30:09 2025       +-----------------------------------------------------------------------------------------+| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     ||-----------------------------------------+------------------------+----------------------+| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC || Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. ||                                         |                        |               MIG M. ||=========================================+========================+======================||   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 || N/A   38C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default ||                                         |                        |                  N/A |+-----------------------------------------+------------------------+----------------------++-----------------------------------------------------------------------------------------+| Processes:                                                                              ||  GPU   GI   CI        PID   Type   Process name                              GPU Memory ||        ID   ID                                                               Usage      ||=========================================================================================||  No running processes found                                                             |+-----------------------------------------------------------------------------------------+

可以看到，最高支持 12.4！

后记

更加详细的内容，可以看：

https://qiankunli.github.io/2025/03/22/cuda.html

附录

源代码：

https://github.com/JasonkayZK/high-performance-computing-learn/blob/main/cuda/1_introduction_to_parallel_programming_and_cuda.ipynb

参考文章：

一些免费的GPU资源

2025-07-24 15:37:04

在学习AI时，经常需要用到GPU资源；而有些时候我们手头并没有老黄的显卡，或者显卡已经不支持进行人工智能的学习了；

本文总结了一些常用的GPU资源；

同时，后续也会在博客更新一些和并行计算、人工智能相关的内容，敬请期待！

一些免费的GPU资源

一、Google Colab（推荐）

网址：

https://colab.research.google.com/notebooks/

特点：

提供 NVIDIA T4/P100/V100/A100（具体型号随机分配）；
免费用户每天最多可使用 12 小时（可能因资源调度而中断）；
可升级至 Colab Pro（9.99 美元/月）或 Colab Pro+（49.99 美元/月），享受更长运行时间和优惠优先级；
集成 Jupyter Notebook 环境，适合深度学习和机器学习任务

二、Kaggle

网址：

https://www.kaggle.com/

特点：

GPU小时数：每周30小时。
GPU：提供Tesla P100，与Google Colab的T4相当。
使用质量：非常好，很少断连。
CPU和内存：提供四个CPU和29GB RAM。
易用性：易用，有类似笔记本的界面。
存储：无持久性存储。

三、Paperspace Gradient

网址：

https://www.paperspace.com/

特点：

GPU小时数：无具体限制，质量不佳。
存储：有持久性存储，数据不会丢失。
GPU：提供M4000 GPU，质量低于Google Colab的T4。

四、其他

1、AWS Sagemaker Studio Lab

特点：

GPU小时数：每天4小时，CPU小时数12小时。
GPU：提供T4 GPU，与Google Colab相同。
使用质量：非常好，很少断连。
易用性：需要在网站上注册。
存储：有持久性存储。

2、Lightning AI

网址：

https://lightning.ai/

特点：

GPU小时数：每月22小时。
CPU：提供一个Studio，4个CPU完全免费。
使用质量：非常好，连接稳定。
易用性：非常好，提供VS Code界面。

3、百度 AI Studio

网址：

https://aistudio.baidu.com/index

特点：

免费提供 GPU 运行环境，支持常见型号（如 T4、P40 等，具体配置可能随时调整）；
集成 PaddlePaddle 以及 TensorFlow、PyTorch 等主流深度学习框架；
类似 Jupyter Notebook 的在线编程环境，适合快速上手、学习和实验；

4、云平台注册赠费

Google Cloud Free Tier
- 新用户可获得 300 美元免费试用额度（90 天内有效），体验包括 GPU 实例在内的众多云服务
- 用户可自行创建 NVIDIA T4/V100/A100 GPU 实例（需手动配置）
- 适用于大规模机器学习和 AI 训练
- 注册时需绑定信用卡，但试用期间不会产生扣费
Microsoft Azure Free Tier
- 新用户可获得 200 美元免费试用额度（30 天内有效），支持 GPU 虚拟机（如 NC/ND 系列）
- 适用于 AI 深度学习训练和企业级应用开发
- 需要信用卡验证，但试用期间不收费
AWS Free Tier（Amazon Web Services）
- 免费层主要提供 750 小时 t2.micro 实例（不含 GPU），部分新用户可额外申请 GPU 实例（如 p3/g4 系列）
- 提供 Amazon SageMaker 平台，支持机器学习项目的快速部署
- 注册需绑定信用卡，确保试用过程不产生费用

其他资源：

https://www.reddit.com/r/KoboldAI/comments/13taldr/google_colabs_possible_alternatives/

https://deepnote.com/compare/alternatives/colab

附录

参考文章：

debian12部署kubernetes-1.28集群

2025-07-21 10:41:07

由于暑假到了以及天气原因，学校的k8s集群暂时关闭了，但是目前还是有使用k8s的需求，花了2个小时又重新搭了一下；

由于国内网络的问题，导致github包、镜像都很难拉下来，因此本文的内容更适合国内需求环境。

源代码：

https://github.com/JasonkayZK/kubernetes-learn

debian12部署kubernetes-1.28集群

零、前置工作

0、环境校验

该部分内容来自于 K8S 官方文档：

一台兼容的 Linux 主机。Kubernetes 项目为基于 Debian 和 Red Hat 的 Linux 发行版以及一些不提供包管理器的发行版提供通用的指令。
每台机器 2 GB 或更多的 RAM（如果少于这个数字将会影响你应用的运行内存）。
CPU 2 核心及以上。
集群中的所有机器的网络彼此均能相互连接（公网和内网都可以）。
节点之中不可以有重复的主机名、MAC 地址或 product_uuid。

1、准备虚拟机

IP Address	Hostname	CPU	Memory	Storage	OS Release	Role
192.168.117.200	k1	4C	8G	100GB	Debian 12	Master
192.168.117.201	k2	4C	8G	100GB	Debian 12	Worker
192.168.117.202	k3	4C	8G	100GB	Debian 12	Worker

虚拟机安装、配置部分不再赘述了！

主要包括下面几个方面：

配置软件源；
配置静态IP；
配置 hosts 解析；
配置 SSH 免密登录；
安装必要工具：net-tools、wget、curl、htop等；

参考：

《从零开始搭建大数据镜像-1》

2、卸载docker（如有）

新版本的 k8s 和 docker 底层都依赖 containerd 容易造成冲突，直接卸载docker：

sudo apt-get purge docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin docker-ce-rootless-extrassudo rm -rf /var/lib/dockersudo rm -rf /var/lib/containerdsudo rm -rf /etc/docker

3、设置系统时区和时间同步

使用阿里云的时钟源：

timedatectl set-timezone Asia/Shanghai# 安装 chronyapt-get install -y chrony# 修改为阿里的时钟源sed -i "s/pool 2.debian.pool.ntp.org iburst/server ntp.aliyun.com iburst/g" /etc/chrony/chrony.conf# 启用并立即启动 chrony 服务systemctl restart chronysystemctl enable chrony# 查看与 chrony 服务器同步的时间源chronyc sources

4、安装ipvs工具

在 Kubernetes 中，ipset 和 ipvsadm 的用途：

ipset 主要用于支持 Service 的负载均衡和网络策略。它可以帮助实现高性能的数据包过滤和转发，以及对 IP 地址和端口进行快速匹配。
ipvsadm 主要用于配置和管理 IPVS 负载均衡器，以实现 Service 的负载均衡。

执行：

apt-get install -y ipset ipvsadm

5、关闭服务

关闭 swap、防火墙等：

# 关闭所有已激活的 swap 分区swapoff -a# 禁用系统启动时自动挂载 swap 分区sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab# 停止 AppArmor 服务systemctl stop apparmor.service# 禁用 AppArmor 服务systemctl disable apparmor.service# 禁用 Uncomplicated Firewall（ufw）ufw disable# 停止 ufw 服务systemctl stop ufw.service# 禁用 ufw 服务systemctl disable ufw.service

6、内核优化

创建一个名为 kubernetes.conf 的内核配置文件，并写入以下配置内容：

cat > /etc/sysctl.d/kubernetes.conf << EOF# 允许 IPv6 转发请求通过iptables进行处理（如果禁用防火墙或不是iptables，则该配置无效）net.bridge.bridge-nf-call-ip6tables = 1# 允许 IPv4 转发请求通过iptables进行处理（如果禁用防火墙或不是iptables，则该配置无效）net.bridge.bridge-nf-call-iptables = 1# 启用IPv4数据包的转发功能net.ipv4.ip_forward = 1# 禁用发送 ICMP 重定向消息net.ipv4.conf.all.send_redirects = 0net.ipv4.conf.default.send_redirects = 0# 提高 TCP 连接跟踪的最大数量net.netfilter.nf_conntrack_max = 1000000# 提高连接追踪表的超时时间net.netfilter.nf_conntrack_tcp_timeout_established = 86400# 提高监听队列大小net.core.somaxconn = 1024# 防止 SYN 攻击net.ipv4.tcp_syncookies = 1net.ipv4.tcp_max_syn_backlog = 2048net.ipv4.tcp_synack_retries = 2# 提高文件描述符限制fs.file-max = 65536# 设置虚拟内存交换（swap）的使用策略为0，减少对磁盘的频繁读写vm.swappiness = 0EOF

加载或启动内核模块 br_netfilter，该模块提供了网络桥接所需的网络过滤功能

modprobe br_netfilter

查看是否已成功加载模块：

lsmod | grep br_netfilter

将读取该文件中的参数设置，并将其应用到系统的当前运行状态中：

sysctl -p /etc/sysctl.d/kubernetes.conf

参考：

Linux操作系统-内核优化

7、内核模块配置

将自定义在系统引导时自动加载的内核模块：

# 将自定义在系统引导时自动加载的内核模块cat > /etc/modules-load.d/kubernetes.conf << EOF# /etc/modules-load.d/kubernetes.conf# Linux 网桥支持br_netfilter# IPVS 加载均衡器ip_vsip_vs_rrip_vs_wrrip_vs_sh# IPv4 连接跟踪nf_conntrack_ipv4# IP 表规则ip_tablesEOF

添加可执行权限：

chmod a+x /etc/modules-load.d/kubernetes.conf

8、安装containerd运行时

以下指令适用于 Kubernetes 1.28！

（1）安装

安装 containerd：

# cri-containerd 比 containerd 多了 runcwget https://github.com/containerd/containerd/releases/download/v1.7.21/cri-containerd-1.7.21-linux-amd64.tar.gztar xf cri-containerd-1.7.21-linux-amd64.tar.gz -C /# 创建目录，该目录用于存放 containerd 配置文件mkdir /etc/containerd# 创建一个默认的 containerd 配置文件containerd config default > /etc/containerd/config.toml# 修改配置文件中使用的沙箱镜像版本sed -i 's#registry.k8s.io/pause:3.8#registry.aliyuncs.com/google_containers/pause:3.9#' /etc/containerd/config.toml# 设置容器运行时（containerd + CRI）在创建容器时使用 Systemd Cgroups 驱动sed -i '/SystemdCgroup/s/false/true/' /etc/containerd/config.toml# 修改存储目录# mkdir /data1/containerd# sed -i 's#root = "/var/lib/containerd"#root = "/data1/containerd"#' /etc/containerd/config.toml

（2）配置脚本

配置启动脚本：

/lib/systemd/system/containerd.service

[Unit]Description=containerd container runtimeDocumentation=https://containerd.ioAfter=network.target local-fs.target[Service]ExecStartPre=-/sbin/modprobe overlayExecStart=/usr/local/bin/containerdType=notifyDelegate=yesKillMode=processRestart=alwaysRestartSec=5LimitNPROC=infinityLimitCORE=infinityTasksMax=infinityOOMScoreAdjust=-999[Install]WantedBy=multi-user.target

执行配置：

# 启用并立即启动 containerd 服务systemctl enable --now containerd.service# 检查 containerd 服务的当前状态systemctl status containerd.service# 检查 containerd crictl runc 的版本containerd --versioncrictl --versionrunc --versioncrictl config runtime-endpoint unix:///run/containerd/containerd.sock

一、安装组件

更新 apt 包索引并安装使用 Kubernetes apt 仓库所需要的包：

sudo apt-get update# apt-transport-https 可能是一个虚拟包（dummy package）；如果是的话，你可以跳过安装这个包sudo apt-get install -y apt-transport-https ca-certificates curl gpg#下载用于 Kubernetes 软件包仓库的公共签名密钥。所有仓库都使用相同的签名密钥，因此你可以忽略URL中的版本：curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.28/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg#添加 Kubernetes apt 仓库。 请注意，此仓库仅包含适用于 Kubernetes 1.28 的软件包； 对于其他 Kubernetes 次要版本，则需要更改 URL 中的 Kubernetes 次要版本以匹配你所需的次要版本 （你还应该检查正在阅读的安装文档是否为你计划安装的 Kubernetes 版本的文档）。# 此操作会覆盖 /etc/apt/sources.list.d/kubernetes.list 中现存的所有配置。echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.28/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list#更新 apt 包索引，安装 kubelet、kubeadm 和 kubectl，并锁定其版本：sudo apt-get updatesudo apt-get install -y kubelet=1.28.13-1.1 kubeadm=1.28.13-1.1 kubectl=1.28.13-1.1#锁定版本sudo apt-mark hold kubelet kubeadm kubectl#说明：在 Debian 12 和 Ubuntu 22.04 之前的早期版本中，默认情况下不存在 /etc/apt/keyrings 目录； 你可以通过运行 sudo mkdir -m 755 /etc/apt/keyrings 来创建它。

配置 kubelet 开机启动：

systemctl enable kubelet

二、初始化集群

1、初始化集群（Master节点执行）

本小节在 master 节点执行！

生成配置文件：

kubeadm config print init-defaults > kubeadm.yaml

配置文件如下：

kubeadm.yaml

apiVersion: kubeadm.k8s.io/v1beta3bootstrapTokens:- groups:  - system:bootstrappers:kubeadm:default-node-token  token: abcdef.0123456789abcdef  ttl: 24h0m0s  usages:  - signing  - authenticationkind: InitConfiguration#localAPIEndpoint:#  advertiseAddress: 192.168.2.232#  bindPort: 6443nodeRegistration:  criSocket: unix:///run/containerd/containerd.sock  imagePullPolicy: IfNotPresent#  name: node  taints: null---apiServer:  timeoutForControlPlane: 4m0sapiVersion: kubeadm.k8s.io/v1beta3certificatesDir: /etc/kubernetes/pkiclusterName: kubernetescontrollerManager: {}dns: {}etcd:  local:    dataDir: /var/lib/etcd# 指定阿里云镜像以及k8s版本imageRepository: registry.cn-hangzhou.aliyuncs.com/google_containerskind: ClusterConfigurationkubernetesVersion: 1.28.13# 新增controlPlaneEndpoint: 192.168.117.200:6443 # 修改为masterIP！networking:  dnsDomain: cluster.local  serviceSubnet: 10.254.0.0/16  podSubnet: 10.255.0.0/16  # 指定pod网段scheduler: {}# 新增如下：---apiVersion: kubelet.config.k8s.io/v1beta1kind: KubeletConfigurationcgroupDriver: systemd---apiVersion: kubeproxy.config.k8s.io/v1alpha1kind: KubeProxyConfigurationmode: ipvs

验证镜像仓配置是否生效。

kubeadm config images list --config=kubeadm.yaml

提前拉取镜像。

kubeadm config images pull --config=kubeadm.yaml

查看镜像是否下载。

crictl images

开始初始化。

kubeadm init --config=kubeadm.yaml

安装完会有加入集群的相关指令：

You should now deploy a pod network to the cluster.Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:  https://kubernetes.io/docs/concepts/cluster-administration/addons/You can now join any number of control-plane nodes by copying certificate authoritiesand service account keys on each node and then running the following as root:  kubeadm join 192.168.117.200:6443 --token abcdef.0123456789abcdef \        --discovery-token-ca-cert-hash sha256:91a2398cbadf3967950dc6900e7411d5319e82ad30e139a1163896f9a8c61234 \        --control-plane Then you can join any number of worker nodes by running the following on each as root:kubeadm join 192.168.117.200:6443 --token abcdef.0123456789abcdef \        --discovery-token-ca-cert-hash sha256:91a2398cbadf3967950dc6900e7411d5319e82ad30e139a1163896f9a8c61234

2、Worker加入集群（Worker节点）

此小节在 worker 节点执行！

执行加入脚本：

kubeadm join 192.168.117.200:6443 --token abcdef.0123456789abcdef \        --discovery-token-ca-cert-hash sha256:91a2398cbadf3967950dc6900e7411d5319e82ad30e139a1163896f9a8c61234

等待后即可加入！

3、Master节点污点（可选）

默认情况下 Master 节点为 control-plane，无法部署服务；

可以通过执行：

# 查看污点kubectl describe node k1 |grep TaintsTaints:    node-role.kubernetes.io/control-plane:NoSchedule# 删除污点kubectl taint node k1 node-role.kubernetes.io/control-plane:NoSchedule-

启用 master 节点调度！

补：其他节点使用kubectl

其他节点默认是无法直接使用 kubectl 管理集群的，我们只需要将配置文件复制到其他节点即可！

方法一：拷贝master节点的/etc/kubernetes/admin.conf 到nodes节点中的同样的目录/etc/kubernetes/ ，然后再配置环境变量

[root@k8s-node1 qq-5201351]# scp k8s-master:/etc/kubernetes/admin.conf /etc/kubernetes/admin.conf

然后再配置环境变量：

echo 'export KUBECONFIG=/etc/kubernetes/admin.conf' >> ~/.bash_profilesource ~/.bash_profile

方法二：拷贝master节点的/etc/kubernetes/admin.conf 到nodes节点的$HOME/.kube目录，并且命名为config

因为默认是没有 $HOME/.kube 目录的，先进行创建：

mkdir -p $HOME/.kube
scp k8s-master:/etc/kubernetes/admin.conf $HOME/.kube/config

三、网络插件

1、安装Calico

k8s 部署完成后还不能使用，需要配置网络插件，从而为 Pod 分配 IP，打通网络等等。

Calico是 目前开源的最成熟的纯三层网络框架之一，是一种广泛采用、久经考验的开源网络和网络安全解决方案，适用于 Kubernetes、虚拟机和裸机工作负载。 Calico 为云原生应用提供两大服务：工作负载之间的网络连接和工作负载之间的网络安全策略。

Calico 访问链接：projectcalico.docs.tigera.io/about/

在这里使用 calico 来做为集群的网络插件，官网提供2种安装方式：

operator 的方式修改镜像比较麻烦，这里不使用；
通过yaml配置文件的方式；

curl https://raw.githubusercontent.com/projectcalico/calico/v3.28.1/manifests/calico.yaml -O

配置：

修改 CALICO_IPV4POOL_CIDR 为我们的网段（本文为：10.254.0.0/16）
修改 CALICO_IPV4POOL_IPIP 为 Always 启用 ipip 协议；

- # - name: CALICO_IPV4POOL_CIDR- #   value: "192.168.0.0/16"+ - name: CALICO_IPV4POOL_CIDR+   value: "10.254.0.0/16"# Enable IPIP+ - name: CALICO_IPV4POOL_IPIP+   value: "Always"

修改镜像地址：

搜索 image: 将镜像修改：

- image: docker.io/calico/cni:v3.28.1+ image: registry.cn-hangzhou.aliyuncs.com/jasonkay/cni:v3.28.1- image: docker.io/calico/node:v3.28.1+ image: registry.cn-hangzhou.aliyuncs.com/jasonkay/node:v3.28.1- image: docker.io/calico/kube-controllers:v3.28.1+ image: registry.cn-hangzhou.aliyuncs.com/jasonkay/kube-controllers:v3.28.1

即，将：docker.io/calico 替换为 registry.cn-hangzhou.aliyuncs.com/jasonkay （我在阿里云上同步的镜像）！

随后执行：

kubectl apply -f calico.yaml

等待部署完成即可！

2、验证

验证 coredns dns 转发是否正常：

# 安装dns工具apt install -y dnsutils# 获取dns ip地址kubectl get svc -n kube-systemNAME       TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                  AGEkube-dns   ClusterIP   10.254.0.10   <none>        53/UDP,53/TCP,9153/TCP   15h# 测试能够解析dig -t a www.baidu.com @10.254.0.10

四、部署应用测试

部署 nginx 进行测试：

nginx-deploy.yaml

apiVersion: apps/v1kind: Deploymentmetadata:  name: nginx-deployment  labels:    app: nginxspec:  replicas: 3  selector:    matchLabels:      app: nginx  template:    metadata:      labels:        app: nginx    spec:      containers:      - name: nginx        image: registry.cn-hangzhou.aliyuncs.com/jasonkay/nginx:latest        ports:        - containerPort: 80---apiVersion: v1kind: Servicemetadata:  name: nginx-servicespec:  type: NodePort  selector:    app: nginx  ports:    - protocol: TCP      port: 80      targetPort: 80      nodePort: 31080

应用：

kubectl apply -f nginx-deploy.yaml

然后访问 k8s-ip:31080，能够正常访问 Nginx！

五、安装Helm

Helm 是 Kubernetes 的一个包管理工具，类似于 Linux 的 Apt 或 Yum；

这个工具能帮助开发者和系统管理员更方便地管理在 Kubernetes 集群上部署、更新、卸载应用。

Helm 中的三个主要概念：

概念	描述
Chart	在 Kubernetes 集群上部署应用所需的所有资源定义的包
Release	在 Kubernetes 集群上部署的 Chart 的实例
Repository	Chart 的存储位置，类似软件仓库，用于分发和分享 Chart

安装脚本：

1. 添加 Helm 的官方 GPG keyroot@k8s-master:~# curl https://baltocdn.com/helm/signing.asc | gpg --dearmor -o /usr/share/keyrings/helm-keyring.gpg2. 添加 Helm 的官方 APT 仓库root@k8s-master:~# echo "deb [signed-by=/usr/share/keyrings/helm-keyring.gpg] https://baltocdn.com/helm/stable/debian/ all main" | tee /etc/apt/sources.list.d/helm-stable-debian.list3. 更新 apt 源root@k8s-master:~# apt-get update4. 安装 Helmroot@k8s-master:~# apt-get install -y helm5. 检查 Helm 是否已正确安装root@k8s-master:~# helm versionversion.BuildInfo{Version:"v3.13.3", GitCommit:"c8b948945e52abba22ff885446a1486cb5fd3474", GitTreeState:"clean", GoVersion:"go1.20.11"}

六、安装面板KubeSphere

官方提供的面板不太好用，这里推荐使用 KubeSphere；

配置下载区域：

export KKZONE=cn

安装也很简单，使用 helm 即可，而且支持国内：

helm upgrade --install -n kubesphere-system \  --create-namespace ks-core \  https://charts.kubesphere.com.cn/main/ks-core-1.1.4.tgz \  --debug --wait \  --set global.imageRegistry=swr.cn-southwest-2.myhuaweicloud.com/ks \  --set extension.imageRegistry=swr.cn-southwest-2.myhuaweicloud.com/ks

等待所有 Pod 就绪后，安装完成，显示：

NOTES:Thank you for choosing KubeSphere Helm Chart.Please be patient and wait for several seconds for the KubeSphere deployment to complete.1. Wait for Deployment Completion    Confirm that all KubeSphere components are running by executing the following command:    kubectl get pods -n kubesphere-system2. Access the KubeSphere Console    Once the deployment is complete, you can access the KubeSphere console using the following URL:    http://192.168.6.10:308803. Login to KubeSphere Console    Use the following credentials to log in:    Account: admin    Password: P@88w0rdNOTE: It is highly recommended to change the default password immediately after the first login.For additional information and details, please visit https://kubesphere.io.

执行以下命令检查 Pod 状态。

kubectl get pods -n kubesphere-system

当 Pod 状态都为 Running 时，使用默认的账户和密码 (admin/P@88w0rd) 通过 <NodeIP>:30880 访问 KubeSphere Web 控制台！

七、工具推荐

1、kubectx

推荐安装 kubectx，可以切换k8s上下文（管理多个集群）；

并且 kubectx 自带了另一个工具：kubens，可以方便切换默认的 namespace；

安装：

apt install -y kubectx

2、nerdctl

nerdctl 可以提供在宿主机上类 docker 的操作（操作 containerd），可以提升用户体验：

cd /tmpwget https://github.com/containerd/nerdctl/releases/download/v1.7.6/nerdctl-1.7.6-linux-amd64.tar.gztar xf nerdctl-1.7.6-linux-amd64.tar.gzmv nerdctl /usr/sbin

附录

参考文章：

源代码：

https://github.com/JasonkayZK/kubernetes-learn

分享两个服务器实用脚本：xsync和xcall

2025-07-21 10:20:51

如果同时需要维护多台服务器，可能会需要在多台服务器之间同步文件、执行命令。

本文介绍了两个简单的脚本实现这一功能！

源代码：

分享两个服务器实用脚本：xsync和xcall

文件同步：xsync

1、前置依赖

xsync 依赖于 rsync 工具，可以通过 yum 或者 apt 简单的安装：

apt或yum install -y rsync

此外，还需要配置 SSH 无密码登录！

2、编写脚本

脚本内容：

xsync

#!/bin/bash# Dependency:#  1. rsync: yum/apt install -y rsync#  2. password-less SSH login## 0. Define server listservers=("server-1" "server-2" "server-3")# 1. check param numif [ $# -lt 1 ]; then  echo "Not Enough Arguement!"  exit 1fi# 2. traverse all machinesfor host in "${servers[@]}"; do  echo "====================  $host  ===================="  # 3. traverse dir for each file  for file in "$@"; do    # 4. check file exist    if [ -e "$file" ]; then      # 5. get parent dir      pdir=$(cd -P "$(dirname "$file")" && pwd)      # 6. get file name      fname=$(basename "$file")      ssh "$host" "mkdir -p $pdir"      rsync -av "$pdir/$fname" "$host:$pdir"    else      echo "$file does not exist!"    fi  donedone

使用时，上面的文件中 servers 数组中的配置，为你服务器集群！

3、使用

增加可执行权限、并将文件放在 PATH 下；

然后直接使用，例如：

xsync ~/.bashrc

命令执行：xcall

和 xsync 类似，编写：

xcall

#!/bin/bash# Dependency: password-less SSH login## Define server array (easily extensible)servers=(  "server-1"  "server-2"  "server-3")# Check if command arguments are providedif [ $# -eq 0 ]; then  echo "Error: Please provide a command to execute" >&2  exit 1fi# Execute command across all serversfor server in "${servers[@]}"; do  echo "--------- $server ----------"  # Execute remote command and handle errors  if ssh "$server" "$*"; then    echo "✓ Command executed successfully"  else    echo "✗ Command failed on server: $server" >&2    # Uncomment below line to exit script on first failure    # exit 1  fidone

使用也是类似：

增加可执行权限、并将文件放在 PATH 下；

然后直接使用，例如：

xcall ls

附录

源代码：

JasonkayZK | 张小凯修改

JasonkayZK | 张小凯的 RSS 预览

下一代提示词工程语言POML简明教程

一、简介

（一）核心架构

（二）主要特性

1、结构化标记系统

2、外部数据集成

3、解耦的表示样式

4、模板引擎

（三）开发生态

1、VSCode扩展

2、多语言SDK

二、基本使用

（一）安装

（二）第一个案例

1、编写POML文件

2、解析并渲染POML

3、与LLM系统集成(Gemini)

（三）使用样式

三、深入学习

附录

开了一个新的专门学习日语的博客

一、安装Node.js&Hexo

二、初始化Hexo项目

三、更换主题

四、自定义配置

五、撰写新的文章

六、发布到Github

1、使用GitHub Actions自动部署

2、使用新分支保存笔记

附录

一、并行编程导论与CUDA入门

（一）、CUDA编程概述

1、什么是CUDA

2、CUDA 运算硬件单元

（1）SM 单元

（2）CPU与GPU协作

（二）、CUDA运算示例：加法

1、CPU加法

2、修改为GPU加法（重点！）

（1）修改文件名为 *.cu

（2）准备和初始化数据（CPU）

（3）数据传输到 GPU

（4：补）CUDA层级结构

i.线程层级结构

ii.线程索引计算方法

（4）编写和调用核函数

（5）将 GPU 数据传输回 CPU

（6）验证结果，释放内存

3、编译&运行CUDA程序

（1）编译流程

补：GPU虚拟架构

（2）编译命令

（三）、GPU性能测试

1、并行加法性能对比

2、将循环放入核函数（Grid-strided loop）

3、CUDA并行加法性能评估（加速比）

为什么<<<1,1>>> 比 CPU慢？

4、CUDA并行加法性能评估（总耗时）

（四）、设备信息

1、cudaDeivceProp

2、CUDA版本

后记

附录

一些免费的GPU资源

一、Google Colab（推荐）

二、Kaggle

三、Paperspace Gradient

四、其他

1、AWS Sagemaker Studio Lab

2、Lightning AI

3、百度 AI Studio

4、云平台注册赠费

附录

debian12部署kubernetes-1.28集群

零、前置工作

0、环境校验

1、准备虚拟机

2、卸载docker（如有）

（1）修改文件名为 `*.cu`

为什么`<<<1,1>>>` 比 CPU慢？