2026-04-23 08:53:09
I've officially started learning AWS to scale my tech company and master modern automation.
Any resources you recommend I should know?
2026-04-23 08:39:21
You wouldn't ship code without tests. But most AI agents ship with nothing — a handful of manual prompts in a notebook, a screenshot of "it worked once," and a prayer that production inputs don't look too different from the test ones.
OAS 1.4 ships oa test a test harness that runs eval cases against real models, asserts on output shape and content, and emits CI-friendly JSON. Your agents get tested like code, because they are code.
Tests live alongside the spec. One YAML file per agent:
# .agents/summariser.test.yaml
spec: ./summariser.yaml
cases:
- name: summarises short documents
task: summarise
input:
document: "The sky is blue. The grass is green. Water is wet."
expect:
output.summary: { type: string, min_length: 10 }
- name: handles empty facts gracefully
task: summarise
input:
document: ""
expect:
output.summary: { contains: "no content" }
- name: smoke test only
task: summarise
input:
document: "..."
# no expect block — passes if the model returns anything valid
Three cases, one file. Each case targets a task in the spec, provides the input, and optionally asserts on the output.
oa test supports a small, practical set of assertions, enough to catch real bugs without turning tests into a DSL.
| Assertion | Example | Checks |
|---|---|---|
contains |
{ contains: "welcome" } |
Substring match (case-insensitive by default) |
equals |
{ equals: "greeting" } |
Exact value equality |
type |
{ type: array } |
Value type: string, number, boolean, object, array
|
min_length |
{ min_length: 1 } |
Length for strings or arrays |
max_length |
{ max_length: 500 } |
Upper bound for strings or arrays |
You can combine them:
expect:
output.items: { type: array, min_length: 1, max_length: 10 }
output.items[0].id: { type: string }
output.summary: { contains: "sky", case_sensitive: false }
Paths support dotted access and array indexing (output.items[0].id). The parser is deliberately simple, if you need richer assertions, drop to a post-processing step in CI rather than extending the harness.
From the terminal:
oa test .agents/summariser.test.yaml
You get human-readable output — green ticks, red crosses, which case failed and why.
For CI, flip to JSON mode:
oa test .agents/summariser.test.yaml --quiet
{
"spec": ".agents/summariser.yaml",
"total": 3,
"passed": 2,
"failed": 1,
"cases": [
{
"name": "summarises short documents",
"passed": true,
"duration_ms": 842
},
{
"name": "handles empty facts gracefully",
"passed": false,
"reason": "output.summary: expected to contain 'no content', got 'The document is empty'",
"duration_ms": 512
},
{
"name": "smoke test only",
"passed": true,
"duration_ms": 654
}
]
}
Pipe this into whatever CI system you use. The exit code is non-zero on any failure, so oa test plays nicely with standard test-runner conventions.
Drop it into a GitHub Actions workflow:
# .github/workflows/test-agents.yml
name: Test agents
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pipx install open-agent-spec
- name: Run agent tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
for test in .agents/*.test.yaml; do
oa test "$test" --quiet
done
Agents now have the same test discipline as the rest of your codebase. Break a prompt? The test case catches it before merge. Swap models? Run the suite and see what drifted.
Model outputs are non-deterministic, so your tests need to assert on shape and invariants, not exact strings.
oa://prime-vector/summariser still works after the registry updatesTest invariants, not novelty. That's where agent tests earn their keep.
Agents-as-code only works if the agents are actually code-like. That means:
oa test
oa test was the last piece missing. With it, agents get the same discipline as any other component of your system: change them, test them, merge them, deploy them.
Define what your agents do. Let the runtime be someone else's problem.
pipx install open-agent-spec
# Add a test file next to your spec
cat > .agents/example.test.yaml <<'EOF'
spec: ./example.yaml
cases:
- name: greets by name
task: greet
input: { name: "CI" }
expect:
output.response: { contains: "CI" }
EOF
# Run it
oa test .agents/example.test.yaml
One command. One YAML file. Your agents now have a test suite.
Resources:
Also in this series:
Open Agent Spec is MIT-licensed and maintained by Prime Vector. If you're running agents in CI, we'd love to hear what broke — issues welcome on GitHub.
2026-04-23 08:38:47
Imagine an employer or recruiter landing on your profile. Instead of a standard, static PDF, they are greeted with a fast, responsive, and live-hosted website that showcases your professional journey with precision. It doesn’t just show them your experience, it proves you have the technical initiative to build, manage, and deploy your own digital footprint.
In this article, I’ll walk you through how I took a professional Pilot’s resume from a local code editor to a live URL using HTML, CSS, and GitHub Pages, including specific terminal hurdles I cleared along the way.
Learning Objectives:
For this exercise, Git Bash is an essential command-line interface. It provides the Unix-style environment necessary to run these commands seamlessly on a Windows machine.
Step 1: Setting the Foundation
Every successful deployment starts with identity. Before Git can track your progress, it needs to know who is behind the code. I started by configuring my global credentials in the terminal to ensure every commit was correctly attributed to my profile.
Using git config --list is a quick way to verify your setup and avoid Permission Denied errors later in the workflow.
NOTE: Only two lines of output appeared because I have configured my environment before this exercise.
Step 2: Building the Workspace
Organization is key to a smooth deployment. Instead of working out of a cluttered root directory, I used the terminal to create a dedicated space for my project. By navigating to the Desktop and using mkdir and touch, I established a clean environment for my source files.
The Workflow:
cd: Navigating to the right environment.
mkdir website: Creating a isolated container for the project.
touch index.html: Generating the entry point for my live site.
Step 3: Bridging Local to Remote
With the local code ready, the next step was creating a remote repository on GitHub. This is where the magic of Hosting begins. By clicking that New button, I created a central hub (Git-Basics101) that would eventually serve my resume to the world.
Creating a remote repository is like setting up a destination for a flight—you need a clear target before you can push your data into the clouds.
Then I copied the HTTPS url, to use in the terminal.
Step 4: Troubleshooting the Workflow
The Git Discovery Error occurred when trying to link my remote. I hit the fatal: not a git repository error. This was a great reminder that git init must be the very first step before any remote connections can be made.

Step 5: Mastering the Editor
To ensure the final product was polished and free of errors, I utilized the Vim (vi) editor directly within the terminal. This allowed me to inspect the index.html file in a raw environment.
I used the cat command to verify my code before it goes live.
NOTE: This resume is for demo purpose.
Step 7: Staging for Deployment
Here am moving files from untracked zone to the staged zone. By using git status, I could see exactly what Git was watching. But before that, I used the ls to very that the file is in that particular location in my local environment. Initially, my index.html was in red (untracked), but with a quick git add ., it turned green—ready and waiting to be committed.
This visual confirmation in the terminal is the developer's safety check to ensure only the right files are being sent to the server.

Step 8: Sealing the Version
With the files staged, it was time to create a permanent snapshot of my work. Running git commit -m "create index.html" acted as the official seal for this version of the project.
The terminal output showing 1 file changed" and "130 insertions is more than just text, it’s a receipt of my progress. By providing a clear, descriptive message, I’ve ensured that any future collaborator (or even my future self) understands exactly what this change was for.

Step 9: Overcoming the "Origin" Obstacle by authentication
One of the most common hurdles for any developer is connecting a local project to a remote server. As seen in my terminal, I initially hit a fatal: origin does not appear to be a git repository error. This happened because my local folder hadn't been introduced to GitHub yet.
Note: I resolved this by explicitly adding the remote origin URL. It was a great reminder that Git needs a clear map of where your code is supposed to go before it can start the journey.
In the modern Git workflow, security is paramount. GitHub requires a Personal Access Token (PAT) instead of a regular password. So I navigated to my Developer Settings to generate a Classic token. This token acts as a secure key, allowing my terminal to communicate safely with my GitHub account.
Step 10: The Successful Push
With the connection established, I ran the final command: git push -u origin master. Seeing those lines of code: Enumerating objects, Counting objects, and finally the URL to the remote repository—is the ultimate mission accomplished moment for a developer.
The code was no longer just on my laptop, it was officially live in the cloud.
Step 11: Activating GitHub Pages
With the code successfully pushed to the Git-Basics101 repository, the final piece of the puzzle was turning on the hosting.
I navigated to the Settings tab of my repository, selected the Pages menu on the left, and set the build source to the master branch and saved the change. Within seconds, GitHub provided a live link. This transition from a local index.html file to a globally accessible URL is the ultimate goal of any web deployment project.
Building this resume was more than just a coding exercise, it was a lesson in the modern developer's workflow. Errors aren't failures, they are the terminal's way of teaching you the correct sequence of operations.
The result? A professional, responsive resume that is ready for the world to see.
2026-04-23 08:37:01
like the moment when something clicks.
When you've been tweaking a recipe and finally nail the flavor. When you're debugging and suddenly see the root cause. When you're reading a paper and a concept from a completely different field connects. The domain changes, but that "this is it" feeling is the same.
I wanted to build a note app that supports the trial and error behind those moments. That's why I started building Graphium.
When I was doing materials science research, I kept running into the same problem: experimental processes were hard to record and even harder to pass on. In grad school, different senior researchers taught me different preprocessing procedures for the same sample material. The know-how lived in people's heads, not in any shared system. We wrote things down in paper lab notebooks, but describing a process precisely is tedious, and searching through those notes later is nearly impossible. On top of that, the data that makes it into publications is just the tip of the iceberg — the trial-and-error data sitting in each lab usually goes unused and eventually disappears.
In 2019, I had a vague idea: "If we could record experimental processes in a structured way, maybe AI could make use of them." (I wrote about this problem in a blog post and a FIT2020 presentation at the time — both in Japanese.) That was the starting point for Graphium. But I had no concrete vision of what it would look like.
Over the next few years, the pieces I needed started coming together in my hands, one by one.
In 2023, around the time GPT-4 came out, I discovered Niklas Luhmann's Zettelkasten — a method where you write individual notes in your own words, link them to each other, and let a network of thought grow over time. I was drawn to the idea that connecting notes could lead to new insights. At the same time, I felt the weight of "permanent notes" — the process of abstracting and restructuring raw notes into refined knowledge. "Could AI handle this part?" I thought. But I couldn't see how this connected to my 2019 problem yet.
In 2024, I came across BlockNote.js, a block-based editor framework. Each element — text, images, data — exists as an independent block with its own ID. I sensed potential in its extensibility. At that point, it felt like "an interesting piece of technology" to me personally, but I didn't yet have a concrete sense of how it would tie into my 2019 problem.
In 2025, I learned about PROV-DM, the W3C standard for describing provenance. It models relationships between Entities (things), Activities (actions), and Agents (actors) in a structured way. That year, a colleague and I wrote a paper called MatPROV, applying PROV-DM to structure the provenance of materials synthesis. It was accepted at a NeurIPS 2025 workshop. For the first time, the vague 2019 idea — "structured recording of experimental processes" — had a formal framework to attach to.
The pieces were accumulating. But I still couldn't see how to put them together into one thing.
In January 2026, a colleague suggested: "What if you could attach context labels to blocks?" A simple idea — give semantic meaning to individual blocks via labels. But the moment I heard it, the scattered pieces clicked into a single picture. Blocks have IDs. IDs mean you can attach labels. Labels mean you can auto-generate a PROV-DM provenance graph. Links between blocks become a Zettelkasten network. And if AI layers a knowledge base on top of that network, you can break through the permanent-note bottleneck.
It also helped that this was a time when AI could vibe-code. Once the ideas connected, I could immediately start building. I went from incubating a concept to implementing it almost overnight.
Then in April 2026, Andrej Karpathy proposed a design pattern called LLM Wiki — an approach where LLMs continuously build and update a Markdown wiki. As he put it: "LLMs don't get bored, don't forget to update a cross-reference, and can touch 15 files in one pass." The idea I'd been carrying since 2023 — "let AI handle Zettelkasten's permanent notes" — suddenly had a concrete implementation pattern.
These pieces combined into a structure where Graphium supports discovery through three layers.
Organize your thinking. Type @ to reference another note, and a network between your notes starts growing. Zettelkasten's "connect your thoughts through links" philosophy, built directly into the editor.
Accelerate discovery. As you accumulate daily notes, AI reads across them and auto-generates a knowledge layer draft. For example, scattered experimental findings across multiple notes get organized into a single synthesized page, which you can then review and edit. With minimal effort, a knowledge base equivalent to Zettelkasten's permanent notes grows over time.
Protect your discoveries. Attach labels like #Input or #Output to blocks, and a provenance graph is auto-generated, showing what went in, what steps were taken, and what came out. Back in grad school, different senior researchers taught me different preprocessing procedures and I couldn't reliably reproduce any of them — this is exactly the kind of tool I wished I'd had back then. Provenance is described using the W3C standard PROV-DM, and when you quote or derive new notes from existing blocks, those connections are structurally recorded too.
A key design principle: these features only appear when you need them.
Without labels, Graphium is a simple note app. Start using @ references and the link network becomes visible. Add # labels and provenance graphs appear. Enable AI and the knowledge layer starts growing. Complexity scales with your actions, step by step.
Structure reveals itself gradually, only to the extent you need it — a progressive design I want to preserve.
The first commit was on March 23, 2026 — just about a month ago. There's plenty left to build, but ideas I'd been carrying in fragments since 2019 are coming together into a single product, and that process itself has been a chain of discoveries.
Right now Graphium is built for individual discovery. But I have a feeling this structure could eventually extend toward formalizing tacit knowledge within teams and growing collective intelligence.
In this series, I'll walk through Graphium's design decisions one by one. Why I built it this way, what I chose not to build, what I'm still figuring out. I want to share the development process as it actually is.
2026-04-23 08:34:57
If you've built multiplayer in Unity, you've probably hit this question at some point:
"Do I use Mirror, or do I use a backend with Socket.IO?"
The answer most tutorials give you is pick one. But they're solving different problems — and using both is not only possible, it's the right architecture for most production multiplayer games.
This guide covers exactly how to integrate socketio-unity with Mirror — what each system owns, where the boundary is, and how to implement the integration without the two systems bleeding into each other.
Socket.IO is a WebSocket client. It connects your game to a Node.js backend and handles everything that requires a server to broker: matchmaking, lobbies, session identity, authoritative scores, reconnection recovery. The backend is the source of truth. Socket.IO is the pipe.
Mirror is an in-scene networking stack. It synchronises transforms, physics, and animation state between players at frame rate. It has no concept of a backend server and cannot validate gameplay events.
The key insight: Mirror never validates — it only synchronises. Socket.IO never touches transforms — it only brokers.
Socket.IO (Node.js Backend) Mirror (In-Scene)
───────────────────────────── ──────────────────
Matchmaking, lobbies, session ──► StartClient() → server
Scores, kills, round state NetworkTransform, Rigidbody
Reconnect recovery, host ID NetworkBehaviour lifecycle
| Concern | Socket.IO | Mirror |
|---|---|---|
| Matchmaking / lobby rooms | ✅ | — |
| Session identity across reconnects | ✅ | — |
| Player transform / position | — | ✅ |
| Rigidbody / physics sync | — | ✅ |
| Animation state | — | ✅ |
| Scores, kill feed, round state | ✅ | — |
| Host migration | ✅ | — |
| Reconnect recovery | ✅ | — |
| Anti-cheat / server validation | ✅ | — |
| WebGL browser support | ✅ | ✅ (via SimpleWebTransport) |
When you're unsure which system owns something, ask: does this need a server to broker it, or does it need low-latency peer sync? The answer tells you where it goes.
socketio-unity (v1.4.0+):
https://github.com/Magithar/socketio-unity.git?path=/package
NativeWebSocket (required dependency):
https://github.com/endel/NativeWebSocket.git#upm
Mirror — install via Package Manager or .unitypackage from the Mirror repo.
The session has four distinct phases. Understanding the phase boundaries is more important than any individual line of code.
Phase 1 — Lobby (Socket.IO only)
Players connect to the /lobby namespace, create or join rooms, exchange session credentials. Mirror is not active at all.
Phase 2 — Match Start (handoff)
The host emits start_match. The backend broadcasts match_started to all room members. This single event is the handoff point — it triggers MirrorGameOrchestrator, which starts Mirror.
Phase 3 — In-Game (both layers active)
Mirror syncs positions via NetworkTransform every frame. Socket.IO delivers authoritative backend events (score_update, player_killed) via the /game namespace. The two systems run in parallel, never touching each other's concerns.
Phase 4 — Teardown (mandatory order)
// Step 1 — Mirror first
if (NetworkServer.active) mirrorNetworkManager.StopHost();
else mirrorNetworkManager.StopClient();
// Step 2 — Clean /game namespace handlers
gameEventBridge.Cleanup();
// Step 3 — Clear netId ↔ playerId mappings
GameIdentityRegistry.Clear();
// Step 4 — Signal intentional leave
lobbyNetworkManager.LeaveRoom();
Reversing steps 1 and 4 is the most common mistake. If you call LeaveRoom() (which closes the socket) before StopHost(), Mirror tries to send disconnect packets over a closed transport. Silent failure, maddening to debug.
Mirror speaks in netId (uint). Socket.IO speaks in playerId (string). They need a translation layer.
GameIdentityRegistry is a static lookup table that maps between the two:
// Register on Mirror player spawn
GameIdentityRegistry.Register(netId, playerId);
// Resolve when a Socket.IO event arrives
var identity = GameIdentityRegistry.GetNetworkObject(playerId);
if (identity != null)
{
// apply event to Mirror object
}
// Clear on ReturnToLobby
GameIdentityRegistry.Clear();
GetNetworkObject checks NetworkServer.spawned first, then NetworkClient.spawned, so it works correctly in all Mirror roles — host, server, or client.
Call Clear() in two places: ReturnToLobby() and store.OnDisconnected. Stale mappings after a disconnect cause events to fire against destroyed objects.
GameEventBridge subscribes to /game namespace events and routes them to Mirror objects via GameIdentityRegistry:
public class GameEventBridge : MonoBehaviour
{
private Action<string> _scoreHandler;
private Action<string> _killHandler;
public void Subscribe()
{
var game = lobbyNetworkManager.Socket.Of("/game");
_scoreHandler = (string json) =>
{
var obj = JObject.Parse(json);
string playerId = obj["playerId"]?.ToString();
int score = obj.Value<int>("score");
var identity = GameIdentityRegistry.GetNetworkObject(playerId);
if (identity != null)
identity.GetComponent<PlayerScore>()?.SetScore(score);
};
_killHandler = (string json) =>
{
var obj = JObject.Parse(json);
string victimId = obj["victimId"]?.ToString();
var identity = GameIdentityRegistry.GetNetworkObject(victimId);
if (identity != null)
identity.GetComponent<PlayerHealth>()?.Die();
};
game.On("score_update", _scoreHandler);
game.On("player_killed", _killHandler);
}
public void Cleanup()
{
var game = lobbyNetworkManager.Socket.Of("/game");
game.Off("score_update", _scoreHandler);
game.Off("player_killed", _killHandler);
}
void OnDestroy() => Cleanup();
}
Critical: never call Subscribe() in Start(). The socket may not be initialized yet. Call it from MirrorGameOrchestrator.HandleMatchStarted(), which is guaranteed to run after the socket is fully connected.
Always cache handler references and call Off() in Cleanup(). The event registry holds delegate references — failing to unsubscribe causes callbacks to fire against destroyed MonoBehaviours.
MirrorGameOrchestrator listens for match_started and coordinates the startup sequence:
public class MirrorGameOrchestrator : MonoBehaviour
{
[SerializeField] private LobbyStateStore store;
[SerializeField] private LobbyNetworkManager lobbyNetworkManager;
[SerializeField] private NetworkManager mirrorNetworkManager;
[SerializeField] private GameEventBridge gameEventBridge;
[SerializeField] private ServerMode serverMode;
[SerializeField] private GameObject lobbyLayer;
[SerializeField] private GameObject gameLayer;
private bool _inGame;
void Start() => store.OnMatchStarted += HandleMatchStarted;
private void HandleMatchStarted(string sceneName, string hostAddress,
int? kcpPort, int? wsPort)
{
if (_inGame) return; // dual guard against duplicate events
if (NetworkClient.active || NetworkServer.active) return;
_inGame = true;
gameEventBridge.Subscribe(); // must happen before StartHost/Client
lobbyLayer.SetActive(false);
gameLayer.SetActive(true);
switch (serverMode)
{
case ServerMode.PeerToPeer:
// Host starts Mirror, others connect to host's LAN IP
if (store.IsHost)
mirrorNetworkManager.StartHost();
else
mirrorNetworkManager.networkAddress = hostAddress;
mirrorNetworkManager.StartClient();
break;
case ServerMode.DedicatedKCP:
// All clients connect to dedicated server
SetKcpPort(kcpPort);
mirrorNetworkManager.networkAddress = hostAddress;
mirrorNetworkManager.StartClient();
break;
case ServerMode.DedicatedWebSocket:
SetWsPort(wsPort);
mirrorNetworkManager.networkAddress = hostAddress;
mirrorNetworkManager.StartClient();
break;
}
}
public void ReturnToLobby()
{
if (NetworkServer.active) mirrorNetworkManager.StopHost();
else mirrorNetworkManager.StopClient();
gameEventBridge.Cleanup();
GameIdentityRegistry.Clear();
lobbyNetworkManager.LeaveRoom();
gameLayer.SetActive(false);
lobbyLayer.SetActive(true);
_inGame = false;
}
}
The _inGame flag plus the Mirror state check is a dual guard against duplicate match_started events — the server can broadcast twice if a reconnect happens mid-handshake.
GameLayer must be inactive at scene start. If it's active when Play begins, NetworkManager.Awake() runs before the orchestrator can deactivate it, initialising Mirror prematurely.
MirrorGameOrchestrator exposes a ServerMode enum as an inspector dropdown. Switch between modes without changing code:
| Mode | Who hosts Mirror | Use case |
|---|---|---|
PeerToPeer |
Room creator runs StartHost(), others connect to LAN IP |
Local / LAN testing |
DedicatedKCP |
All clients connect to hostAddress:kcpPort
|
Dedicated server, native builds (UDP) |
DedicatedWebSocket |
All clients connect to hostAddress:wsPort
|
Dedicated server, WebGL builds |
For dedicated server mode, set MIRROR_SERVER_ADDRESS, MIRROR_KCP_PORT, and MIRROR_WS_PORT as environment variables on your lobby server. The server injects them into every match_started broadcast automatically.
PlayerIdentityBridge runs on the Mirror player prefab and registers the netId ↔ playerId mapping when the player spawns:
public class PlayerIdentityBridge : NetworkBehaviour
{
[SyncVar(hook = nameof(OnDisplayNameChanged))]
private string _displayName;
[SerializeField] private TMP_Text nameLabel;
public override void OnStartLocalPlayer()
{
var store = FindObjectOfType<LobbyStateStore>();
CmdRegisterIdentity(store.LocalPlayerId);
string name = store.CurrentRoom?.players
.FirstOrDefault(p => p.id == store.LocalPlayerId)?.name
?? store.LocalPlayerId;
CmdSetDisplayName(name);
}
[Command]
private void CmdRegisterIdentity(string playerId)
{
GameIdentityRegistry.Register(netIdentity.netId, playerId);
// Sync to all clients
RpcRegisterIdentity(netIdentity.netId, playerId);
}
[ClientRpc]
private void RpcRegisterIdentity(uint netId, string playerId)
{
GameIdentityRegistry.Register(netId, playerId);
}
[Command]
private void CmdSetDisplayName(string name)
{
_displayName = name;
OnDisplayNameChanged("", name); // SyncVar hook doesn't fire on host
}
private void OnDisplayNameChanged(string _, string newName)
{
if (nameLabel != null) nameLabel.text = newName;
}
}
Note the manual hook call in CmdSetDisplayName — Mirror SyncVar hooks do not fire on the host when the value is set on the server. This is a common gotcha.
Starting Mirror before match_started — always start Mirror inside HandleMatchStarted. Calling StartHost() from a button before the backend confirms creates an orphaned Mirror session.
Using Mirror [Command] for game validation — [Command] goes to the Mirror host, which is a client and can be spoofed. Route all validation through Socket.IO. The backend emits the result; Mirror executes the visual effect.
Double-spawning when migrating from PlayerSync — if your project previously handled player_join via Socket.IO, disable those handlers during the Mirror game phase. Mirror owns all in-scene player lifecycle.
StartClient() failure leaving the player stranded — wire OnClientDisconnect to call ReturnToLobby() so players return to the lobby instead of seeing a blank screen.
The socketio-unity repo ships a full Mirror Integration sample — lobby → match transition, WASD movement synced via NetworkTransform, lobby display name above each player, graceful shutdown, and a Node.js test server with HTTP endpoints to fire game events from a browser while Unity runs.
Import via Package Manager → Samples → "Mirror Integration".
cd path/to/socketio-unity-mirror-server
npm install
npm run start:mirror
MIT licensed. Zero paid dependencies.
Live WebGL demo: magithar.github.io/socketio-unity/
Have you built a hybrid architecture like this before? What was the hardest part — the boundary between systems, the teardown order, or something else entirely?
2026-04-23 08:19:51
8.7
/ 10
<span>Benchmark Performance</span>
<span>9.5</span>
<span>Agentic Capabilities</span>
<span>9.0</span>
<span>Cost Efficiency</span>
<span>9.5</span>
<span>Instruction Following</span>
<span>7.2</span>
<span>Ecosystem & Tooling</span>
<span>7.5</span>
Moonshot AI shipped Kimi Code K2.6 as generally available on April 20, 2026 — one week after beta testers ran the Code Preview. The release is significant: K2.6 tops SWE-Bench Pro at 58.6%, outscoring GPT-5.4 (57.7%) and Claude Opus 4.6 (53.4%) on the benchmark that comes closest to measuring real-world GitHub issue resolution. It does this while running fully open weights under a Modified MIT License and charging $0.60 per million input tokens — roughly 5x cheaper than Claude Sonnet 4.6.
That combination — top-tier coding benchmarks, open weights, and aggressive pricing — makes K2.6 the most credible challenger to Claude Code that developers have seen in 2026.
Kimi K2.6 is Moonshot AI's flagship model, built from the ground up for agentic software engineering. Architecturally, it uses the same Mixture-of-Experts design as K2.5: 1 trillion total parameters with only 32 billion activated per forward pass. The full architecture details: 384 experts in total, 8 selected per token (plus one shared expert that is always active), 61 layers, an attention hidden dimension of 7,168, and 64 attention heads.
What K2.6 changes from K2.5 is execution depth. Kimi K2.5 could reliably follow 30–50 sequential tool calls before losing coherence. K2.6 extends that to 200–300 calls. Agent swarm capacity grows from 100 to 300 simultaneous sub-agents, each capable of executing across up to 4,000 coordinated steps. Moonshot AI demonstrated the practical implications with a real test: K2.6 autonomously overhauled an 8-year-old financial matching engine over 13 hours, achieving a 185% throughput improvement without human intervention.
That's not a benchmark. That's a production refactoring job that would normally take a senior engineer a week.
If you've been following the AI coding tools landscape in 2026, Kimi K2.6 lands in the tier just below Claude Mythos but well above the open-weight field. It's Moonshot AI's direct answer to Claude Sonnet 4.6 and the Cursor background agent ecosystem.
Numbers first, context after.
| Benchmark | Kimi K2.6 | GPT-5.4 | Claude Opus 4.6 | Gemini 3.1 Pro | Kimi K2.5 |
|---|---|---|---|---|---|
| SWE-Bench Pro | 58.6% | 57.7% | 53.4% | 54.2% | 50.7% |
| SWE-Bench Verified | 80.2% | — | — | — | — |
| LiveCodeBench v6 | 89.6 | — | 88.8 | — | — |
| HLE-Full (with tools) | 54.0 | 52.1 | 53.0 | 51.4 | — |
| DeepSearchQA (F1) | 92.5% | 78.6% | — | — | — |
| Terminal-Bench 2.0 | 66.7% | — | — | — | — |
| API Input Price | $0.60/M | varies | $3.00/M | varies | $0.60/M |
SWE-Bench Pro is currently the most credible coding evaluation because it tests models on real GitHub issues — bugs filed by actual developers, not synthetic problems. K2.6's 58.6% means it correctly resolves more than half of those issues autonomously, placing it ahead of every closed-weight model in this comparison.
The HLE-Full with tools result (54.0) is perhaps more surprising. Humanity's Last Exam tests genuinely hard multi-domain reasoning, and K2.6 leads there too — which suggests that Moonshot AI's improvements to tool call reliability have broader reasoning implications, not just code execution effects.
One important note: BenchLM currently ranks K2.6 as #6 out of 111 models for coding overall, with an average score of 89.9. It is leading the open-weight category by a significant margin.
Strengths
<ul>
<li>
Top SWE-Bench Pro score — 58.6% on real GitHub issues beats every frontier model, including GPT-5.4 and Claude Opus 4.6
Weaknesses
<ul>
<li>
English instruction following lags Claude — Complex multi-part English prompts with nuanced constraints show more drift than Claude Sonnet 4.6
Kimi K2.6 is available through four channels with different economics:
Managed API (platform.kimi.ai)
OpenRouter (moonshotai/kimi-k2.6)
Microsoft Azure AI Foundry
Self-Hosted (Hugging Face weights)
For context: at $0.60 input / $2.50 output, K2.6 is 5x cheaper on input and 6x cheaper on output than Claude Sonnet 4.6 ($3/$15). Against Claude Opus 4.6 or 4.7, the gap widens further. For agentic pipelines that generate thousands of tool-call roundtrips, this pricing difference translates directly to project economics.
The Modified MIT License allows unrestricted commercial use with one exception: if your product exceeds 100 million monthly active users or $20 million in monthly revenue, you must display a visible "Kimi K2.6" attribution in your user interface. Most developer teams won't hit that threshold, but SaaS companies building on top of K2.6 should check their TOS terms before deploying.
Kimi's API is OpenAI SDK-compatible. If you're already calling OpenAI endpoints, the switch is a base URL change:
from openai import OpenAI
client = OpenAI(
api_key="your-moonshot-api-key",
base_url="https://api.moonshot.ai/v1"
)
response = client.chat.completions.create(
model="kimi-k2.6",
messages=[
{"role": "user", "content": "Refactor this Python class to use dataclasses."}
]
)
print(response.choices[0].message.content)
Get your API key at platform.kimi.ai/console/api-keys.
This is the integration that's gained the most traction. Set three environment variables and Claude Code's entire interface — slash commands, subagents, CLAUDE.md — runs against K2.6's backend:
# Linux / macOS
export ANTHROPIC_BASE_URL="https://api.moonshot.ai/anthropic"
export ANTHROPIC_AUTH_TOKEN="your-moonshot-api-key"
export ANTHROPIC_MODEL="kimi-k2.6"
export ANTHROPIC_DEFAULT_OPUS_MODEL="kimi-k2.6"
export ANTHROPIC_DEFAULT_SONNET_MODEL="kimi-k2.6"
# Then launch Claude Code normally
claude
Kimi maintains an Anthropic-compatible API endpoint at api.moonshot.ai/anthropic, which means Claude Code's tool call format, context compaction, and session management work without modification. The practical advantage: you get Claude Code's polished UX at K2.6's pricing.
If you're already using Claude Code for advanced workflows, this is the fastest way to evaluate K2.6 without changing your tooling setup.
Moonshot AI ships its own terminal agent built on K2.6:
pip install kimi-cli
kimi /login # OAuth via browser
kimi # Start coding session
The CLI includes repository-aware context, MCP tool integration (kimi mcp add), cron scheduling, and shell mode toggle with Ctrl-X. It supports 256K context tuned for repository-scale codebases and outputs at ~100 tokens/second. For teams comfortable with terminal-first AI coding agents, this is the most direct path.
For teams wanting zero per-token cost:
# Install dependencies
pip install vllm transformers>=4.57.1
# Launch vLLM server with K2.6 weights
python -m vllm.entrypoints.openai.api_server \
--model moonshotai/Kimi-K2.6 \
--tensor-parallel-size 4 \
--max-model-len 65536 \
--dtype bfloat16
Hardware baseline: 4× H100 80GB for the full model in bfloat16. For lower-budget setups, community GGUF quantizations from ubergarm reduce VRAM requirements significantly, though at reduced accuracy on complex reasoning tasks.
The recommended inference stack is vLLM or SGLang. vLLM's MRV2 architecture (released March 2026) handles MoE routing well; SGLang is faster for structured output generation. If you're already running vLLM in production, K2.6 slots in without configuration changes beyond the model path.
The 13-hour financial engine refactor is the headline, but production reports are more nuanced.
Where K2.6 genuinely wins:
Where Claude Code still has the edge:
The hybrid workflow gaining traction: K2.6 for code generation and bulk execution, Claude Opus 4.7 for planning, validation, and anything requiring precise instruction adherence. Running K2.6 via the OpenAI-compatible endpoint alongside tools like LiteLLM's proxy makes provider switching transparent to application code.
K2.6 is the right choice if you're:
Stick with Claude Code if you're:
Compare K2.6 alongside other capable open-weight agents like Goose by Block and Hermes Agent if your priority is moving away from proprietary model dependencies entirely.
The weights are publicly available on Hugging Face under a Modified MIT License. "Modified" because of the revenue/MAU attribution requirement — but for the vast majority of developers and teams, it's functionally open source with commercial use allowed.
Yes. Set ANTHROPIC_BASE_URL=https://api.moonshot.ai/anthropic and ANTHROPIC_AUTH_TOKEN=<your-kimi-key> and ANTHROPIC_MODEL=kimi-k2.6. Claude Code's UI, slash commands, and CLAUDE.md handling all work against K2.6's backend via Kimi's Anthropic-compatible endpoint.
The 300 sub-agent, 4,000 coordinated step architecture is accessible via Kimi Code CLI and the managed API. You define an orchestration prompt describing the overall task; K2.6's planning layer spawns sub-agents for parallelizable work (e.g., different modules or files) and coordinates their outputs. Direct programmatic control over individual sub-agent allocation is not yet exposed in the API — it's handled internally by the model.
The Kimi Code CLI is tuned for 256K tokens on repository-scale codebases. Via the managed API, current documentation shows 128K. Self-hosted configurations depend on your --max-model-len setting and available VRAM.
Both are competitive open-weight coding models at aggressive price points. DeepSeek V3.2 has the unique capability of simultaneous thinking + tool use in one API call. K2.6 leads on SWE-Bench Pro and on agent swarm scale. For pure coding throughput and agentic workflows, K2.6 currently has the benchmark edge.
Bottom Line
Kimi Code K2.6 is the most capable open-weight coding model available in April 2026, and its pricing makes it a serious Claude alternative for cost-sensitive agentic pipelines. The benchmark lead is real and the Claude Code drop-in integration removes most switching friction. The honest caveat: complex instruction following and ecosystem maturity still favor Anthropic — but for teams primarily doing code generation at scale, K2.6 earns its place in the stack.
Prefer a deep-dive walkthrough? Watch the full video on YouTube.