MoreRSS

site iconHackerNoonModify

We are an open and international community of 45,000+ contributing writers publishing stories and expertise for 4+ million curious and insightful monthly readers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of HackerNoon

How to Safely Integrate AI Into Structured Backend Systems

2026-05-01 17:03:47

We didn’t start by trying to “add AI” to our system. The system was already there stable, predictable, and doing exactly what it was supposed to do. It was a typical Java backend built with Spring Boot, with well-defined APIs, structured validation, and workflows that behaved consistently as long as the inputs were clean. Then we introduced AI into one part of the flow. The idea seemed straightforward: let AI interpret user intent and generate structured inputs for downstream services. In controlled scenarios, it worked. But as we moved closer to real usage, things started to behave differently.

Java systems are built around predictability. You define contracts, validate inputs, and control execution paths. AI doesn’t operate the same way. It interprets, approximates, and produces outputs that are usually correct but not always consistent. That difference doesn’t seem like a problem until both systems interact. We saw this clearly when AI started generating payloads for a transaction service. Most of the time, the payload looked right. But occasionally, there were small variations—field names slightly different, dates formatted inconsistently, or optional fields missing. Nothing obviously broken, but enough to cause issues. From the AI’s perspective, the output was valid. From the system’s perspective, it wasn’t. That gap is where things began to fail.

This became more visible in a payment dispute workflow we were building. A user would submit a request in natural language, AI would extract structured data, and that payload would be sent to a Spring Boot API for processing. On paper, the flow was clean. In practice, small inconsistencies started compounding as the request moved through multiple services. One service expected normalized dates, another assumed certain fields were already validated, and another relied on strict naming conventions. The result wasn’t a complete failure but a partial one some parts of the workflow succeeded while others failed silently. Debugging this was difficult because the issue didn’t originate in a single place; it was spread across the entire pipeline.

\ Control the gap between AI and execution

\ \ Our first instinct was to make the system more tolerant. We relaxed validation rules, added fallback logic, and tried to handle variations dynamically. While this made the system more permissive, it also made it less predictable. The same input could lead to different outcomes depending on how the AI shaped the payload at that moment. We were effectively pushing uncertainty deeper into the system, which made it harder to reason about and maintain.

Things started to improve when we changed the approach. Instead of trying to make Java systems behave more like AI, we introduced a clear boundary between the two. AI was allowed to interpret input, but it was no longer allowed to directly drive execution. Every AI-generated payload had to pass through a control layer before reaching the Spring Boot service. This layer handled mapping fields to known models, enforcing schema alignment, normalizing formats like dates and identifiers, and rejecting invalid payloads early. Only after passing these checks would the request move forward.

This shift also changed how we thought about validation. Instead of validating deep inside services, we moved validation to the boundary between AI output and system input. The system stopped asking, “Can I process this request?” and started asking, “Should this request even reach the system?” That small shift prevented bad data from propagating through multiple services and reduced the complexity of downstream debugging.

Another important realization came from observing how AI behaved under real usage. In testing, outputs looked stable and predictable. In production-like scenarios, variation increased significantly. To handle this, we added visibility into the flow by capturing raw AI outputs, tracking rejected payloads, and identifying recurring patterns. This wasn’t just about debugging; it was about understanding how AI behaves over time. Once we saw those patterns, tightening the system became much easier.

What this experience showed us is that Java systems don’t need to become more flexible to work with AI. They need stronger boundaries. Instead of expanding business logic to accommodate uncertainty, systems should enforce strict contracts, validate inputs early, and isolate variability before execution. Java remains predictable, and AI becomes manageable within those constraints.

A more stable way to approach integration is to treat AI as an upstream layer. AI can suggest structure, but it shouldn’t define execution. That separation keeps responsibilities clear and prevents variability from leaking into core system behavior.

In the end, integrating AI into Java systems doesn’t fail because the technology is immature. It fails when boundaries are unclear. AI introduces variability, while Java systems depend on consistency. When those two interact without control, even small differences can turn into system-wide issues. What worked for us wasn’t making the system more tolerant or trying to make AI perfect. It was introducing a clear separation between interpretation and execution and enforcing strict validation at that boundary. The system didn’t become perfect, but it became predictable and in real-world systems, that matters far more.

\

How to Build a Browser-Based Voice Assistant With the AssemblyAI Voice Agent API

2026-05-01 16:29:42

Real-time voice apps have a reputation for being painful to build. You’d normally need a speech-to-text service, an LLM, a text-to-speech engine, a WebSocket server to coordinate them, and some way to handle turn-taking so people aren’t talking over each other.

AssemblyAI’s Voice Agent API handles all of that behind a single WebSocket endpoint. You stream audio in, you get spoken responses back. In this tutorial, we’ll build a browser-based voice assistant from scratch—a tiny Express server for authentication, and an HTML page that captures your mic, talks to the agent, and plays its responses. The whole thing is roughly 120 lines of code across two files.

What we’re building

A browser page where you click a button, talk to an AI voice assistant, and hear it respond in real time. The assistant can also call tools—we’ll wire up a simple weather lookup to demonstrate. No frameworks, no build step, no React. Just vanilla HTML and JavaScript.

Prerequisites

You need Node.js (v18+) and an AssemblyAI API key. If you don’t have one yet, sign up for free—the API key is on your dashboard.

Step 1: The token server

Browsers can’t set custom headers on WebSocket connections, so you can’t pass your API key directly. Instead, your server mints a short-lived token and the browser uses that to authenticate. This keeps your API key off the client entirely.

Create a file called server.js:

const express = require("express");
 
const app = express();
app.use(express.static("public"));
 
app.get("/token", async (req, res) => {
  const response = await fetch(
    "https://agents.assemblyai.com/v1/token?expires_in_seconds=300",
    { headers: { Authorization: `Bearer ${process.env.ASSEMBLYAI_API_KEY}` } }
  );
  if (!response.ok) return res.status(500).send("Token generation failed");
  const { token } = await response.json();
  res.json({ token });
});
 
app.listen(3000, () => console.log("Running on http://localhost:3000"));

That’s the entire backend. One endpoint, 15 lines. Each token is single-use and expires after 5 minutes, so even if someone intercepts one, the blast radius is minimal.

Step 2: Capture mic audio in the browser

Create a public/index.html file. We’ll build it up section by section, starting with the audio capture. The Voice Agent API expects PCM16 mono audio at 24kHz, base64-encoded.

<!DOCTYPE html>
<html>
<head><title>Voice Assistant</title></head>
<body>
&nbsp; <h1>Voice Assistant</h1>
&nbsp; <button id="start">Start Conversation</button>
&nbsp; <button id="stop" disabled>Stop</button>
&nbsp; <div id="transcript"></div>
&nbsp;
&nbsp; <script>
&nbsp; let ws, audioCtx, micStream, processor;
&nbsp;
&nbsp; document.getElementById("start").onclick = async () => {
&nbsp; &nbsp; document.getElementById("start").disabled = true;
&nbsp; &nbsp; document.getElementById("stop").disabled = false;
&nbsp;
&nbsp; &nbsp; // 1. Get a temporary token from our server
&nbsp; &nbsp; const { token } = await fetch("/token").then(r => r.json());
&nbsp;
&nbsp; &nbsp; // 2. Open WebSocket to the Voice Agent API
&nbsp; &nbsp; const wsUrl = new URL("wss://agents.assemblyai.com/v1/ws");
&nbsp; &nbsp; wsUrl.searchParams.set("token", token);
&nbsp; &nbsp; ws = new WebSocket(wsUrl);
&nbsp;
&nbsp; &nbsp; // 3. Configure the agent on connect
&nbsp; &nbsp; ws.onopen = () => {
&nbsp; &nbsp; &nbsp; ws.send(JSON.stringify({
&nbsp; &nbsp; &nbsp; &nbsp; type: "session.update",
&nbsp; &nbsp; &nbsp; &nbsp; session: {
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; system_prompt: "You are a helpful voice assistant. " +
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; "Keep responses under 2 sentences. " +
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; "Use get_weather for weather questions.",
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; greeting: "Hi! Ask me anything, or try asking about the weather.",
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; output: { voice: "ivy" },
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; tools: [{
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; type: "function",
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; name: "get_weather",
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; description: "Get current weather for a city",
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; parameters: {
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; type: "object",
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; properties: {
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; location: { type: "string", description: "City name" }
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; },
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; required: ["location"]
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }]
&nbsp; &nbsp; &nbsp; &nbsp; }
&nbsp; &nbsp; &nbsp; }));
&nbsp; &nbsp; };

A few things to note: the system prompt tells the agent to keep it short (critical for voice UX—nobody wants to listen to a paragraph), and we’ve registered a get_weather tool right in the session config.

Step 3: Handle events and stream audio both ways

Now we need to handle the incoming events from the API and stream our mic audio out. Add this right after the ws.onopen handler:

&nbsp; &nbsp; // 4. Handle incoming events
&nbsp; &nbsp; const pendingTools = [];
&nbsp;
&nbsp; &nbsp; ws.onmessage = async (event) => {
&nbsp; &nbsp; &nbsp; const msg = JSON.parse(event.data);
&nbsp;
&nbsp; &nbsp; &nbsp; switch (msg.type) {
&nbsp; &nbsp; &nbsp; &nbsp; case "session.ready":
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; startMic();&nbsp; // Begin streaming audio once ready
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; break;
&nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; case "reply.audio":
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; playAudio(msg.data);
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; break;
&nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; case "transcript.user":
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; log("You: " + msg.text);
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; break;
&nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; case "transcript.agent":
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; log("Agent: " + msg.text);
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; break;
&nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; case "tool.call":
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // Simulate a weather lookup
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; const result = msg.name === "get_weather"
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ? { temp: "72°F", conditions: "Sunny" }
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; : { error: "Unknown tool" };
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; pendingTools.push({
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; call_id: msg.call_id, result
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; });
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; break;
&nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; case "reply.done":
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if (msg.status === "interrupted") {
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; pendingTools.length = 0;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; } else if (pendingTools.length > 0) {
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; for (const tool of pendingTools) {
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ws.send(JSON.stringify({
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; type: "tool.result",
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; call_id: tool.call_id,
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; result: JSON.stringify(tool.result)
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }));
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; pendingTools.length = 0;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; break;
&nbsp; &nbsp; &nbsp; }
&nbsp; &nbsp; };

The key pattern with tool calling: accumulate results during tool.call events, but don’t send them back until reply.done fires. The agent speaks a transition phrase while waiting, and sending results too early causes timing issues.

Step 4: Mic input and audio playback

Finally, wire up the Web Audio API for both capturing mic input (resampled to 24kHz PCM16) and playing the agent’s audio responses. Note the closing }; on the second-to-last line—it closes the outer start.onclick handler.

 // 5. Mic capture — resample to 24kHz PCM16
&nbsp; &nbsp; async function startMic() {
&nbsp; &nbsp; &nbsp; audioCtx = new AudioContext({ sampleRate: 24000 });
&nbsp; &nbsp; &nbsp; micStream = await navigator.mediaDevices.getUserMedia({
&nbsp; &nbsp; &nbsp; &nbsp; audio: { sampleRate: 24000, channelCount: 1 }
&nbsp; &nbsp; &nbsp; });
&nbsp; &nbsp; &nbsp; const source = audioCtx.createMediaStreamSource(micStream);
&nbsp; &nbsp; &nbsp; processor = audioCtx.createScriptProcessor(4096, 1, 1);
&nbsp;
&nbsp; &nbsp; &nbsp; processor.onaudioprocess = (e) => {
&nbsp; &nbsp; &nbsp; &nbsp; if (ws.readyState !== WebSocket.OPEN) return;
&nbsp; &nbsp; &nbsp; &nbsp; const float32 = e.inputBuffer.getChannelData(0);
&nbsp; &nbsp; &nbsp; &nbsp; const pcm16 = new Int16Array(float32.length);
&nbsp; &nbsp; &nbsp; &nbsp; for (let i = 0; i < float32.length; i++) {
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; pcm16[i] = Math.max(-32768,
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Math.min(32767, Math.floor(float32[i] * 32768)));
&nbsp; &nbsp; &nbsp; &nbsp; }
&nbsp; &nbsp; &nbsp; &nbsp; const b64 = btoa(String.fromCharCode(
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ...new Uint8Array(pcm16.buffer)));
&nbsp; &nbsp; &nbsp; &nbsp; ws.send(JSON.stringify({
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; type: "input.audio", audio: b64
&nbsp; &nbsp; &nbsp; &nbsp; }));
&nbsp; &nbsp; &nbsp; };
&nbsp;
&nbsp; &nbsp; &nbsp; source.connect(processor);
&nbsp; &nbsp; &nbsp; processor.connect(audioCtx.destination);
&nbsp; &nbsp; }

\ \

 // 6. Play agent audio
&nbsp; &nbsp; function playAudio(base64Data) {
&nbsp; &nbsp; &nbsp; const bytes = atob(base64Data);
&nbsp; &nbsp; &nbsp; const pcm16 = new Int16Array(bytes.length / 2);
&nbsp; &nbsp; &nbsp; for (let i = 0; i < pcm16.length; i++) {
&nbsp; &nbsp; &nbsp; &nbsp; pcm16[i] = bytes.charCodeAt(i * 2)
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | (bytes.charCodeAt(i * 2 + 1) << 8);
&nbsp; &nbsp; &nbsp; }
&nbsp; &nbsp; &nbsp; const float32 = new Float32Array(pcm16.length);
&nbsp; &nbsp; &nbsp; for (let i = 0; i < pcm16.length; i++) {
&nbsp; &nbsp; &nbsp; &nbsp; float32[i] = pcm16[i] / 32768;
&nbsp; &nbsp; &nbsp; }
&nbsp; &nbsp; &nbsp; const buffer = audioCtx.createBuffer(1, float32.length, 24000);
&nbsp; &nbsp; &nbsp; buffer.copyToChannel(float32, 0);
&nbsp; &nbsp; &nbsp; const src = audioCtx.createBufferSource();
&nbsp; &nbsp; &nbsp; src.buffer = buffer;
&nbsp; &nbsp; &nbsp; src.connect(audioCtx.destination);
&nbsp; &nbsp; &nbsp; src.start();
&nbsp; &nbsp; }
&nbsp;
&nbsp; &nbsp; function log(text) {
&nbsp; &nbsp; &nbsp; const el = document.getElementById("transcript");
&nbsp; &nbsp; &nbsp; el.innerHTML += "<p>" + text + "</p>";
&nbsp; &nbsp; &nbsp; el.scrollTop = el.scrollHeight;
&nbsp; &nbsp; }
&nbsp;
&nbsp; &nbsp; // 7. Stop button
&nbsp; &nbsp; document.getElementById("stop").onclick = () => {
&nbsp; &nbsp; &nbsp; if (ws) ws.close();
&nbsp; &nbsp; &nbsp; if (micStream) micStream.getTracks().forEach(t => t.stop());
&nbsp; &nbsp; &nbsp; if (processor) processor.disconnect();
&nbsp; &nbsp; &nbsp; document.getElementById("start").disabled = false;
&nbsp; &nbsp; &nbsp; document.getElementById("stop").disabled = true;
&nbsp; &nbsp; };
&nbsp; };
&nbsp; </script>
</body>
</html>

Step 5: Run it

Install Express and start the server:

npm install express
ASSEMBLYAI_API_KEY=your_key_here node server.js

Open http://localhost:3000, click "Start Conversation," and talk. You’ll hear the agent greet you and respond to your questions. Try asking "What’s the weather in Tokyo?" to see tool calling in action.

Where to go from here

This is a working voice assistant in two files and about 120 lines of meaningful code. No separate STT, LLM, or TTS services to manage. No orchestration layer. Just one WebSocket doing everything.

Some next steps worth exploring: swap ivy for a multilingual voice like lucia (Spanish/English) or ren (Japanese/English). Add more tools—maybe one that queries your database or creates a support ticket. Adjust the vad_threshold for noisier environments. Or use session.resume to reconnect dropped sessions without losing context (sessions persist for 30 seconds after disconnection).

The full API reference and more examples are in the Voice Agent API docs. If you build something cool with this, I’d love to see it.

\

7 Things You Can Build With a Single WebSocket (Using AssemblyAI’s Voice Agent API)

2026-05-01 16:23:23

Most voice AI architectures look like a Rube Goldberg machine. You pipe audio into a speech-to-text service, feed the transcript to an LLM, send the LLM’s reply to a text-to-speech engine, then duct-tape the audio back to the user. Each hop adds latency, failure modes, and billing dashboards.

AssemblyAI’s Voice Agent API collapses all of that into one WebSocket connection. You stream mic audio in, you get spoken agent responses back. Turn detection, tool calling, barge-in—it’s all built in. The endpoint is wss://agents.assemblyai.com/v1/ws.

Here are seven things you can build on top of it—each one surprisingly little code.

1. A multilingual support line that switches voices mid-call

The Voice Agent API ships with 18 English voices and 16 multilingual voices that support code-switching between languages and English. That means your agent can greet a caller in English, detect that they’d prefer Spanish, and swap to the lucia voice with a single session.update message—no reconnection, no new session.

The config is dead simple:

{ "type": "session.update", "session": { "output": { "voice": "lucia" } } }

You could wire this up with a language-detection tool: register a tool called detect_language, and when the agent invokes it, respond with the detected language and fire a session.update to change the voice. The user never notices a seam.

2. A voice-powered knowledge base you can actually talk to

The API’s tool calling feature lets you register external functions that the agent can invoke mid-conversation. Register a search_docs tool that hits your docs index (Pinecone, Elasticsearch, whatever), and suddenly you have a voice interface to your entire knowledge base.

The quickstart in AssemblyAI’s docs actually ships with an MCP server wired up this way. You ask a question out loud, the agent calls the tool, gets the answer, and speaks it back—all over that single WebSocket. No reading docs required, ironically.

3. A real-time language tutor that code-switches naturally

The multilingual voices don’t just speak one language—they support code-switching. The arjun voice handles Hindi/Hinglish and English natively. pierre does French and English. That’s exactly what a language tutor needs: the ability to drop in and out of the target language mid-sentence.

Set your system prompt to something like: "You are a French tutor. Speak mostly in French but switch to English to explain grammar. Correct the user’s pronunciation gently." Pair it with the pierre voice and you’ve got a conversational language partner that’s available 24/7.

4. A voice-driven order system for restaurants

Phone ordering is still massive in food service, and most of it is still handled by humans. Build an agent with tools like getmenu, addtoorder, and submitorder that hit your POS API. The agent takes the order conversationally, confirms items, and submits it—all while the kitchen staff keeps working.

The built-in turn detection with adjustable VAD threshold means the agent won’t cut the customer off mid-sentence when they’re reading a complicated order. And barge-in support means they can interrupt to say "actually, make that a large" without waiting for the agent to finish talking.

5. A meeting copilot you can interrupt and question

Most meeting transcription tools are passive—they record and summarize after the fact. But what if you could talk to the transcript in real time?

Feed meeting audio into the Voice Agent API and register tools like searchtranscript and getaction_items. During the meeting, you can ask "What did Sarah say about the deadline?" and get a spoken answer. Session resumption (sessions persist for 30 seconds after disconnection) means you don’t lose context if your connection hiccups.

6. A browser-based voice concierge for SaaS onboarding

The API has a clean browser integration path: your server mints a short-lived token via GET /v1/token, and the browser opens the WebSocket with that token as a query param. Your API key never touches the client.

Embed this in your app’s onboarding flow and you’ve got a voice concierge that walks new users through setup. Register tools like createproject or inviteteammate so the agent can actually perform actions in your app while talking the user through it. It’s like having a customer success rep embedded in your UI—except it scales to every user simultaneously.

7. A lead qualification agent that routes to sales

Inbound sales calls are time-sensitive. Every minute a lead waits, your close rate drops. Build an agent that picks up immediately, asks qualifying questions conversationally, and uses a routetosales tool to hand off warm leads to the right rep—complete with a transcript summary.

The system prompt does the heavy lifting here: define your qualification criteria, tell the agent which questions to ask, and specify when to escalate. The agent handles the rest. Because it’s all one WebSocket, the latency between the caller saying something and the agent responding is minimal—no awkward silence while three different services talk to each other behind the scenes.

The common thread

Every one of these projects would’ve been a multi-service integration nightmare a year ago. The Voice Agent API compresses the entire voice AI stack—speech recognition, language understanding, voice synthesis, turn management—into a single WebSocket you can connect to in about 50 lines of code.

The interesting part isn’t any one of these use cases. It’s that the same API handles all of them. Swap the system prompt and tools, and the same connection becomes a completely different agent.

If you want to try it yourself, the Voice Agent API docs have a working quickstart you can run in under five minutes. And if you don’t have an API key yet, grab one for free.

\

How to Evaluate STT for Voice Agents in Production

2026-05-01 16:04:02

When you're building a voice agent, the benchmarks you reach for matter. Get them wrong, and you optimise for the wrong thing. You ship a system that feels broken in ways you can't immediately diagnose, and end up chasing ghosts in your latency numbers.

There are only a handful of public STT benchmarks. Artificial Analysis is one of the few independent ones (but they currently don’t do real-time). After that, you're mostly left with metrics from the providers themselves, which tend to look… favorable, shall we say.

And most of them are measuring the wrong thing anyway.

The most important thing to remember: any latency metric needs the context of accuracy. I would rather talk to a bot that takes 100ms longer to respond than have to repeat myself because it transcribes something wrong.

And not just accuracy. Reliability and cost have to be factored in when you're actually making a provider decision.

Speed you can trust. That's the target. Not just a fast TTFB.

\ What I’ll cover in this blog post:

  • The metric that actually matters for voice agents: TTFS
  • Why TTFB is the worst metric in the industry
  • Pipecat evals: the best voice agent benchmark so far
  • Try it yourself
  • Final thoughts
  • Full glossary of STT latency terms

\


The metric that actually matters for voice agents: TTFS

For cascade voice agents, voice agent latency (sometimes called speech-to-speech latency) is the time from a user finishing speaking to the agent starting to speak back. In that window, a lot of things have to happen. Turn detection, final transcript delivery, RAG if you're using it, LLM call, TTS streaming.

The transcription provider's share of that is TTFS: time to final segment. The time from the user finishing speech to the final, stable transcript arriving. The thing that actually gets passed to your LLM.

You'll also hear it called EoT latency (end of turn), though strictly that's better reserved for when turn completion is detected. Pipecat calls it TTFS publicly and TTFB in their code. Helpful.

In our current system, a framework like Pipecat or LiveKit detects the end of turn in around 200ms, sends a forceendof_utterance message, and finals come back shortly after. Our internal tooling measures end-of-speech to finals latency at 0.451 ± 0.022s for our Voice SDK with built-in turn detection.

Everything else, RTF, TTFB, partial latency, and final latency, matters less than this number for voice agents. Some of those metrics are actively misleading. More on that below.

\


Why TTFB is a poor metric to bank on

In my opinion, anyway. But I see it cited over and over again, so it's worth explaining why it's so bad.

TTFB (time to first byte, sometimes TTFT for time to first token) measures the delay from when you start streaming audio to when you get the very first piece of transcript back. The problem: how long is a word? If the engine fires back "Super" almost immediately when someone starts saying "Supercalifragilisticexpialidocious," that's technically fast. It's also useless. Your agent can't act on "Super."

The extreme version of this is Deepgram Flux, as benchmarked by Coval.ai. Their data shows "Two" coming back first regardless of what's actually being said. Likely a statistical bias toward common first words in their training data. Very low TTFB. Completely non-actionable. Your stack still has to wait for a correction before it can do anything.

Firing back a volatile guess doesn't make a service fast. It just means the rest of your AI stack has to wait anyway.

TTFB is popular because it maps naturally from LLM and TTS benchmarking, where the first token genuinely signals useful output. In transcription that analogy breaks down. The first byte is often just a guess that might change a millisecond later. It's not a reliable indicator of when your agent can actually start working.

Coval.ai has independent TTFB measurements worth looking at. Just pay attention to what's actually in that first byte, not only how fast it arrives.

\


Pipecat evals: the best voice agent benchmark so far

| Service | Transcripts | Perfect | WER Mean | Pooled WER | TTFS Median | TTFS P95 | TTFS P99 | |----|----|----|----|----|----|----|----| | AssemblyAI | 99.8% | 66.8% | 3.49% | 3.02% | 256ms | 362ms | 417ms | | AWS | 100.0% | 77.4% | 1.68% | 1.75% | 1136ms | 1527ms | 1897ms | | Azure | 100.0% | 82.9% | 1.21% | 1.18% | 1016ms | 1345ms | 1791ms | | Cartesia | 99.9% | 60.5% | 3.92% | 4.36% | 266ms | 364ms | 898ms | | Deepgram | 99.8% | 76.5% | 1.71% | 1.62% | 247ms | 298ms | 326ms | | ElevenLabs | 99.7% | 81.3% | 3.16% | 3.12% | 281ms | 348ms | 407ms | | Google | 100.0% | 69.0% | 2.84% | 2.85% | 878ms | 1155ms | 1570ms | | OpenAI | 99.3% | 75.9% | 3.24% | 3.06% | 637ms | 965ms | 1655ms | | Speechmatics | 99.7% | 83.2% | 1.40% | 1.07% | 495ms | 676ms | 736ms |

Full Pipecat benchmarks →

Two axes: semantic WER and median TTFS. Both matter. Neither works without the other.

The dataset

Pipecat built this eval set originally to train their Smart Turn V3 turn detection model. It's a set of short utterances, some cut off and some finishing naturally as an end of turn. That's a closer approximation to real production audio than most STT test sets, which tend toward clean studio recordings.

Semantic WER

Standard WER counts wrong words. Semantic WER asks a different question: can the LLM understand what was meant based on what was transcribed? That's the right question for voice agents. Your agent doesn't fail because a word was slightly off. It fails because the LLM got the wrong meaning and did the wrong thing.

To calculate it: audio goes through Google's transcription to establish a ground truth (this is debatable, why them specifically? But the outputs can be human corrected). Then a large custom prompt asks Claude to compare the meaning of a transcription against that ground truth and compute error per word based on meaning, not exact wording.

That's how you measure accuracy for voice agents.

How TTFS is measured

Pipecat's turn detection pipeline uses two models. Silero VAD (is someone speaking right now?) and Smart Turn V3 (is the turn actually complete?). Once VAD drops for stop_secs, Smart Turn runs on the audio and decides if the user is done. If it decides yes, a message goes to the STT provider to finalise.

TTFS is measured from when VAD initially goes low to when finals come back. That captures real end-of-speech to final transcript latency, including network time, which is exactly what contributes to your voice agent's response delay.

Our internal approach uses force alignment to pinpoint exact word timing, which is more precise. But the Pipecat approach is reproducible by anyone, which matters more for a benchmark.

\


\

On where Speechmatics sits

The obvious question you may be asking: Speechmatics' median TTFS is 495ms. Deepgram is 247ms. AssemblyAI is 256ms. Why is that a win?

Look at the accuracy column. 83.2% of turns came back completely perfect. That's the highest on the board. Pooled WER of 1.07%, also the lowest.

The pareto curve is what matters here. If users have to repeat themselves because the transcript was wrong, you've added more perceived latency than any 250ms difference in TTFS would have saved. A faster wrong answer is still a wrong answer. I would take the milliseconds anyday.

It's also worth noting that accuracy isn't only an English problem. Domain accuracy across medical transcription, accents, and non-native speakers matters enormously in production, and barely surfaces in most benchmarks. Our latency is consistent across languages and domains, which the Pipecat eval doesn't fully capture.

One more thing for anyone optimising on cost: the Standard model is cheaper than Deepgram, more accurate, and has the same improved latency as the Enhanced model.

\


Try it yourself

The Pipecat example in the Speechmatics Academy has been updated to Pipecat 0.101, which now includes live TTFS measurement for transcription services.

To test it out:

  1. Clone the repo: https://github.com/speechmatics/speechmatics-academy/tree/main/integrations/pipecat/02-simple-voice-bot-web
  2. Grab a free API key from Speechmatics
  3. Spin up the bot

Use the endpoint closest to your location for lowest latency. EU is the default. As you talk, the metrics tab updates with live TTFS per turn.

\


Conclusion

Pipecat has built the most useful public benchmark for voice agents so far. Two axes, real-world turn data, semantic accuracy rather than raw WER. If you're evaluating STT providers, start here.

But as much as I think about latency, I keep coming back to the same conclusion: accuracy matters more.

Latency has reached a point where you can have a comfortable conversation with a voice agent. People who say "latency is UX" aren't wrong. But repeating yourself because the system mishears you is far more annoying than the gap.

Right now, the transcription latency that matters for voice agents is end-of-speech to finals. That might not be the case in a year. Speculative generation is getting more capable. Turn detection is going to get more layered, split across transcription, orchestration, and the LLM, each adding its own backstop, with more of it getting absorbed into the transcription layer itself.

Fast matters. Fast, reliable, and accurate across languages, accents, and domains matters more.

\


\

Appendix: STT latency terms, explained

A full breakdown for reference, ordered from least to most relevant for voice agents.

Partials (interim transcripts)

The current best guess. Volatile, subject to change. Emitted sometimes before you've even finished a word. Useful for UI visualisation and for speculative generation, sending to the LLM before the turn ends. Not the ground truth you act on.

Finals

Stable, committed transcripts. Won't change once emitted. Take longer to arrive because the engine needs confidence that the word is complete and surrounding context won't shift its interpretation. This is what you pass to your LLM.

Turns

One half of a conversation. Detecting the boundary between a mid-sentence pause and a completed thought is a hard real-time engineering problem. Frameworks like Pipecat and LiveKit use dedicated turn detection models to decide when to close the turn and trigger the LLM.

RTF (real-time factor)

Not a latency, a speed ratio. Time to transcribe a second of audio. RTF of 0.05 means a 100-second file came back in 5 seconds. For real-time applications, as long as RTF is below 1 you're fine. Mostly relevant for batch benchmarks. Ignore for voice agent work.

TTFB / TTFT (time to first byte / token)

Time from audio stream start to first transcript fragment. Misleading for voice agents as explained above. The first byte is often a volatile guess. Check Coval.ai for independent measurements, but weight them accordingly.

Partial latency

Time from finishing a word to receiving the first partial relating to it. What most people feel when watching a transcription service in action. Useful for perceived responsiveness and speculative generation. Not the number that drives agent response time.

Finals latency

Time from word completion to the relevant final arriving. At Speechmatics this is controlled with the max_delay API parameter. Minimum 700ms to preserve accuracy, recommended around 1.5s for voice agents. Relevant for live captioning and running transcripts.

TTFS (time to final segment)

Time from the user finishing speech to the final transcript arriving. The transcription provider's direct contribution to voice agent response latency. The number that matters.

\


:::info A note on algorithmic vs network latency: the latency figures above represent algorithmic latency, what you'd measure on a direct connection. The Pipecat evaluations include real-world network latency, which is the honest way to run these tests.

:::

\

Uphold refutes Misstatements in New York Attorney General’s Press Release Regarding Cred, LLC Fraud

2026-05-01 16:00:06

Uphold HQ, Inc (“Uphold”) yesterday entered into a settlement agreement with the New York Office of the Attorney General (“OAG”), to resolve the OAG’s civil inquiry into the collapse of Cred, LLC (“Cred”), a third-party firm, and Cred’s CredEarn program, due to the fraud perpetrated by Cred executives in 2020. \n \n Yesterday, the OAG issued a statement about this settlement, which misrepresents key facts.

Uphold categorically rejects any suggestion that it knowingly promoted Cred’s fraudulent scheme. To the contrary, Cred deliberately and repeatedly misled Uphold. Uphold, like its customers and CredEarn’s other users, was a victim of Cred’s deception. The U.S. Department of Justice identified Uphold as a victim in its criminal prosecution of the Cred executives.

Any statements in the OAG’s press release should not be read to suggest that Uphold acted knowingly or otherwise acted to intentionally deceive customers. Uphold expressly rejects those characterizations and did not agree to them or admit any liability in its settlement with the OAG.

“We are deeply disappointed by the New York Attorney General’s statement, which is profoundly inaccurate and misrepresents the facts of the case,” said Simon McLoughlin, CEO of Uphold.

“Uphold acted with integrity throughout its relationship with Cred LLC, a third-party lending firm that ran into financial difficulties in 2020. As soon as Uphold became aware of the issues at Cred, we demanded that Cred notify its regulators, shut down access to the service, and acted to protect our customers immediately.

“The U.S. Department of Justice, in its criminal investigation of Cred, correctly found that Uphold was a victim of Cred and was not in any way to blame for that company’s actions.” \n \n Uphold and the OAG jointly agreed the factual basis upon which any public statements would be issued. What was published by the OAG yesterday misrepresents those facts and is inconsistent with the parties’ agreement.

As acknowledged by the OAG in the settlement agreement, Uphold did not know of the liquidity issues at Cred until October 2020. Uphold was also unaware that Cred’s statements about the financial viability of its CredEarn product were false, and that Cred was taking active steps to deceive Uphold and CredEarn users.

Critically, as soon as Uphold became aware of Cred’s liquidity issues it acted decisively to protect customers. Within hours, Uphold froze Cred’s access to Uphold, cutting off Cred from \n continuing to offer its product on Uphold’s platform. Uphold immediately demanded that Cred self-report the losses of customer funds to its regulators, issuing an ultimatum that Uphold itself would notify regulators if Cred failed to do so. These actions brought Cred’s misconduct to light and halted its ability to continue taking in customer funds.

Without  Uphold’s intervention, Cred would have continued soliciting and receiving funds while concealing its losses. Uphold put an end to that conduct and thereafter cooperated extensively in federal law enforcement’s prosecution of Dan Schatt and other Cred executives, which resulted in significant prison sentences and financial restitution orders for the victims.

These facts fundamentally contradict any narrative suggesting passive or complicit behavior by Uphold.

Uphold voluntarily cooperated with regulators throughout the investigation and entered into an Assurance of Discontinuance with OAG in good faith to resolve regulatory issues relating primarily to marketing and registration matters—without any admission of liability. That agreement does not contain any allegation that Uphold knew of any fraudulent scheme or that it caused investor losses.

Uphold remains focused on transparency, regulatory compliance, and protecting users.

About Uphold

Uphold is a financial technology company that believes on-chain services are the future of finance. It provides modern infrastructure for on-chain payments, banking and investments. Offering Consumer Services, Business Services and Institutional Trading, Uphold makes financial services easy and trustworthy for millions of customers in more than 140 countries.

Uphold integrates with more than 30 trading venues, including centralized and decentralized exchanges, to deliver superior liquidity, resilience and optimal execution. Uphold never loans out customer assets and is always 100% reserved.

The company pioneered radical transparency and uniquely publishes its assets and liabilities every 30 seconds on a public website (https://uphold.com/en-us/transparency).

Uphold is regulated in the U.S. by FinCen and State regulators; and is registered in the UK with the FCA and in Europe with the Financial Crime Investigation Service under the Ministry of the Interior of the Republic of Lithuania. Securities products and services are offered by Uphold Securities, Inc., a broker-dealer registered with the SEC and a member of FINRA and SIPC.

To learn more about Uphold’s products and services, visit uphold.com.

\

:::tip This story was distributed as a release by Blockchain Wire under HackerNoon Business Blogging Program.

:::

\

BTCC Exchange and the Argentine Football Association Unite as Legends in New Campaign Video

2026-05-01 15:09:44

Lodz, Poland, April 30th, 2026/Chainwire/--BTCC Exchange, the world's longest-running cryptocurrency exchange, has released a landmark campaign video celebrating its official partnership with the Argentine Football Association (AFA). Titled "Legends Made With Every Trade", the video draws a powerful parallel between two names that have stood the test of time and thrived across every cycle in their respective fields.

Founded in 2011, BTCC has served over 11 million users across more than 100 countries, making it a true legend of the crypto industry. The AFA, home to the reigning World Champions and one of the most celebrated football associations, mirrors that legacy on the global sports stage. Together, they represent what it means to be resilient and built to last.

The video is now live on BTCC's YouTube channel, and fans can enter an exclusive giveaway on X for a chance to win exciting prizes: https://x.com/BTCCexchange/status/2049051002108510431

Surrounding the AFA partnership, BTCC will launch a trading championship with a record-breaking prize pool of over one million USDT. The top trader will take home a jersey personally signed by Lionel Messi, captain of the Argentine National Football Team.

The excitement does not stop there. June 2026 marks BTCC's 15th anniversary, a milestone to be celebrated with a mega trading campaign coinciding with the FIFA World Cup, as a token of appreciation for all the support from users worldwide. The campaign will feature large-scale trading competitions and fan-favorite elements including winner prediction challenges, giving traders and football fans alike a chance to be part of something truly historic.

The new campaign video is available to watch: https://www.youtube.com/watch?v=IPQNMdRi5G4

About BTCC

Founded in 2011, BTCC is a leading global cryptocurrency exchange serving over 11 million users across 100+ countries. As the official regional sponsor of the Argentine Football Association (AFA) and with NBA All-Star Jaren Jackson Jr. as its global brand ambassador, BTCC offers secure and accessible cryptocurrency trading services, focused on delivering a user-friendly experience while adhering to applicable regulatory standards.

Official website: https://www.btcc.com/en-US

X: https://x.com/BTCCexchange

Contact: [email protected]

Contact

Aaryn Ling

[email protected]

:::tip This story was published as a press release by Chainwire under HackerNoon’s Business Blogging Program

:::

Disclaimer:

This article is for informational purposes only and does not constitute investment advice. Cryptocurrencies are speculative, complex, and involve high risks. This can mean high prices volatility and potential loss of your initial investment. You should consider your financial situation, investment purposes, and consult with a financial advisor before making any investment decisions. The HackerNoon editorial team has only verified the story for grammatical accuracy and does not endorse or guarantee the accuracy, reliability, or completeness of the information stated in this article. #DYOR