MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

A Developer’s Guide to Surviving the AI Product Tsunami

2025-12-06 10:36:27

Let’s be honest.

For many developers, the last two years have felt less like a technology wave… and more like a tsunami.

Every week brings:

  • a new model
  • a new agent framework
  • a new AI startup
  • a new “this will replace developers” headline
  • a new tool promising 10x productivity

And in the middle of all this noise, one silent question keeps growing:

“Where do I actually stand in this new world?”

This article is not about hype. It’s about survival, relevance, and long-term advantage as a developer in the AI era.

Let me break this down in the most practical way possible.

1. The First Truth: AI Is Not Here to Replace Developers; It’s Here to Replace Unstructured Work

AI doesn’t replace:

  • thinking
  • judgment
  • system design
  • architecture
  • trade-offs
  • constraints
  • security decisions

AI replaces:

  • repetitive coding
  • boilerplate logic
  • mechanical refactoring
  • basic test generation
  • documentation drafting
  • pattern-based tasks

If most of your value lives in repetition, you are at risk. If your value lives in reasoning, you are becoming more powerful than ever.

The future belongs to thinking developers, not typing developers.

2. Stop Chasing Tools. Start Owning Systems

Most developers today are stuck in tool-chasing mode:

  • one week it’s LangChain
  • next week it’s agents
  • then vector DBs
  • then a new multi-modal stack

Tool-chasing creates surface knowledge. Surface knowledge creates panic.

Instead, I believe in owning systems:

  • How data flows end-to-end
  • How memory is stored and retrieved
  • How inference is triggered
  • How decisions are validated
  • How failures are handled
  • How cost is controlled
  • How humans stay in the loop

Tools will change. Systems will remain.

The developer who understands systems deeply will always be employable.

3. The New Core Developer Skill Is Orchestration

In the past, developers wrote logic. Now, developers orchestrate intelligence.

This means:

  • choosing when to use models
  • choosing when NOT to use models
  • routing tasks to the right tool
  • blending rule-based systems with AI
  • balancing automation with safety
  • sequencing agents and workflows
  • designing fallback behaviour

This is not prompt engineering. This is intelligence orchestration.

And it’s becoming one of the most valuable skills in tech.

4. You Don’t Need to Become an AI Researcher to Stay Relevant

This is a big misconception.

Most developers do NOT need to:

  • train models
  • build transformers
  • design architectures
  • work on GPUs
  • publish ML papers

What they DO need to master is:

  • applying AI safely
  • integrating AI into real systems
  • validating outputs
  • managing drift
  • building guardrails
  • designing human-in-the-loop workflows
  • making AI reliable in production

Applied AI will create far more jobs than research AI.

5. The Real Risk Is Becoming a “Wrapper Developer”

A dangerous trend is emerging:

Developers building only:

  • thin UI on top of APIs
  • zero business logic
  • zero unique workflows
  • zero ownership of data
  • zero defensibility

This creates fragile careers.

If your entire skillset can be replaced by:

  • a better API
  • a cheaper model
  • a new framework

… then your leverage is weak.

Strong developers own:

  • domain logic
  • system design
  • business workflows
  • data pipelines
  • decision processes

Wrappers disappear. Systems survive.

6. The Developer Advantage Is Not Coding Speed; It’s Problem Framing

AI now codes fast.

So the advantage shifts to:

  • How you frame the problem
  • How you decompose tasks
  • How you define constraints
  • How you evaluate results
  • How you test edge cases
  • How you design feedback loops

Problem framing is now more valuable than syntax.

The developer who can turn messy reality into a clean, executable system will dominate the next decade.

7. Learn to Build “Human + AI” Systems, Not Fully Autonomous Ones

Fully autonomous systems sound impressive. They also break in the real world.

Smart developers now focus on:

  • AI suggestions + human confirmation
  • AI automation + human review
  • AI generation + human judgment
  • AI speed + human ethics

This hybrid model is what enterprise actually trusts. It’s also what scales safely.

8. The Career Shift: From “Implementer” to “Operator”

The strongest developers are shifting from:

task executors to system operators

They don’t ask:

“What feature should I code today?”

They ask:

  • What system is broken?
  • What workflow is inefficient?
  • What process is leaking value?
  • What automation can compound?
  • Where can AI replace friction without replacing judgment?

This is the mindset that turns developers into:

  • tech leads
  • system architects
  • startup founders
  • AI operators
  • high-leverage consultants

9. Ignore the Fear Narrative. Focus on the Leverage Narrative

Fear-based headlines get clicks:

  • “Developers are doomed.”
  • “AI will replace programmers.”
  • “Coding is dead.”

Reality-based thinking creates careers:

  • AI multiplies output
  • AI reduces low-level work
  • AI increases system complexity
  • AI increases orchestration demand
  • AI increases architectural responsibility

The more AI enters software, the more high level thinking becomes valuable.

Here’s My Take

The AI product tsunami is real. But it’s not here to wash developers away.

It’s here to wash away:

  • shallow skills
  • repetitive work
  • surface-level knowledge
  • fragile careers

What remains will be:

  • system thinkers
  • orchestration experts
  • applied AI builders
  • judgment-driven developers
  • architecture-first engineers

The safest place to stand in this storm is not behind tools. It’s above systems.

That’s where long-term relevance is built.

Next article:

“The AI Stack I Use to Run My Company (And Why It Works).”

Migrating from Ingress NGINX to Pomerium Ingress Controller

2025-12-06 09:57:00

The Kubernetes community announced that Ingress NGINX will be retired in March 2026. After that, there won't be any more updates, bugfixes, or security patches. While your existing deployments will keep working, running without security updates is risky and there will be no further feature developments.

Many Kubernetes operators are now evaluating alternatives to the community Ingress NGINX controller. The Pomerium ingress controller offers a compelling migration path that provides the same reverse proxy functionality you're used to, with optional zero trust capabilities (what we do best) that you can adopt incrementally without requiring an immediate overhaul of your existing setup.

Why Consider Pomerium?

While there are several good ingress controller alternatives available, the Pomerium ingress controller provides the same reverse proxy functionality you're used to with Ingress NGINX, but with built-in zero trust features that you can adopt incrementally. Since both Pomerium Core and the ingress controller are open source, you can evaluate and implement without vendor lock-in concerns.

Before You Start

This guide assumes you have:

What's Different?

Unlike NGINX, Pomerium has two key requirements:

  1. HTTPS is mandatory - all routes must use TLS
  2. Policies are required - you must specify an access policy (even if it's permissive)

These requirements ensure security by default, but you can configure permissive policies that function exactly like traditional reverse proxies.

A Simple Migration Example

Let's look at a typical Ingress NGINX configuration and its Pomerium equivalent:

Ingress NGINX to Pomerium Ingress Controller:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-app
+  annotations:
+    ingress.pomerium.io/policy:
+      - allow:
+          any: true
spec:
-  ingressClassName: nginx
+  ingressClassName: pomerium
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: my-app-service
            port:
              number: 80
  tls:
  - hosts:
    - app.example.com
    secretName: app-tls-cert

The configuration is nearly identical—just change the ingress class from nginx to pomerium and add a basic policy. The any: true policy tells Pomerium to allow all requests through without applying access restrictions—essentially functioning as a traditional reverse proxy with no additional authentication or authorization layers. Your existing network security, firewall rules, and application-level authentication remain unchanged.

Policy Options for Basic Reverse Proxy Functionality

For straightforward migration that matches Ingress NGINX's default behavior, you have several policy options:

# Option 1: Allow any request (most similar to Ingress NGINX default)
ingress.pomerium.io/policy: |
  - allow:
      any: true

# Option 2: Truly public access (annotation shortcut)
ingress.pomerium.io/allow_public_unauthenticated_access: 'true'

# Option 3: Any authenticated user (if you want basic auth)
ingress.pomerium.io/allow_any_authenticated_user: 'true'

TLS Certificate Management

Since Pomerium requires HTTPS, consider using cert-manager for automatic certificate provisioning. The Pomerium ingress controller integrates seamlessly with cert-manager:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-app
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    ingress.pomerium.io/policy: |
      - allow:
          any: true
spec:
  ingressClassName: pomerium
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: my-app-service
            port:
              number: 80
  tls:
  - hosts:
    - app.example.com
    secretName: app-tls-cert # cert-manager will create this

Start Simple, Add Zero Trust When Ready

So if you're looking to migrate from Ingress NGINX, migrating to Pomerium gives you immediate reverse proxy functionality identical to Ingress NGINX. Plus, if you decide to explore zero trust down the road, you can replace the permissive policy with fine-grained rules based on user identity, device status, request context, or other factors:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-app
  annotations:
    ingress.pomerium.io/policy: |
      - allow:
          and:
            - domain:
                is: example.com
spec:
  ingressClassName: pomerium
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: my-app-service
            port:
              number: 80
  tls:
  - hosts:
    - app.example.com
    secretName: app-tls-cert

Getting Started

With the March 2026 retirement deadline, you have time to plan your migration carefully. The Pomerium ingress controller installation is straightforward and well-documented. You can run both controllers side-by-side during migration, gradually moving services over as you validate functionality.

Whether you're looking for a sustainable long-term solution or preparing for a zero trust future, the Pomerium ingress controller offers a natural evolution from traditional reverse proxy patterns. Start with the familiar, add security when you're ready.

Learn more about the Pomerium ingress controller at github.com/pomerium/ingress-controller or check out the deployment documentation.

How I Built an MCP Server in 50 Lines of Python (Auto-Generated from OpenAPI)

2025-12-06 09:50:00

The Model Context Protocol (MCP) is revolutionizing how AI assistants interact with external tools. But building MCP servers from scratch? That's tedious.

In this post, I'll show you how we built ConnectSafely's MCP server that automatically generates 12+ LinkedIn automation tools from an existing OpenAPI specification—in under 50 lines of Python.

What We're Building

An MCP server that:

  • ✅ Auto-generates tools from any OpenAPI spec
  • ✅ Handles per-request authentication (API keys)
  • ✅ Works with Claude Desktop, Cursor, and any MCP client
  • ✅ Deploys as a standard web service with health checks

The result? AI assistants can now execute LinkedIn actions like posting comments, fetching profiles, and managing engagement—all through natural language.

The Tech Stack

Component Purpose
FastMCP MCP server framework with OpenAPI support
FastAPI Web framework for middleware & health checks
httpx Async HTTP client with event hooks
ContextVar Thread-safe per-request state management

The Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Claude Desktop │────▶│   MCP Server    │────▶│  LinkedIn API   │
│  / Cursor / AI  │     │  (FastMCP)      │     │  (OpenAPI)      │
└─────────────────┘     └─────────────────┘     └─────────────────┘
        │                       │
        │   ?apiKey=xxx         │  Authorization: Bearer xxx
        └───────────────────────┘
              API Key Flow

The magic happens in the middle layer—our MCP server reads an OpenAPI spec and automatically exposes every endpoint as an MCP tool.

The Complete Code

Let's break this down piece by piece:

1. Setting Up Per-Request Authentication

The challenge: MCP servers handle multiple concurrent requests, each potentially with different API keys. We need thread-safe state management.

from contextvars import ContextVar
import httpx

# Store API key per request using context variables (thread-safe)
api_key_context: ContextVar[str] = ContextVar('api_key', default='')

# Event hook to inject Authorization header from context
async def add_auth_header(request: httpx.Request):
    """
    Async event hook that injects the Authorization header 
    from context variable. Called for every outgoing request.
    """
    api_key = api_key_context.get()
    if api_key:
        request.headers["Authorization"] = f"Bearer {api_key}"

Why ContextVar? Unlike global variables, ContextVar maintains separate values for each async context. When Request A sets an API key, it won't leak into Request B.

2. Creating the HTTP Client with Event Hooks

# Create HTTP client with event hook for dynamic header injection
api_client = httpx.AsyncClient(
    base_url="https://api.connectsafely.ai/linkedin",
    timeout=300.0,  # 5 minute timeout for long operations
    event_hooks={"request": [add_auth_header]}
)

The event_hooks pattern is incredibly powerful—every request automatically gets the correct Authorization header injected, without modifying any endpoint code.

3. Loading the OpenAPI Spec

import sys

try:
    print("Loading OpenAPI spec...")
    openapi_spec = httpx.get(
        "https://api.connectsafely.ai/linkedin/openapi.json", 
        timeout=5.0
    ).json()
    print("OpenAPI spec loaded successfully!")
except httpx.ConnectError:
    print("ERROR: Could not connect to API")
    sys.exit(1)
except Exception as e:
    print(f"ERROR: Failed to load OpenAPI spec: {e}")
    sys.exit(1)

We fetch the OpenAPI spec at startup. This means:

  • Zero manual tool definitions – endpoints become tools automatically
  • Always in sync – update your API, MCP tools update too
  • Self-documenting – tool descriptions come from OpenAPI

4. The Magic: FastMCP from OpenAPI

from fastmcp import FastMCP

# Create the MCP server from the OpenAPI specification
mcp = FastMCP.from_openapi(
    openapi_spec=openapi_spec,
    client=api_client,
    name="LinkedIn API Server"
)

This single line converts your entire OpenAPI spec into MCP tools. Each endpoint becomes a callable tool with:

  • Proper parameter validation
  • Type hints from the schema
  • Descriptions from OpenAPI docs

5. FastAPI Integration for Production Features

from fastapi import FastAPI, Request

# Create MCP ASGI app
mcp_app = mcp.http_app(path='/')

# Create FastAPI app with MCP lifespan
app = FastAPI(lifespan=mcp_app.lifespan)

@app.get("/health")
async def health_check():
    """Health check endpoint for Docker and load balancers."""
    return {"status": "healthy"}

We wrap the MCP app in FastAPI to add:

  • Health check endpoints (essential for Kubernetes/Docker)
  • Custom middleware
  • Additional REST endpoints if needed

6. API Key Extraction Middleware

@app.middleware("http")
async def extract_api_key(request: Request, call_next):
    """
    Middleware to extract API key from query parameter 
    or Authorization header. Supports both formats:
    - ?apiKey=your-key
    - Authorization: Bearer your-key
    """
    # Extract from query parameter first
    api_key = request.query_params.get("apiKey", "")

    if not api_key:
        # Fallback to Authorization header
        auth_header = request.headers.get("Authorization", "")
        if auth_header.startswith("Bearer "):
            api_key = auth_header.split(" ")[1]

    # Store in context for downstream use
    api_key_context.set(api_key)

    response = await call_next(request)
    return response

# Mount MCP server at root
app.mount("/", mcp_app)

This middleware intercepts every request, extracts the API key, and stores it in our ContextVar. The httpx event hook then reads it for outgoing API calls.

7. Running the Server

if __name__ == "__main__":
    import uvicorn

    print("\nStarting MCP server on http://0.0.0.0:3011")
    print("Connect with: http://localhost:3011?apiKey=your-api-key")

    uvicorn.run(app, host="0.0.0.0", port=3011)

Connecting to Claude Desktop

Add this to your Claude Desktop config (claude_desktop_config.json):

{
  "mcpServers": {
    "connectsafely": {
      "url": "http://localhost:3011?apiKey=YOUR_API_KEY"
    }
  }
}

Now Claude can execute LinkedIn automation tasks through natural conversation:

"Post a comment on the latest post from @naval saying 'Great insight on leverage!'"

Claude will use the MCP tools to:

  1. Search for the user's posts
  2. Find the latest one
  3. Post the comment via the API

Why This Pattern is Powerful

1. Zero Duplication

Your REST API and MCP tools share the same OpenAPI source of truth. Update once, both update.

2. Instant AI Integration

Any existing API with an OpenAPI spec can become AI-accessible in minutes.

3. Production Ready

  • Health checks for orchestrators
  • Proper async handling
  • Thread-safe authentication
  • Configurable timeouts

4. Client Flexibility

Works with any MCP-compatible client:

  • Claude Desktop
  • Cursor IDE
  • Custom LangChain agents
  • n8n AI nodes

The Complete Code

Here's everything in one copy-paste block:

import httpx
from fastmcp import FastMCP
from fastapi import FastAPI, Request
from contextvars import ContextVar
import sys

# Store API key per request using context variables (thread-safe)
api_key_context: ContextVar[str] = ContextVar('api_key', default='')

# Event hook to inject Authorization header from context
async def add_auth_header(request: httpx.Request):
    api_key = api_key_context.get()
    if api_key:
        request.headers["Authorization"] = f"Bearer {api_key}"

# Create HTTP client with event hook for dynamic header injection
api_client = httpx.AsyncClient(
    base_url="https://api.connectsafely.ai/linkedin",
    timeout=300.0,
    event_hooks={"request": [add_auth_header]}
)

# Load the OpenAPI spec
try:
    print("Loading OpenAPI spec...")
    openapi_spec = httpx.get(
        "https://api.connectsafely.ai/linkedin/openapi.json", 
        timeout=5.0
    ).json()
    print("OpenAPI spec loaded successfully!")
except Exception as e:
    print(f"ERROR: Failed to load OpenAPI spec: {e}")
    sys.exit(1)

# Create the MCP server from OpenAPI specification
mcp = FastMCP.from_openapi(
    openapi_spec=openapi_spec,
    client=api_client,
    name="LinkedIn API Server"
)

# Create MCP ASGI app
mcp_app = mcp.http_app(path='/')

# Create FastAPI app with MCP lifespan and middleware
app = FastAPI(lifespan=mcp_app.lifespan)

@app.get("/health")
async def health_check():
    return {"status": "healthy"}

@app.middleware("http")
async def extract_api_key(request: Request, call_next):
    api_key = request.query_params.get("apiKey", "")
    if not api_key:
        auth_header = request.headers.get("Authorization", "")
        if auth_header.startswith("Bearer "):
            api_key = auth_header.split(" ")[1]
    api_key_context.set(api_key)
    response = await call_next(request)
    return response

# Mount MCP server to FastAPI app at root
app.mount("/", mcp_app)

if __name__ == "__main__":
    import uvicorn
    print("\nStarting MCP server on http://0.0.0.0:3011")
    print("Connect with: http://localhost:3011?apiKey=your-api-key")
    uvicorn.run(app, host="0.0.0.0", port=3011)

What's Next?

This pattern opens up possibilities:

  • Multi-tenant SaaS: Each user's API key routes to their data
  • Rate limiting: Add middleware to track usage per key
  • Caching: Cache frequently-requested data at the MCP layer
  • Logging: Track which AI tools are used most

Try It Yourself

  1. Get a ConnectSafely API key: connectsafely.ai
  2. Or adapt for your own API: Just point to your OpenAPI spec URL
  3. Full documentation: connectsafely.ai/integrations/mcp-server

Resources

AWS re:Invent 2025 - Create hyper-personalized voice interactions with Amazon Nova Sonic (AIM374)

2025-12-06 09:46:20

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Create hyper-personalized voice interactions with Amazon Nova Sonic (AIM374)

In this video, Veerdhawal Pande and Ankur Gandhe introduce Amazon Nova 2 Sonic, a speech-to-speech foundation model for real-time conversational AI. They demonstrate new capabilities including support for seven languages with masculine and feminine voices, language switching within sessions, asynchronous task completion, cross-modal text and speech input/output, and turn-taking controllability. The presentation highlights significant improvements in speech understanding accuracy (50% better on alphanumerics), state-of-the-art ASR performance, and enhanced tool calling capabilities. Ankur explains the architectural advantages of the unified speech-to-speech model over traditional cascade systems. Key use cases include customer service automation, voice assistants, and education apps. Integration options through LiveKit, Pipecat, Amazon Connect, and telephony partners like Twilio are covered. Amma Pandekar from Cisco demonstrates an AI receptionist application for a tire chain, showcasing real-world implementation with appointment booking and multi-modal dialogue capabilities.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Introduction to Amazon Nova 2 Sonic: A Speech-to-Speech Foundation Model for Real-Time Conversational AI

Before we get started, let me introduce myself. I'm Veerdhawal Pande, Principal Product Manager for Amazon Nova Sonic as part of Amazon General Intelligence, and I'm Ankur Gandhe. I'm a Principal Scientist also working in AGI and focusing on the Nova Sonic models.

Thumbnail 30

Thumbnail 40

Today we are gathered here to talk about Amazon Nova 2 Sonic, our speech-to-speech foundation model for real-time humanlike conversational AI. Before we dive deeper, let's quickly go through the different topics that we intend to cover. First, we'll be talking about the product itself and the core features that we have enabled, key benchmarks, speech-to-speech architecture, and major use cases that our customers are using Nova Sonic for. Later on we'll cover developer tools which make it easy for you to integrate Amazon Nova Sonic, and then we have a special guest today who will give us a demo and speak to the product they are building with Amazon Nova Sonic. After that we'll open the floor for question answering.

Thumbnail 80

With that, let's get started with Amazon Nova 2 Sonic. At Amazon, we are at the forefront of building voice-enabled technologies for conversational AI, whether it's for Alexa or other services such as Amazon Connect for customer service call automation. Yesterday we heard Matt Garman speak in his keynote about Amazon Nova 2 Sonic. We launched this model which enables real-time humanlike conversational AI. It's our second generation of model which went into general availability yesterday on December 2nd. It supports speech and text input and output modalities, and it's backed by a bidirectional streaming API as part of Bedrock to enable low latency conversations. It's available in four regions: IAD, PDX, ARN, and NRT with more regions coming soon.

Thumbnail 160

Core Conversational AI Features and What's New in Nova 2 Sonic

The most important thing here is that this model not only enables smooth conversational AI, but it also offers today's industry's most competitive quality and price performance offering. Let's look deeper into some of the conversational AI features. Accurate speech understanding is the foremost and most fundamental thing for any kind of conversational AI application. We are proud to say that Amazon Nova 2 Sonic delivers best-in-class speech understanding with robustness to different speaking styles, accents, and even noise in the background.

As I mentioned, it supports bidirectional streaming API with low user-perceived latency. This is critical for us to not only process streaming audio at speed, but also generate responses which seem natural and fluid in conversations. The model is capable of handling fluid dialogues not only when it comes back with its responses, but even when users interrupt at will, take pauses, or hesitate. These nonverbal cues tell the model to let users finish their thought before starting to respond.

Another key aspect of this model is its ability to detect not just what the user said, but how they expressed it—the tonality, the voice-based sentiment—and adapt its own responses based on that. This unlocks the ability to empathize with users when they express frustration or elate when they express happiness and mirror the overall sentiment of the user. However, any conversational application is incomplete for enterprise use cases unless it can actually be used to make and complete tasks. That's why the model also delivers tool calling, knowledge grounding with responsible AI, and privacy compliance. What this means is you can generate responses backed by a knowledge repository to be factually correct or invoke other tools to accomplish tasks such as making a reservation or upgrading your membership.

Thumbnail 280

Now, most of what I spoke about until now was also part of Nova Sonic One. So what's new? Let's take a look at what are the new conversational AI capabilities that we have unlocked with this launch of Nova 2 Sonic. First, additional languages. We initially started off with English back in April, added a few more languages in the July and August timeline, but now I'm proud to say we have seven languages, each with its own masculine and feminine voice supported by Nova 2 Sonic. We will dive into more details with each of those.

Next, we have language switching, the ability to not just speak in one language, but switch languages at will within the same session, and the AI can come back and respond in the user's preferred language. Another cool feature is asynchronous task completion. As we just talked about tool calling, APIs can take anywhere from 8 to 10 seconds or even longer. Instead of blocking on those tool responses, the Nova 2 Sonic model enables conversations fluidly where the user can switch topics and continue to talk or even invoke other tools while the previous tool call is working in the background.

Next, we have enabled cross-modal input output. This was a key ask from customers who started to adopt Nova Sonic earlier, where they would like the same model to be able to handle text and speech input and keep the conversational context together. This unlocks new features which we'll dive deeper into. And then of course, turn taking controllability. This refers to the ability for the developer to configure the duration of the pause that the user can take before the AI determines it's an end of a turn. Let's dive deeper into each one of these.

Thumbnail 390

Thumbnail 420

Deep Dive into New Capabilities: Multilingual Support, Language Switching, Asynchronous Tasks, and Turn-Taking Control

As I mentioned, we have now launched 7 languages: English with all different variants US, GB, India, and Australian. Indian and Australian English are the new additions with Nova 2 Sonic. We also have Spanish, French, Italian, German, Hindi, and Portuguese. Now, as I mentioned, each of these languages not only depicts the ability to understand these different languages and sub-dialects, but also the ability to respond in expressive native sounding voices. Let's take a moment to listen to one of these examples.

Let me play what we have here for US English masculine voice, Matthew. Welcome to Any Bank. I'm your AI assistant and I'm excited to help you with anything you need. Want to check your accounts, learn about loan options, or explore some investment ideas? Just let me know. I'm here to make your banking experience easier and a whole lot more enjoyable. Okay, let's try some new voices here. Welcome to Any Bank. I'm your AI assistant and I'm excited to help you with anything you need. Want to check your accounts, learn about loan options, or explore some investment ideas? Just let me know. I'm here to make your banking experience easier and a whole lot more enjoyable. That was Australian English.

Thumbnail 540

Any Bank may apka swagat hai. Main apka AI assistant hoon aur apki madad karne mein mujhe khushi hogi. Chahe aap apne account dekna chahte hain, kisi loan ke bare mein janna chahte hain. That was male for Hindi. And then lastly, let me try one for Portuguese. Bem-vindo ao Any Bank. Sou seu assistente de IA e estou muito feliz em ajudá-lo com qualquer coisa que você precise. Quer verificar suas contas, aprender sobre opções de empréstimo ou explorar algumas ideias de investimento? Apenas me avise. Estou aqui para tornar sua experiência bancária mais fácil e muito mais agradável. Welcome to Any Bank. I'm your AI assistant and I'm excited to help you with anything you need. Want to check your accounts, learn about loan options, or explore some investment ideas? Just let me know. I'm here to make your banking experience easier and a whole lot more enjoyable. Okay, lesson learned. Let's take a moment.

Thumbnail 570

Thank you. That was quite a lot of expressive voices. Coming to language switching, it's not only important to be able to speak expressively and with native sounding voices in a given language, but also the ability to switch within the same session and speak different languages with the same voice. And that's what we have unlocked with the polyglot voices and language switching feature. Let's take a moment to see how the user was able to interact in two different languages and the voice stayed the same while responding expressively in either. Hey, how are you? Hey there, I'm doing great. Thanks for asking. How about you? What can I help you with today? I want to buy shoes. That's an excellent idea. Good shoes make everything better. You'll want to think about the style you need and make sure they fit comfortably. Happy shopping.

Ich will Turnschuhe kaufen. Das ist eine ausgezeichnete Idee. Gute Schuhe machen alles besser. Sie müssen über den Stil nachdenken, den Sie benötigen, und sicherstellen, dass sie bequem sitzen. Viel Spaß beim Einkaufen.

Thumbnail 640

As you saw, the user was able to switch between two different languages within the same session, and the voice was able to adapt to either and respond back. I mentioned earlier a new feature called asynchronous task completion. As you will see in this demo, the user invokes a request which triggers a tool call in the background. While that tool call is happening, the user is able to switch the context and the topic and have a conversation fluidly while the AI comes back and responds to the previous tool call.

Thumbnail 670

Thumbnail 680

Thumbnail 690

While you are looking into that, can you also let me know the closest AnyBank Home Lending branch? And maybe also let me know what will happen with my escrow account. All right, Osmond, let me give you the info you asked for. First, the branch. Your closest AnyBank Home Lending branch is at 240 Greenwich Street, New York, New York, 10007. They're open Monday through Friday from 9:00 a.m. to 6:00 p.m. Their main phone number is 212-232-1420.

Thumbnail 710

Thumbnail 720

Now, regarding your escrow account after the transfer, I'll let you know that after I share the refinance status details, which I'm pulling up right now. OK, Osmond, here's the update on your refinance and your escrow account. Your refinance application for your Palm Beach property is moving along. It's a 15-year fixed at 6.125%, and the good news is it's still pending, but we do need some documents. As you see, the user was asking about refinancing while the tool was looking up details on that. He was able to switch topics and ask about escrow and other things, and the model did not lose the context when the tool call came back. It was able to weave the response back and generate an effective response.

Thumbnail 750

Next feature that we want to highlight here is cross-modal speech and text input output. As you can imagine, a lot of enterprises would like to start off with a text bot and then add the voice interface to their text bot. When users use this kind of application from the mobile phone or the browser, they want to have the convenience of typing in their text input and with the press of a microphone button switch to voice. That's the convenience the cross-modal speech and text input enables. The demo here shows how the user was able to enter the date in text format. The model picks up that date and comes back with a speech response, after which the user continues talking through their voice.

One key aspect of this feature is it's not only unlocking the cross-modal text input output for speech and text, but it also enables new features. Customers would like to use Sonic to make outbound calls, and when they do so they encounter IVR applications where they have to navigate DTMF tones. Some customers when making inputs maybe want to punch in their credit card information with their phone, which also means the model has to interpret the DTMF tones. Now with this technology, we are able to convert those into metadata, a text metadata that the model can interpret and therefore give the convenience to either use voice or keyboard input to enter their information.

Thumbnail 860

Thumbnail 870

Thumbnail 880

Another application of this feature that customers have used is the initial greeting. You could now give a text prompt and generate a welcome greeting at the very beginning of the session to welcome or start the conversation. So Osmond, can you confirm your date of birth for me? OK, Osmond, I've got January 1st, 1990. Is that correct? No, no, it's not. Got it. Thanks for clarifying. So, to confirm, what's your correct date of birth? OK, so I've got January 1st, 1980 as your date of birth. Is that right? Yes.

Thumbnail 890

OK. Lastly, we talk about turn taking controllability. We have experienced that there are customers who would like very quick turn taking, for example, a gaming application, or certain cohorts of customers like the elderly.

They're non-native speakers who are learning a new language, or certain cohorts of customers like the elderly, who may take a little longer to complete their thought or pause in between. What our customers asked for was the ability to configure the duration of the pause that the user can take before the AI determines the end of the turn. With that in mind, we've launched the turn taking controllability feature, which gives you three sensitivity settings. The highest sensitivity, which is the default setting, allows you to take a short pause of 1.5 seconds, and the lowest sensitivity can go up to 2 seconds, which allows users to take a longer pause and be able to complete their thought.

Thumbnail 950

Let's take a quick look at how this plays out without actually cutting off the user's speech input. In the first example, the user says: "I'm planning to visit San Diego next month for a vacation with the whole family. We have 2 kids with us, so what places would you recommend us to visit in San Diego?" The AI responds: "San Diego in December is wonderful for family trips with mild weather and many fun activities." Let's compare this with the other one. The user says: "I'm planning to visit San Diego next month for a vacation with the whole family. We have 2 kids with us, so what places would you recommend us to visit in San Diego? Oh, by the way, what will the weather be like there in December?" The AI responds: "For your family trip to San Diego next month, here are some top recommendations. Must visit spots include the San Diego Zoo, Legoland California, and SeaWorld San Diego."

Basically, it's showing more robustness to a longer pause. The response was more relevant to the user's query. They were really inquiring about the weather conditions and ideas for places to visit next month in December, but in the previous example they were cut off early, so the response they got was not as relevant. This gives you more flexibility in how you can tailor our turn taking behavior of the AI to suit different customer needs.

Performance Benchmarks: Speech Understanding, Reasoning, Tool Calling, and Conversation Quality

With that, let me call upon Ankur, who's going to walk us through some of the key benchmarks. Thank you, Veerdhawal. I hope you're all excited about the new features that we're launching with Nova 2 Sonic. But it's not just the new features; we're also improving the core capabilities of the model. One feedback that we got from our customers was that the ASR, or speech understanding, of Sonic was really great, but there was still room for improvement. There was room to improve in terms of noisy conditions and especially on things like dates and numbers or alphanumerics, which our customers use often to get date of birth or account numbers and things like that.

Thumbnail 1090

With Nova 2 Sonic, we have improved on both. Compared to all the other models that we have evaluated, Nova 2 Sonic is state of the art in terms of ASR recognition. On alphanumerics, we've improved over Sonic 1 by about 50%. It's not just the ASR. We've also improved reasoning, and compared to all the other speech to speech models that we compared against, Sonic has one of the best speech to reasoning capabilities.

Thumbnail 1120

Voice agents are not just for chit chatting or asking questions. They also need to take some action based upon whatever the user is going to ask the system to do, which is why tool calling is a critical functionality that we measure Sonic against. This includes both simple tasks like having a few tools to call, versus complex tasks where you might want to call multiple tools to accomplish a task, and the asynchronous tool calling functionality that Veerdhawal was just mentioning. Based on the key use cases that our customers had with Sonic 1, we improved our model with better training and better evaluations. Across multiple benchmarks, we showed that Nova 2 Sonic has significantly improved over Sonic 1.

For a great experience with Sonic, it's important to also evaluate the overall conversation quality. For every single language that we evaluate on, we made sure that the quality of conversation was better than everybody else. We took human evaluators to compare our voice and our conversations against similar models, and Sonic won every single time. Sonic is not just highly capable in doing speech understanding and doing tasks and functionality; it also sounds more expressive and natural, providing overall a great experience for anybody who is using Sonic or listening to Sonic.

Thumbnail 1200

Why Build Nova Sonic? The Unified Speech-to-Speech Architecture Advantage Over Traditional Cascade Systems

Now, I want to take a step back to ask a question which many people might be asking.

Why did we build Nova Sonic? If you have interacted with Alexa, Siri, or even more recent customer service calls, you have probably already spoken to a voice agent that already exists in the world. So why did we build Nova Sonic?

Thumbnail 1240

Typical voice assistants are built in a cascade system. They have a speech-to-text model that converts speech to text. That text is then fed into a text reasoning model that performs all the reasoning, agentic workflows, and tool calling, and it generates a text response which is then sent to a text-to-speech model. I spent most of my career working on Alexa trying to improve this entire pipeline. These systems work great. In the past decade, they have made major improvements in terms of architecture and training to improve the reasoning abilities and to improve the accuracy of these models into the high 90s. So why build Nova Sonic?

Thumbnail 1290

We realized that these traditional voice AI systems have a constraint: they are fragmented. Each model is designed to do a different task. Because each component is working on its own, it creates a bottleneck. The systems cannot talk to each other without this text. Without this common brain, you are unable to carry over context beyond what is feasible by the limitations of the system.

Thumbnail 1300

We have tried to address this. We have tried to build orchestration around this for something like error recovery. We have tried to build error recovery systems that detect when the automatic speech recognition was wrong and then ask clarification questions. We have tried to build personalization that feeds into each and every single component of this pipeline so that we get better speech recognition, better pronunciations, and better understanding. We have also tried to add more contextualization by carrying context through these different models.

The problem is that this requires a lot of orchestration designed by engineers and scientists, and a lot of handholding and building feature by feature. It is not scalable. We also realized that even though we are building all these things, there are a lot of edge cases that could not be solved just due to the limitations of this architecture. When foundation models came out, we saw something that could change this. We could get rid of the constraints that were in this cascade system and truly build a conversational model that was like a human assistant.

Thumbnail 1360

That is what Nova Sonic is. It is a foundational speech-to-speech model that understands and generates speech at the same time, preserving tone, prosody, and conversational flow in a single dialogue, resulting in a more natural interaction. We did this by training on millions of hours of data, learning from human-human interactions and human-agent interactions. The model learns on its own how to act in different conditions, different scenarios, and different acoustic conditions. Because our training data is tailored towards specific use cases, the model naturally learns how to do things like context carryover, how to do error handling, and all these improvements would not have been possible without this architectural change.

Thumbnail 1410

The numbers showing earlier on speech understanding accuracy, on tool calling, and on reasoning are a result of this unified architecture. Now customers do not need to build this complex orchestration trying to do everything. With this unified architecture as its backbone, Nova Sonic is able to naturally carry context of the entire conversation. Things like personalization through retrieval-augmented generation or through prompting or even through document upload can happen easily, and it is available to all the components of the model, not just a single one.

Another piece of feedback that we always get with voice assistants is that high latency is very annoying to somebody who is used to talking to human assistants. With this unified model, it minimizes the time that it takes to understand what the user said, reason about it, and then respond. This significantly reduces the user's perceived latency. The foundation model has been trained on millions of hours of data, so it has learned naturally how to understand the nuances of speech, something that would not be possible to encode in just text or some other binary functions.

When you talk, it is not just about what you are talking about, but also how you are talking—the tone and the intonation of your voice. With Nova Sonic, it learns to adapt its responses based on the prosody and the emotion of the user. We also built in this moral responsibility so that the entire conversation in both voice and in text stays consistent. With every generation of the speech-to-speech model, all the abilities I was talking about will continue to improve, which is why now is the time to build with these models and give your customers a much better experience.

I think I've spoken enough about the benefits. I'll head back to V, so he can talk about real customer use cases and how customers are actually using Sonic. Thank you.

Customer Use Cases and Developer Integration Tools: From Amazon Connect to Telephony Partners

So as you can imagine, there are a lot of different use cases that have emerged, but the three most primary use cases are self-service voice-first customer service automation on interactions, voice-enabled personal assistants, and education and language learning apps, which lend themselves very well for the application of this kind of speech-to-speech technology. We have key customers who are building applications and productionizing workloads with this, including Crescendo, iFrame, Rujum, Cisco, Ring, and Amazon Connect. Here's a testimonial from the CTO of Rujum highlighting the accuracy of understanding speech, the ability to handle dialogues fluently with low latency, and overall deliver a better human-like customer experience.

Thumbnail 1520

One thing that we learned very quickly after the launch of Sonic 1.0 was that it's not just enough to deploy a model in Bedrock and provide bidirectional streaming APIs. There's more nuance to building these kind of real-time applications, and that required developers in particular to handle all the input output signals and manage the session life cycle. This turned out to be a last mile problem where really deploying this and scaling to their applications took much effort on their behalf.

Thumbnail 1580

Therefore, we have partnered with a couple of different teams here which I want to walk through who've made this integration seamless. Let's start with LiveKit and Pipecat. As I mentioned, these are frameworks available for conversational AI which handle all of the input output signaling. They abstract these so that it's easier for developers to directly integrate their mobile apps and browser applications with Sonic on Bedrock while also managing the session life cycle. Typically Bedrock supports 8 minutes of a session lifetime, but with these frameworks developers would like to extend this and carry forward the conversational context to extend the overall cycle.

Thumbnail 1650

Next we have Amazon Connect. More recently, as of yesterday in fact, Amazon Connect has now gone into general availability with Nova Sonic to enable customer service call automation. This not only brings the expressivity and fluid dialogue handling that we just talked about of Sonic, but it also unlocks the agentic capabilities of Connect AI agents and different AI tools that they bring for customer service call automation. This integration is now available in general availability in IAD and PDX regions supporting at least three languages: English, Spanish, and Portuguese, with more coming in Q1.

And lastly, not all enterprises have a customer service contact support center. Maybe there are small business owners who would just like to enable an outbound call mechanism to reach out to their customers, or while their shop is closed, they may want to receive phone calls on their behalf and take orders, which is where telephony integration is key. Therefore, we have partnered with the likes of AudioCodes, Vonage, and Twilio, which make it seamless and easy for customers to not only integrate through the telephony channel but also extend through other ways such as SIP telephony, which allows you to connect internet telephones through your browser or mobile app to talk to Sonic and get the same sort of conversational expressivity and dialogue that you get directly integrating to Bedrock.

Thumbnail 1800

Partner Demo: Cisco's AI Receptionist Solution Powered by Nova Sonic for Enterprise Voice Workloads

With that, it's time to see a partner demo. And it's my pleasure to welcome Amma Pandekar, who is the VP of product management at Cisco Calling, who would like to speak about what he's doing with Nova Sonic. Thanks for having me. You folks having a good show so far? It's amazing to be here and we have a great partnership with AWS over the years. I'm Amma Pandekar. I work for Cisco as a lead product manager for our voice platform that powers our calling, our contact center, and so on. Just by a quick show of hands, how many of you deal with voice workloads or manage them? Quite a sizable number. But how many of you work with customer service enablement type roles? So what's happening is we're here to talk about how AI is changing the overall voice-enabled workload ecosystem.

Cisco is a big networking powerhouse with networking, security devices, and collaboration all together. We serve the largest financial service houses, including all of the Wall Street banks. The largest development bank in the world is our customer, and we have large retailers including Amazon on our platform. All of the pharmacies and most governments across the world, including the US government, use our services. Healthcare is huge for us as well. We have over 50 million users on our platform and enable about 1.5 billion minutes in the AI assistant that we have built on the Amazon platform.

Thumbnail 1880

On the other side, we also have Apex Contact Center, which is a contact center like Amazon Connect that powers customer care scenarios for companies like United Airlines and Maersk. The key is that this is available globally in all markets on the AWS platform. We have 195 markets today across the world with a 99.999% SLA. This has to be super critical for large customers, particularly when it comes to voice.

I'm really excited to show you what we have built in partnership with the Nomasonic team for an AI receptionist use case. Previously, all of these workloads were hard to deploy and manage. For example, if you have a bank and after 5 o'clock there's nobody there to answer the phone, what are you going to do? You don't have staff in those locations. If you want to call a pharmacy and update your prescription or do a refill after 5 p.m., there's nobody available. That's one case where customers want to do less with more, so they're using tools like AI receptionists for that.

The other scenario we're seeing is cyclical demand during a short period of time. Tax season is a typical example. You want people to be able to answer questions, answer customer calls, book appointments using the tool handling we discussed, and even transfer calls to people who might be able to assist with all the information. How many of you have started a call and had to repeat your information three times over when you call customer service? For me, it starts with my last name, which is complex, so I have to go with P for Pennsylvania and A for blood, and by the end of it, I'm so tired. Things like multimodal input give you the ability to text it or put it on a website, so you don't have to specify or call things out.

Thumbnail 1990

Lastly, what these tools are doing is helping customers prioritize their employee time for high-value customer interactions so that you can provide really good customer service. Let me show you a demo of how this works. But before that, very quickly, if you call any business, we have a receptionist that answers the phone powered using the languages and voices that the platform has. We train it with a knowledge base. For example, if you're a pharmacy, you can take all of your pharmacy information and input that to create a knowledge base. It can answer questions about medicines, products, hours, and more.

Thirdly, it can take actions. For example, you can book an appointment or schedule an appointment. You can look up your data from a CRM system and know your account information. When you need help, you can say, "Hey, I need to book an appointment with the doctor," for example. If that's a complex task, then it transfers to a live person, so it's human assisted. That person already has the knowledge that was told to the AI receptionist, so you don't have to repeat yourself.

I'm going to show this to you in action to make it real. This is a demo we built, and we're actually building it for a customer. I cannot name them, but they're a large tire chain in the US with a lot of tile stores across the West Coast and the Midwest. They have thousands of tire chain stores but don't have staff to manage all of those and answer those calls. What they want to do is automate 50 percent of the questions and answer them and even do the appointment handling in an automated way so that the technicians can focus on actually fixing tires and car problems. That is their goal, and we built this with them in mind.

Thumbnail 2110

Thumbnail 2120

Thumbnail 2140

Let me explain this to you. Welcome to Technology Tires Emergency Service. I can help you in Spanish, English, French, or Hindi for your tire emergency. Speak in your language of preference. How can I assist you today? Hey, I have a tire blowout and I need a complete replacement. I'm sorry to hear about your tire blowout. I can help you get that sorted right away. To get started, could you please share your phone number with me? This will let me look up your vehicle details and account information.

Thumbnail 2160

Thumbnail 2180

Thumbnail 2200

Great, thank you for sharing that. I found your account, Amey. You have a 2024 BMW X4. Since you mentioned a tire blowout and need a complete replacement, let me present the tire options available for your vehicle. Just a moment while I pull up the options for you. Perfect. I have the tire options for your 2024 BMW X4. Here are your three options for a complete tire replacement. Option one, Michelin Pilot Sport AS4, premium performance tire with excellent grip and handling. Option 2, Pirelli Cinturato. I'll take the Michelin Pilot, please. Excellent choice. The Michelin Pilot Sport AS4 is a great option. To schedule your tire replacement, I'll need two more details. One, your current location where you'd like the service performed. Two, your preferred date and time for the appointment. Could you share those details with me?

Thumbnail 2220

Thumbnail 2230

Thumbnail 2240

Thumbnail 2250

I'm at 67 Broadway, Las Vegas, Nevada, and I'd like the technician to be here somewhere between 11:00 a.m. and 12 noon. Just a moment. I'm getting that scheduled for you now. While you're doing that, do you have any towing services in case you're not able to fix it here? Yes, we do offer towing services if tire repair is not successful, or your vehicle needs to be moved immediately to avoid a potential hazard. Your appointment is now confirmed. Your appointment number is tire 260715. A technician will arrive at 67 Broadway, Las Vegas on November 26th between 10:00 a.m. and 10 p.m. to replace the tire on your 2024 BMW X4 with the Michelin Pilot Sport AS4. You'll receive a confirmation call 30 minutes before arrival. Please make sure your vehicle is accessible at that location. Is there anything else I can help you with today?

Thumbnail 2280

No, that's all. You've been great. Thank you. You're very welcome, Amey. I'm glad I could help. Your technician will arrive shortly and we'll get that tire replaced quickly. If you need any further assistance, don't hesitate to call us. Have a safe day. What do you think of that? You know, the best thing I like about that demo is that it got my name pronunciation right. Because my name is complex, and I mean all that's simple, but most of the time it gets pronounced as Amy. So you can even train these models for the correct pronunciations and the phonetics to power a great customer experience.

A few things I'll call out that we talked about and used here. One was that we use a multi-modal dialogue so they can enter a phone number, specify an address, or provide a credit card number, all of that very easily. The second thing is that, like bargaining, which is I don't want to wait for all the information before I can talk back, and then the model will recognize that and respond appropriately. That's the second one. So this is actually real, and we are planning to roll this out sometime early next calendar year to our customers.

Thumbnail 2410

You see the power of this tool set and what it can do for customers and essentially revolutionize the voice interactions that you may have with your employees or your customers on a data database. I want to thank Veerdhawal and the Ankur team for letting me be part of this, and then thank you for being at re:Invent. I think it's on to the next questions next, right? Thank you, man. Thank you. All right, that brings us to the end of our session, but I want to point out a few links and open up the floor for questions. Here's a link to the user guide and our technical report, which has all the benchmarks that we talked about. The user guide has code samples and other details on how you can prompt the model to make it work for different use cases. I also want to point out two other sessions that are related to Sonic that the team is doing to really give you a hands-on experience in building applications. Thank you so much, everyone. Thanks for coming in.

; This article is entirely auto-generated using Amazon Bedrock.

AWS re:Invent 2025 - Building scalable applications with text and multimodal understanding (AIM375)

2025-12-06 09:41:13

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Building scalable applications with text and multimodal understanding (AIM375)

In this video, Amazon AGI introduces Amazon Nova 2.0 multimodal foundation models that process text, images, videos, audio, and speech natively. The session covers three key areas: document intelligence with optimized OCR and key information extraction, image and video understanding with temporal awareness and reasoning capabilities, and Amazon Nova Multimodal Embeddings for cross-modal search across all content types. Box's Tyan Hynes demonstrates real-world applications, including automated materials testing report analysis for engineering firms and continuity checks for production studios, showcasing how the 1 million token context window and native multimodal processing eliminate the need for separate models and manual annotation workflows.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Enterprise Challenges with Multimodal Data: The Untapped Potential

Good morning, everyone. First of all, thank you for being here. As a quick introduction, I'm Dinesh Rajput, a Principal Product Manager with Amazon AGI. AGI stands for Artificial General Intelligence, an organization within Amazon that builds first-party foundation models called Amazon Nova. Today's session will discuss how you can utilize data beyond just text, such as images, documents, videos, audio, and call recordings to build accurate, context-aware applications using Amazon Nova Foundation models. I'm also joined by my colleague Brandon Nair, who will discuss image and video understanding, and we have one of our customers, Tyan Hynes, who represents Box and will share how they use Amazon Nova models to improve their AI workflows.

Thumbnail 70

This is the broad agenda for today. First, we'll discuss the enterprise needs when it comes to multimodal data and the different challenges that customers face. Then we'll provide a quick overview of Amazon Nova, two models that we introduced yesterday. We'll do a deep dive on how we've optimized these models for document intelligence use cases as well as visual reasoning use cases. Finally, we'll discuss multimodal embeddings and how they can help you search and retrieve multimodal data across your enterprise, and we'll conclude with a customer success story of how Box is using Amazon Nova models to power its AI workflows.

Thumbnail 120

Let's talk about data first. Today, organizations have immense amounts of data: text, structured data, contracts in shared drives, videos, and call recordings of customers. However, if we're honest, we use a very small portion of that data today. It's mostly either text or structured data that you have in your database tables. This is the only thing that we practically use in most of our AI workflows. A lot of these other multimodal data somehow gets unused and doesn't really contribute to our AI applications. Multimodal foundation models are really changing that by letting you see what's inside an image, what's happening within the different frames of a video sequence, and what a customer is saying or feeling within a support call. You can use all of this data together to reason over it and deliver customer insights and improve your AI workflows.

Thumbnail 190

We've been working with customers, and when they try to use multimodal data, there are three key challenges that they face. First, there are many separate models and tools. You need one model to process text, another model to process structured data, a third model to process images, and then you might need a fourth model to process videos. This leads to multiple problems. First, you're forced to stitch together these different tools, which makes the entire process very costly and complicated. The second key problem is that because you have these different models across different modalities, it's very difficult to put together all of that context and reason over all of these modalities together to actually deliver customer insights. Think about what you read in a document and a customer call and putting that together to solve the real customer problem. Because of these different models, that's not really feasible today. The third thing is that a lot of these models are not super accurate, which forces you to have a human in the loop. This manual checking doesn't really scale, which leads to cost issues and efficiency issues when you're trying to deploy these AI workflows.

Thumbnail 300

Introducing Amazon Nova 2.0: Natively Multimodal Foundation Models

We launched Amazon Nova 1.0 models at re:Invent last year. We've been working with many of our customers over the last year and have had an amazing response with tens of thousands of customers using us. We've heard similar feedback from all of our customers, and to solve these customer challenges, we've introduced Amazon Nova 2.0 models that were just launched yesterday. One of the key fundamental things that we designed Amazon Nova 2.0 models for is to treat all the modalities as a first-class citizen.

When we designed Amazon Nova 2.0 models, we made them natively multimodal, and they are able to process text, images, videos, audio, and speech, as well as generate text and images. We have a variety of models to cater to your different cost, latency, and accuracy profiles.

We have Amazon Nova 2 Lite, which is our fast, cost-effective reasoning model for most of your AI workloads. We have Nova 2 Pro, which is our most intelligent model for your highly complex tasks. We have Nova 2 Omni, which is our unified model that can not only understand text, images, videos, and audio, but can also generate text as well as images. We have Nova 2 Sonic, which is a conversational speech-to-speech low-latency model. Finally, we have Nova multimodal embeddings, which is a model that can create embeddings across all of your modalities so that you are able to implement any sort of search or retrieval on all of your enterprise data.

Thumbnail 380

Let me quickly go over some of the salient aspects of these models. First, we designed these models to have 1 million tokens of context window. To put that in perspective, it means the model can process 90 minutes of video, hours of audio, and hundreds of pages of documents all in one go. We have made these models multilingual in nature so they can process 200+ languages, and on the speech side we have optimized them for 10+ languages so that your solution truly scales globally.

One final aspect is that we have also included reasoning within these models. Nova Lite, Pro, and Omni all come with reasoning enabled so that you can reason over all of the data, whether it is text or multimodal data together. We have also optimized these models for tool calling so that you can implement agentic workflows using all of your multimodal data.

Thumbnail 440

When launching these models, we made sure that they achieve frontier intelligence on multimodal tasks. Here are some benchmarks that we have presented. These are mostly multimodal benchmarks. TripleMU Pro is a complex visual reasoning task on images. We have document OCR, whether you are doing optical character recognition or extracting documents of handwritten text, slides, and so on. We have real KIE, which is key information extraction for tables where you want to extract values. We have QB highlights, which is our video understanding benchmark, and then we have Screenpot, which is essentially any kind of UI or browser tasks.

Thumbnail 490

Thumbnail 510

As you can see, Nova Lite and Omni perform at the frontier compared to their peers in this category. Nova 2 Pro is also an extremely competitive model that we have introduced in preview. We will potentially GA this, and it also shows extremely competitive performance on multimodal tasks compared to its peers.

Document Intelligence Deep Dive: OCR and Key Information Extraction

Now let us get into a deep dive around document intelligence. When it comes to document intelligence, we have heard from a lot of our customers that the two primitives that they care about are optical character recognition and key information extraction. We have made sure that Amazon Nova multimodal models deliver absolutely the best performance on these two key primitives so that developers as well as enterprises can build on top of them to enable their different AI workflows.

Thumbnail 550

Here is a quick example of a Nova output from an OCR task. As you can see, we have optimized these models around three key parameters. One is robust real-world OCR. What I mean by robust real-world OCR is that a lot of documents you get in enterprises may be handwritten. The scan quality is not always great. Sometimes the documents are a little tilted, so we have made sure that we have optimized around all of these real-world scenarios so that the information you are able to extract from documents is super accurate and requires minimal additional work.

Second is mixed context understanding. It is very simple for models to just extract a chart or a table or text separately, but most of our real-world documents have all of these things put together. We have made sure that the model also excels in understanding these different sets of context, whether it is charts, tables, or any sort of text, so that you are able to deliver this performance together. Finally, we have structured output.

We've also made sure that our model is able to produce the right structured output whether it's in JSON, HTML, or XML format, so that it's machine readable. You can then extract all of this and put it in your databases so that you can further process it. These are the three key things that we made sure our models really excel at.

Thumbnail 630

The second task is key information extraction. Here we have tried to optimize the models around three key things. First is schema-driven extraction. Any sort of schema that you specify to the model—maybe you want to extract only five rows, or maybe you want to extract all the rows, or maybe you want to label certain rows in a very specific manner—we have made sure that the model does instruction following around what schema you want to define, what sort of indentation you want, and what output you want.

Second is layout-aware text extraction. We have seen in a lot of use cases that the layouts, especially in complex layouts, require a lot of interpretation that the model needs to do. We have made sure that we are covering all of these long tail use cases of really complex real-world tables which require a lot of human interpretation to actually extract the data and make sense of it.

Finally, all of our models also have reasoning capabilities within them, so you could also extract all of this information from these documents and reason over these documents to see if the model has extracted it correctly or even fundamentally if the data is correct or not. As a quick example, you might extract a zip code from a document, but you need some model intelligence to know that a zip code should be five digits and not four digits, and that might be an error in the form. So you don't just want extraction, but you want intelligent extraction within those documents.

Image and Video Understanding: Visual Perception, Reasoning, and Temporal Analysis

Thank you so much. I'm going to hand it over to my colleague Brandon, who will talk a little bit more about image as well as video understanding. Hey folks, how's it going? My name is Brandon Nair, and I am a Senior Product Manager on Amazon AGI, similar to the team that Dinesh is from.

Thumbnail 750

Dinesh has given us a pretty great deep dive on how Nova 2 models can be applied for understanding use cases, but I'd like to expand that a little bit further and showcase how those capabilities can also solve for image and video understanding use cases. Vision understanding is a foundational capability that enables language models to really understand the world not just in the form of pixels, but similar to the way that humans do—really understanding the context and the meaning behind those images and videos.

Nova 2 models process all images as well as video content in their unstructured format, so it processes it natively and converts it into structured output that could be text written in JSON format. That text can be used by customers to easily search, categorize, and generate rich business insights from that visual content.

Some examples of what these insights could look like include identifying key objects in an image or a video, identifying and extracting text overlays in a video through optical character recognition, understanding temporal relationships or causal relationships within a video, or generating rich captions that can be used by customers to really understand what content is contained within that particular video asset.

Thumbnail 870

We are super excited about the Nova 2 models that have come out, and from a vision understanding perspective, you can break these down into improvements that we are bringing through with the Nova 2 models. You can break those down into three capabilities that we are trying to solve for based on our customer needs. The first is on vision perception. The second is on reasoning and scene semantics. Thirdly, we have temporal understanding, or really the understanding of how time influences the understanding of a video. Let's dive through those one by one.

Thumbnail 910

Visual perception can be seen as the model's eyesight. It represents how adept a model is at taking in a particular image or video and understanding all the elements within those videos or images. This includes understanding attributes associated with each element, such as color, shape, and count, as well as understanding geospatial relationships between the different elements. Vision perception forms the basis of common computer vision tasks such as object detection.

Thumbnail 990

Object detection is a way that you prompt a model to identify a particular object and potentially extract the coordinates of that object that can be passed downstream into a system for further processing. An example of this could be logo detection, where you identify a logo and pass it downstream to do an ad attribution use case. In the example on screen, we have a standard living room, and I've prompted Nova 2 to identify and detect plants, cushions, table, and a TV. In the next slide, I've overlaid the bounding box coordinates corresponding to each of those particular objects.

As you can see, Nova has identified all the objects on the screen, even the little plants that are in the bookshelf in the back, and even the black TV, which is literally represented as a black rectangle. The model has an understanding of what to expect in a living room space and is able to detect that as a TV. The second finding from this particular image is how tight these bounding boxes are. This gives you a sense of how well the model is able to focus on the particular object that you are trying to detect, and it is a measure of how accurately the model is able to detect those objects.

Thumbnail 1080

Visual perception is a mechanism to test how well the model can visualize, and this capability can be extended further to having the model generate captions of what it's actually seeing. In this example, it could be something like "modern living room" or "bright open spaces with natural sunlight coming through." The second capability I'd like to discuss is that each of the Nova 2 models has reasoning capabilities. This allows the model to extend its uses beyond just visual perception and identifying what it can see, but to actually contribute different elements together to make a logical deduction or inference of what is actually happening.

Thumbnail 1130

The model consumes reasoning tokens in order to generate this kind of thinking, and within the API we provide parameters to developers that allow them to control the depth or budget of the degree of thinking in order to solve for their particular use case. It's another way to prioritize what makes the most sense for your use case. The third big upgrade that we have is around video and temporal understanding. Video understanding is a critical capability for a bunch of use cases, from media asset management to semantic search to personalization and recommendations to contextual ad placements to video question answering and detection. The list goes on.

Thumbnail 1190

The challenge, however, is that videos are a really complex asset or modality to deal with. As you can see, videos have frames, shots, scenes, and chapters. You may have audio transcripts in there, and you have textual screen overlays. It's a really complex multimodal asset. But this is compounded by the fact that we also need to consider the time dimension. Time is super important to understanding the context of a particular video, what's actually transpiring in a scene, and what's actually taking place. Current solutions have basically two options that they can utilize.

Thumbnail 1210

The first option is to have manual annotation of the videos. This involves a team of people watching a video and literally noting down metadata that is annotated to the video. Due to its manual nature, this approach is really unscalable. It depends on the depth at which someone takes down these annotations, and because there's a human in the loop, you get a degree of variability that can differ from teammate to teammate.

The second option that customers can pursue is to extract frames from the video and send those frames to a vision language model to generate metadata. However, this option too is flawed. Firstly, it requires customers to build out complex pipelines to preprocess their images and videos into frames or images, and then integrate them into a vision language model. Secondly, because you are extracting frames and processing them separately, when you consider the sheer volume of video archives you might want to go through, this could become cost inefficient.

Thumbnail 1310

The third aspect of processing frames is that you don't have that element of temporal understanding, which is really important for understanding what's actually happening in a video. Nova 2 models support video natively. We've trained the model to understand the temporal aspects of the video so it understands what's happening across the length of frames to have a deeper understanding of what's happening in the particular video.

Thumbnail 1370

Thumbnail 1380

Temporal understanding is super important because with it, you also get the ability to ask the model to generate chapters where it could provide a description from time A to time B, indicating what has transpired within those two timeframes. Or you could ask it to process a video and extract the timestamps that correspond to a particular event that might be interesting to you. I have a demo here which showcases this. This is a video, a sped-up version of a documentary that features Werner Vogels, the CTO of AWS. In this video, we have a number of occurrences where someone is standing on a boat. There are actually four occurrences in this particular video.

Thumbnail 1390

Thumbnail 1400

Thumbnail 1410

I provided that video to the Nova 2 model and prompted it to extract each of the timestamps that correspond to when someone is standing on a boat. Nova 2 was able to identify all four of the particular events when someone was standing on a boat, and not only that, but it was also able to localize that timestamp to within one to two seconds of when the start and end time was, which is a pretty powerful capability for identifying different events happening within your video.

Thumbnail 1430

Amazon Nova Multimodal Embeddings: Unified Representation Across All Content Types

Now I'm going to switch gears and talk about Amazon Nova model embeddings. Amazon Nova Multimodal Embeddings is a separate model from the Nova 2 models. It takes in text, images, documents, video, and audio as inputs, and it outputs an embedding that represents any of the components that are passed in as an input. Before we go any further, let's define what an embedding is. An embedding, simply put, is a representation of the input that you provided to the model.

Thumbnail 1460

You can think about this lovely picture of a Labrador. The Labrador is sitting on a beach with an ocean in the background and a blue handkerchief tied around its neck. All of those elements within that image are important to understand the overall context. When you convert this image into an embedding, you're really trying to represent it in a machine-readable format, which we call an embedding, that captures all of those intricate details and all of that information represented within the image. This is super important because it helps enable semantic search applications where you don't have to rely on metadata.

You can rely just on the embedding itself to retrieve the correct image. It also helps for RAG applications. As you're thinking about building out deep AI workflows that manage to retrieve important information that might be proprietary to your business, with an embedding model, you don't have to have the metadata. You can actually retrieve it just based on the embedding itself.

Thumbnail 1540

Amazon Nova multimodal embeddings is a state-of-the-art embedding model that takes in text, images, documents, video, and audio and outputs an embedding. It is the first model in the industry to process all these different content types and to process them within the same embedding space, so it has the same level of understanding. In other words, if you have a text string of the word "dog," if you have an image of a dog, or if you have a video of a dog, they are all represented in the exact same way. This allows you to expand applications to cross-modal capabilities such as doing text plus image to image search or video to video search, or trying to retrieve visual documents that contain both text and images.

Thumbnail 1610

Nova Embeddings offers great price performance at approximately 10 to 20 percent lower cost than other alternative embedding models. Some of the key features of the embedding model include unmatched modality coverage. It also provides a very long context length of 8,000 tokens for an embedding model, which is a pretty high amount. This refers to how much context you can bake within a single embedding and still have a meaningful representation of your input. It also provides segmentation capabilities within the API, so if you have longer text, video, or audio, you can first split those into smaller, manageable pieces and then generate embeddings for each of those pieces.

The model comes equipped with both synchronous and asynchronous APIs. The synchronous API handles your latency-sensitive workflows where it might impact your user experience. You can think about something like someone searching for a document where you want that to be retrieved pretty quickly. It also supports asynchronous capabilities, so if you have a very large video file that you want to process and it's not latency-sensitive, you're able to pass it through the asynchronous API and get a response at a later time once the job has been completed.

Lastly, the model comes with a choice of four embedding dimensions, which really gives you the option to trade off the level of compression within an embedding against your storage cost implications. The model is trained with Matrioshka representation learning, which essentially means that it bakes in the most important context earlier into the embedding dimension. So if you truncate the embedding from the native 3,000 dimensions, you can still maintain a very high degree of accuracy. In our benchmarks, we see a pretty minimal accuracy drop when we move from a 3,000 embedding all the way to a 256 embedding.

Thumbnail 1740

In the slide, I also talked through some of the benchmarks that we've developed to compare Amazon Nova multimodal embeddings against alternative models. The first thing you should notice is that because Amazon Nova multimodal embeddings is so comprehensive, we have to pick select models to compare it to because other models tend to be more specialized toward a particular modality or particular domain such as images, documents, or maybe just videos. As you can see from the slide, across video retrieval, visual documents, as well as text, the model delivers great accuracy in terms of your retrieval tasks. It's a great model that we've been really proud about, and we've started to get some great feedback over the past few weeks.

Box's Journey: Unlocking Unstructured Content with Nova Multimodal Embeddings

We're excited to have more folks test it out and give us a sense of how it is being applied within your business use cases. Now I will transition over to Tyan, who will take us through a bit more about how Amazon Nova multimodal embeddings is unlocking new use cases at Box. I don't think the click is working.

Thumbnail 1820

Welcome everybody. I'm excited to talk to you about how Box is specifically using these new models. For those of you who are not familiar with Box, we are the leading intelligent content management company. What does that mean? Companies trust us to store, manage, and share their information so that they have the ability to interact with that information, share it securely with other customers or with other people in their organization, and actually get really useful value out of that information. Over 115,000 organizations trust Box, and these are across many different industries, including highly regulated industries like government, finance, and life sciences.

Thumbnail 1870

Thumbnail 1890

Obviously, it's really critical for Box to be able to be secure about how we manage that content, but we also need to make sure that we're providing access to the full breadth of the content. There's a whole bunch of use cases that our customers have that they want to use this information for across a wide variety of different verticals and industries, all the way from digital asset management and insurance claims management to real product design and development. One of the big challenges with this kind of information is that most of it is in unstructured data. We're talking about PDFs, Word documents, Excel spreadsheets, all kinds of things, video files, and traditionally it's been really hard to get access to that information.

Thumbnail 1930

Previous paradigms have really focused on structured data and being able to do database queries. That's not super useful to a lot of our customers because so much of the information that they need is in that 90 percent of unstructured content. There's a lot of really untapped value that we want to unlock, and obviously AI is a huge way to do that. That's where Box AI comes in, and that's the particular product that I work on. This is the platform that we built that allows Box customers to use AI and apply it to their content. We've had things out for a few years now, but we're continually improving and making sure that we can provide even more value to that content, not just being able to find things and ask questions, but also take that information that you get out of that content and use it to create new assets or power workflows.

Thumbnail 1970

Thumbnail 2000

A couple of key use cases: the very first thing that people like to do is really just being able to generate instant insights from content. How many of you have had a time where you have a really long document that you probably need to read before a meeting and you just need to know a few things? You don't have time to read the 50-page document, so being able to just ask questions really quickly on that is super helpful. Once you have that information, you can do interesting things with it. For example, if you have a long document and you know that you need to be able to search for it later, you can extract that information and save it as metadata. You can save that back to the document, or you can take that information—for example, if you have a loan document—you can pull that information out and then take that metadata and put it in another system as well.

Thumbnail 2040

Thumbnail 2060

Really, we're being able to take that information and spread it across not just the Box ecosystem but across your entire ecosystem so that you can really get more value. Of course, the real thing that we're really focused on is being able to use that information to automate workflows. It's great to be able to go and read through a document or look through a video and find the information that you want, but what we really want to do is empower our customers to actually take that information and use it to power new processes so that people don't have to do that manual step and it just gets done automatically.

This is not a sales pitch. I just wanted to set some context for what Box is and why we care about this stuff. Let's actually get into the details of what we're trying to do. Let's go back to that 90 percent of content that is unstructured. That's a lot of content across many different file sizes and file types. In order to access all of that, we need to use RAG, or Retrieval Augmented Generation.

If we have a 10-hour video, we can't just plug that into a model and get the information. We need to use RAG to do that, especially when we're talking about comparing across different documents. The real challenge is that current models tend to just be very text-focused. There are some models that do images as well, and we talked a little bit earlier in the presentation about that.

Thumbnail 2120

But it's a real challenge to access anything beyond text. This is where we're trying to figure out how to solve this problem. For our Box customers, this is really important. There's a lot of content—think PDFs, CAD files, presentations—where we can access the text, but we can't necessarily access all the embedded images and charts. We're losing a lot of context because we're only looking at one dimension of that file. These kinds of files make up a huge percentage of the files that our customers store in Box, so we're losing a lot of information.

We also have a lot of video files, audio presentations, and PDFs. These tend to be disproportionately large files in Box. We actually have a customer with a video that is more than 10 years long. I don't quite know how they've managed that, but as a result, we have some very large files within Box that we have to manage. The only way to get value out of those really large files is using RAG.

Thumbnail 2190

We've been trying to solve this problem for a while. We've done a couple of different things. Obviously, the first thing is to convert audio from your video and audio files into text. There are very good existing models that can do that. I think this is what a lot of multimodal models do right now—they're really just extracting text and then doing the embedding on the text. That allows them to search across text files, audio files, and video files. The other thing is human annotation, which is a very traditional approach. This is what we did even before AI. If you wanted to get information out of a document, you hired a person, told them to find these things, and they would annotate it. We've tried both approaches, but there are some big limitations to those.

Thumbnail 2240

Especially when you're talking about human annotation, it's super difficult to scale. It's very expensive. You have to hire people who are experts in a particular field, so it's quite challenging. Both approaches are quite slow. Transcription is getting better and faster, but human annotation takes a really long amount of time. No matter what approach you use, your potential for search is really limited to whatever keywords you're extracting from human annotation or what is specifically in that transcript from the audio or video. As a result, really important context is lost as part of that process.

Thumbnail 2300

Up to this point, even though we've been trying to solve this particular problem, we haven't really found a great solution. The whole goal is to look at more than just text. We want to look at all the information within a document together at the same time. We want to look across all different file types. My team's goal is simple: if you store it in Box, you should be able to use AI on it. That's not just that you can use AI on a file—you should be able to use AI on the entire content of that file. That's what we're working towards.

Thumbnail 2330

This is where we get to the new Nova multimodal embeddings model, and this really has been a game changer for us. This is going to allow us to unlock a lot of what we've been trying to do. We finally have a single multimodal embeddings model that handles all content types, not just text. It's not pulling a transcript and doing it—it's actually pulling the text, the image, the video, everything from whatever that file is. So we have all that information, and that means we're not losing that critical information that might come from the non-text pieces in that file. For us, that is huge.

Thumbnail 2400

We can really look at documents with different kinds of file types within a document as well as looking across the entire depth and breadth of the files stored within Box. Of course, that unlocks a ton of new use cases that we just couldn't do before. It's an added bonus for us that it's very fast, very scalable, and very affordable, which is great for us.

Thumbnail 2410

Real-World Impact: Customer Success Stories and Future Directions

I'm going to give you some real examples of customers who have actually used this technology. We have a leading engineering and architecture firm that does a lot of materials testing. They receive these reports every month that are extremely long—around 80 pages—with lots of in-depth technical information about the results of all their tests. Previously, they had to hire someone at their company to go through and read the entire report, pulling out project-specific information for different projects to identify anything in those results that might need additional action. This was incredibly time-consuming to extract the information from the report.

Additionally, there are often attached videos or images. For example, if someone is trying to look at a particular room that's been built, they might want to take pictures of what things look like and then see if that actually matches the specific requirements they have. All of this required a human to do manually. However, with this new embeddings model, this customer can now create an agent that can go through all that information and pull out the insights they need in project-specific summaries as well as summaries for executives, along with a list of action items they need to take based on that report so they know what to do next. This is huge for them and saves days and days of work just by being able to do this.

Thumbnail 2510

Let me talk about a fun one too. We definitely have some really cool media and entertainment use cases. You can imagine they use a lot of videos and audio. We have a very large production studio that is a customer of ours, and one of the things that's a real challenge for them is continuity checks. When you're filming something, you have many people working on different sets at different times, which is quite complex. You might film a scene or a few scenes at one time with everything set up in your location, and then for whatever reason you might need to tear that set down, go somewhere else, and then come back to that location at a different time.

You need to set everything up to be exactly like it was before. In the past, that meant someone had to go through the video from that previous time they were on that set and look at where all the objects were and what all the context was. Then they would go and set the set up to match exactly what it was last time. This was super time-consuming, especially because depending on where your camera is, you can't just look at the last shot—you have to go and look at all the different shots so that you have the whole set set up the correct way. Now they can use this to search the video and really be able to find that complex information right down to finding where that coffee cup was located and what direction the particular writing was oriented so that they can get everything set up looking exactly like it was when they finished. This saves them hours and hours of work.

Thumbnail 2630

We're really excited about this one as well. What's next? We're doing a lot of really great stuff with the teams here at Amazon to really start getting this in the hands of our customers. We've proven that it works and solves our use cases. So now we really need to make sure that all our customers have access to this. That's a big effort for us now—to really start applying this embeddings model at scale in our production environment. We're also working on testing and integrating those Nova 2 models that were just announced. My team is actually actively working on testing them right now and looking at them on our use cases so we can see where they really do well.

Thumbnail 2720

One thing that's really great is we're now able to extend the kinds of things that Box AI can do to new use cases. Every time we have some new capability that comes out, there's always new use cases that come up that we just couldn't do before. So we want to be able to really start figuring out now that we have this new capability with these new embeddings models to be able to look at all this different content, what new kinds of things we can do. There's a lot of really active work happening right now to see what new use cases we can apply these models to. That's it. Thank you so much everybody for joining us. We really appreciate you coming out and listening. Please make sure that you complete the session survey—it's in the mobile app. For those of you who have questions, we'll be off to the side over here to answer questions for a few more minutes. Thanks so much everyone.

; This article is entirely auto-generated using Amazon Bedrock.