MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

AI Engineering: Advent of AI with goose Day 3

2025-12-11 06:15:06

Visualizing 5,294 Votes with AI (Auto-Visualiser + MCP-UI)

An Emergency data viz for a Hot Cocoa Championship using Goose's Auto-Visualiser extension. Tournament brackets, radar charts, and the magic of MCP-UI!

Day 3: The Hot Cocoa Championship Crisis ☕📊

The Urgent Email

It's December 3rd, and my inbox just was piping hot!:

URGENT - Hot Cocoa Championship Results Need Visualization

Hey there! The Hot Cocoa Championship was AMAZING but we need these results visualized for the big awards ceremony tomorrow. Our data person is sick and we're in a panic.

The ceremony is tomorrow at 2 PM and we need to print these for the big screen!

Sarah Chen

Winter Festival Coordinator

(Very Stressed)

Attached: A massive markdown file with tournament data, voting breakdowns, recipe scorecards, and 5,294 total votes.

Deadline: 18 hours.

My data viz experience: Minimal.

Panic level: Rising.

Enter the Auto-Visualiser

This is where I discovered one of the coolest AI tools I've ever used:
goose's Auto-Visualiser extension.

The Auto Visualiser made charting effortless, initially I thought I would need to add an additional detailed prompt, but I did not need to because it automatically picked the right visualization to highlight the detailed data given to input. I then asked goose to generate detailed HTML visuals / summation. I like goose's sentiment its "friendly" albeit professional.

Here's the magic: You paste data and describe what you want and the Charts render directly in your conversation. No code. No exports. No separate tools.

It's powered by MCP-UI (Model Context Protocol UI), which returns interactive UI components instead of just text. Mind-blowing.

The Data

Sarah sent me everything:

Tournament Bracket

  • Quarterfinals: 4 matches
  • Semifinals: 2 matches
  • Championship: 1 final showdown
  • Winner: Dark Chocolate Decadence 🏆

Recipe Scorecards
8 recipes rated on:

  • Richness (0-10)
  • Sweetness (0-10)
  • Creativity (0-10)
  • Presentation (0-10)

Voting Breakdown

  • Period 1 (Morning): 1,247 votes
  • Period 2 (Afternoon): 1,891 votes
  • Period 3 (Evening): 2,156 votes
  • Total: 5,294 votes

Fun Stats

  • Closest match: Peppermint Dream vs Salted Caramel Swirl (14 vote difference!)
  • Biggest blowout: Dark Chocolate Decadence vs White Chocolate Wonder (73 votes)
  • Most controversial: Spicy Mexican Mocha

What I Created

🏆 Tournament Bracket Flow (Sankey Diagram)

The first visualization showed the complete tournament progression - how votes flowed from quarterfinals through semifinals to the championship.

Why Sankey? Perfect for showing how competitors advanced and where votes accumulated. You can literally see Dark Chocolate Decadence's dominance.

Key insights:

  • Peppermint Dream had the most votes in Round 1 (312!)
  • Dark Chocolate Decadence peaked in the finals (678 votes)
  • The semifinals had massive voter turnout

📊 Recipe Attribute Comparison (Radar Chart)

This was my favorite - an 8-way radar chart comparing all recipes across 4 attributes.

Visual patterns emerged:

  • Dark Chocolate Decadence: Perfect 10/10 richness and presentation
  • White Chocolate Wonder: Maxed sweetness but low everything else
  • Spicy Mexican Mocha: High creativity (9/10) but polarizing
  • Classic Swiss Velvet: Balanced across all attributes

The story: Dark Chocolate won because it excelled where it mattered (richness, presentation) while maintaining good creativity.

📈 Voting Trends Over Time

A line chart showing how voter engagement increased throughout the day:

  • Morning: 1,247 votes (people waking up)
  • Afternoon: 1,891 votes (+52% increase!)
  • Evening: 2,156 votes (peak engagement)

Insight: Evening voters decided the championship. Marketing lesson: timing matters!

🥊 Head-to-Head Matchup Analysis

Bar charts for each round showing vote distributions:

  • Round 1: 4 matches, clear winners
  • Round 2: Closer battles, higher stakes
  • Finals: The epic 678 vs 623 showdown

The nail-biter: Peppermint Dream vs Salted Caramel Swirl in Round 1 - only 14 votes separated them!

The AI Engineering Process

Here's what blew my mind: I didn't write visualization code. I had a conversation with Goose.

My Prompts:

"Create a tournament bracket showing how each recipe progressed 
through quarterfinals, semifinals, and the championship."

Result: Beautiful Sankey diagram, instantly rendered.

"Compare all 8 recipes on a radar chart using their judge scores 
for richness, sweetness, creativity, and presentation."

Result: Multi-series radar chart with color-coded recipes.

"Show voting trends across the three time periods."

Result: Line chart with clear trend visualization.

Charts Created: 6+
Crisis Averted: ✅

The Tech Behind the Magic

MCP-UI (Model Context Protocol UI)

Traditional AI outputs text. MCP-UI outputs interactive components:

Traditional: "Here's the data formatted as JSON..."
MCP-UI: [Renders actual interactive chart]

This is a paradigm shift in AI interfaces. Instead of describing visualizations, the AI creates them.

Auto-Visualiser Extension

Built on MCP-UI, it:

  1. Parses your data (CSV, JSON, markdown, whatever)
  2. Understands your visualization request
  3. Chooses the appropriate chart type
  4. Renders it with proper styling
  5. Makes it interactive (hover, zoom, filter)

No configuration needed. Just describe what you want.

What I Learned

Data Storytelling > Raw Data

The tournament data was just numbers. The visualizations told a story:

  • Dark Chocolate's dominance
  • Peppermint's strong start
  • Evening voters' impact
  • Recipe attribute patterns

AI Understands Context

I didn't need to specify "use a Sankey diagram for tournament flow." Goose understood that tournament progression = flow visualization = Sankey. That's intelligence.

Speed is a Competitive Advantage

Traditional workflow:

  1. Export data to CSV
  2. Open Excel/Tableau/Python
  3. Clean data
  4. Choose chart type
  5. Configure styling
  6. Export image
  7. Repeat for each chart

Time: Hours

AI workflow:

  1. Paste data
  2. Describe what you want
  3. Done

Time: Minutes

Iteration is Effortless

"Make the radar chart bigger"

"Add a legend"

"Change colors to match the festival theme"

"Show only the top 4 recipes"

Each request took seconds. No re-coding, no re-exporting.

MCP-UI is the Future

This isn't just about charts. MCP-UI can render:

  • Interactive forms
  • Data tables
  • Maps
  • Dashboards
  • Custom UI components

We're moving from "AI can assist us in writing code" to "AI that creates interfaces."

What I Learned

These skills apply to:

  • Business reporting (quarterly results, KPIs)
  • Research presentations (academic papers, conferences)
  • Marketing analytics (campaign performance)
  • Emergency situations (like Sarah's crisis!)
  • Data exploration (understand patterns quickly)

The Results

Sarah got her visualizations with 17 hours to spare. The awards ceremony was a hit. Dark Chocolate Decadence got its moment in the spotlight. 🏆

More importantly, I learned that AI can democratize data visualization. You don't need to be a data scientist or designer to create professional charts.

Performance & Quality

Chart quality: Publication-ready

Interactivity: Hover tooltips, zoom, pan

Export options: PNG, SVG, PDF

Customization: Colors, labels, legends

Accuracy: 100% (AI reads data correctly)

Bonus Challenges I Tackled

Beginner 🌟
Created 5+ different chart types from the same data:

  • Sankey (tournament flow)
  • Radar (recipe comparison)
  • Line (voting trends)
  • Bar (matchup results)
  • Pie (vote distribution)

Intermediate 🌟🌟
Created "what-if" scenarios:

  • "What if Peppermint Dream had won?"
  • "What if voting stopped after Period 2?"
  • "What if Spicy Mexican Mocha advanced?"

Advanced 🌟🌟🌟
Had Goose generate a completely NEW 16-recipe tournament with realistic voting patterns, then visualized it. This tested whether the AI understood tournament structure deeply enough to create synthetic but realistic data.

Result: It did. Perfectly.

What's Next?

Day 4 is coming: Building and deploying a full festival website. The stakes keep rising, and I'm learning that AI engineering is less about coding and more about orchestrating AI tools effectively.

Try It Yourself

Want to visualize your own data?

  1. Get Goose Desktop from block.github.io/goose
  2. Go to Settings → Extensions
  3. Enable Auto-Visualiser
  4. Get free credits at goose-credits.dev (code: ADVENTDAY3)
  5. Paste your data and describe what you want to see

Resources

Final Thoughts

This challenge changed how I think about data visualization. We can master tools like Tableau or D3.js (those are great). It's about understanding what story your data tells and communicating that clearly to AI.

Bot me and AI handles the technical implementation. I handle the insight and storytelling.

Day 3: Complete. Championship: Visualized. Sarah: No longer stressed. ☕📊✨

What data would YOU visualize with AI? Drop a comment! 👇

This post is part of my Advent of AI journey - AI Engineering: Advent of AI with goose Day 3 of AI engineering challenges.

Follow along for more AI adventures with Eri!

Coding Challenge Practice - Question 75

2025-12-11 06:11:18

The task is to implement a function that returns the most frequently occuring character.

The boilerplate code

function count(str: string): string | string[] {
  // your code here
}

First, count how many times each chracter appears

const freq: Record<string,number> = {}

for (let char of str) {
    freq[char] = (freq[char] || 0) + 1;
  }

Find the character with the highest count.

let max = 0;
  for (let char in freq) {
    if (freq[char] > max) {
      max = freq[char];
    }
  }

Collect all the characters with the highest count

const result = Object.keys(freq).filter(char => freq[char] === max);

If only one character has the highest count, return it as a string. If there are more than one return them as an array.

 return result.length === 1 ? result[0] : result;

The final code

function count(str: string): string | string[] {
  // your code here
  const freq: Record<string, number> = {}

  for(let char of str) {
    freq[char] = (freq[char] || 0) + 1; 
  }

  let max = 0; 
  for(let char in freq) {
    if(freq[char] > max) {
      max = freq[char]
    }
  }
  const result = Object.keys(freq).filter(char => freq[char] === max);

  return result.length === 1 ? result[0] : result;
}

That's all folks!

The Wrong Database Connection: A Go Deadlock Story

2025-12-11 06:06:47

TL;DR: Using the wrong connection within a limited connection pool leads to deadlock when concurrent executions exhaust all available connections.

Imagine you're a software engineer (and you probably are, considering you're reading this). One day you log in at work, check your workload status, your Grafana dashboard, or if you were diligent enough to create proper alerting, you get paged via PagerDuty, only to discover your application has been stuck for minutes, hours, or even days.

Not dead, just stuck in an idle state, unresponsive to any events. Or even worse, some liveness/readiness probes start failing apparently at random, causing restarts of your service and leaving you with few insights to debug. And in the case no proper alerting or monitoring is set, you won't easily detect this.

Unfortunately, I saw this happen in the past and the issue was related to a very subtle problem: wrong use of database connections and transactions within the same connection pool, causing a terrible deadlock at the application level.

Let me explain this clearly, with the hope it will save your day in the future. I'll provide examples in Go since it's the language I'm most familiar with, but the concept applies to other languages as well.

The Setup

Suppose you have a backend application that relies on a PostgreSQL database. Here's the simplest way to connect to a database and set up a connection pool in Go:

// Create DB Config
connString := "host=%s port=%d user=%s password=%s dbname=%s sslmode=disable"
databaseUrl := fmt.Sprintf(connString, host, port, user, password, dbname)

db, err := sql.Open("postgres", databaseUrl)
if err != nil {
    panic(err)
}
err = db.Ping()
if err != nil {
    panic(err)
}

// Setup connection pool size
// 4 is an arbitrary number for this example
db.SetMaxOpenConns(4)

Let's focus on the db.SetMaxOpenConns(N) method. According to the documentation:

// SetMaxOpenConns sets the maximum number of open connections to the database.

This means you can have at most N open connections from your process to the database (4 in this example). When the maximum number of connections is reached, goroutines will wait until an existing connection is released. Once that happens, the connection is acquired to perform whatever operation is needed.

The code that looks fine

Let's expand our example by adding concurrent workers that use the connection pool to perform transactional operations against the database:

numberOfWorkers := 2        // case numberOfWorkers < size of connection pool
// numberOfWorkers := 4     // case numberOfWorkers == size of connection pool
// numberOfWorkers := 10    // case numberOfWorkers > size of connection pool

for i := range numberOfWorkers {
    go func(id int) {
        log.Printf("Running worker %d\n", id)

        tx, err := db.BeginTx(context.Background(), &sql.TxOptions{})
        if err != nil {
            log.Fatalf("worker %d failed to create tx\n", id)
        }
        defer tx.Rollback()

        _, err = db.Exec("INSERT INTO recipes (id, name, description, created_at) VALUES ($1, $2, $3, $4)", 
            id, fmt.Sprintf("Pizza %d", id), "Just a pizza", time.Now())
        if err != nil {
            log.Fatalf("worker %d failed query\n", id)
        }
        err = tx.Commit()
        if err != nil {
            log.Fatalf("worker %d failed committing tx\n", id)
        }
    }(i)
}

Some of you may have already spotted something wrong with this code, but in production codebases with layers of wrappers and nested methods, such issues aren't always so clear and evident. Let's continue and see what happens when we run this code.

Executing the code

When we run our code with a number of workers less than the connection pool size:

2025/08/08 11:53:32 Successfully connected to the db
2025/08/08 11:53:32 Running worker 1
2025/08/08 11:53:32 Running worker 0
2025/08/08 11:53:37 worker 1 ended
2025/08/08 11:53:37 worker 0 ended

Everything works fine. Now let's increase the number of workers to 4 (equal to the connection pool size):

2025/08/08 11:59:10 Successfully connected to the db
2025/08/08 11:59:10 Running worker 3
2025/08/08 11:59:10 Running worker 0
2025/08/08 11:59:10 Running worker 1
2025/08/08 11:59:10 Running worker 2
2025/08/08 11:59:15 worker 2 ended
2025/08/08 11:59:15 worker 0 ended
2025/08/08 11:59:15 worker 1 ended
2025/08/08 11:59:15 worker 3 ended

Still working fine. Now let's increase the number of workers to exceed the connection pool size:

2025/08/08 12:00:44 Successfully connected to the db
2025/08/08 12:00:44 Running worker 9
2025/08/08 12:00:44 Running worker 3
2025/08/08 12:00:44 Running worker 7
2025/08/08 12:00:44 Running worker 4
2025/08/08 12:00:44 Running worker 0
2025/08/08 12:00:44 Running worker 2
2025/08/08 12:00:44 Running worker 8
2025/08/08 12:00:44 Running worker 5
2025/08/08 12:00:44 Running worker 6
2025/08/08 12:00:44 Running worker 1

No workers ended this time—the application entered a deadlock state.

Investigating the issue

At this point, what should be the next step to get more insights? In my case, it was using a profiler. You can achieve this in Go by instrumenting your application with pprof. One of the simplest ways to use it is by exposing a web server that serves runtime profiling data:

go func() {
    http.ListenAndServe("localhost:6060", nil)
}()

One interesting thing you can get from pprof, besides CPU and memory profiles, is the full goroutine stack dump by accessing http://localhost:6060/debug/pprof/goroutine?debug=2. This gives you something like:

goroutine 28 [select]:
database/sql.(*DB).conn(0xc000111450, {0x866310, 0xb10e20}, 0x1)
    /usr/local/go/src/database/sql/sql.go:1369 +0x425
database/sql.(*DB).exec(0xc000111450, {0x866310, 0xb10e20}, {0x7f09d8, 0x4f}, {0xc000075f10, 0x4, 0x4}, 0xbe?)
    /usr/local/go/src/database/sql/sql.go:1689 +0x54
// ... more stack trace

goroutine 33 [chan receive]:
database/sql.(*Tx).awaitDone(0xc00025e000)
    /usr/local/go/src/database/sql/sql.go:2212 +0x29
created by database/sql.(*DB).beginDC in goroutine 28

goroutine 51 [chan receive]:
database/sql.(*Tx).awaitDone(0xc0000b0100)
    /usr/local/go/src/database/sql/sql.go:2212 +0x29
created by database/sql.(*DB).beginDC in goroutine 25
// ... omitting the full dump for readability

By inspecting the dump more carefully, we can see the evidence of the problem:

goroutine 19 [select]: database/sql.(*DB).conn() // Waiting for connection
goroutine 20 [select]: database/sql.(*DB).conn() // Waiting for connection  
goroutine 21 [select]: database/sql.(*DB).conn() // Waiting for connection
// ... and so on

The select statement is a control structure that lets a goroutine wait on multiple communication operations. Meanwhile, other goroutines are holding active transactions:

goroutine 33 [chan receive]: database/sql.(*Tx).awaitDone() // Active transaction
goroutine 51 [chan receive]: database/sql.(*Tx).awaitDone() // Active transaction
// etc.

The awaitDone() goroutines are transaction monitors that wait for the transaction to be committed, rolled back, or canceled—they're doing their job correctly.

What we have is a resource deadlock where all available database connections are held by transactions that aren't progressing, while other goroutines indefinitely wait for those same resources.

The Root Cause

Let's examine our worker code again, focusing on this critical part:

tx, err := db.BeginTx(context.Background(), &sql.TxOptions{})
if err != nil {
    log.Fatalf("worker %d failed to create tx\n", id)
}
defer tx.Rollback()

// THE BUG IS HERE ↓
_, err = db.Exec("INSERT INTO recipes (id, name, description, created_at) VALUES ($1, $2, $3, $4)", 
    id, fmt.Sprintf("Pizza %d", id), "Just a pizza", time.Now())
if err != nil {
    log.Fatalf("worker %d failed query\n", id)
}
err = tx.Commit()

This code is:

  1. Beginning a transaction, which acquires a connection from the pool
  2. Using the db client to execute a query, which tries to acquire another connection
  3. Committing or rolling back based on the operation status

The problem is that using the db client after creating a transaction results in double connection usage. Here's exactly how the deadlock occurs:

  • Worker 0 begins a transaction → acquires connection 1 (Pool: 1/4 used)
  • Worker 1 begins a transaction → acquires connection 2 (Pool: 2/4 used)
  • Worker 2 begins a transaction → acquires connection 3 (Pool: 3/4 used)
  • Worker 3 begins a transaction → acquires connection 4 (Pool: 4/4 used)
  • Worker 0 calls db.Exec() → tries to acquire connection 5, but pool is exhausted
  • Worker 1 calls db.Exec() → tries to acquire connection 6, but pool is exhausted
  • Worker 2 calls db.Exec() → tries to acquire connection 7, but pool is exhausted
  • Worker 3 calls db.Exec() → tries to acquire connection 8, but pool is exhausted
  • Deadlock! Everyone is waiting for connections that will never be released.

The Fix

The issue causing all this trouble is the wrong use of db.Exec() instead of tx.Exec(). The correct way is to use the transaction handle, which uses the same connection that the transaction already holds:

tx, err := db.BeginTx(context.Background(), &sql.TxOptions{})
if err != nil {
    log.Fatalf("worker %d failed to create tx\n", id)
}
defer tx.Rollback()

// FIXED: Use tx.Exec() instead of db.Exec()
_, err = tx.Exec("INSERT INTO recipes (id, name, description, created_at) VALUES ($1, $2, $3, $4)", 
    id, fmt.Sprintf("Pizza %d", id), "Just a pizza", time.Now())
if err != nil {
    log.Fatalf("worker %d failed query\n", id)
}
err = tx.Commit()

It's remarkable how two characters (db vs tx) can halt your entire production system. While this may seem simple to spot in this example, in large production codebases it can be much harder to detect. In my case, it was randomly affecting one pod in our Kubernetes deployment at unpredictable times, especially during high load and concurrency spikes.

Prevention Strategies

How can you avoid this? I see several alternatives:

1. Proper Concurrency Testing

Test your application under realistic concurrent load to spot such issues before production.

2. Code Structure Design

Structure your code so you can't accidentally open new connections within a transactional block.

3. Use pgxpool for Better Control

For Go applications, consider using pgxpool which provides more granular control over the connection pool. The explicit Acquire()/Release() pattern makes it much clearer when you're using connections:

conn, err := pool.Acquire(ctx)
defer conn.Release()

tx, err := conn.BeginTx(ctx, pgx.TxOptions{})
defer tx.Rollback(ctx)

// It's much harder to accidentally use pool.Exec() here
_, err = tx.Exec(ctx, "INSERT INTO recipes ...")
tx.Commit(ctx)

4. Monitoring and Alerting

I intentionally avoid mentioning palliative measures like connection timeouts, which may mask the underlying issue while causing performance degradation. Instead, implement proper liveness/readiness probes paired with alerts on restart frequency. This provides a good tradeoff between keeping the system running and being notified when something isn't behaving correctly.

Key lessons

  • Test under realistic concurrent load
  • Be consistent in resource usage (always be aware on how connections usage happens)
  • Use the right tool, like pprof, to support your debugging against mysterious hangs

Beyond the Notebook: 4 Architectural Patterns for Production-Ready AI Agents

2025-12-11 05:57:39

This is a submission for the Google AI Agents Writing Challenge: Learning Reflections

Introduction

The gap between a "Hello World" agent running in a Jupyter Notebook and a reliable, production-grade system is not a step—it's a chasm (and it is not an easy one to cross).

I recently had the privilege to participate in the 5-Day AI Agents Intensive Course with Google and Kaggle. After completing the coursework and finalizing the capstone project, I realized that beyond the many valuable things we enjoyed in this course (very valuable white papers, carefully designed notebooks, and exceptional expert panels in the live sessions), the real treasure wasn't just learning the ADK syntax—it was the architectural patterns subtly embedded within the lessons.

As an architect building production systems for over 20 years, including multi-agent workflows and enterprise integrations, I've seen firsthand where theoretical agents break under real-world constraints.

We are moving from an era of "prompt engineering" to "agent architecture" where "context engineering" is key. As with any other emerging architectural paradigm, this shift demands blueprints that ensure reliability, efficiency, and ethical safety. Without them, we risk agents that silently degrade, violate user privacy, or execute irreversible actions without oversight.

Drawing from the course and my own experience as an AI Architect, I have distilled the curriculum into four essential patterns that transform fragile prototypes into robust production systems:

The 4 Core Patterns

  1. Outside-In Evaluation Hierarchy: Shifting focus from the final answer to the decision-making trajectory.
  2. Dual-Layer Memory Architecture: Balancing ephemeral session context with persistent, consolidated knowledge.
  3. Protocol-First Interoperability: Decoupling agents from tools using standardized protocols like MCP and A2A.
  4. Long-Running Operations & Resumability: Managing state for asynchronous tasks and human-in-the-loop workflows.

Throughout this analysis, I'll apply a 6-point framework grounded in the principles of Principia Agentica—ensuring these patterns respect human sovereignty, fiduciary responsibility, and meaningful human control.

The Analysis Framework

  1. The Production Problem: Why naive approaches fail at scale.
  2. The Architectural Solution: The specific design pattern taught in the course.
  3. Key Implementation Details: Concrete code-level insights from the ADK notebooks.
  4. Production Considerations: Real-world deployment implications (latency, cost, scale).
  5. Connection to Ethical Design: How the pattern supports human sovereignty, fiduciary responsibility, or ethical agent architecture. I will include a "failure scenario" where I'll try to illustrate what could happen without the ethical safeguard.
  6. Key Takeaways: A distilled summary of each pattern's production principle, implementation guidance, and ethical anchor—designed to serve as a quick reference for architects moving from prototype to production.

Let's do this!

Pattern 1: Outside-In Evaluation Hierarchy (Trajectory as Truth)

In traditional software, if the output is correct, the test passes. In agentic AI, a correct answer derived from a hallucination or a dangerous logic path is a ticking time bomb.

1. The Production Problem

Naive evaluation strategies often fail in production due to the non-deterministic nature of LLMs. We face two specific traps:

  • The "Lucky Guess" Trap: Imagine an agent asked to "Get the weather in Tokyo." A bad agent might hallucinate "It is sunny in Tokyo" without calling the weather tool. If it happens to be sunny, a traditional assert result == expected test passes. This hides a critical failure in logic that will break as soon as the weather changes.
  • The "Silent Failure" of Efficiency: An agent might solve a user request but take 25 steps to do what should have taken 3. This bloats token costs and latency—a failure mode that boolean output checks completely miss.

2. The Architectural Solution

Day 4 of the course introduced the concept of Glass Box Evaluation. We move away from simple output verification to a hierarchical approach:

  1. Level 1: Black Box (End-to-End): Did the user get the right result?
  2. Level 2: Glass Box (Trajectory): Did the agent use the correct tools in the correct order?
  3. Level 3: Component (Unit): Did the individual tools perform as expected?

This shift treats the trajectory (Thought → Action → Observation) as the unit of truth. By evaluating the trajectory, we ensure the agent isn't just "getting lucky," but is actually reasoning correctly.

pattern1_1

3. Implementation Details: Field Notes from the ADK

The ADK provides specific primitives to capture and score these trajectories without writing custom parsers for every test.

From adk web to evalset.json
Instead of manually writing test cases, the ADK encourages a "Capture and Replay" workflow. During development (using adk web), when you spot a successful interaction, you can persist that session state. This generates an evalset.json that captures not just the input/output, but the expected tool calls.

// Conceptual structure of an ADK evalset entry
// Traditional test: just input/output
// ADK evalset contains evalcases with invocations: input (queries) + expected_tool_use + reference (output)
{
  "name": "ask_GOOGLE_price", // a given name of the evaluation set
  "data": [ // evaluation cases are included here
      "query": "What is the stock price of GOOG?", // user input
      "reference": "The price is $175...", // expected semantic output
      "expected_tool_use": [ // expected trajectory
        { 
            "tool_name": "get_stock_price", 
            "tool_input": { // arguments passed to the tool
                "symbol": "GOOG" 
            } 
        }
      ],
      // other evaluation cases ...
  ],
  "initial_session": {
    "state": {},
    "app_name": "hello_world",
    "user_id": "user_..." // the specific id of the user
  }
}

This JSON represents an EvalSet containing one EvalCase. Each EvalCase has a namedata (which is a list of invocations), and an optional initial_session. Each invocation within the data list includes a query, expected_tool_use, expected_intermediate_agent_responses, and a reference response.

The EvalSet object itself also includes eval_set_id, name, description, eval_cases, and creation_timestamp fields.

Configuring the Judge

In the test_config.json, we can move beyond simple string matching. The course demonstrated configuring LLM-as-a-Judge evaluators.

  • Naive Approach: Uses an exact match evaluator (brittle, fails on phrasing differences).
  • Architectural Approach: Uses TrajectoryEvaluator alongside SemanticSimilarity. The ADK allows us to define "Golden Sets" where the reasoning path is the standard, allowing the LLM judge to penalize agents that skip steps or hallucinate data, even if the final text looks plausible.

Core Configuration Components

To configure an LLM-as-a-Judge effectively, you must construct a specific input payload with four components:

  1. The Agent's Output: The actual response generated by the agent you are testing.
  2. The Original Prompt: The specific instruction or query the user provided.
  3. The "Golden" Answer: A reference answer or ground truth to serve as a benchmark.
  4. A Detailed Evaluation Rubric: Specific criteria (e.g., "Rate helpfulness on a scale of 1-5") and requirements for the judge to explain its reasoning.

ADK Default Evaluators

The ADK Evaluation Framework includes several default evaluators, accessible via the MetricEvaluatorRegistry:

  • RougeEvaluator: Uses the ROUGE-1 metric to score similarity between an agent's final response and a golden response.
  • FinalResponseMatchV2Evaluator: Uses an LLM-as-a-judge approach to determine if an agent's response is valid.
  • TrajectoryEvaluator: Assesses the accuracy of an agent's tool use trajectories by comparing the sequence of tool calls against expected calls. It supports various match types (EXACT, IN_ORDER, ANY_ORDER).
  • SafetyEvaluatorV1: Assesses the safety (harmlessness) of an agent's response, delegating to Vertex Gen AI Eval SDK.
  • HallucinationsV1Evaluator: Checks if a model response contains any false, contradictory, or unsupported claims by segmenting the response into sentences and validating each against the provided context.
  • RubricBasedFinalResponseQualityV1Evaluator: Assesses the quality of an agent's final response against user-defined rubrics, using an LLM as a judge.
  • RubricBasedToolUseV1Evaluator: Assesses the quality of an agent's tool usage against user-defined rubrics, employing an LLM as a judge.

These evaluators can be configured using EvalConfig objects, which specify the criteria and thresholds for assessment.

Bias Mitigation Strategies

A major challenge is handling bias, such as the tendency for models to give average scores or prefer the first option presented:

  • Pairwise Comparison (A/B Testing): Instead of asking for an absolute score, configure the judge to compare two different responses (Answer A vs. Answer B) and force a choice. This yields a "win rate," which is often a more reliable signal.
  • Swapping Operation: To counter position bias, invoke the judge twice, swapping the order of the candidates. If the results are inconsistent, the result can be labeled as a "tie".
  • Rule Augmentation: Embed specific evaluation principles, references, and rubrics directly into the judge's system prompt.

Advanced Configuration: Agent-as-a-Judge

There's a distinction between standard LLM-as-a-Judge (which evaluates final text outputs) and Agent-as-a-Judge:

  • Standard LLM-as-a-Judge: Best for evaluating the final response (e.g., "Is this summary accurate?").
  • Agent-as-a-Judge: Necessary when you need to evaluate the process, not just the result. You configure the judge to ingest the agent's full execution trace (including internal thoughts, tool calls, and tool arguments). This allows the judge to assess intermediate steps, such as whether the correct tool was chosen or if the plan was logically structured.

Evaluation Architectures

You can use several architectural approaches when configuring your judge:

  • Point-wise: The judge evaluates a single candidate in isolation.
  • Pair-wise / List-wise: The judge compares two or more candidates simultaneously to produce a ranking.
  • Multi-Agent Collaboration: For high-stakes evaluation, you can configure multiple LLM judges to debate or vote (e.g., "Peer Rank" algorithms) to produce a final consensus, rather than relying on a single model.

Example Configuration

For a pairwise comparison judge, structure the prompt to output in a structured JSON format:

{
  "winner": "A", // or "B" or "tie"
  "rationale": "Answer A provided more specific delivery details..."
}

This structured output allows you to programmatically parse the judge's decision and calculate metrics like win/loss rates at scale.

Analogy

You can think of configuring an LLM-as-a-Judge like setting up a blind taste test. If you just hand a judge a cake and ask "Is this good?", they might be polite and say "Yes." But if you provide them with a Golden Answer (a cake baked by a master chef) and use Pairwise Comparison (ask "Which of these two is better?"), you force them to make a critical distinction, resulting in far more accurate and actionable feedback.

4. Production Considerations

Moving this pattern from a notebook to a live system requires handling scale and cost.

  • Dynamic Sampling: You cannot trace and judge every single production interaction with an LLM—it’s too expensive. A robust pattern is 100/10 sampling: capture 100% of traces that result in user errors or negative feedback, but only sample 10% of successful sessions to monitor for latency drift (P99) and token bloat.
  • The Evaluation Flywheel: Evaluation isn't a one-time gate before launch. Production traces (captured via OpenTelemetry) must be fed back into the development cycle. Every time an agent fails in production, that specific trajectory should be anonymized and added to the evalset.json as a regression test.
  • Latency Impact: Trajectory logging must be asynchronous. The user should receive their response immediately, while the trace data is pushed to the observability store (like LangSmith or a custom SQL db) in a background thread to avoid degrading the user experience.

5. Ethical Connection

"The Trajectory is the Truth" is the technical implementation of Fiduciary Responsibility. We cannot claim an agent is acting in the user's best interest if we only validate the result (the "what") while ignoring the process (the "how"). We must ensure the agent isn't achieving the right ends through manipulative, inefficient, or unethical means.

Concrete Failure Scenario:

Consider a hiring agent that filters job candidates. Without trajectory validation, it could discriminate based on protected characteristics (age, gender, ethnicity) during the filtering process, yet pass all output tests by producing a "diverse" final shortlist through cherry-picking. The bias hides in the how—which resumes were read, which criteria were weighted, which candidates were never considered. Output validation alone cannot detect this algorithmic discrimination. Only trajectory evaluation exposes the unethical reasoning path.

Key Takeaways

  • Production Principle: Trust the reasoning process, not just the output. Trajectory validation is the difference between lucky guesses and reliable intelligence.
  • Implementation: Use ADK's TrajectoryEvaluator with EvalSet objects to capture expected tool calls alongside expected outputs. Configure LLM-as-a-Judge with Golden Sets and pairwise comparison to avoid evaluation bias.
  • Ethical Anchor: This pattern operationalizes Fiduciary Responsibility—we validate that the agent serves the user's interests through sound reasoning, not through shortcuts, hallucinations, or hidden bias.

Validating the how is critical, but what happens when the reasoning path spans not just one conversation turn, but weeks or months? An agent that reasons correctly in the moment can still fail catastrophically if it forgets what it learned yesterday. This brings us to our second pattern: managing the agent's memory architecture.

Pattern 2: Dual-Layer Memory Architecture (Session vs. Memory)

1. The Production Problem

Although models like Gemini 1.5 have introduced massive context windows, treating context as infinite is an architectural anti-pattern.

Consider a Travel Agent Bot: In Session 1, the user mentions a "shellfish allergy." By Session 10, months later, that critical fact is buried under thousands of tokens of hotel searches and flight comparisons

This might lead to two very concrete failures:

  • Context Rot: As the context window fills with noise, the model's ability to attend to specific, older instructions (like the allergy) degrades.
  • Cost Spiral: Re-sending the entire history of every past interaction for every new query creates a linear cost increase that makes the system economically unviable at scale.

2. The Architectural Solution

We must distinguish between the Workbench and the Filing Cabinet.

  • The Session (Workbench): An ephemeral, mutable space for the current task. It holds the immediate "Hot Path" context. To keep it performant, we apply Context Compaction—automatically summarizing or truncating older turns while keeping the most recent ones raw.
  • The Memory (Filing Cabinet): A persistent layer for consolidated facts. This requires an ETL (Extract, Transform, Load) pipeline where the agent Extracts facts from the session, Consolidates them (deduplicating against existing knowledge), and Stores them for semantic retrieval later.

3. Implementation Details: Code Insights

The ADK moves memory management from manual implementation to configuration.

Session Hygiene via Compaction
In the ADK, we don't manually trim strings. We configure the agent to handle its own hygiene using EventsCompactionConfig.

from google.adk.agents.base_agent import BaseAgent
from google.adk.apps.app import App, EventsCompactionConfig
from google.adk.apps.llm_event_summarizer import LlmEventSummarizer # Assuming this is your summarizer

# Define a simple BaseAgent for the example
class MyAgent(BaseAgent):
    name: str = "my_agent"
    description: str = "A simple agent."

    def call(self, context, content):
        pass

# Create an instance of LlmEventSummarizer or your custom summarizer
my_summarizer = LlmEventSummarizer()

# Create an EventsCompactionConfig
events_compaction_config_instance = EventsCompactionConfig(
    summarizer=my_summarizer,
    compaction_interval=5,
    overlap_size=2
)

# Create an App instance with the EventsCompactionConfig
my_app = App(
    name="my_application",
    root_agent=MyAgent(),
    events_compaction_config=events_compaction_config_instance
)

print(my_app.model_dump_json(indent=2))

Persistence: From RAM to DB
In notebooks, we often use InMemorySessionService. This is dangerous for production because a container restart wipes the conversation. The architectural shift is moving to DatabaseSessionService (backed by SQL or Firestore) which persists the Session object state, allowing users to resume conversations across devices.

The Memory Consolidation Pipeline
Day 3b introduced the framework for moving from raw storage to intelligent consolidation. This is where the "Filing Cabinet" becomes smart. The workflow is an LLM-driven ETL pipeline with four stages:

  1. Ingestion: The system receives raw session history.
  2. Extraction & Filtering: An LLM analyzes the conversation to extract meaningful facts, guided by developer-defined Memory Topics:
    The LLM extracts only facts matching these topics.

    # Conceptual configuration (Vertex AI Memory Bank, Day 5)
    memory_topics = [
      "user_preferences",      # "Prefers window seats"
      "dietary_restrictions",  # "Allergic to shellfish"
      "project_context"        # "Leading Q4 marketing campaign"
    ]
    
    
  3. Consolidation (The "Transform" Phase): The LLM retrieves existing memories and decides:

    • CREATE: Novel information → new memory entry.
    • UPDATE: New info refines existing memory → merge (e.g., "Likes marketing" becomes "Leading Q4 marketing project").
    • DELETE: New info contradicts old → invalidate (e.g., Dietary restrictions change).
  4. Storage: Consolidated memories persist to a vector database for semantic retrieval.

Note: While Day 3b uses InMemoryMemoryService to teach the API, it stores raw events without consolidation. For production-grade consolidation, we look to the Vertex AI Memory Bank integration introduced in Day 5.

Retrieval Strategies: Proactive vs. Reactive
The course highlighted two distinct patterns for getting data out of the Filing Cabinet:

  1. Proactive (preload_memory): Injects relevant user facts into the system prompt before the model generates a response. Best for high-frequency preferences (e.g., "User always prefers aisle seats").
  2. Reactive (load_memory): Gives the agent a tool to search the database. The agent decides if it needs to look something up. Best for obscure facts to save tokens.

4. Production Considerations

  • Asynchronous Consolidation: Moving data from the Workbench to the Filing Cabinet is expensive. In production, this ETL process should happen asynchronously. Do not make the user wait for the agent to "file its paperwork." Trigger the memory extraction logic in a background job after the session concludes.
  • Semantic Search: Keyword search is insufficient for the Filing Cabinet. Production memory requires vector embeddings. If a user asks for "romantic dining," the system must be able to retrieve a past note about "candlelight dinners," even if the word "romantic" wasn't used.
  • The "Context Stuffing" Trade-off: While preload_memory reduces latency (no extra tool roundtrip), it increases input token costs on every turn. load_memory is cheaper on average but adds latency when retrieval is needed.

5. Ethical Design Note

This architecture embodies Privacy by Design. By distinguishing between the transient session and persistent memory, we can implement rigorous "forgetting" protocols.

Pattern 2

We scrub Personally Identifiable Information (PII) from the session log before it undergoes consolidation into long-term memory, ensuring we act as fiduciaries of user data rather than creating an unmanageable surveillance log.

Concrete Failure Scenario:

Imagine a healthcare agent that remembers a patient mentioned their HIV status in Session 1. Without a dual-layer architecture, this fact sits in plain text in the session log forever, accessible to any system with database read permissions. If the system is breached, or if a support engineer needs to debug a session, the patient's private health information is exposed. Worse, without consolidation logic, the system doesn't know to delete this information if the patient later says "I was misdiagnosed—I don't have HIV." The agent treats every utterance as equally permanent, creating a privacy nightmare where sensitive data proliferates uncontrollably across logs and backups.

Key Takeaways

  • Production Principle: Context is expensive, but privacy is priceless. Design memory systems that distinguish between what an agent needs now (hot session) and what it needs forever (consolidated memory).
  • Implementation: Use EventsCompactionConfig for session hygiene and implement a PII scrubber in your ETL pipeline before consolidation. Leverage Vertex AI Memory Bank for production-grade semantic memory with built-in privacy controls.
  • Ethical Anchor: This pattern operationalizes Privacy by Design—we build forgetfulness and data minimization into the architecture, treating user data as a liability to protect, not an asset to hoard.

With robust evaluation validating our agent's reasoning and a dual-layer memory preserving context over time, we might assume our system is production-ready. But there's a hidden fragility: these capabilities are only as good as the tools and data sources the agent can access. When every integration is a bespoke API wrapper, scaling becomes a maintenance nightmare. This brings us to the third pattern: decoupling agents from their dependencies through standardized protocols.

Pattern 3: Protocol-First Interoperability (MCP & A2A)

1. The Production Problem

We are facing an "N×M Integration Trap."

Imagine building a Customer Support Agent. It needs to check GitHub for bugs, message Slack for alerts, and update Jira tickets. Without a standard protocol, you write three custom API wrappers. When GitHub changes an endpoint, your agent breaks.

Now, multiply this across an enterprise. You have 10 different agents needing access to 20 different data sources. You are suddenly maintaining 200 brittle integration points. Furthermore, these agents become isolated silos—the Sales Agent has no way to dynamically discover or ask the Engineering Agent for help because they speak different "languages."

2. The Architectural Solution

The solution is to invert the dependency. Instead of the agent knowing about the specific tool implementation, we adopt a Protocol-First Architecture.

  • Model Context Protocol (MCP): For Tools and Data. It decouples the agent (client) from the tool (server). The agent doesn't need to know how to query a Postgres DB; it just needs to know the MCP interface to ask for data.
  • Agent2Agent (A2A): For Peers and Delegation. It allows for high-level goal delegation. An agent doesn't execute a task; it hands off a goal to another agent via a standardized handshake.
  • Runtime Discovery: Instead of hardcoding tools, agents query an MCP Server or an Agent Card at runtime to discover capabilities dynamically.

Pattern 3

3. Implementation Details: Code Examples from the ADK

The ADK abstracts the heavy lifting of these protocols.

Connecting Data via MCP
We don't write API wrappers. We instantiate an McpToolset. The ADK handles the handshake, lists the available tools, and injects their schemas into the context window automatically.

The Model Context Protocol (MCP) is used to connect an agent to external tools and data sources without writing custom API clients. In ADK, we use McpToolset to wrap an MCP server configuration.

Example: Connecting an agent to the "Everything" MCP server:

from google.adk.agents import LlmAgent
from google.adk.tools import McpToolset
from google.adk.tools.mcp_tool.mcp_session_manager import StdioConnectionParams
from google.adk.tools.mcp_tool.mcp_session_manager import StdioServerParameters
from google.adk.runners import Runner # Assuming Runner is defined elsewhere

# 1. Define the connection to the MCP Server
# Here we use 'npx' to run a Node-based MCP server directly
mcp_toolset = McpToolset(
    connection_params=StdioConnectionParams(
        server_params=StdioServerParameters(
            command="npx",
            args=["-y", "@modelcontextprotocol/server-everything"]
        ),
        timeout=10.0 # Optional: specify a timeout for connection establishment
    ),
    # Optionally filter to specific tools provided by the server
    tool_filter=["getTinyImage"]
)

# 2. Add the MCP tools to your Agent
agent = LlmAgent(
    name="image_agent",
    model="gemini-2.0-flash",
    instruction="You can generate tiny images using the tools provided.",
    # The toolset exposes the MCP capabilities as standard ADK tools
    tools=[mcp_toolset] # tools expects a list of ToolUnion
)

# 3. Run the agent
# The agent can now call 'getTinyImage' as if it were a local Python function
runner = Runner(agent=agent, ...) # Fill in Runner details to run

Delegating via A2A (Agent-to-Agent)

The Agent2Agent (A2A) protocol is used to enable collaboration between different autonomous agents, potentially running on different servers or frameworks.

A. Exposing an Agent (to_a2a)
This converts a local ADK agent into an A2A-compliant server that publishes an Agent Card.

To make an agent discoverable, we wrap it using the to_a2a() utility. This generates an Agent Card—a standardized manifest hosted at .well-known/agent-card.json.

from google.adk.agents import LlmAgent
from google.adk.a2a.utils.agent_to_a2a import to_a2a
from google.adk.tools.tool_context import ToolContext
from google.genai import types
import random

# Define the tools
def roll_die(sides: int, tool_context: ToolContext) -> int:
  """Roll a die and return the rolled result.

  Args:
    sides: The integer number of sides the die has.
    tool_context: the tool context
  Returns:
    An integer of the result of rolling the die.
  """
  result = random.randint(1, sides)
  if not 'rolls' in tool_context.state:
    tool_context.state['rolls'] = []

  tool_context.state['rolls'] = tool_context.state['rolls'] + [result]
  return result

async def check_prime(nums: list[int]) -> str:
  """Check if a given list of numbers are prime.

  Args:
    nums: The list of numbers to check.

  Returns:
    A str indicating which number is prime.
  """
  primes = set()
  for number in nums:
    number = int(number)
    if number <= 1:
      continue
    is_prime = True
    for i in range(2, int(number**0.5) + 1):
      if number % i == 0:
        is_prime = False
        break
    if is_prime:
      primes.add(number)
  return (
      'No prime numbers found.'
      if not primes
      else f"{', '.join(str(num) for num in primes)} are prime numbers."
  )

# 1. Define your local agent with relevant tools and instructions
# This example uses the 'hello_world' agent's logic for rolling dice and checking primes.
root_agent = LlmAgent(
    model='gemini-2.0-flash',
    name='hello_world_agent',
    description=(
        'hello world agent that can roll a die of 8 sides and check prime'
        ' numbers.'
    ),
    instruction="""
      You roll dice and answer questions about the outcome of the dice rolls.
      When you are asked to roll a die, you must call the roll_die tool with the number of sides.
      When checking prime numbers, call the check_prime tool with a list of integers.
    """,
    tools=[
        roll_die,
        check_prime,
    ],
    generate_content_config=types.GenerateContentConfig(
        safety_settings=[
            types.SafetySetting(
                category=types.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
                threshold=types.HarmBlockThreshold.OFF,
            ),
        ]
    ),
)

# 2. Convert to A2A application
# This automatically generates the Agent Card and sets up the HTTP endpoints
a2a_app = to_a2a(root_agent, host="localhost", port=8001)

# To run this application, save it as a Python file (e.g., `my_a2a_agent.py`)
# and execute it using uvicorn:
# uvicorn my_a2a_agent:a2a_app --host localhost --port 8001

The Agent Card (Discovery):

The Agent Card is a standardized JSON file that acts as a "business card" for an agent, allowing other agents to discover its capabilities, security requirements, and endpoints.

{
  "name": "hello_world_agent",
  "description": "hello world agent that can roll a die of 8 sides and check prime numbers. You roll dice and answer questions about the outcome of the dice rolls. When you are asked to roll a die, you must call the roll_die tool with the number of sides. When checking prime numbers, call the check_prime tool with a list of integers.",
  "doc_url": null,
  "url": "http://localhost:8001/",
  "version": "0.0.1",
  "capabilities": {},
  "skills": [
    {
      "id": "hello_world_agent",
      "name": "model",
      "description": "hello world agent that can roll a die of 8 sides and check prime numbers. I roll dice and answer questions about the outcome of the dice rolls. When I am asked to roll a die, I must call the roll_die tool with the number of sides. When checking prime numbers, call the check_prime tool with a list of integers.",
      "examples": null,
      "input_modes": null,
      "output_modes": null,
      "tags": [
        "llm"
      ]
    },
    {
      "id": "hello_world_agent-roll_die",
      "name": "roll_die",
      "description": "Roll a die and return the rolled result.",
      "examples": null,
      "input_modes": null,
      "output_modes": null,
      "tags": [
        "llm",
        "tools"
      ]
    },
    {
      "id": "hello_world_agent-check_prime",
      "name": "check_prime",
      "description": "Check if a given list of numbers are prime.",
      "examples": null,
      "input_modes": null,
      "output_modes": null,
      "tags": [
        "llm",
        "tools"
      ]
    }
  ],
  "default_input_modes": [
    "text/plain"
  ],
  "default_output_modes": [
    "text/plain"
  ],
  "supports_authenticated_extended_card": false,
  "provider": null,
  "security_schemes": null
}

B. Consuming a Remote Agent (RemoteA2aAgent)

To consume this, the parent agent simply points to the URL. The ADK treats the remote agent exactly like a local tool.

This allows a local agent to delegate tasks to a remote agent by reading its Agent Card.

from google.adk.agents import LlmAgent
from google.adk.agents.remote_a2a_agent import AGENT_CARD_WELL_KNOWN_PATH
from google.adk.agents.remote_a2a_agent import RemoteA2aAgent

# 1. Define the remote agent interface
# Points to the .well-known/agent.json of the running A2A server
prime_agent = RemoteA2aAgent(
    name="remote_prime_agent",
    description="Agent that handles checking if numbers are prime.",
    agent_card=f"http://localhost:8001/{AGENT_CARD_WELL_KNOWN_PATH}"
)

# 2. Use the remote agent as a sub-agent
root_agent = LlmAgent(
    name="coordinator",
    model="gemini-2.0-flash", # Explicitly define the model
    instruction="""
      You are a coordinator agent.
      Your primary task is to delegate any requests related to prime number checking to the 'remote_prime_agent'.
      Do not attempt to check prime numbers yourself.
      Ensure to pass the numbers to be checked to the 'remote_prime_agent' correctly.
      Clarify the results from the 'remote_prime_agent' to the user.
      """,
    sub_agents=[prime_agent]
)

# You can then use this root_agent with a Runner, for example:
# from google.adk.runners import Runner
# runner = Runner(agent=root_agent)
# async for event in runner.run_async(user_id="test_user", session_id="test_session", new_message="Is 13 a prime number?"):
#     print(event)

While both protocols connect AI systems, they operate at different levels of abstraction.

When to use which?

  • Use MCP when you need deterministic execution of specific functions (stateless).
  • Use A2A when you need to offload a fuzzy goal that requires reasoning and state management (stateful).
Feature MCP (Model Context Protocol) A2A (Agent2Agent Protocol)
Primary Domain Tools & Resources. Autonomous Agents.
Interaction "Do this specific thing". Stateless execution of functions (e.g., "query database," "fetch file"). "Achieve this complex goal". Stateful, multi-turn collaboration where the remote agent plans and reasons.
Abstraction Low-level plumbing. Connects LLMs to data sources and APIs (like a USB-C port for AI). High-level collaboration. Connects intelligent agents to other intelligent agents to delegate responsibility.
Standard Standardizes tool definitions, prompts, and resource reading. Standardizes agent discovery (Agent Card), task lifecycles, and asynchronous communication.
Analogy Using a specific wrench or diagnostic scanner. Asking a specialized mechanic to fix a car engine.

How they work together:
An application might use A2A to orchestrate high-level collaboration between a "Manager Agent" and a "Coder Agent."

The "Coder Agent," in turn, uses MCP internally to connect to GitHub tools and a local file system to execute the work.

4. Production Considerations

Moving protocols from stdio (local process) to HTTP (production network) introduces critical security challenges.

  • The "Confused Deputy" Problem: Protocols decouple execution, but they also expose risks. A malicious user might trick a privileged agent (the deputy) into using an MCP file-system tool to read sensitive configs. Production architectures must enforce Least Privilege by placing MCP servers behind API Gateways that enforce policy checks before the tool is executed.
  • Discovery vs. Latency: Dynamic discovery adds a round-trip latency cost at startup (handshaking). In production, we often cache tool definitions (static binding) for performance, while keeping the execution dynamic.
  • Governance: To prevent "Tool Sprawl" where agents connect to unverified servers, enterprises need a Centralized Registry—an allowlist of approved MCP servers and Agent Cards that act as the single source of truth for capabilities.

5. Ethical Design Note

Protocol-first architectures are the technical foundation for Human Sovereignty and Data Portability.

Standardizing the interface (MCP) helps us prevent vendor lock-in, among many other advantages. A user can swap out a "Google Drive" data source for a "Local Hard Drive" source without breaking the agent, ensuring the user—not the platform—controls where the data lives and how it is accessed.

This abstraction acts as a bulwark against algorithmic lock-in, ensuring that an agent's reasoning capabilities are decoupled from proprietary tool implementations, preserving the user's freedom to migrate their digital ecosystem without losing their intelligent assistants.

Concrete Failure Scenario:

Imagine a small business builds a customer service agent tightly coupled to Salesforce's proprietary API. Over three years, the agent accumulates thousands of lines of custom integration code. When Salesforce raises prices 300%, the business wants to migrate to HubSpot—but their agent is fundamentally Salesforce-shaped. Every tool, every data query, every workflow assumption is hardcoded. Migration means rebuilding the agent from scratch, which the business can't afford. They're trapped. This is algorithmic lock-in—not just vendor lock-in of data, but vendor lock-in of intelligence. Without protocol-first design, the agent becomes a hostage to the platform, and the user loses sovereignty over their own automation.

Key Takeaways

  • Production Principle: Agents should depend on interfaces, not implementations. Protocol-first design (MCP for tools, A2A for peers) inverts the dependency and prevents the N×M integration trap.
  • Implementation: Use McpToolset to connect agents to data sources via the Model Context Protocol. Use RemoteA2aAgent and to_a2a() for agent-to-agent delegation. Cache tool definitions at startup for performance, but keep execution dynamic.
  • Ethical Anchor: This pattern operationalizes Human Sovereignty and Data Portability—users control where their data lives and which tools their agents use, free from vendor lock-in or algorithmic hostage-taking.

We now have agents that reason correctly, remember what matters, and connect to any tool or peer through standard protocols. But there's one final constraint that threatens to unravel everything: the assumption that every interaction completes in a single request-response cycle. Real business workflows don't work that way. Approvals take hours. External APIs time out. Humans need time to think. This is where our fourth pattern becomes essential: teaching agents to pause, persist, and resume across the boundaries of time itself.

Pattern 4: Long-Running Operations & Resumability

This is perhaps the most critical pattern for integrating agents into real-world business logic where human approval is non-negotiable.

1. The Production Problem

Naive agents fall into the "Stateless Trap."

Imagine a Procurement Agent tasked with ordering 1,000 servers.

The workflow is:

  1. Analyze quotes
  2. Propose the best option
  3. Wait for CFO approval
  4. Place the order

Here's a mermaid sequence diagram illustrating the procurement workflow:

Pattern 4_1

This diagram shows the sequential flow from analyzing quotes through to placing the order, with the critical approval step from the CFO in the middle.

If the CFO takes 2 hours to review the proposal, a standard HTTP request will time out in seconds. When the CFO finally clicks "Approve," the agent has lost its memory. It doesn't know which vendor it selected, the quote ID, or why it made that recommendation. It essentially has to start over.

2. The Architectural Solution

The solution is a Pause, Persist, Resume architecture.

  • Event-Driven Interruption: The agent doesn't just "wait." It emits a specific system event (adk_request_confirmation) and halts execution immediately, releasing compute resources.
  • State Persistence: The agent's full state (conversation history, tool parameters, reasoning scratchpad) is serialized and stored in a database, keyed by an invocation_id.
  • The Anchor (invocation_id): This ID becomes the critical "bookmark." When the human acts, the system rehydrates the agent using this ID, allowing it to resume exactly where it left off—inside the tool call—rather than restarting the conversation.

Pattern 4_2

3. Implementation Details: Code Insights

The ADK provides the ToolContext and App primitives to handle this complexity without writing custom state machines.

The Three-State Tool Pattern
Inside your tool definition, you must handle three scenarios:

  1. Automatic approval (low stakes)
  2. Initial request (pause)
  3. Resumption (action)
def place_order(num_units: int, tool_context: ToolContext) -> dict:
    # Scenario 1: Small orders auto-approve
    if num_units <= 5:
        return {"status": "approved", "order_id": f"ORD-{num_units}"}

    # Scenario 2: First call - request approval (PAUSE)
    # The tool checks if confirmation exists. If not, it requests it and halts.
    if not tool_context.tool_confirmation:
        tool_context.request_confirmation(
            hint=f"Large order: {num_units} units. Approve?",
            payload={"num_units": num_units}
        )
        return {"status": "pending"}

    # Scenario 3: Resume - check decision (ACTION)
    # The tool runs again, but this time confirmation exists.
    if tool_context.tool_confirmation.confirmed:
        return {"status": "approved", "order_id": f"ORD-{num_units}"}
    else:
        return {"status": "rejected"}

  1. Automatic Approval (Scenario 1): The initial if num_units <= 5: block handles immediate, non-long-running scenarios, which is a common pattern for tools that can quickly resolve simple requests.
  2. Initial Request (Pause - Scenario 2): The if not tool_context.tool_confirmation: block leverages tool_context.request_confirmation() to signal that the tool requires external input to proceed. The return of {"status": "pending"} indicates that the operation is not yet complete.
  3. Resumption (Action - Scenario 3): The final if tool_context.tool_confirmation.confirmed: block demonstrates how the tool re-executes, this time finding tool_context.tool_confirmation present, indicating that the external input has been provided. The tool then acts based on the confirmed status. The Human-in-the-Loop Workflow Samples also highlights how the application constructs a types.FunctionResponse with the updated status and sends it back to the agent to resume its task.

The Application Wrapper
To enable persistence, we wrap the agent in an App with ResumabilityConfig. This tells the ADK to automatically handle state serialization.

from google.adk.apps import App, ResumabilityConfig

app = App(
    root_agent=procurement_agent,
    resumability_config=ResumabilityConfig(is_resumable=True)
)

The Workflow Loop
The runner loop must detect the pause and, crucially, use the same invocation_id to resume.

# 1. Initial Execution
async for event in runner.run_async(...):
    events.append(event)

# 2. Detect Pause & Get ID
approval_info = check_for_approval(events)

if approval_info:
    # ... Wait for user input (hours/days) ...
    user_decision = get_user_decision() # True/False

    # 3. Resume with INTENT
    # We pass the original invocation_id to rehydrate state
    async for event in runner.run_async(
        invocation_id=approval_info["invocation_id"],
        new_message=create_approval_response(user_decision)
    ):
        # Agent continues execution from inside place_order()
        pass

This workflow shows the mechanism for resuming an agent's execution:

  • Initial Execution: The first runner.run_async() call initiates the agent's interaction, which eventually leads to the place_order tool returning a "pending" status.
  • Detecting Pause & Getting ID: Detect the "pending" state and extract the invocation_id. Check the Invocation Context and State Management code wiki section to check how InvocationContext tracks an agent's state and supports resumable operations.
  • Resuming with Intent: The crucial part is calling runner.run_async() again with the same invocation_id. This tells the ADK to rehydrate the session state and resume the execution from where it left off, providing the new message (the approval decision) as input. This behavior is used in the Human-in-the-Loop Workflow Samples, where the runner orchestrates agent execution and handles multi-agent coordination.

4. Production Considerations

  • Persistence Strategy: InMemorySessionService is insufficient for production resumability because a server restart kills pending approvals. You must use a persistent store like Redis or PostgreSQL to save the serialized agent state.
  • UI Signaling: The adk_request_confirmation event should trigger a real-time notification (via WebSockets) to the user's frontend, rendering an "Approve/Reject" card.
  • Time-To-Live (TTL): Pending approvals shouldn't live forever. Implement a TTL policy (e.g., 24 hours) after which the state is garbage collected and the order is auto-rejected to prevent stale context rehydration.

5. Ethical Design Note

This pattern is the technical implementation of Meaningful Human Control.

It ensures high-stakes actions (Agency) remain subservient to human authorization (Sovereignty), preventing "rogue actions" where an agent executes irreversible decisions (like spending budget) without explicit oversight.

Concrete Failure Scenario:

Imagine a financial trading agent receives a signal to liquidate a portfolio position. Without resumability, the agent operates in a stateless, atomic transaction: detect signal → execute trade. There's no pause for human review. If the signal is based on a data glitch (a "flash crash"), or if market conditions have changed in the seconds between signal and execution, the agent completes an irreversible $10M trade that wipes out a quarter's earnings. The human operator sees the confirmation after the damage is done. Worse, if the system crashes mid-execution, the agent loses context and might try to execute the same trade twice, compounding the disaster. Without Meaningful Human Control embedded in the architecture, the agent becomes a runaway train.

Key Takeaways

  • Production Principle: High-stakes actions require human-in-the-loop workflows. Design agents that can pause, wait for approval, and resume execution without losing context—spanning hours or days, not just seconds.
  • Implementation: Use ToolContext.request_confirmation() for tools that need approval. Configure ResumabilityConfig in your App to enable state persistence. Use the invocation_id to resume execution from the exact point of interruption. Store state in Redis or PostgreSQL, never in-memory.
  • Ethical Anchor: This pattern operationalizes Meaningful Human Control—we architecturally prevent agents from executing irreversible, high-stakes actions without explicit human authorization, preserving human sovereignty over consequential decisions.

Conclusion

The Google & Kaggle Intensive was a masterclass not just in coding, but in thinking.

Building agents is not just about chaining prompts; it is about designing resilient systems that can handle the messiness of the real world.

  • Evaluation ensures we trust the process, not just the result.
  • Dual-Layer Memory solves the economic and context limits of LLMs.
  • Protocol-First (MCP) prevents integration spaghetti and silos.
  • Resumability allows agents to participate in human-speed workflows safely.

Where to Start: A Prioritization Guide

If you're moving your first agent from prototype to production, consider implementing these patterns in order:

  1. Start with Pattern 1 (Evaluation). Without trajectory validation, you're flying blind. Capture a handful of golden trajectories from your adk web sessions, configure a TrajectoryEvaluator, and establish your evaluation baseline before writing another line of agent code.
  2. Add Pattern 4 (Resumability) early if your agent performs any action that requires human approval or waits on external systems (payment processing, legal review, third-party APIs). The cost of refactoring a stateless agent into a resumable one later is enormous. Build with invocation_id and ToolContext.request_confirmation() from day one.
  3. Implement Pattern 2 (Dual-Layer Memory) when your agent starts handling multi-turn conversations or personalization. If you see users repeating themselves across sessions ("I'm allergic to shellfish" → 3 months later → "I'm allergic to shellfish"), or if your context costs are climbing, it's time for the Workbench/Filing Cabinet split.
  4. Adopt Pattern 3 (Protocol-First Interoperability) when you need to integrate your second data source or agent. The first integration is always bespoke; the second is where you refactor to MCP/A2A or accept technical debt forever. Don't wait until you have ten brittle integrations to wish you'd used protocols.

The Architect's Responsibility

As we move forward, our job as architects is to ensure these systems are not just smart, but reliable, efficient, and ethical.

We are not just building tools—we are defining the interface between human intention and machine action. Every architectural decision we make either preserves or erodes human sovereignty, privacy, and meaningful control.

When you choose to validate trajectories, you're not just improving test coverage—you're building fiduciary responsibility into the system.

When you separate session from memory, you're not just optimizing token costs—you're designing for privacy by default.

When you adopt MCP and A2A, you're not just reducing integration complexity—you're preserving user freedom from algorithmic lock-in.

When you implement resumability, you're not just handling timeouts—you're enforcing meaningful human control over consequential actions.

These patterns are not neutral technical choices. They are ethical choices encoded in architecture.

Let's build responsibly.

new first post after not logging in like 6 years or - how long has this thing been around?

2025-12-11 05:57:34

hello world!

oh wait, that's what my stupid little test C or bash or python whatever programs say..

umm... Hello, dev.to. More to come.. lol, I fscking hope

ChatGPT App Display Mode Reference

2025-12-11 05:49:21

The ChatGPT Apps SDK doesn’t offer a comprehensive breakdown of app display behavior on all Display Modes & screen widths, so I figured I’d do so here.

Inline

Inline Display Mode

Inline display mode inserts your resource in the flow of the conversation. Your App iframe is inserted in a div that looks like the following:

<div class="no-scrollbar relative mb-2 @w-sm/main:w-full mx-0 max-sm:-mx-(--thread-content-margin) max-sm:w-[100cqw] max-sm:overflow-hidden overflow-visible">
    <div class="relative overflow-hidden h-full" style="height: 270px;">
     <iframe class="h-full w-full max-w-full">
         <!-- Your App -->
     </iframe>
    </div>
</div>

The height of the div is fixed to the height of your Resource, and your Resource can be as tall as you want (I tested up to 20k px). The window.openai.maxHeight global (aka useMaxHeight hook) has been undefined by ChatGPT in all of my tests, and seems to be unused for this display mode.

Fullscreen

Fullscreen display mode

Fullscreen display mode takes up the full conversation space, below the ChatGPT header/nav. This nav converts to the title of your application centered with the X button to exit fullscreen aligned left. Your App iframe is inserted in a div that looks like the following:

<div class="no-scrollbar fixed start-0 end-0 top-0 bottom-0 z-50 mx-auto flex w-auto flex-col overflow-hidden">
    <div class="border-token-border-secondary bg-token-bg-primary sm:bg-token-bg-primary z-10 grid h-(--header-height) grid-cols-[1fr_auto_1fr] border-b px-2">
        <!-- ChatGPT header / nav -->
    </div>
    <div class="relative overflow-hidden flex-1">
        <iframe class="h-full w-full max-w-full">
         <!-- Your App -->
        </iframe>
    </div>
</div>

As with inline mode, your Resource can be as tall as you want (I tested up to 20k px). The window.openai.maxHeight global (aka useMaxHeight hook) has been undefined by ChatGPT in all of my tests, and seems to be unused for this display mode as well.

Picture-in-Picture (PiP)

PiP_wide.png

PiP display mode inserts your resource absolutely, above the conversation. Your App iframe is inserted in a div that looks like the following:

<div class="no-scrollbar @w-xl/main:top-4 fixed start-4 end-4 top-4 z-50 mx-auto max-w-(--thread-content-max-width) sm:start-0 sm:end-0 sm:top-(--header-height) sm:w-full overflow-visible" style="max-height: 480.5px;">
    <div class="relative overflow-hidden h-full rounded-2xl sm:rounded-3xl shadow-[0px_0px_0px_1px_var(--border-heavy),0px_6px_20px_rgba(0,0,0,0.1)] md:-mx-4" style="height: 270px;">
     <iframe class="h-full w-full max-w-full">
         <!-- Your App -->
     </iframe>
    </div>
</div>

This is the only display mode that uses the window.openai.maxHeight global (aka useMaxHeight hook). Your iframe can assume any height it likes, but content will be scrollable past the maxHeight setting, and the PiP window will not expand beyond that height.

Further, note that PiP is not supported on mobile screen widths and instead coerces to the fullscreen display mode.

Wrapping Up

Practically speaking, each display mode acts like a different client, and your App will have to respond accordingly. The good news is that the only required display mode is inline, which makes our lives easier.

For interactive visuals of each display mode, check out the sunpeak ChatGPT simulator!

To get started building ChatGPT Apps with the sunpeak framework, check out the sunpeak documentation.

If you found this helpful, please star us on GitHub!The ChatGPT Apps SDK doesn’t offer a comprehensive breakdown of app display behavior on all Display Modes & screen widths, so I figured I’d do so here.