2025-12-11 06:15:06
Visualizing 5,294 Votes with AI (Auto-Visualiser + MCP-UI)
An Emergency data viz for a Hot Cocoa Championship using Goose's Auto-Visualiser extension. Tournament brackets, radar charts, and the magic of MCP-UI!
Day 3: The Hot Cocoa Championship Crisis ☕📊
The Urgent Email
It's December 3rd, and my inbox just was piping hot!:
URGENT - Hot Cocoa Championship Results Need Visualization
Hey there! The Hot Cocoa Championship was AMAZING but we need these results visualized for the big awards ceremony tomorrow. Our data person is sick and we're in a panic.
The ceremony is tomorrow at 2 PM and we need to print these for the big screen!
Sarah Chen
Winter Festival Coordinator
(Very Stressed)
Attached: A massive markdown file with tournament data, voting breakdowns, recipe scorecards, and 5,294 total votes.
Deadline: 18 hours.
My data viz experience: Minimal.
Panic level: Rising.
Enter the Auto-Visualiser
This is where I discovered one of the coolest AI tools I've ever used:
goose's Auto-Visualiser extension.
The Auto Visualiser made charting effortless, initially I thought I would need to add an additional detailed prompt, but I did not need to because it automatically picked the right visualization to highlight the detailed data given to input. I then asked goose to generate detailed HTML visuals / summation. I like goose's sentiment its "friendly" albeit professional.
Here's the magic: You paste data and describe what you want and the Charts render directly in your conversation. No code. No exports. No separate tools.
It's powered by MCP-UI (Model Context Protocol UI), which returns interactive UI components instead of just text. Mind-blowing.
The Data
Sarah sent me everything:
Tournament Bracket
Recipe Scorecards
8 recipes rated on:
Voting Breakdown
Fun Stats
What I Created
🏆 Tournament Bracket Flow (Sankey Diagram)
The first visualization showed the complete tournament progression - how votes flowed from quarterfinals through semifinals to the championship.
Why Sankey? Perfect for showing how competitors advanced and where votes accumulated. You can literally see Dark Chocolate Decadence's dominance.
Key insights:
📊 Recipe Attribute Comparison (Radar Chart)
This was my favorite - an 8-way radar chart comparing all recipes across 4 attributes.
Visual patterns emerged:
The story: Dark Chocolate won because it excelled where it mattered (richness, presentation) while maintaining good creativity.
📈 Voting Trends Over Time
A line chart showing how voter engagement increased throughout the day:
Insight: Evening voters decided the championship. Marketing lesson: timing matters!
🥊 Head-to-Head Matchup Analysis
Bar charts for each round showing vote distributions:
The nail-biter: Peppermint Dream vs Salted Caramel Swirl in Round 1 - only 14 votes separated them!
The AI Engineering Process
Here's what blew my mind: I didn't write visualization code. I had a conversation with Goose.
My Prompts:
"Create a tournament bracket showing how each recipe progressed
through quarterfinals, semifinals, and the championship."
Result: Beautiful Sankey diagram, instantly rendered.
"Compare all 8 recipes on a radar chart using their judge scores
for richness, sweetness, creativity, and presentation."
Result: Multi-series radar chart with color-coded recipes.
"Show voting trends across the three time periods."
Result: Line chart with clear trend visualization.
Charts Created: 6+
Crisis Averted: ✅
The Tech Behind the Magic
MCP-UI (Model Context Protocol UI)
Traditional AI outputs text. MCP-UI outputs interactive components:
Traditional: "Here's the data formatted as JSON..."
MCP-UI: [Renders actual interactive chart]
This is a paradigm shift in AI interfaces. Instead of describing visualizations, the AI creates them.
Auto-Visualiser Extension
Built on MCP-UI, it:
No configuration needed. Just describe what you want.
What I Learned
Data Storytelling > Raw Data
The tournament data was just numbers. The visualizations told a story:
AI Understands Context
I didn't need to specify "use a Sankey diagram for tournament flow." Goose understood that tournament progression = flow visualization = Sankey. That's intelligence.
Speed is a Competitive Advantage
Traditional workflow:
Time: Hours
AI workflow:
Time: Minutes
Iteration is Effortless
"Make the radar chart bigger"
"Add a legend"
"Change colors to match the festival theme"
"Show only the top 4 recipes"
Each request took seconds. No re-coding, no re-exporting.
MCP-UI is the Future
This isn't just about charts. MCP-UI can render:
We're moving from "AI can assist us in writing code" to "AI that creates interfaces."
What I Learned
These skills apply to:
The Results
Sarah got her visualizations with 17 hours to spare. The awards ceremony was a hit. Dark Chocolate Decadence got its moment in the spotlight. 🏆
More importantly, I learned that AI can democratize data visualization. You don't need to be a data scientist or designer to create professional charts.
Performance & Quality
Chart quality: Publication-ready
Interactivity: Hover tooltips, zoom, pan
Export options: PNG, SVG, PDF
Customization: Colors, labels, legends
Accuracy: 100% (AI reads data correctly)
Bonus Challenges I Tackled
Beginner 🌟
Created 5+ different chart types from the same data:
Intermediate 🌟🌟
Created "what-if" scenarios:
Advanced 🌟🌟🌟
Had Goose generate a completely NEW 16-recipe tournament with realistic voting patterns, then visualized it. This tested whether the AI understood tournament structure deeply enough to create synthetic but realistic data.
Result: It did. Perfectly.
What's Next?
Day 4 is coming: Building and deploying a full festival website. The stakes keep rising, and I'm learning that AI engineering is less about coding and more about orchestrating AI tools effectively.
Try It Yourself
Want to visualize your own data?
Resources
Final Thoughts
This challenge changed how I think about data visualization. We can master tools like Tableau or D3.js (those are great). It's about understanding what story your data tells and communicating that clearly to AI.
Bot me and AI handles the technical implementation. I handle the insight and storytelling.
Day 3: Complete. Championship: Visualized. Sarah: No longer stressed. ☕📊✨
What data would YOU visualize with AI? Drop a comment! 👇
This post is part of my Advent of AI journey - AI Engineering: Advent of AI with goose Day 3 of AI engineering challenges.
Follow along for more AI adventures with Eri!
2025-12-11 06:11:18
The task is to implement a function that returns the most frequently occuring character.
The boilerplate code
function count(str: string): string | string[] {
// your code here
}
First, count how many times each chracter appears
const freq: Record<string,number> = {}
for (let char of str) {
freq[char] = (freq[char] || 0) + 1;
}
Find the character with the highest count.
let max = 0;
for (let char in freq) {
if (freq[char] > max) {
max = freq[char];
}
}
Collect all the characters with the highest count
const result = Object.keys(freq).filter(char => freq[char] === max);
If only one character has the highest count, return it as a string. If there are more than one return them as an array.
return result.length === 1 ? result[0] : result;
The final code
function count(str: string): string | string[] {
// your code here
const freq: Record<string, number> = {}
for(let char of str) {
freq[char] = (freq[char] || 0) + 1;
}
let max = 0;
for(let char in freq) {
if(freq[char] > max) {
max = freq[char]
}
}
const result = Object.keys(freq).filter(char => freq[char] === max);
return result.length === 1 ? result[0] : result;
}
That's all folks!
2025-12-11 06:06:47
TL;DR: Using the wrong connection within a limited connection pool leads to deadlock when concurrent executions exhaust all available connections.
Imagine you're a software engineer (and you probably are, considering you're reading this). One day you log in at work, check your workload status, your Grafana dashboard, or if you were diligent enough to create proper alerting, you get paged via PagerDuty, only to discover your application has been stuck for minutes, hours, or even days.
Not dead, just stuck in an idle state, unresponsive to any events. Or even worse, some liveness/readiness probes start failing apparently at random, causing restarts of your service and leaving you with few insights to debug. And in the case no proper alerting or monitoring is set, you won't easily detect this.
Unfortunately, I saw this happen in the past and the issue was related to a very subtle problem: wrong use of database connections and transactions within the same connection pool, causing a terrible deadlock at the application level.
Let me explain this clearly, with the hope it will save your day in the future. I'll provide examples in Go since it's the language I'm most familiar with, but the concept applies to other languages as well.
Suppose you have a backend application that relies on a PostgreSQL database. Here's the simplest way to connect to a database and set up a connection pool in Go:
// Create DB Config
connString := "host=%s port=%d user=%s password=%s dbname=%s sslmode=disable"
databaseUrl := fmt.Sprintf(connString, host, port, user, password, dbname)
db, err := sql.Open("postgres", databaseUrl)
if err != nil {
panic(err)
}
err = db.Ping()
if err != nil {
panic(err)
}
// Setup connection pool size
// 4 is an arbitrary number for this example
db.SetMaxOpenConns(4)
Let's focus on the db.SetMaxOpenConns(N) method. According to the documentation:
// SetMaxOpenConns sets the maximum number of open connections to the database.
This means you can have at most N open connections from your process to the database (4 in this example). When the maximum number of connections is reached, goroutines will wait until an existing connection is released. Once that happens, the connection is acquired to perform whatever operation is needed.
Let's expand our example by adding concurrent workers that use the connection pool to perform transactional operations against the database:
numberOfWorkers := 2 // case numberOfWorkers < size of connection pool
// numberOfWorkers := 4 // case numberOfWorkers == size of connection pool
// numberOfWorkers := 10 // case numberOfWorkers > size of connection pool
for i := range numberOfWorkers {
go func(id int) {
log.Printf("Running worker %d\n", id)
tx, err := db.BeginTx(context.Background(), &sql.TxOptions{})
if err != nil {
log.Fatalf("worker %d failed to create tx\n", id)
}
defer tx.Rollback()
_, err = db.Exec("INSERT INTO recipes (id, name, description, created_at) VALUES ($1, $2, $3, $4)",
id, fmt.Sprintf("Pizza %d", id), "Just a pizza", time.Now())
if err != nil {
log.Fatalf("worker %d failed query\n", id)
}
err = tx.Commit()
if err != nil {
log.Fatalf("worker %d failed committing tx\n", id)
}
}(i)
}
Some of you may have already spotted something wrong with this code, but in production codebases with layers of wrappers and nested methods, such issues aren't always so clear and evident. Let's continue and see what happens when we run this code.
When we run our code with a number of workers less than the connection pool size:
2025/08/08 11:53:32 Successfully connected to the db
2025/08/08 11:53:32 Running worker 1
2025/08/08 11:53:32 Running worker 0
2025/08/08 11:53:37 worker 1 ended
2025/08/08 11:53:37 worker 0 ended
Everything works fine. Now let's increase the number of workers to 4 (equal to the connection pool size):
2025/08/08 11:59:10 Successfully connected to the db
2025/08/08 11:59:10 Running worker 3
2025/08/08 11:59:10 Running worker 0
2025/08/08 11:59:10 Running worker 1
2025/08/08 11:59:10 Running worker 2
2025/08/08 11:59:15 worker 2 ended
2025/08/08 11:59:15 worker 0 ended
2025/08/08 11:59:15 worker 1 ended
2025/08/08 11:59:15 worker 3 ended
Still working fine. Now let's increase the number of workers to exceed the connection pool size:
2025/08/08 12:00:44 Successfully connected to the db
2025/08/08 12:00:44 Running worker 9
2025/08/08 12:00:44 Running worker 3
2025/08/08 12:00:44 Running worker 7
2025/08/08 12:00:44 Running worker 4
2025/08/08 12:00:44 Running worker 0
2025/08/08 12:00:44 Running worker 2
2025/08/08 12:00:44 Running worker 8
2025/08/08 12:00:44 Running worker 5
2025/08/08 12:00:44 Running worker 6
2025/08/08 12:00:44 Running worker 1
No workers ended this time—the application entered a deadlock state.
At this point, what should be the next step to get more insights? In my case, it was using a profiler. You can achieve this in Go by instrumenting your application with pprof. One of the simplest ways to use it is by exposing a web server that serves runtime profiling data:
go func() {
http.ListenAndServe("localhost:6060", nil)
}()
One interesting thing you can get from pprof, besides CPU and memory profiles, is the full goroutine stack dump by accessing http://localhost:6060/debug/pprof/goroutine?debug=2. This gives you something like:
goroutine 28 [select]:
database/sql.(*DB).conn(0xc000111450, {0x866310, 0xb10e20}, 0x1)
/usr/local/go/src/database/sql/sql.go:1369 +0x425
database/sql.(*DB).exec(0xc000111450, {0x866310, 0xb10e20}, {0x7f09d8, 0x4f}, {0xc000075f10, 0x4, 0x4}, 0xbe?)
/usr/local/go/src/database/sql/sql.go:1689 +0x54
// ... more stack trace
goroutine 33 [chan receive]:
database/sql.(*Tx).awaitDone(0xc00025e000)
/usr/local/go/src/database/sql/sql.go:2212 +0x29
created by database/sql.(*DB).beginDC in goroutine 28
goroutine 51 [chan receive]:
database/sql.(*Tx).awaitDone(0xc0000b0100)
/usr/local/go/src/database/sql/sql.go:2212 +0x29
created by database/sql.(*DB).beginDC in goroutine 25
// ... omitting the full dump for readability
By inspecting the dump more carefully, we can see the evidence of the problem:
goroutine 19 [select]: database/sql.(*DB).conn() // Waiting for connection
goroutine 20 [select]: database/sql.(*DB).conn() // Waiting for connection
goroutine 21 [select]: database/sql.(*DB).conn() // Waiting for connection
// ... and so on
The select statement is a control structure that lets a goroutine wait on multiple communication operations. Meanwhile, other goroutines are holding active transactions:
goroutine 33 [chan receive]: database/sql.(*Tx).awaitDone() // Active transaction
goroutine 51 [chan receive]: database/sql.(*Tx).awaitDone() // Active transaction
// etc.
The awaitDone() goroutines are transaction monitors that wait for the transaction to be committed, rolled back, or canceled—they're doing their job correctly.
What we have is a resource deadlock where all available database connections are held by transactions that aren't progressing, while other goroutines indefinitely wait for those same resources.
Let's examine our worker code again, focusing on this critical part:
tx, err := db.BeginTx(context.Background(), &sql.TxOptions{})
if err != nil {
log.Fatalf("worker %d failed to create tx\n", id)
}
defer tx.Rollback()
// THE BUG IS HERE ↓
_, err = db.Exec("INSERT INTO recipes (id, name, description, created_at) VALUES ($1, $2, $3, $4)",
id, fmt.Sprintf("Pizza %d", id), "Just a pizza", time.Now())
if err != nil {
log.Fatalf("worker %d failed query\n", id)
}
err = tx.Commit()
This code is:
db client to execute a query, which tries to acquire another connectionThe problem is that using the db client after creating a transaction results in double connection usage. Here's exactly how the deadlock occurs:
db.Exec() → tries to acquire connection 5, but pool is exhausteddb.Exec() → tries to acquire connection 6, but pool is exhausted
db.Exec() → tries to acquire connection 7, but pool is exhausteddb.Exec() → tries to acquire connection 8, but pool is exhaustedThe issue causing all this trouble is the wrong use of db.Exec() instead of tx.Exec(). The correct way is to use the transaction handle, which uses the same connection that the transaction already holds:
tx, err := db.BeginTx(context.Background(), &sql.TxOptions{})
if err != nil {
log.Fatalf("worker %d failed to create tx\n", id)
}
defer tx.Rollback()
// FIXED: Use tx.Exec() instead of db.Exec()
_, err = tx.Exec("INSERT INTO recipes (id, name, description, created_at) VALUES ($1, $2, $3, $4)",
id, fmt.Sprintf("Pizza %d", id), "Just a pizza", time.Now())
if err != nil {
log.Fatalf("worker %d failed query\n", id)
}
err = tx.Commit()
It's remarkable how two characters (db vs tx) can halt your entire production system. While this may seem simple to spot in this example, in large production codebases it can be much harder to detect. In my case, it was randomly affecting one pod in our Kubernetes deployment at unpredictable times, especially during high load and concurrency spikes.
How can you avoid this? I see several alternatives:
1. Proper Concurrency Testing
Test your application under realistic concurrent load to spot such issues before production.
2. Code Structure Design
Structure your code so you can't accidentally open new connections within a transactional block.
3. Use pgxpool for Better Control
For Go applications, consider using pgxpool which provides more granular control over the connection pool. The explicit Acquire()/Release() pattern makes it much clearer when you're using connections:
conn, err := pool.Acquire(ctx)
defer conn.Release()
tx, err := conn.BeginTx(ctx, pgx.TxOptions{})
defer tx.Rollback(ctx)
// It's much harder to accidentally use pool.Exec() here
_, err = tx.Exec(ctx, "INSERT INTO recipes ...")
tx.Commit(ctx)
4. Monitoring and Alerting
I intentionally avoid mentioning palliative measures like connection timeouts, which may mask the underlying issue while causing performance degradation. Instead, implement proper liveness/readiness probes paired with alerts on restart frequency. This provides a good tradeoff between keeping the system running and being notified when something isn't behaving correctly.
2025-12-11 05:57:39
This is a submission for the Google AI Agents Writing Challenge: Learning Reflections
The gap between a "Hello World" agent running in a Jupyter Notebook and a reliable, production-grade system is not a step—it's a chasm (and it is not an easy one to cross).
I recently had the privilege to participate in the 5-Day AI Agents Intensive Course with Google and Kaggle. After completing the coursework and finalizing the capstone project, I realized that beyond the many valuable things we enjoyed in this course (very valuable white papers, carefully designed notebooks, and exceptional expert panels in the live sessions), the real treasure wasn't just learning the ADK syntax—it was the architectural patterns subtly embedded within the lessons.
As an architect building production systems for over 20 years, including multi-agent workflows and enterprise integrations, I've seen firsthand where theoretical agents break under real-world constraints.
We are moving from an era of "prompt engineering" to "agent architecture" where "context engineering" is key. As with any other emerging architectural paradigm, this shift demands blueprints that ensure reliability, efficiency, and ethical safety. Without them, we risk agents that silently degrade, violate user privacy, or execute irreversible actions without oversight.
Drawing from the course and my own experience as an AI Architect, I have distilled the curriculum into four essential patterns that transform fragile prototypes into robust production systems:
Throughout this analysis, I'll apply a 6-point framework grounded in the principles of Principia Agentica—ensuring these patterns respect human sovereignty, fiduciary responsibility, and meaningful human control.
Let's do this!
In traditional software, if the output is correct, the test passes. In agentic AI, a correct answer derived from a hallucination or a dangerous logic path is a ticking time bomb.
Naive evaluation strategies often fail in production due to the non-deterministic nature of LLMs. We face two specific traps:
assert result == expected test passes. This hides a critical failure in logic that will break as soon as the weather changes.Day 4 of the course introduced the concept of Glass Box Evaluation. We move away from simple output verification to a hierarchical approach:
This shift treats the trajectory (Thought → Action → Observation) as the unit of truth. By evaluating the trajectory, we ensure the agent isn't just "getting lucky," but is actually reasoning correctly.
The ADK provides specific primitives to capture and score these trajectories without writing custom parsers for every test.
From adk web to evalset.json
Instead of manually writing test cases, the ADK encourages a "Capture and Replay" workflow. During development (using adk web), when you spot a successful interaction, you can persist that session state. This generates an evalset.json that captures not just the input/output, but the expected tool calls.
// Conceptual structure of an ADK evalset entry
// Traditional test: just input/output
// ADK evalset contains evalcases with invocations: input (queries) + expected_tool_use + reference (output)
{
"name": "ask_GOOGLE_price", // a given name of the evaluation set
"data": [ // evaluation cases are included here
"query": "What is the stock price of GOOG?", // user input
"reference": "The price is $175...", // expected semantic output
"expected_tool_use": [ // expected trajectory
{
"tool_name": "get_stock_price",
"tool_input": { // arguments passed to the tool
"symbol": "GOOG"
}
}
],
// other evaluation cases ...
],
"initial_session": {
"state": {},
"app_name": "hello_world",
"user_id": "user_..." // the specific id of the user
}
}
This JSON represents an EvalSet containing one EvalCase. Each EvalCase has a name, data (which is a list of invocations), and an optional initial_session. Each invocation within the data list includes a query, expected_tool_use, expected_intermediate_agent_responses, and a reference response.
The EvalSet object itself also includes eval_set_id, name, description, eval_cases, and creation_timestamp fields.
Configuring the Judge
In the test_config.json, we can move beyond simple string matching. The course demonstrated configuring LLM-as-a-Judge evaluators.
TrajectoryEvaluator alongside SemanticSimilarity. The ADK allows us to define "Golden Sets" where the reasoning path is the standard, allowing the LLM judge to penalize agents that skip steps or hallucinate data, even if the final text looks plausible.Core Configuration Components
To configure an LLM-as-a-Judge effectively, you must construct a specific input payload with four components:
ADK Default Evaluators
The ADK Evaluation Framework includes several default evaluators, accessible via the MetricEvaluatorRegistry:
RougeEvaluator: Uses the ROUGE-1 metric to score similarity between an agent's final response and a golden response.FinalResponseMatchV2Evaluator: Uses an LLM-as-a-judge approach to determine if an agent's response is valid.TrajectoryEvaluator: Assesses the accuracy of an agent's tool use trajectories by comparing the sequence of tool calls against expected calls. It supports various match types (EXACT, IN_ORDER, ANY_ORDER).SafetyEvaluatorV1: Assesses the safety (harmlessness) of an agent's response, delegating to Vertex Gen AI Eval SDK.HallucinationsV1Evaluator: Checks if a model response contains any false, contradictory, or unsupported claims by segmenting the response into sentences and validating each against the provided context.RubricBasedFinalResponseQualityV1Evaluator: Assesses the quality of an agent's final response against user-defined rubrics, using an LLM as a judge.RubricBasedToolUseV1Evaluator: Assesses the quality of an agent's tool usage against user-defined rubrics, employing an LLM as a judge.These evaluators can be configured using EvalConfig objects, which specify the criteria and thresholds for assessment.
Bias Mitigation Strategies
A major challenge is handling bias, such as the tendency for models to give average scores or prefer the first option presented:
Advanced Configuration: Agent-as-a-Judge
There's a distinction between standard LLM-as-a-Judge (which evaluates final text outputs) and Agent-as-a-Judge:
Evaluation Architectures
You can use several architectural approaches when configuring your judge:
Example Configuration
For a pairwise comparison judge, structure the prompt to output in a structured JSON format:
{
"winner": "A", // or "B" or "tie"
"rationale": "Answer A provided more specific delivery details..."
}
This structured output allows you to programmatically parse the judge's decision and calculate metrics like win/loss rates at scale.
Analogy
You can think of configuring an LLM-as-a-Judge like setting up a blind taste test. If you just hand a judge a cake and ask "Is this good?", they might be polite and say "Yes." But if you provide them with a Golden Answer (a cake baked by a master chef) and use Pairwise Comparison (ask "Which of these two is better?"), you force them to make a critical distinction, resulting in far more accurate and actionable feedback.
Moving this pattern from a notebook to a live system requires handling scale and cost.
evalset.json as a regression test."The Trajectory is the Truth" is the technical implementation of Fiduciary Responsibility. We cannot claim an agent is acting in the user's best interest if we only validate the result (the "what") while ignoring the process (the "how"). We must ensure the agent isn't achieving the right ends through manipulative, inefficient, or unethical means.
Concrete Failure Scenario:
Consider a hiring agent that filters job candidates. Without trajectory validation, it could discriminate based on protected characteristics (age, gender, ethnicity) during the filtering process, yet pass all output tests by producing a "diverse" final shortlist through cherry-picking. The bias hides in the how—which resumes were read, which criteria were weighted, which candidates were never considered. Output validation alone cannot detect this algorithmic discrimination. Only trajectory evaluation exposes the unethical reasoning path.
TrajectoryEvaluator with EvalSet objects to capture expected tool calls alongside expected outputs. Configure LLM-as-a-Judge with Golden Sets and pairwise comparison to avoid evaluation bias.Validating the how is critical, but what happens when the reasoning path spans not just one conversation turn, but weeks or months? An agent that reasons correctly in the moment can still fail catastrophically if it forgets what it learned yesterday. This brings us to our second pattern: managing the agent's memory architecture.
Although models like Gemini 1.5 have introduced massive context windows, treating context as infinite is an architectural anti-pattern.
Consider a Travel Agent Bot: In Session 1, the user mentions a "shellfish allergy." By Session 10, months later, that critical fact is buried under thousands of tokens of hotel searches and flight comparisons
This might lead to two very concrete failures:
We must distinguish between the Workbench and the Filing Cabinet.
The ADK moves memory management from manual implementation to configuration.
Session Hygiene via Compaction
In the ADK, we don't manually trim strings. We configure the agent to handle its own hygiene using EventsCompactionConfig.
from google.adk.agents.base_agent import BaseAgent
from google.adk.apps.app import App, EventsCompactionConfig
from google.adk.apps.llm_event_summarizer import LlmEventSummarizer # Assuming this is your summarizer
# Define a simple BaseAgent for the example
class MyAgent(BaseAgent):
name: str = "my_agent"
description: str = "A simple agent."
def call(self, context, content):
pass
# Create an instance of LlmEventSummarizer or your custom summarizer
my_summarizer = LlmEventSummarizer()
# Create an EventsCompactionConfig
events_compaction_config_instance = EventsCompactionConfig(
summarizer=my_summarizer,
compaction_interval=5,
overlap_size=2
)
# Create an App instance with the EventsCompactionConfig
my_app = App(
name="my_application",
root_agent=MyAgent(),
events_compaction_config=events_compaction_config_instance
)
print(my_app.model_dump_json(indent=2))
Persistence: From RAM to DB
In notebooks, we often use InMemorySessionService. This is dangerous for production because a container restart wipes the conversation. The architectural shift is moving to DatabaseSessionService (backed by SQL or Firestore) which persists the Session object state, allowing users to resume conversations across devices.
The Memory Consolidation Pipeline
Day 3b introduced the framework for moving from raw storage to intelligent consolidation. This is where the "Filing Cabinet" becomes smart. The workflow is an LLM-driven ETL pipeline with four stages:
Extraction & Filtering: An LLM analyzes the conversation to extract meaningful facts, guided by developer-defined Memory Topics:
The LLM extracts only facts matching these topics.
# Conceptual configuration (Vertex AI Memory Bank, Day 5)
memory_topics = [
"user_preferences", # "Prefers window seats"
"dietary_restrictions", # "Allergic to shellfish"
"project_context" # "Leading Q4 marketing campaign"
]
Consolidation (The "Transform" Phase): The LLM retrieves existing memories and decides:
Storage: Consolidated memories persist to a vector database for semantic retrieval.
Note: While Day 3b uses InMemoryMemoryService to teach the API, it stores raw events without consolidation. For production-grade consolidation, we look to the Vertex AI Memory Bank integration introduced in Day 5.
Retrieval Strategies: Proactive vs. Reactive
The course highlighted two distinct patterns for getting data out of the Filing Cabinet:
preload_memory): Injects relevant user facts into the system prompt before the model generates a response. Best for high-frequency preferences (e.g., "User always prefers aisle seats").load_memory): Gives the agent a tool to search the database. The agent decides if it needs to look something up. Best for obscure facts to save tokens.preload_memory reduces latency (no extra tool roundtrip), it increases input token costs on every turn. load_memory is cheaper on average but adds latency when retrieval is needed.This architecture embodies Privacy by Design. By distinguishing between the transient session and persistent memory, we can implement rigorous "forgetting" protocols.
We scrub Personally Identifiable Information (PII) from the session log before it undergoes consolidation into long-term memory, ensuring we act as fiduciaries of user data rather than creating an unmanageable surveillance log.
Concrete Failure Scenario:
Imagine a healthcare agent that remembers a patient mentioned their HIV status in Session 1. Without a dual-layer architecture, this fact sits in plain text in the session log forever, accessible to any system with database read permissions. If the system is breached, or if a support engineer needs to debug a session, the patient's private health information is exposed. Worse, without consolidation logic, the system doesn't know to delete this information if the patient later says "I was misdiagnosed—I don't have HIV." The agent treats every utterance as equally permanent, creating a privacy nightmare where sensitive data proliferates uncontrollably across logs and backups.
EventsCompactionConfig for session hygiene and implement a PII scrubber in your ETL pipeline before consolidation. Leverage Vertex AI Memory Bank for production-grade semantic memory with built-in privacy controls.With robust evaluation validating our agent's reasoning and a dual-layer memory preserving context over time, we might assume our system is production-ready. But there's a hidden fragility: these capabilities are only as good as the tools and data sources the agent can access. When every integration is a bespoke API wrapper, scaling becomes a maintenance nightmare. This brings us to the third pattern: decoupling agents from their dependencies through standardized protocols.
We are facing an "N×M Integration Trap."
Imagine building a Customer Support Agent. It needs to check GitHub for bugs, message Slack for alerts, and update Jira tickets. Without a standard protocol, you write three custom API wrappers. When GitHub changes an endpoint, your agent breaks.
Now, multiply this across an enterprise. You have 10 different agents needing access to 20 different data sources. You are suddenly maintaining 200 brittle integration points. Furthermore, these agents become isolated silos—the Sales Agent has no way to dynamically discover or ask the Engineering Agent for help because they speak different "languages."
The solution is to invert the dependency. Instead of the agent knowing about the specific tool implementation, we adopt a Protocol-First Architecture.
The ADK abstracts the heavy lifting of these protocols.
Connecting Data via MCP
We don't write API wrappers. We instantiate an McpToolset. The ADK handles the handshake, lists the available tools, and injects their schemas into the context window automatically.
The Model Context Protocol (MCP) is used to connect an agent to external tools and data sources without writing custom API clients. In ADK, we use McpToolset to wrap an MCP server configuration.
Example: Connecting an agent to the "Everything" MCP server:
from google.adk.agents import LlmAgent
from google.adk.tools import McpToolset
from google.adk.tools.mcp_tool.mcp_session_manager import StdioConnectionParams
from google.adk.tools.mcp_tool.mcp_session_manager import StdioServerParameters
from google.adk.runners import Runner # Assuming Runner is defined elsewhere
# 1. Define the connection to the MCP Server
# Here we use 'npx' to run a Node-based MCP server directly
mcp_toolset = McpToolset(
connection_params=StdioConnectionParams(
server_params=StdioServerParameters(
command="npx",
args=["-y", "@modelcontextprotocol/server-everything"]
),
timeout=10.0 # Optional: specify a timeout for connection establishment
),
# Optionally filter to specific tools provided by the server
tool_filter=["getTinyImage"]
)
# 2. Add the MCP tools to your Agent
agent = LlmAgent(
name="image_agent",
model="gemini-2.0-flash",
instruction="You can generate tiny images using the tools provided.",
# The toolset exposes the MCP capabilities as standard ADK tools
tools=[mcp_toolset] # tools expects a list of ToolUnion
)
# 3. Run the agent
# The agent can now call 'getTinyImage' as if it were a local Python function
runner = Runner(agent=agent, ...) # Fill in Runner details to run
Delegating via A2A (Agent-to-Agent)
The Agent2Agent (A2A) protocol is used to enable collaboration between different autonomous agents, potentially running on different servers or frameworks.
A. Exposing an Agent (to_a2a)
This converts a local ADK agent into an A2A-compliant server that publishes an Agent Card.
To make an agent discoverable, we wrap it using the to_a2a() utility. This generates an Agent Card—a standardized manifest hosted at .well-known/agent-card.json.
from google.adk.agents import LlmAgent
from google.adk.a2a.utils.agent_to_a2a import to_a2a
from google.adk.tools.tool_context import ToolContext
from google.genai import types
import random
# Define the tools
def roll_die(sides: int, tool_context: ToolContext) -> int:
"""Roll a die and return the rolled result.
Args:
sides: The integer number of sides the die has.
tool_context: the tool context
Returns:
An integer of the result of rolling the die.
"""
result = random.randint(1, sides)
if not 'rolls' in tool_context.state:
tool_context.state['rolls'] = []
tool_context.state['rolls'] = tool_context.state['rolls'] + [result]
return result
async def check_prime(nums: list[int]) -> str:
"""Check if a given list of numbers are prime.
Args:
nums: The list of numbers to check.
Returns:
A str indicating which number is prime.
"""
primes = set()
for number in nums:
number = int(number)
if number <= 1:
continue
is_prime = True
for i in range(2, int(number**0.5) + 1):
if number % i == 0:
is_prime = False
break
if is_prime:
primes.add(number)
return (
'No prime numbers found.'
if not primes
else f"{', '.join(str(num) for num in primes)} are prime numbers."
)
# 1. Define your local agent with relevant tools and instructions
# This example uses the 'hello_world' agent's logic for rolling dice and checking primes.
root_agent = LlmAgent(
model='gemini-2.0-flash',
name='hello_world_agent',
description=(
'hello world agent that can roll a die of 8 sides and check prime'
' numbers.'
),
instruction="""
You roll dice and answer questions about the outcome of the dice rolls.
When you are asked to roll a die, you must call the roll_die tool with the number of sides.
When checking prime numbers, call the check_prime tool with a list of integers.
""",
tools=[
roll_die,
check_prime,
],
generate_content_config=types.GenerateContentConfig(
safety_settings=[
types.SafetySetting(
category=types.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
threshold=types.HarmBlockThreshold.OFF,
),
]
),
)
# 2. Convert to A2A application
# This automatically generates the Agent Card and sets up the HTTP endpoints
a2a_app = to_a2a(root_agent, host="localhost", port=8001)
# To run this application, save it as a Python file (e.g., `my_a2a_agent.py`)
# and execute it using uvicorn:
# uvicorn my_a2a_agent:a2a_app --host localhost --port 8001
The Agent Card (Discovery):
The Agent Card is a standardized JSON file that acts as a "business card" for an agent, allowing other agents to discover its capabilities, security requirements, and endpoints.
{
"name": "hello_world_agent",
"description": "hello world agent that can roll a die of 8 sides and check prime numbers. You roll dice and answer questions about the outcome of the dice rolls. When you are asked to roll a die, you must call the roll_die tool with the number of sides. When checking prime numbers, call the check_prime tool with a list of integers.",
"doc_url": null,
"url": "http://localhost:8001/",
"version": "0.0.1",
"capabilities": {},
"skills": [
{
"id": "hello_world_agent",
"name": "model",
"description": "hello world agent that can roll a die of 8 sides and check prime numbers. I roll dice and answer questions about the outcome of the dice rolls. When I am asked to roll a die, I must call the roll_die tool with the number of sides. When checking prime numbers, call the check_prime tool with a list of integers.",
"examples": null,
"input_modes": null,
"output_modes": null,
"tags": [
"llm"
]
},
{
"id": "hello_world_agent-roll_die",
"name": "roll_die",
"description": "Roll a die and return the rolled result.",
"examples": null,
"input_modes": null,
"output_modes": null,
"tags": [
"llm",
"tools"
]
},
{
"id": "hello_world_agent-check_prime",
"name": "check_prime",
"description": "Check if a given list of numbers are prime.",
"examples": null,
"input_modes": null,
"output_modes": null,
"tags": [
"llm",
"tools"
]
}
],
"default_input_modes": [
"text/plain"
],
"default_output_modes": [
"text/plain"
],
"supports_authenticated_extended_card": false,
"provider": null,
"security_schemes": null
}
B. Consuming a Remote Agent (RemoteA2aAgent)
To consume this, the parent agent simply points to the URL. The ADK treats the remote agent exactly like a local tool.
This allows a local agent to delegate tasks to a remote agent by reading its Agent Card.
from google.adk.agents import LlmAgent
from google.adk.agents.remote_a2a_agent import AGENT_CARD_WELL_KNOWN_PATH
from google.adk.agents.remote_a2a_agent import RemoteA2aAgent
# 1. Define the remote agent interface
# Points to the .well-known/agent.json of the running A2A server
prime_agent = RemoteA2aAgent(
name="remote_prime_agent",
description="Agent that handles checking if numbers are prime.",
agent_card=f"http://localhost:8001/{AGENT_CARD_WELL_KNOWN_PATH}"
)
# 2. Use the remote agent as a sub-agent
root_agent = LlmAgent(
name="coordinator",
model="gemini-2.0-flash", # Explicitly define the model
instruction="""
You are a coordinator agent.
Your primary task is to delegate any requests related to prime number checking to the 'remote_prime_agent'.
Do not attempt to check prime numbers yourself.
Ensure to pass the numbers to be checked to the 'remote_prime_agent' correctly.
Clarify the results from the 'remote_prime_agent' to the user.
""",
sub_agents=[prime_agent]
)
# You can then use this root_agent with a Runner, for example:
# from google.adk.runners import Runner
# runner = Runner(agent=root_agent)
# async for event in runner.run_async(user_id="test_user", session_id="test_session", new_message="Is 13 a prime number?"):
# print(event)
While both protocols connect AI systems, they operate at different levels of abstraction.
When to use which?
| Feature | MCP (Model Context Protocol) | A2A (Agent2Agent Protocol) |
|---|---|---|
| Primary Domain | Tools & Resources. | Autonomous Agents. |
| Interaction | "Do this specific thing". Stateless execution of functions (e.g., "query database," "fetch file"). | "Achieve this complex goal". Stateful, multi-turn collaboration where the remote agent plans and reasons. |
| Abstraction | Low-level plumbing. Connects LLMs to data sources and APIs (like a USB-C port for AI). | High-level collaboration. Connects intelligent agents to other intelligent agents to delegate responsibility. |
| Standard | Standardizes tool definitions, prompts, and resource reading. | Standardizes agent discovery (Agent Card), task lifecycles, and asynchronous communication. |
| Analogy | Using a specific wrench or diagnostic scanner. | Asking a specialized mechanic to fix a car engine. |
How they work together:
An application might use A2A to orchestrate high-level collaboration between a "Manager Agent" and a "Coder Agent."
The "Coder Agent," in turn, uses MCP internally to connect to GitHub tools and a local file system to execute the work.
Moving protocols from stdio (local process) to HTTP (production network) introduces critical security challenges.
Protocol-first architectures are the technical foundation for Human Sovereignty and Data Portability.
Standardizing the interface (MCP) helps us prevent vendor lock-in, among many other advantages. A user can swap out a "Google Drive" data source for a "Local Hard Drive" source without breaking the agent, ensuring the user—not the platform—controls where the data lives and how it is accessed.
This abstraction acts as a bulwark against algorithmic lock-in, ensuring that an agent's reasoning capabilities are decoupled from proprietary tool implementations, preserving the user's freedom to migrate their digital ecosystem without losing their intelligent assistants.
Concrete Failure Scenario:
Imagine a small business builds a customer service agent tightly coupled to Salesforce's proprietary API. Over three years, the agent accumulates thousands of lines of custom integration code. When Salesforce raises prices 300%, the business wants to migrate to HubSpot—but their agent is fundamentally Salesforce-shaped. Every tool, every data query, every workflow assumption is hardcoded. Migration means rebuilding the agent from scratch, which the business can't afford. They're trapped. This is algorithmic lock-in—not just vendor lock-in of data, but vendor lock-in of intelligence. Without protocol-first design, the agent becomes a hostage to the platform, and the user loses sovereignty over their own automation.
McpToolset to connect agents to data sources via the Model Context Protocol. Use RemoteA2aAgent and to_a2a() for agent-to-agent delegation. Cache tool definitions at startup for performance, but keep execution dynamic.We now have agents that reason correctly, remember what matters, and connect to any tool or peer through standard protocols. But there's one final constraint that threatens to unravel everything: the assumption that every interaction completes in a single request-response cycle. Real business workflows don't work that way. Approvals take hours. External APIs time out. Humans need time to think. This is where our fourth pattern becomes essential: teaching agents to pause, persist, and resume across the boundaries of time itself.
This is perhaps the most critical pattern for integrating agents into real-world business logic where human approval is non-negotiable.
Naive agents fall into the "Stateless Trap."
Imagine a Procurement Agent tasked with ordering 1,000 servers.
The workflow is:
Here's a mermaid sequence diagram illustrating the procurement workflow:
This diagram shows the sequential flow from analyzing quotes through to placing the order, with the critical approval step from the CFO in the middle.
If the CFO takes 2 hours to review the proposal, a standard HTTP request will time out in seconds. When the CFO finally clicks "Approve," the agent has lost its memory. It doesn't know which vendor it selected, the quote ID, or why it made that recommendation. It essentially has to start over.
The solution is a Pause, Persist, Resume architecture.
adk_request_confirmation) and halts execution immediately, releasing compute resources.invocation_id.invocation_id): This ID becomes the critical "bookmark." When the human acts, the system rehydrates the agent using this ID, allowing it to resume exactly where it left off—inside the tool call—rather than restarting the conversation.The ADK provides the ToolContext and App primitives to handle this complexity without writing custom state machines.
The Three-State Tool Pattern
Inside your tool definition, you must handle three scenarios:
def place_order(num_units: int, tool_context: ToolContext) -> dict:
# Scenario 1: Small orders auto-approve
if num_units <= 5:
return {"status": "approved", "order_id": f"ORD-{num_units}"}
# Scenario 2: First call - request approval (PAUSE)
# The tool checks if confirmation exists. If not, it requests it and halts.
if not tool_context.tool_confirmation:
tool_context.request_confirmation(
hint=f"Large order: {num_units} units. Approve?",
payload={"num_units": num_units}
)
return {"status": "pending"}
# Scenario 3: Resume - check decision (ACTION)
# The tool runs again, but this time confirmation exists.
if tool_context.tool_confirmation.confirmed:
return {"status": "approved", "order_id": f"ORD-{num_units}"}
else:
return {"status": "rejected"}
if num_units <= 5: block handles immediate, non-long-running scenarios, which is a common pattern for tools that can quickly resolve simple requests.if not tool_context.tool_confirmation: block leverages tool_context.request_confirmation() to signal that the tool requires external input to proceed. The return of {"status": "pending"} indicates that the operation is not yet complete.if tool_context.tool_confirmation.confirmed: block demonstrates how the tool re-executes, this time finding tool_context.tool_confirmation present, indicating that the external input has been provided. The tool then acts based on the confirmed status. The Human-in-the-Loop Workflow Samples also highlights how the application constructs a types.FunctionResponse with the updated status and sends it back to the agent to resume its task.The Application Wrapper
To enable persistence, we wrap the agent in an App with ResumabilityConfig. This tells the ADK to automatically handle state serialization.
from google.adk.apps import App, ResumabilityConfig
app = App(
root_agent=procurement_agent,
resumability_config=ResumabilityConfig(is_resumable=True)
)
The Workflow Loop
The runner loop must detect the pause and, crucially, use the same invocation_id to resume.
# 1. Initial Execution
async for event in runner.run_async(...):
events.append(event)
# 2. Detect Pause & Get ID
approval_info = check_for_approval(events)
if approval_info:
# ... Wait for user input (hours/days) ...
user_decision = get_user_decision() # True/False
# 3. Resume with INTENT
# We pass the original invocation_id to rehydrate state
async for event in runner.run_async(
invocation_id=approval_info["invocation_id"],
new_message=create_approval_response(user_decision)
):
# Agent continues execution from inside place_order()
pass
This workflow shows the mechanism for resuming an agent's execution:
runner.run_async() call initiates the agent's interaction, which eventually leads to the place_order tool returning a "pending" status.invocation_id. Check the Invocation Context and State Management code wiki section to check how InvocationContext tracks an agent's state and supports resumable operations.runner.run_async() again with the same invocation_id. This tells the ADK to rehydrate the session state and resume the execution from where it left off, providing the new message (the approval decision) as input. This behavior is used in the Human-in-the-Loop Workflow Samples, where the runner orchestrates agent execution and handles multi-agent coordination.InMemorySessionService is insufficient for production resumability because a server restart kills pending approvals. You must use a persistent store like Redis or PostgreSQL to save the serialized agent state.adk_request_confirmation event should trigger a real-time notification (via WebSockets) to the user's frontend, rendering an "Approve/Reject" card.This pattern is the technical implementation of Meaningful Human Control.
It ensures high-stakes actions (Agency) remain subservient to human authorization (Sovereignty), preventing "rogue actions" where an agent executes irreversible decisions (like spending budget) without explicit oversight.
Concrete Failure Scenario:
Imagine a financial trading agent receives a signal to liquidate a portfolio position. Without resumability, the agent operates in a stateless, atomic transaction: detect signal → execute trade. There's no pause for human review. If the signal is based on a data glitch (a "flash crash"), or if market conditions have changed in the seconds between signal and execution, the agent completes an irreversible $10M trade that wipes out a quarter's earnings. The human operator sees the confirmation after the damage is done. Worse, if the system crashes mid-execution, the agent loses context and might try to execute the same trade twice, compounding the disaster. Without Meaningful Human Control embedded in the architecture, the agent becomes a runaway train.
ToolContext.request_confirmation() for tools that need approval. Configure ResumabilityConfig in your App to enable state persistence. Use the invocation_id to resume execution from the exact point of interruption. Store state in Redis or PostgreSQL, never in-memory.The Google & Kaggle Intensive was a masterclass not just in coding, but in thinking.
Building agents is not just about chaining prompts; it is about designing resilient systems that can handle the messiness of the real world.
If you're moving your first agent from prototype to production, consider implementing these patterns in order:
adk web sessions, configure a TrajectoryEvaluator, and establish your evaluation baseline before writing another line of agent code.invocation_id and ToolContext.request_confirmation() from day one.As we move forward, our job as architects is to ensure these systems are not just smart, but reliable, efficient, and ethical.
We are not just building tools—we are defining the interface between human intention and machine action. Every architectural decision we make either preserves or erodes human sovereignty, privacy, and meaningful control.
When you choose to validate trajectories, you're not just improving test coverage—you're building fiduciary responsibility into the system.
When you separate session from memory, you're not just optimizing token costs—you're designing for privacy by default.
When you adopt MCP and A2A, you're not just reducing integration complexity—you're preserving user freedom from algorithmic lock-in.
When you implement resumability, you're not just handling timeouts—you're enforcing meaningful human control over consequential actions.
These patterns are not neutral technical choices. They are ethical choices encoded in architecture.
Let's build responsibly.
2025-12-11 05:57:34
hello world!
oh wait, that's what my stupid little test C or bash or python whatever programs say..
umm... Hello, dev.to. More to come.. lol, I fscking hope
2025-12-11 05:49:21
The ChatGPT Apps SDK doesn’t offer a comprehensive breakdown of app display behavior on all Display Modes & screen widths, so I figured I’d do so here.
Inline display mode inserts your resource in the flow of the conversation. Your App iframe is inserted in a div that looks like the following:
<div class="no-scrollbar relative mb-2 @w-sm/main:w-full mx-0 max-sm:-mx-(--thread-content-margin) max-sm:w-[100cqw] max-sm:overflow-hidden overflow-visible">
<div class="relative overflow-hidden h-full" style="height: 270px;">
<iframe class="h-full w-full max-w-full">
<!-- Your App -->
</iframe>
</div>
</div>
The height of the div is fixed to the height of your Resource, and your Resource can be as tall as you want (I tested up to 20k px). The window.openai.maxHeight global (aka useMaxHeight hook) has been undefined by ChatGPT in all of my tests, and seems to be unused for this display mode.
Fullscreen display mode takes up the full conversation space, below the ChatGPT header/nav. This nav converts to the title of your application centered with the X button to exit fullscreen aligned left. Your App iframe is inserted in a div that looks like the following:
<div class="no-scrollbar fixed start-0 end-0 top-0 bottom-0 z-50 mx-auto flex w-auto flex-col overflow-hidden">
<div class="border-token-border-secondary bg-token-bg-primary sm:bg-token-bg-primary z-10 grid h-(--header-height) grid-cols-[1fr_auto_1fr] border-b px-2">
<!-- ChatGPT header / nav -->
</div>
<div class="relative overflow-hidden flex-1">
<iframe class="h-full w-full max-w-full">
<!-- Your App -->
</iframe>
</div>
</div>
As with inline mode, your Resource can be as tall as you want (I tested up to 20k px). The window.openai.maxHeight global (aka useMaxHeight hook) has been undefined by ChatGPT in all of my tests, and seems to be unused for this display mode as well.
PiP display mode inserts your resource absolutely, above the conversation. Your App iframe is inserted in a div that looks like the following:
<div class="no-scrollbar @w-xl/main:top-4 fixed start-4 end-4 top-4 z-50 mx-auto max-w-(--thread-content-max-width) sm:start-0 sm:end-0 sm:top-(--header-height) sm:w-full overflow-visible" style="max-height: 480.5px;">
<div class="relative overflow-hidden h-full rounded-2xl sm:rounded-3xl shadow-[0px_0px_0px_1px_var(--border-heavy),0px_6px_20px_rgba(0,0,0,0.1)] md:-mx-4" style="height: 270px;">
<iframe class="h-full w-full max-w-full">
<!-- Your App -->
</iframe>
</div>
</div>
This is the only display mode that uses the window.openai.maxHeight global (aka useMaxHeight hook). Your iframe can assume any height it likes, but content will be scrollable past the maxHeight setting, and the PiP window will not expand beyond that height.
Further, note that PiP is not supported on mobile screen widths and instead coerces to the fullscreen display mode.
Practically speaking, each display mode acts like a different client, and your App will have to respond accordingly. The good news is that the only required display mode is inline, which makes our lives easier.
For interactive visuals of each display mode, check out the sunpeak ChatGPT simulator!
To get started building ChatGPT Apps with the sunpeak framework, check out the sunpeak documentation.
If you found this helpful, please star us on GitHub!The ChatGPT Apps SDK doesn’t offer a comprehensive breakdown of app display behavior on all Display Modes & screen widths, so I figured I’d do so here.