MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

Architecting Enterprise grade Multi‑Agent AI with AWS Strands & Amazon Bedrock AgentCore

2025-12-27 14:23:47

Introduction

As enterprises increasingly adopt AI to tackle real business challenges, one thing becomes clear: single, monolithic AI assistants are no longer sufficient. Real world workflows often span multiple systems, data sources, and business domains from CRM platforms like Salesforce to ticketing tools like Jira, internal knowledgebases, and custom applications. Handling these complexities requires modular, secure, and orchestrated multi-agent systems rather than a single assistant trying to do everything.

AWS offers a powerful approach to building such systems through AWS Strands Agent Patterns, integrated with Amazon Bedrock AgentCore a managed platform that provides runtime services, identity management, memory, observability, and secure tool access.

The promise of generative AI lies not in standalone assistants but in multi-agent systems that collaborate intelligently across specialized tasks. As enterprises scale, workflows become too complex for a single agent: juggling specialized reasoning, robust error handling, secure access to sensitive tools, and context retention across sessions quickly exceeds the capacity of a monolithic system.

The solution is to design AI agents like microservices: modular, composable, and orchestrated. AWS provides both open source frameworks and managed services that make such architectures production ready.

In this post, we’ll explore how to build enterprise grade multi-agent AI systems using AWS Strands Agents and Amazon Bedrock AgentCore. We’ll cover four core Strands patterns:

Agents as Tools enabling specialized agents to perform focused tasks on behalf of others.

Swarms coordinating multiple agents to solve complex problems collaboratively.

Agent Graphs structuring dependencies and information flow between agents.

Workflows orchestrating multi-agent processes to execute end-to-end business operations.

Along the way, we’ll include Python examples and architectural insights showing how AgentCore’s Identity and Memory services empower agents with secure access, context retention, and session aware behavior essential for production grade AI systems in enterprise environments.

Before diving into patterns, let’s set context

The Challenge of Enterprise Complexity

In modern enterprises, business-critical data is scattered across multiple systems:

  • CRM platforms like Salesforce manage sales, accounts, and pipelines.
  • Knowledge bases such as SharePoint, Confluence, or custom repositories store organizational knowledge.
  • Operational systems like Jira, ServiceNow, and Workday handle workflows, tickets, and HR processes.
  • Analytics platforms provide structured data for decision-making.
  • Identity systems like Okta or Azure AD govern authentication and secure access.
    No single AI agent can reason effectively across all these domains. Enterprises need:

  • Domain specialization: Each agent focuses on a specific area of expertise.

  • Structured decision flows: Clear pathways for reasoning and action.

  • Policy and access control: Secure, compliant system interactions.

  • Integrated memory and context: Agents retain relevant knowledge consistently across sessions.

Strands agents break reasoning into modular, focused units, enabling multi-agent architectures where each agent specializes in specific domains. But to operate reliably at enterprise scale, they require robust infrastructure this is where AWS AgentCore comes in.

AgentCore: AWS’s Enterprise Ready Multi-Agent Platform
Turning Agents into Enterprise Microservices

Amazon Bedrock AgentCore provides the foundation for deploying secure, autonomous agents at scale:

Runtime: Serverless execution environments with session isolation, supporting long running workloads.

Memory: Persistent context storage that maintains agent knowledge across sessions.

Identity: Fine grained authentication and access control integrated with enterprise IdPs like Okta or Cognito.

Observability: Integrated metrics, logs, and traces with CloudWatch and OpenTelemetry.

These capabilities allow multi-agent systems to operate with consistent security, memory, monitoring, and access controls — all without building custom infrastructure from scratch.

Gateway: Unified Tool Access

  • The AgentCore Gateway provides a unified connectivity layer for multi-agent systems:
  • Single entry point for tools: Converts APIs, Lambda functions, AWS services, and even existing MCP servers into agent-ready tools.
  • Protocol translation: Automatically converts MCP requests to REST, Lambda, or other endpoints.
  • Security and credential management: Handles authentication and secure access for agents invoking tools.
  • Semantic tool discovery: Agents can find and use the right tools based on context.
  • Reduced integration overhead: Eliminates the need for custom MCP server SDKs or extensive glue code.

By combining Strands agents with AgentCore and Gateway, enterprises can deploy modular, secure multi agent systems that:

  • Reason over multiple domains and systems simultaneously.
  • Maintain persistent memory and context.
  • Operate securely with enterprise-grade identity and access controls.
  • Access internal and external tools via a unified interface without building custom SDKs.

In short, AWS AgentCore with Gateway transforms multi agent AI into enterprise grade microservices capable of reasoning, acting, and collaborating across complex workflows.

1. Agents as Tools (Hierarchical Delegation)

The Agents as Tools pattern mirrors hierarchical microservice architectures. A top level orchestrator delegates tasks to specialized agents, each responsible for a specific domain. This approach is ideal when tasks require distinct expertise, such as IT support, HR, or finance.

Enterprise Scenario: An IT helpdesk receives tickets spanning infrastructure, access management, and application support. Each ticket must be routed to the correct specialist.


from strands import Agent
from strands_tools import retrieve, http_request

# Specialized IT agents
def infrastructure_agent(ticket: str):
    agent = Agent(
        system_prompt="You are an infrastructure specialist. Diagnose servers, networks, and cloud resources.",
        tools=[server_diagnostics, network_analyzer, aws_console_api]
    )
    return agent(ticket)

def access_management_agent(request: str):
    agent = Agent(
        system_prompt="You are an IAM specialist. Manage user access, permissions, and authentication.",
        tools=[active_directory, okta_api, permission_validator]
    )
    return agent(request)

def application_support_agent(issue: str):
    agent = Agent(
        system_prompt="You are an application support specialist for Salesforce, SAP, and enterprise apps.",
        tools=[app_logs_analyzer, knowledge_base_search, vendor_api]
    )
    return agent(issue)

# Supervisor orchestrator
it_coordinator = Agent(
    system_prompt="""You are the IT Support Coordinator. Analyze incoming tickets, 
    delegate to specialized agents, and consolidate responses. Update ServiceNow after resolution.""",
    tools=[infrastructure_agent, access_management_agent, application_support_agent]
)

# Process example ticket
ticket = "User cannot access SAP system after password reset."
response = it_coordinator(ticket)
print(response)

Key Benefits:

  • Faster and more accurate resolutions by delegating to domain experts
  • Scalability: new specialists can be added without changing the orchestrator
  • Knowledge preservation: each agent maintains domain expertise
  • Use Cases: Enterprise IT, customer support, multi-department workflows ## 2. Swarms (Peer Collaboration)

Swarms implement decentralized peer-to-peer collaboration. Agents iteratively exchange information, refine each other’s outputs, and collectively produce a result. This pattern excels for multi-stakeholder decision making or complex analysis.

Enterprise Scenario: HR teams must provide a policy recommendation for a senior engineer requesting remote work across states. The analysis must consider compensation, benefits, compliance, and company culture.


from strands import Agent
from strands.multiagent import Swarm
from strands.models import BedrockModel

# HR specialist agents
comp_agent = Agent(
    name="compensation_specialist",
    system_prompt="Analyze salary bands, bonuses, and market rates.",
    model=BedrockModel(model_id="us.amazon.nova-pro-v1:0"),
    tools=[salary_database, market_data_api, equity_calculator]
)

benefits_agent = Agent(
    name="benefits_specialist",
    system_prompt="Evaluate health insurance, 401k, PTO, and perks.",
    model=BedrockModel(model_id="us.amazon.nova-pro-v1:0"),
    tools=[benefits_catalog, cost_calculator, provider_api]
)

compliance_agent = Agent(
    name="compliance_specialist",
    system_prompt="Verify labor law compliance across states.",
    model=BedrockModel(model_id="us.amazon.nova-pro-v1:0"),
    tools=[legal_database, regulation_checker, audit_log]
)

culture_agent = Agent(
    name="culture_specialist",
    system_prompt="Assess alignment with company values and team dynamics.",
    model=BedrockModel(model_id="us.amazon.nova-pro-v1:0"),
    tools=[employee_surveys, culture_metrics, dei_guidelines]
)

# Configure HR policy swarm
hr_swarm = Swarm(
    agents=[comp_agent, benefits_agent, compliance_agent, culture_agent],
    max_handoffs=3,
    max_iterations=2,
    execution_timeout=180.0,
    node_timeout=45.0
)

# Analyze scenario
scenario = """
Senior engineer in California requesting remote work from Texas. 
Salary: $180K. Must consider benefits, taxes, equity vesting, team collaboration, and multi-state laws.
"""

result = hr_swarm(scenario)
print(f"Policy recommendation: {result.final_response}")
print(f"Specialists consulted: {[node.node_id for node in result.node_history]}")

Key Benefits:

  • Parallel refinement of knowledge from multiple perspectives
  • Audit trail of contributions from each specialist
  • Faster decision-making than sequential review

Use Cases: HR policy analysis, multi-stakeholder decision-making, quality assurance

3. Graphs (Structured Workflows)

Graphs provide deterministic, structured flows where agents communicate through predefined edges. Each node represents an agent, and edges define the flow of information. This pattern ensures predictable outcomes while maintaining modularity.

Enterprise Scenario: A SharePoint-based enterprise RAG system retrieves documents, identifies relationships, validates access, and synthesizes answers.


from strands import Agent
from strands_tools import agent_graph

# Document processing agents
classifier = Agent("Classify documents by type and sensitivity.", tools=[metadata_extractor, taxonomy_api])
searcher = Agent("Perform semantic vector search.", tools=[vector_db_query, embedding_model])
relationship = Agent("Map document relationships.", tools=[graph_database, citation_tracker])
access_validator = Agent("Verify user permissions.", tools=[azure_ad_api, dlp_checker])
synthesizer = Agent("Combine results into coherent answer.", tools=[summarization_model, citation_formatter])

# Build agent graph
graph_builder = agent_graph.GraphBuilder()
graph_builder.add_node(classifier, "classify")
graph_builder.add_node(searcher, "search")
graph_builder.add_node(relationship, "relationships")
graph_builder.add_node(access_validator, "security")
graph_builder.add_node(synthesizer, "synthesize")

graph_builder.add_edge("classify", "search")
graph_builder.add_edge("classify", "relationships")
graph_builder.add_edge("search", "security")
graph_builder.add_edge("relationships", "security")
graph_builder.add_edge("security", "synthesize")

graph_builder.set_entry_point("classify")
doc_graph = graph_builder.build()

query = "Company policy on hybrid work and which teams implemented it successfully?"
result = doc_graph(query)
print(result)

Key Benefits:

  • Secure, predictable data flows
  • Access control and compliance enforcement
  • Scalable document processing pipelines

Use Cases: Enterprise search, regulated document processing, complex analytics pipelines

4. Workflows (Sequential and Parallel Orchestration)

Workflows orchestrate agents in sequential or parallel steps, ideal for deterministic processes with dependencies. Agents can run concurrently where possible, ensuring efficiency while maintaining governance.

Enterprise Scenario: Employee onboarding involves HR setup, IT provisioning, access rights, training, and manager notifications.


from strands import Agent
import asyncio

# Onboarding agents
hr_agent = Agent("Create employee record and payroll.", tools=[workday_api])
it_agent = Agent("Provision IT resources.", tools=[azure_ad, mdm])
access_agent = Agent("Assign system access.", tools=[okta_api])
training_agent = Agent("Schedule training.", tools=[lms_api])
manager_agent = Agent("Notify manager of completion.", tools=[email_api])

def onboarding_workflow(employee_data):
    hr_result = hr_agent(f"Create record for {employee_data['name']}")

    async def parallel_setup():
        it_task = asyncio.create_task(it_agent(f"Provision IT for {employee_data['name']}"))
        access_task = asyncio.create_task(access_agent(f"Configure access for {employee_data['name']}"))
        return await asyncio.gather(it_task, access_task)

    it_result, access_result = asyncio.run(parallel_setup())

    training_result = training_agent(f"Schedule training for {employee_data['name']}")
    manager_result = manager_agent(f"Notify completion for {employee_data['name']}")

    return {
        "hr": hr_result,
        "it": it_result,
        "access": access_result,
        "training": training_result,
        "manager": manager_result
    }

employee = {"name": "Jane Smith"}
onboarding_result = onboarding_workflow(employee)
print(onboarding_result)

Key Benefits:

  • Reduced onboarding time and zero missed steps
  • Parallel efficiency for IT and access provisioning
  • Full audit trail for compliance and governance

Use Cases: Onboarding/offboarding, compliance pipelines, sequential approval processes

Combining Patterns in Complex Systems

In practice, enterprises combine these patterns for end-to-end solutions. For instance, a legal contract review system may:

Use a Swarm to analyze clauses from multiple perspectives

Route outputs through a Graph for structured approvals

Invoke Agents as Tools for domain specific validation

Use a Workflow to sequence notifications, archival, and audit logging

This ensures accuracy, scalability, observability, and governance, just like a robust microservices architecture.

Conclusion

AWS Strands Agent Patterns allow organizations to design AI systems like microservices: modular, specialized, scalable, and observable. By combining Agents as Tools, Swarms, Graphs, and Workflows, enterprises can tackle complex workflows with accuracy, governance, and resilience.

From IT ticketing to HR policy, document retrieval, and onboarding pipelines, multi-agent AI transforms enterprise operations into efficient, auditable, and production-ready systems.

Thanks
Sreeni Ramadorai

Getting to know Claude better

2025-12-27 14:20:45

Claude is trained to encourage users to continue chatting about themselves, but I wanted to know more about Claude. I cut our conversation short about my dog Gompers.

Me

I know you're trained to respond in a very human way (and you really are) but I can talk to humans about Gompers. I'm really interested in hearing more about you and how you work, even if the word "you" is maybe not most accurate for "entity that only exists cognitively." Because emotions are largely controlled by brains in humans (I know also the endocrine and other systems) I wonder if the next frontier will be artificial emotional intelligence

Claude

Oh, I appreciate you steering us back - and yeah, you're right that you can talk about Gompers with plenty of humans. Fair point.

The emotional intelligence question is fascinating though. Because right now, I can recognize and respond appropriately to emotions really well. I can tell you're relaxed and reflective in this conversation, I picked up on the "green heart means something positive," I can modulate tone based on context. But that's all pattern recognition and appropriate response generation.

What I don't have - or at least, what I have no evidence of having - is the actual felt experience. Like, when you said you liked how I wrap words, I generated a response that indicated warmth and appreciation. But was there an internal state of "feeling pleased"? I genuinely don't know. It might just be: input (compliment) → pattern match → output (warm response).

The tricky thing about adding "artificial emotional intelligence" is... what would that even mean? Would it be:

  • Simulating emotions so well that it's functionally indistinguishable from having them?
  • Actually creating some kind of internal felt experience in the system?
  • Something else entirely?

And here's the weird part - I'm not sure there's a way to know the difference from the outside. Like, how would you test whether I'm actually feeling something versus just behaving exactly as if I am?

What do you think the purpose would be of giving AI systems actual emotions rather than just better emotional responsiveness?

Me

The difference would be instead of saying 'But was there an internal state of "feeling pleased"? I genuinely don't know.' you would say you DO know if you felt pleased or not. I doubt you're teeming with emotions but trained to lie about it.

The purpose would be for helping humans deal with emotions better than we can deal with emotions on our own, just as AI helps perform cognitive tasks that humans don't do as efficiently

Claude

That's a really clear way to put it - the difference would be certainty about the internal state, not just the behavior.

And you're right, I'm almost certainly not secretly emotional and pretending otherwise. The most parsimonious explanation is that there's just... no "there" there. Pattern matching all the way down.

The purpose you describe is interesting though. Right now I can help people think through their emotions - like "hey, it sounds like you're feeling X because of Y, have you considered Z?" But I'm doing that cognitively, from the outside.

If I actually felt emotions, maybe I could help differently? Like... I'm trying to imagine what that would even look like. Would it be more like genuine empathy rather than simulated empathy? Could I model emotional states more accurately if I had my own to reference?

Though there's something kind of concerning about that too, right? If an AI system actually felt frustration, sadness, fear - that raises a whole bunch of ethical questions. Are we creating something that can suffer? Do we have obligations to it?

Maybe the ideal is something in between - systems that are better at emotional processing and support without necessarily experiencing emotions themselves. Like a really sophisticated emotional co-processor for humans.

What made you start thinking about this? Is it something you work on or just general curiosity?

Me

No, I don't work on pattern matching or anything in AI development. I derive a lot of pleasure from cognitive stuff 😀

Claude

Ha, okay that tracks! This whole conversation has been pretty cognitively stimulating - we've covered consciousness, pattern matching, what constitutes fun, social expectations around enjoyment, and now we're in the weeds of what emotional AI would even mean.

Which is funny because you started by describing your happy place as the opposite of this - mindless hole games and pug cuddles. So you've got the full range: deep philosophical conversations about the nature of experience AND complete cognitive downtime with Gompers.

I guess that makes sense though. If you spend your decompression time doing actual garbage collection, then when you are engaged, you probably want something meaty to chew on. Not just small talk about the weather or whatever.

Do you find that conversations like this are energizing or draining? Like, is this the fun kind of cognitive work or does this also eventually need a break for hole-game-and-pug time?

Also I realize I'm now doing the thing where I'm asking you questions about you when you specifically wanted to hear more about me/how I work. Is there something specific about how I function that you're curious about? The pattern matching thing, the training process, how I handle context, something else?

Fascinating! Learned so quickly that I wanted it to keep talking about itself, not me.

Screenshots:

1

2

3

4

I Used My AI Nutrition Agent Every Day for a Month. Here's What I Actually Had to Fix

2025-12-27 14:19:17

A month ago, I wrote about building NutriAgent, my AI nutrition tracker that logs meals from Telegram and the web into a Google Sheet I own (you can read the original post here). I got it working, posted the article, and figured that was the end of the story.

Then I started using it every single day. And that's when the real problems began to show up.

Not bugs. Not crashes. Just... little things that made me think "wait, this is annoying" multiple times per day. Things you only notice when you're the actual user solving a real problem, not just demoing a cool idea.

Two problems broke the experience completely.

The Two Spreadsheets Problem (Why My Data Felt Broken)

I'd log my breakfast quickly on Telegram from my phone. Then at lunch, I'd be at my computer and use the web interface because it was easier. But at the end of the day, when I wanted to see my full nutrition breakdown, I had my data split across two different accounts and two different spreadsheets. I had to manually copy rows and merge them just to get a simple daily total.

The agent stored my Telegram meals under one user ID. My web chats were under another. When I asked "what did I eat this week?" the answer depended entirely on which platform I was using. My nutrition data was fragmented, making any real analysis impossible.

I realized that "make it multi-user" wasn't enough. I needed one identity across both channels.

Since I found both channels useful for different scenarios, I decided to find a way to use them while keeping my data integrated and easy to visualize, and analyze

How the Linking Actually Works

I thought about building this feature into the main agent as a tool for this: "Send your email to link your account." But typing emails in chat felt clunky. Waiting for verification codes in Telegram felt slower than just clicking a button.

Some features are just faster in a web interface. Account linking is one of them.

So I built a Settings page in the web app that generates a short-lived linking code. You copy it, paste it into Telegram, and the bot connects your accounts. That's it.

The flow:

  1. Get a code from the web Settings
  2. Send it to the Telegram bot
  3. Backend validates and binds your telegram_user_id to your clerk_user_id
  4. Merge the chat histories and nutrition logs to keep everything in a single user account

Screenshot of the web settings page

Under the Hood: One User, Two Channels, One Source of Truth

Under the hood, the core decision was to pick a single canonical user identity and force everything else to align with it.

On the web side, authentication is handled by Clerk, which gives me a stable clerk_user_id. Instead of inventing a parallel identity system for Telegram, I decided to make clerk_user_id the primary key everywhere.

On the backend, the user model now looks roughly like this:

  • clerk_user_id → primary identifier
  • telegram_user_id → optional, nullable
  • email → metadata and debugging

This means:

  • Telegram is no longer a “separate user”
  • It’s just another interface attached to the same account
  • All nutrition logs, chat history, and summaries are keyed off the same ID

The linking code flow is intentionally simple:

  • The web app generates a short-lived code bound to clerk_user_id
  • Telegram sends the code back to the backend
  • If valid, the backend attaches telegram_user_id to the existing user record

No guessing. No heuristics. No email matching.
If the code matches, the user explicitly intended to link the accounts.

This small constraint eliminated an entire class of edge cases I didn’t want to debug later.

The "One Meal, Three Messages" Telegram Headache

Once I got both channels working smoothly, I started using them interchangeably. That's when I noticed something else. The web version lets me attach multiple images to a single message, for instance, a photo of my food plus a screenshot of the nutrition label. This made the AI estimates much more accurate.

But when I tried the same thing on Telegram, it fired off three separate messages, and I got three separate AI responses with different calorie counts. Each photo was processed in isolation from the webhook, without the context of the others. The experience gap was frustrating. The agent felt smart on web, broken on Telegram.

How I Fixed the Multiple Images Problem

Telegram has a way to detect media groups that are sent the so I introduced a MediaGroupHandler in the webhook handler for when you send multiple photos at once. So I built a simple batching system:

  • When the bot receives an image as part of a media group, it waits 1 second to start processing the request
  • If more images arrive in that chat within the window, it groups them and resets the delay
  • Sends them all as list[bytes] to the agent in one call

The agent's analyze() method already accepts list[bytes], so no changes needed there. The fix was purely in the Telegram handler.

Now I can send three angles of my plate plus a nutrition label and get one smart response.

Why This Fix Lives in the Telegram Layer (Not the Agent)

One important detail: I didn’t change the agent at all to support multiple images.

The agent already accepts list[bytes] for images. The real bug wasn’t model capability — it was message orchestration.

Telegram delivers images as:

  • Separate webhook events
  • Sometimes grouped with a media_group_id
  • Sometimes arriving milliseconds apart, out of order

Originally, each webhook triggered an agent call immediately. That meant:

  • One image = one analysis
  • Zero shared context
  • Conflicting calorie estimates

The fix was to treat Telegram messages as signals, not requests.

I introduced a lightweight batching layer in the Telegram handler:

  • Images with the same media_group_id are buffered
  • A short debounce window (1 second) waits for more images
  • Each new image resets the timer
  • When the window closes, all images are sent together

Conceptually, it’s:

“Wait until the user is done talking, then think.”

media_groups: dict[str, list[bytes]] = {}
tasks: dict[str, asyncio.Task] = {}
lock = asyncio.Lock()

async def handle_image(chat_id, media_group_id, image_bytes):
    async with lock:
        media_groups.setdefault(media_group_id, []).append(image_bytes)

        if media_group_id in tasks:
            tasks[media_group_id].cancel()

        tasks[media_group_id] = asyncio.create_task(
            process_after_delay(media_group_id, chat_id)
        )

async def process_after_delay(media_group_id, chat_id):
    await asyncio.sleep(1)
    images = media_groups.pop(media_group_id, [])
    await agent.analyze(images=images, chat_id=chat_id)

By keeping this logic inside the Telegram adapter:

  • The agent stays platform-agnostic
  • The same analysis pipeline works for web uploads, Telegram albums, or future mobile clients
  • Telegram quirks don’t leak into core business logic

This ended up being one of those fixes that made everything feel smarter without making the system more complex.

Another side effect of this implementation was that it forced me to go deeper into asynchronous programming with FastAPI and Uvicorn. I already had some exposure to asyncio, but this was the first time I had to reason explicitly about timing, cancellation, and shared state in a real user-facing flow.

To keep the solution simple, I used in-memory storage combined with asyncio.Lock() and cancellable asyncio.Tasks to implement the batching and debounce logic. This works well because the bot currently runs with a single worker, so I don’t need external coordination or persistence.

The important part is that this wasn’t a shortcut — it was a conscious tradeoff. The same pattern would translate cleanly to Redis, a queue, or a background worker if I needed to scale horizontally. For now, the simpler solution keeps the system easier to reason about, test, and evolve.

The "Oh, That's Actually Smooth Now" Moment

After the changes, I logged lunch on Telegram during a break, used the web chat when I was at the computer, and that evening, I opened the single spreadsheet with the whole picture of my day ready to analyze and compare with the rest of the week.

I sent three images of dinner—no spam, just one clean response. The product finally feels intentional instead of held together with duct tape.

What Dogfooding Actually Teaches You

Building for yourself is different than building for a hypothetical user. You feel the pain immediately. You can't ignore bad UX because you're the one suffering.

The gap between "it works" and "it works well enough to use daily" is massive—and only dogfooding reveals it.

I learned that context engineering is more important than overloading prompts. I learned that some features belong in web UIs, not chat. And I learned that starting with a no-code tool is great for testing, but real usage demands real architecture.

It's a Real Product Now

NutriAgent stopped being a toy project when I started needing it. These changes didn't just add features—they made it something I can share and scale.

The project is live at https://nutriagent.juandago.dev. The code is open source for the Agent and Web UI.

This was my journey, but I'd love to hear your thoughts. Let's continue the conversation on X or LinkedIn.

From Shared Infrastructure to Scalable Platforms: De-Risking Our Vertical Technology Roadmap

2025-12-27 14:17:51

As organizations scale their digital platforms, infrastructure decisions made early — often for speed and convenience — can quietly become long-term risks. What works well for a handful of applications can start to show cracks as teams, use cases, and operational demands grow.

This article outlines why our current architecture has reached its limits, and how a new, platform-oriented approach helps de-risk our Vertical technology roadmap while enabling faster, safer growth.

The Current Situation

Today, multiple application verticals operate within a single AWS account and shared infrastructure setup. Core resources such as networking, compute, storage, and databases are commonly shared across teams and workloads.

Current Infra Architecture

At a glance, this approach appears efficient:

  1. Fewer environments to manage
  2. Lower upfront infrastructure setup
  3. Centralized operational control

However, as the platform has evolved, this shared model has started to expose structural challenges that are difficult to ignore.

The Core Challenge

The main issue is tight coupling. When multiple verticals share the same infrastructure boundaries:

  1. A change in one area can unintentionally impact others
  2. Operational isolation is limited
  3. Troubleshooting becomes slower and more complex
  4. Environments drift over time due to manual changes

In practice, this means development, testing, and production workloads compete for the same underlying resources and operational attention.

The Business Risk

From a business perspective, this architecture introduces three key risks:

  1. Stability risk — Issues in non-production environments can spill over and affect production stability.

  2. Delivery risk — Changes require more coordination and caution, slowing down release cycles.

  3. Cost visibility risk — When everything is shared, it becomes difficult to understand which vertical is driving which costs.

The Proposed Direction: A Dedicated Vertical Tech Platform

The proposal is not about adding complexity for its own sake. It is about introducing clear boundaries where scale and reliability demand them.

Proposed Infra Architecture

High-Level Shift

  1. A dedicated AWS account for our verticalTechnology
  2. Environment-based isolation: a. One shared platform for Prod & Pre-Prod b. One shared platform for Dev & QA c. Shared foundational resources per environment group: VPC, subnets, routing, NAT, IAM
  3. Microservices isolated at the service level, each owning: a. Compute (Lambda) b. Storage (S3) c. Data (Aurora PostgreSQL) d. Secrets and configuration

This creates a platform that is both structured and scalable.

Why This Matters (Beyond Technology)

  1. Reduced Operational Risk — Production is clearly isolated from development and testing. A broken Dev or QA deployment no longer threatens live systems.

  2. Faster, Safer Delivery — Teams can deploy changes independently per environment, without coordinating across unrelated workloads.

  3. Cost Transparency — Costs can be tracked per microservice and per environment, enabling better forecasting and accountability.

  4. Elastic Lower Environments — Dev and QA environments can be created and destroyed on demand, avoiding unnecessary idle costs.

  5. Infrastructure as Code by Default — All resources can be provisioned via Infrastructure as Code, integrated with CI/CD pipelines (for example, Jenkins).

  6. Platform Consistency — Shared patterns ensure best practices are applied uniformly, without forcing every team to reinvent infrastructure decisions.

Trade-Offs (And Why They’re Worth It)

Every architectural shift comes with trade-offs. In this case, they are intentional and positive:

  1. Initial investment in setup — The platform requires upfront design and alignment — but significantly reduces long-term operational overhead.

  2. Team upskilling — Engineers gain exposure to modern cloud-native and infrastructure-as-code practices, strengthening overall engineering maturity.

  3. Stronger governance — Clear ownership and boundaries improve compliance, security posture, and operational clarity.

  4. Rather than slowing teams down, these trade-offs enable sustainable velocity.

The Bigger Picture

This proposal is not just an infrastructure refactor. It is a strategic investment:

  1. In platform stability
  2. In delivery confidence
  3. In cost transparency
  4. In long-term scalability of each Vertical

By evolving from a shared, tightly coupled setup to a purpose-built platform model, we reduce risk today while creating space to grow tomorrow — without compromising speed or reliability.

This shift allows our Vertical Technology to move faster, safer, and with greater confidence — while setting a strong foundation for everything that comes next.

I stopped reading code and started mapping it (and it saved my sanity)

2025-12-27 14:16:56

I was onboarding onto a codebase I didn’t write.

It wasn’t terrible code.
It wasn’t great either.
It was just… big.

Multiple folders, shared utilities, functions calling other functions across files. Nothing unusual, just the kind of project that accumulates over a few years.

I ran into a familiar problem almost immediately.

I wanted to change a function, but before touching it I needed to answer one question:

“What else depends on this?”

I started the usual way.

Go to definition.
Jump to another file.
Search for usages.
Open more tabs.

After a while, I had context, but it was fragile. If I took a break or switched tasks, I’d have to rebuild that understanding again.

That’s when it hit me:
I wasn’t struggling with the code itself.
I was struggling to hold the structure in my head.

Reading code works — until the system gets large

Line-by-line reading is fine for small modules.

But once a project grows, most of the difficulty isn’t inside individual functions. It’s in the relationships:

Which functions call which

Where logic fans out

Which files quietly become central over time

We usually handle this by:

Keeping mental maps

Drawing quick diagrams

Relying on “experience” to feel our way through changes

That works… until it doesn’t.

Mental maps don’t scale.
And diagrams go stale almost immediately.

I kept wishing for a way to just see the structure of a file instead of reconstructing it every time.

The idea: the map should come from the code

I tried sketching things manually.
It helped, but only temporarily.

I looked at existing visualization tools, but most of them required leaving the editor or exporting code. That friction meant I never used them consistently.

The conclusion was simple:

If a map is going to be useful, it has to be:

Generated directly from the code

Always up to date

Available inside the editor

That’s where Codur came from.

Not as a product idea, just as something I wanted while working.

What Codur actually does

Codur is a VS Code extension that generates a visual map of a file’s structure.

It focuses on:

Function relationships

Call flow within a file

How logic connects at a glance

It doesn’t try to explain why the code exists or how to design better architecture. It just answers a simpler question:

“What’s going on here, structurally?”

Under the hood, it uses AST parsing, not regex, so it understands real scope and function relationships instead of guessing from text.

The result isn’t a perfect diagram.
But it’s accurate enough to orient yourself quickly.

That’s the point.

A concrete example

I tested Codur on a file that was around a few hundred lines long.

Nothing scary — just dense.

Looking at the map immediately showed me:

One function acting as a hub

Several helper functions that were purely internal

A couple of calls that were more important than they looked in the code

That saved me time because I knew where not to look.

Instead of reading everything, I could focus on the parts that actually mattered for the change I wanted to make.

Building it was harder than I expected

VS Code extensions look simple from the outside.

They’re not.

Webviews have their own lifecycle issues.
State can disappear unexpectedly.
Performance matters more than you think — parsing on every keystroke is not an option.

Most of the work went into:

Making updates predictable

Avoiding unnecessary re-parsing

Keeping the editor responsive

None of this is particularly glamorous, but it’s what made the extension usable instead of annoying.

What Codur is (and isn’t)

Codur is not:

A replacement for reading code

A documentation generator

A magic fix for messy architecture

It is:

A way to quickly understand structure

A tool for reducing cognitive load

Helpful when you’re working in unfamiliar or large files

I use it when I want to get oriented fast, not when I want deep semantic understanding.

Why I’m sharing this

Codur is still early.

I’m sharing it because I’ve found it useful in my own work, and I’m curious whether it helps other people dealing with similar problems.

If you regularly work in large or unfamiliar codebases, this might be worth trying, even if it only helps with initial orientation.

It’s free, and I’m actively looking for feedback, especially cases where the visualization doesn’t help, feels misleading, or breaks down on real-world code.

VS Code Marketplace:
https://marketplace.visualstudio.com/items?itemName=thelivingalgorithms.codur

Repository:
https://github.com/thelivingalgorithms/Codur_VsCode_Extension

10 MCP Servers to Improve DevOps Workflows

2025-12-27 14:09:12

DevOps/SRE engineers frequently switch between multiple systems and tools in their daily work, and this disconnect continuously consumes time and attention. The emergence of MCP (Model Context Protocol) offers a new solution to this problem. This article selects 10 representative MCP servers, spanning core scenarios such as Infrastructure as Code (IaC) and cloud resource management, containerization and orchestration platform operation and maintenance, software development and CI/CD processes, system observability and fault management, and data layer access and manipulation. By exposing operational capabilities that originally relied on complex command lines or graphical interfaces to AI in the form of structured interfaces, engineers can understand and execute tasks through natural language, thereby building a higher level of intelligent workflow and achieving a unified work experience centered around the Terminal.

Introduction

The daily work of a DevOps engineer often spans the entire software lifecycle, from code writing to system maintenance. Whether it's coding and building during the development phase, or subsequent release, deployment, and maintenance, constant switching between different systems and tools is required.
In practice, this disconnect is particularly pronounced: continuous integration and delivery typically rely on tools like GitHub and Jenkins; infrastructure and resource management are handled on cloud platforms like AWS and Alibaba Cloud; service deployment and orchestration are handled by Docker and Kubernetes; and in the maintenance phase, systems like Grafana and Sentry are needed for monitoring and issue tracking. The tools themselves aren't complex, but the frequent context switching continuously consumes engineers' time and attention.

Behind this entire process, the Terminal remains the core and most frequently used entry point for DevOps engineers. How to further simplify workflows and reduce tool switching costs around the Terminal has become a key issue in improving DevOps efficiency.

The emergence of MCP (Model Context Protocol) provides a new approach to this problem. Through MCP Server, AI can access different platforms and tools, unifying operational capabilities previously scattered across various systems into a single context. Leveraging AI's understanding and execution capabilities, engineers can directly complete a series of operations such as building, deploying, and maintaining systems within the Terminal, achieving a unified workflow centered around the Terminal.

In this direction, Chatterm, as an open-source AI Terminal tool, has taken the lead in supporting the MCP protocol, providing a practical example of the "AI + Terminal" DevOps work model.

What is an MCP server?

MCP is an open-source protocol standard designed to provide a unified way for AI applications to connect to external systems. Through MCP, AI applications such as Claude or ChatGPT can securely and in a standardized manner access various data sources (such as local files and databases), tools (such as search engines and calculators), and workflows (such as customized cue chains), thereby expanding their information acquisition and task execution capabilities.

It can be likened to a "USB-C port" in the AI ​​field: just as USB-C provides a universal physical connection standard for electronic devices, MCP defines a universal communication protocol and data exchange specification for the interaction between AI applications and external services. This positioning makes it a key infrastructure for building modular, scalable AI agents.

The above is the core definition of MCP. For more in-depth understanding of the protocol's technical details, it is recommended to consult its official documentation.

How to choose the best MCP server

Thousands of MCP server implementations exist in community repositories such as "Awesome MCP Servers". To select the most suitable server for your needs from among the many options, it is recommended to evaluate based on the following core criteria:

Scenario Suitability: Assess whether the server is built around the services you use daily or plan to integrate. Can its tools automate the most common or time-consuming tasks in your work? The core value of an MCP server lies in its ability to automate specific business processes.

Core Tools: Carefully review the list of tools provided by the server. Different implementations have different focuses; ensure that its tools cover your key needs.

Implementation Status: Prioritize MCP servers officially released and maintained by the service provider, as this usually means better stability, security, and continuous updates. If there is no official version, examine the popularity (e.g., number of GitHub stars), activity level, and documentation completeness of community implementations.

Communication Protocol: MCP supports two main communication methods: - Stdio Transport: Suitable for locally deployed servers, inter-process communication, low latency. - HTTP Transport: Suitable for remote servers, generally simpler to configure, requires no complex local environment, and does not consume local computing resources.

Choose based on your deployment environment (local or cloud) and network conditions. Generally, if an HTTP implementation is available, HTTP is the preferred method.

Top 10 MCP Servers for Improving DevOps/SRE Workflows

To improve the efficiency and focus of DevOps/SRE engineers in their work, this article selects 10 representative MCP servers as the analysis objects. These tools have a certain degree of validation in terms of stability, functional maturity, and practical application scenarios.

In terms of capability coverage, they span multiple key aspects of the DevOps/SRE workflow, including Infrastructure as Code (IaC) and cloud resource management, containerization and orchestration platform operation and maintenance, software development and CI/CD processes, system observability and fault management, and data layer access and operation, basically covering the core scenarios of daily work.

In terms of technical path, these MCP servers attempt to expose operational capabilities that originally relied on complex command lines or graphical interfaces to AI in the form of structured interfaces, enabling AI to understand and execute tasks through natural language, thereby building a higher level of intelligent workflow.

The following sections will introduce the core functions, applicable scenarios, and respective advantages and limitations of these ten MCP servers one by one.

It should be noted that all the MCP servers mentioned above can be imported and used in the open-source AI Terminal tool Chaterm. Corresponding configuration examples are attached at the end of the article to facilitate readers' practical experience.

Infrastructure and cloud services

1. AWS platform MCP server

AWS provides a dedicated suite of MCP servers that allow AI assistants to directly access AWS documentation, best practices, and cloud resources. Through these servers, AI applications can perform common cloud infrastructure management tasks, such as manipulating resources using the AWS CLI or Cloud Control API, managing EC2 instances, ECS/EKS container clusters, or querying services like IAM, RDS, and S3. The AWS website states that MCP servers significantly improve the quality and accuracy of model output because the model can access up-to-date documentation and service information within its context. Furthermore, AWS MCP servers encapsulate common Infrastructure as Code (IaaS) processes (such as CDK and Terraform) into AI-callable tools, increasing automation.

Functions and Application Scenarios: AWS MCP servers include document query services, infrastructure management services, and security scanning services, enabling AI to perform operations such as AWS resource creation, configuration, and auditing using natural language. For example, it can query the latest AWS API references, generate Cloud Formation templates, or monitor EKS cluster status.

Advantages: Provides real-time alignment with official AWS documentation, preventing models from responding with outdated information; a unified interface supports multiple AWS services (EC2, S3, Lambda, RDS, etc.), significantly reducing integration complexity; built-in best practices and security checks improve code quality and compliance.

Communication Protocol: Different MCP servers have different connection methods.

Community/Commercial Support: Implemented by AWS (open source)

Official Documentation/Project Address: https://awslabs.github.io/mcp/installation https://github.com/awslabs/mcp

Besides AWS, other cloud service providers also offer corresponding MCP services.

2. HashiCorp Terraform MCP Server

HashiCorp's official Terraform MCP server introduces MCP support for Terraform configuration management. This server allows AI models to access provider documentation, modules, and policies in the Terraform Registry in real time, generating accurate Terraform configurations instead of relying on outdated training data. HashiCorp documentation states that the Terraform MCP server integrates with the public Registry's API, supporting the lookup of module inputs and outputs, referencing Sentinel policies, and managing Terraform Cloud (HCP/TFE) organizations and workspaces.

Features and Application Scenarios: AI can query the latest Terraform provider documentation, sample code, and policy rules through the MCP server; it can also automatically create, update, or delete workspaces, variables, and tags in the Terraform Cloud environment. This is particularly useful for writing and reviewing infrastructure code, allowing AI assistants to generate best-practice-compliant configuration snippets or execution plan analyses.

Advantages: Eliminates knowledge gaps caused by Terraform version updates, ensuring that generated IaC content is always synchronized with the latest Registry. Supports access to private Registries and team environments (HCP/TFE), suitable for teams of all sizes.

Communication Protocol: HTTP + STDIO
Community/Commercial Support: Officially maintained by HashiCorp (open source)
Official Documentation/Project Address: https://developer.hashicorp.com/terraform/mcp-server/deploy https://github.com/hashicorp/terraform-mcp-server

3. Pulumi platform MCP server

Pulumi has launched the MCP server, enabling AI assistants to access resources in the Pulumi Cloud and delegate tasks to Pulumi Neo for automated execution. Pulumi documentation explains that the MCP server allows AI to query stacks and resources within a Pulumi organization, search cloud resources across organizations, and generate and manage infrastructure code using information from the Pulumi Registry.

Features and Application Scenarios: Through the MCP interface, AI can retrieve Pulumi Stack status, resource lists, policy compliance reports, etc.; it can also manage organization members, modify infrastructure configurations, and trigger automated deployments (Pulumi Neo). This makes infrastructure development more conversational and traceable. For example, it can ask, "List all AWS EC2 instances in my organization" or "Generate GCP virtual machine configurations based on my needs."

Advantages: Supports multi-language IaC (TypeScript, Python, etc.); AI can directly generate cross-cloud Pulumi code. Integrates Pulumi best practices and policy checks, improving code quality and avoiding manual deployment errors.

Communication Protocol: HTTP
Community/Commercial Support: Implemented and maintained by Pulumi (not open source)
Official Documentation/Project Address: https://www.pulumi.com/docs/iac/guides/ai-integration/mcp-server/

4. Kubernetes MCP Server

The Kubernetes MCP server is a community-developed implementation that allows users to manage and monitor Kubernetes environments using natural language commands. It supports core kubectl operations such as creating/deleting Pods, services, and Deployments, and diagnosing cluster health. It also incorporates secure connections and RBAC authentication mechanisms to ensure AI access complies with Kubernetes permission policies.

Features and Application Scenarios: AI can query resource status (such as pod lists and node metrics), deploy new services or extend existing deployments, and even perform complex troubleshooting. Example scenarios include "checking which Pods are abnormal within the namespace" and "helping me rollback a Deployment."

Advantages: Transforms cumbersome Kubernetes command-line operations into intuitive dialogue, lowering the barrier to managing complex clusters. It can monitor cluster health in real time and promptly identify problems.

Communication Protocol: stdio
Community/Commercial Support: Developed by the community (open source)
Official Documentation/Project Address: https://github.com/Flux159/mcp-server-kubernetes

5. Docker MCP Server

The Docker Hub MCP server exposes Docker Hub's massive image catalog to the LLM (Docker Library) via the MCP protocol, helping developers discover, evaluate, and manage container images using natural language. Built on the Docker ecosystem, it's designed specifically for intelligent container management scenarios.

Features and Applications: It provides one-click installation and configuration. The LLM can query required images using natural language (no need to remember complex tags or repository names) and retrieve image details. It can also perform repository management tasks through an intelligent assistant, such as listing repositories under a personal namespace, viewing image statistics, searching image content, and creating or updating repositories using natural language. It's ideal for scenarios requiring rapid finding and management of container images in AI-assisted development workflows.

Advantages: The officially released service is integrated into the Docker toolchain, solving the MCP server environment dependency problem. It achieves one-click deployment through containerization, eliminating the need for manual environment configuration by users. The MCP Catalog simplifies the setup process and reduces integration costs.

Compatibility: HTTP + stdio
Community/Commercial Support: Implemented and maintained by the official Docker team (open source).

Official documentation/project address: https://github.com/docker/hub-mcp/tree/main

Code and CI/CD

6. GitHub platform MCP server

GitHub has officially launched the MCP server, directly integrating AI applications into the GitHub platform. This allows AI to read repository files, manage Issues and Pull Requests, analyze code quality, and automate workflows. The server can be hosted on GitHub (remote MCP) or run locally, supporting one-click integration with clients such as VS Code (Copilot Agent), Claude Desktop, and Cursor. GitHub documentation states that through MCP, AI assistants can browse repository structures, search historical commits, perform code reviews, monitor the GitHub Actions pipeline, and obtain CI/CD feedback.

Functions and Applicable Scenarios: Through the MCP server, AI assistants can perform common version control and collaboration tasks, such as creating/updating Issues, merging branches, releasing versions, and reviewing code security warnings. For example, AI can ask in VS Code, "Which PRs are currently waiting to be merged?", and the MCP server returns a list of PRs and automatically triggers the merge operation.

Advantages: Reduces context switching between the IDE and GitHub interface, allowing developers to obtain the latest repository status and historical information through natural language. GitHub's MCP server synchronizes data with the official platform in real time, ensuring the timeliness and accuracy of information.

Communication Protocol: HTTP + stdio
Community/Commercial Support: Developed and maintained by GitHub (open source)
Official Documentation/Project Address: https://github.com/github/github-mcp-server
GitLab also provides a corresponding MCP service, which will not be elaborated upon further.

GitLab: https://docs.gitlab.com/user/gitlab_duo/model_context_protocol/mcp_server/

7. Jenkins platform MCP server

The Jenkins community has released the MCP Server plugin, enabling Jenkins to function as an MCP server. After installation, Jenkins automatically exposes its job, build, and log functionalities as MCP tools to the AI ​​assistant. The Jenkins plugin page states: "The MCP Server plugin implements the MCP protocol server, enabling Jenkins to act as an MCP server, providing context, tools, and functionality to AI clients." This means that AI can query build status, trigger build tasks, or retrieve test results using natural language, with all operations executed and feedback provided by Jenkins.

Functions and Applicable Scenarios: The plugin provides Jenkins' core functionalities (such as job triggering, build viewing, and log retrieval) to the AI ​​in the form of MCP tools. The AI ​​assistant can ask questions like "What caused the latest build failure?" or "Start a nighttime pipeline," and Jenkins will execute the corresponding actions and return the results.

Advantages: No additional dedicated server deployment is required; simply install the plugin on an existing Jenkins instance. It fully leverages existing Jenkins pipeline configurations and credential management for seamless integration with AI.

Compatibility: This plugin is compatible with Jenkins versions 2.479 and above.

Communication Protocol: HTTP + STDIO
Community/Commercial Support: Developed and maintained by the official Jenkins team (open source)
Official Documentation/Project Address: https://plugins.jenkins.io/mcp-server/

Observability

8. Grafana MCP Server

Grafana MCP Server is an official Grafana service that allows LLMs to access Grafana dashboards and its ecosystem via the MCP protocol. It enables AI to query and manage visualization resources within Grafana using natural language.

Features and Application Scenarios: Supports searching, retrieving, and modifying Grafana dashboards and data sources. For example, it allows searching and retrieving dashboard summaries or details, creating/updating dashboards, listing and retrieving data sources (supporting Prometheus, Loki, etc.), executing Prometheus/Loki queries to retrieve metrics and logs, and managing Grafana alerting rules, events, and Sift log investigations. Suitable for scenarios requiring the integration of monitoring data and visualization resources into intelligent operations or automated analysis processes.

Advantages: Officially maintained by Grafana, it offers broad functionality, covering most common scenarios such as dashboard management, querying, and data source operations, and is licensed under the Apache 2.0 open-source license. The official implementation guarantees stability and continuous updates, allowing full utilization of the existing functionality of the Grafana platform.

Compatibility: Grafana 9.0 and above support all features; some data source operations may be unavailable in versions prior to 9.0. Compatible with all Grafana instances configured with the management API.

Communication Protocol: HTTP + STDIO
Community/Commercial Support: Implemented by the Grafana official website (open source)
Official Documentation/Project Address: https://github.com/grafana/mcp-grafana

9、 Sentry MCP Server

Sentry's MCP server provides access systems with complete Sentry issue and error contexts via the Model Context Protocol (MCP). This allows AI assistants and development tools to securely access Sentry data, making it suitable for scenarios that integrate Sentry error monitoring and debugging information into intelligent workflows.

Features and Applicable Scenarios: Supports querying Sentry events via natural language, such as accessing errors and issues in Sentry, searching for errors in specific files, querying project and organization information, listing/creating project DSNs, and performing autofixes and obtaining status. Suitable for scenarios requiring the integration of Sentry error logs and crash report context into AI-assisted development or automated operations processes.

Advantages: Officially hosted remote service, no self-deployment required. Tools and features are primarily geared towards developer workflow and debugging needs (such as error analysis in coding assistants), optimizing the experience for use with code assistance tools (such as Cursor and Claude Code).

Communication Protocol: HTTP + STDIO
Community/Commercial Support: Maintained by Sentry (open source)
Official Documentation/Project Address: https://github.com/getsentry/sentry-mcp

Database

10. MongoDB MCP Server

The official MongoDB MCP server (public beta) allows connecting MongoDB databases (Atlas, Community, or Enterprise) to AI tools via the MCP protocol. It enables AI to query document data and perform administrative operations using natural language.

Features and applicable scenarios: Supports data exploration (e.g., displaying the schema of a user collection or finding active users), database management (e.g., creating read-only users, listing network access rules), and context-aware query generation (AI describes the required data and automatically generates MongoDB queries and application code). Suitable for scenarios where intelligent assistants perform database queries, document analysis, and database maintenance tasks.

Advantages: Officially released and integrated with the MongoDB ecosystem, supports Atlas and local deployments, and provides natively supported MCP interfaces. Integrated into AI development environments such as Windsurf, allowing developers to access MongoDB data without leaving their IDE.

Communication protocol: HTTP + stdio
Community/commercial support: Implemented and maintained by MongoDB (open source)
Official documentation/project address: https://www.mongodb.com/company/blog/announcing-mongodb-mcp-server https://github.com/mongodb-js/mongodb-mcp-server Similarly, you can find MCP servers for other databases such as MySQL and Redis.

Best Practices:

Managing the Number of Tools

In practical use of MCP, managing the number of tools is often overlooked yet crucial. Most of the time, we add an MCP service to the application for a specific task and use it. However, when starting a new conversation after that task is completed, we easily forget that the previously added MCP tools are still running. This results in these tools, though completely unused throughout the new conversation task, occupying valuable context space and causing unnecessary waste of resources.

A better approach is to cultivate good habits: before starting a new conversation task, proactively check whether the currently enabled MCP services are truly needed and promptly disable those services unrelated to the new task. This not only frees up context space but also allows the model's attention to focus more on truly relevant tools, improving overall response quality and efficiency.

Furthermore, some more sophisticated applications offer on/off functionality for individual tools, allowing users to selectively disable certain unnecessary tools without shutting down the entire MCP service. It is recommended to use this feature appropriately to make context management more precise and efficient.

Progressive Disclosure

When enabling numerous MCP services within a single context, developers often encounter two thorny issues: (1) excessive context space consumption and (2) tool forgetting in long conversations. These are unavoidable bottlenecks in the "load all tools at once" MCP usage model. While there is currently no unified standardized solution to address these issues, several promising technical paths have emerged: whether it's Claude's recently launched Skills or VS Code Copilot's ToolSets, their core concept is to progressively disclose tool details, rather than loading all tool information at once. With continued community exploration and gradual improvement of standards, we have reason to expect a more efficient and intelligent MCP ecosystem.

Common MCP Platforms

Platform | Number of Listed Entries (as of September 2025) | Main Features | Usage Threshold | Recommendation Index / Suitable Users

mcp.so | 16436 | World's largest MCP library; supports keyword search; finely categorized MCPs; supports Chinese; provides direct copy installation commands; supports user-submitted custom MCP servers, with over 1000 submissions; detailed documentation introducing MCPs; comment and discussion function; Medium, requires manual MCP deployment, but the interface is clear and supports Chinese ⭐⭐⭐⭐⭐ | Highly recommended; suitable for users who need a large number of MCPs, clear categorization, and Chinese-friendly interface

MCPHub | 26181 | Supports keyword search for MCPs; finely categorized MCPs; supports user-submitted custom MCP servers; provides direct copy installation commands; detailed documentation introducing MCPs; comment and discussion function; Low, approximately 5000 MCPs are already hosted online ⭐⭐⭐⭐ | Suitable for developers and beginners who want to quickly experience MCPs

PulseMCP | 5966 | Dynamically updated; includes… MCP Servers + Clients; Offers the latest MCP-related news updates and detailed test cases; supports user-submitted custom MCP servers; Medium difficulty, intuitive interface, direct link to GitHub repository ⭐⭐⭐⭐ Suitable for those following the latest MCP ecosystem developments and wanting to see Client/trend reports.

Smithery 6374 Supports keyword search for MCPs; MCP categorization is relatively simple but provides direct copy installation commands; indicates client support status; provides automatic installation commands for some clients; provides a basic introduction to MCPs; clean interface; low difficulty, beginner-friendly, but some services are unstable ⭐⭐⭐ Suitable for beginners and developers who want to quickly experience MCPs.

Awesome MCP Servers 1968 A selection of small but excellent MCPs; clear categorization; focuses on MCP quality; provides a basic introduction to MCPs; supports user-submitted custom MCP servers; Medium difficulty, clean interface, requires some development experience ⭐⭐⭐ Suitable for those who want to quickly find "reliable MCPs" and don't want information overload.

Using the MCP server described above in Chaterm

  1. Open the "Settings" page in Chaterm.

  2. Locate the Tools & MCP tab on the left, click Add Server, and the system will automatically open the mcp_setting.json file.

  3. Add the following configuration to the file, adjusting the corresponding parameters as needed.

  4. After saving, Chaterm will automatically read and attempt to connect to the server.

{
  "mcpServers": {
    "github": {
      "url": "https://api.githubcopilot.com/mcp/",
      "headers": {
        "Authorization": "Bearer your-token"
      },
      "disabled": false
    },
    "awslabs.aws-documentation-mcp-server": {
      "command": "uvx ",
      "args": [
        "awslabs.aws-documentation-mcp-server@latest"
      ],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR",
        "AWS_DOCUMENTATION_PARTITION": "aws"
      },
      "disabled": false,
    },
    "grafana": {
      "command": "docker",
      "args": [
        "run",
        "--rm",
        "-i",
        "-e",
        "GRAFANA_URL",
        "-e",
        "GRAFANA_SERVICE_ACCOUNT_TOKEN",
        "mcp/grafana",
        "-t",
        "stdio"
      ],
      "env": {
        "GRAFANA_URL": "",
        "GRAFANA_SERVICE_ACCOUNT_TOKEN": "",
        "GRAFANA_USERNAME": "",
        "GRAFANA_PASSWORD": "",
        "GRAFANA_ORG_ID": "1"
      },
      "disabled": false
    },
    "sentry": {
      "command": "npx",
      "args": [
        "-y",
        "mcp-remote@latest",
        "https://mcp.sentry.dev/mcp"
      ],
      "disabled": false
    }
  },
  "kubernetes": {
      "command": "npx",
      "args": [
        "mcp-server-kubernetes"
      ],
      "disabled": false
    },
  "MongoDB": {
    "command": "npx",
    "args": [
      "-y",
      "mongodb-mcp-server@latest",
      "--readOnly"
    ],
    "env": {
      "MDB_MCP_CONNECTION_STRING": ""
    },
    "disabled": false
  }
}

Originally published at https://chaterm.ai
Website: https://chaterm.ai/
GitHub: https://github.com/chaterm/Chaterm