MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

An Autonomous, Agentic, AI Assistant, Meet Alfred and this is how I built him.

2026-03-16 23:39:44

Introduction

My people it's me again. This time I have built something fun but mostly useful. I gave building an autonomous agent a chance and it's turning out well. I know it's a cliché but his name is Alfred. The thing is AI agents are no longer a novelty. It all started out as simple chatbots chaining a few prompts together. Now it has evolved into something far more capable. These systems that can "reason" (I know it's just a lot of math and not actual reasoning), plan, use tools, and execute multi-step workflows with minimal human intervention. Agentic flows, where an AI iteratively breaks down a goal, takes actions, evaluates results, and course-corrects, are quickly becoming the backbone of serious productivity tooling.

But the thing is not all models are created equal. The market is crowded. GPT-4o, Gemini, Mistral, Llama, DeepSeek all have their own strengths, trade-offs, and devoted user bases. Picking the right model for a given task has become something of an art form in itself. Especially because the benchmarks keep getting blurrier and blurrier.

For me, that choice keeps coming back to Anthropic's Claude and specifically to Opus. As an engineer, I spend a significant portion of my day thinking in systems: abstractions, edge cases, failure modes and architecture trade-offs. Opus is the only model that consistently feels like it's doing the same while cleverly grabbing my immediate system context. Where other models can produce code that technically compiles but misses the intent entirely, Opus tends to understand the why behind what I'm building, not just the what. That distinction, subtle as it sounds, makes an enormous practical difference when you're deep in a complex codebase. Opus has downsides, especially because sometimes it takes shortcuts without adhering to the principles you intended.

On the bright side what sealed it for me, though, was the CLI experience. Claude's command-line interface is genuinely pleasant to use: fast, composable, and unobtrusive in a way that fits naturally into my existing workflow. It doesn't feel like a detour. It feels like a tool that belongs in your terminal alongside the rest of my stack.

In this article I'm going to talk about why I needed Alfred, the problem he solves for me, how I built him and how I improve him on this ever changing landscape where engineering meets productivity.

The Monday Morning Problem Every Developer Knows

It is Monday, 8:30 AM. Before I have written a single line of code, I already have a full-time job just figuring out where to start.

Over the weekend, 47 new Gmail messages came in. Some are spam. Some are newsletters I never unsubscribed from. But buried somewhere in that pile is an escalation that needs urgent attention and a teammate asking for a code review. I do not know which email it is yet. I have to dig for it.

That is just Gmail. I also have 12 Outlook emails from work: meeting updates, an HR policy change, and my manager asking about feature progress. Then there are 8 Teams messages spread across 3 different channels covering a production incident from Saturday, a design review thread, and standup notes. On top of that, 3 pull requests were opened against repos I review, and 2 calendar conflicts appeared for Tuesday that I need to sort out before the day gets going.

None of these systems talk to each other. So my morning routine becomes a manual context-switching exercise. I open Gmail, scan subject lines, try to mentally rank urgency. Then I switch to Outlook and do the same. Then Teams. Then Azure DevOps. By the time I have a rough picture of what actually needs my attention, 45 to 60 minutes have passed. And that client escalation? Still buried under newsletters when I finally find it.

The frustrating part is that most of that time is not real work. It is just triage. It is the overhead that comes before the actual job even starts. The other option is to close everything and wait for someone to walk to my table. Lmao I do this all the time.

But well, this is the problem I built Alfred to solve.

What do I want from Alfred?

Unification! Alfred is a personal AI agent built around a single idea: collapsing the chaos of my digital workday into one intelligent, unified system. It continuously polls Gmail at configurable intervals and receives Outlook emails and Microsoft Teams messages via Power Automate webhooks, storing everything locally in SQLite so that regardless of the source, nothing slips through the cracks.
Every incoming email is then put through an AI classification pipeline that assigns it one of six categories (Urgent, Personal, Work, Newsletter, Transactional, or Spam), gives it a priority level from 1 to 5, generates a human-readable summary, extracts action items with optional due dates, and flags whether a follow-up is needed.
From there, a configurable rules engine evaluates each classified email and proposes an appropriate action: archive it, delete it, forward it, draft a reply, or surface it for attention via a notify action with quick-action buttons.
Destructive actions like deletions, sends, and PR approvals wait behind an explicit approval gate in the dashboard, while non-destructive ones like classification and drafting execute automatically.
Every action is tracked through a full lifecycle from proposed to executed, with timestamps, rollback data, and execution results all stored in an append-only audit log.

Email flow

Beyond email, Alfred integrates deeply with the rest of my work toolchain. It connects to Google Calendar and Outlook Calendar for listing, creating, updating, and searching events, and handles Azure DevOps for querying and managing work items, approving pull requests, tracking pipeline runs, and browsing repositories. When a pull request is opened, a dedicated webhook handler automatically fetches the PR details, checks pipeline status, attempts to link related work items from branch name patterns, generates an LLM summary, and proposes approval or work item creation actions accordingly. Microsoft Teams is covered too, with channel message search and webhook-based ingestion keeping Alfred aware of conversations happening outside of email. Tying everything together is a conversational chat interface powered by an agentic loop that extracts intents from natural language, executes them across services, and returns structured, context-aware responses.

devops

Let's look at some of Alfred's core flows in detail

Email Polling and Synchronization

Alfred's background worker is built around an AgentLoop flow. When the server starts, the agentLoop runs an initial poll immediately, then sets a repeating setInterval timer at a configurable cadence. Each tick calls a listMessages request emailPort.listMessages("in:inbox", 50) to fetch up to 50 messages from Gmail via the Gmail API. 50 is a reasonable number for my personal workflow

To avoid reprocessing emails Alfred has already seen, the loop maintains an in-memory string set of message IDs. Every polled message is checked against this set, and only genuinely new messages pass through:

const newMessages = messages.filter((msg) => !this.seenIds.has(msg.id));
for (const msg of newMessages) {
  this.seenIds.add(msg.id);
}

New messages are immediately persisted to SQLite through EmailRepo.upsert(). The upsert uses SQLite's INSERT ... ON CONFLICT(id) DO UPDATE pattern, which means if Alfred encounters the same email ID twice (for example after a server restart), it updates the existing row rather than creating a duplicate. The repository stores the full email body, sender, recipients, labels, attachments as serialized JSON, and a source field that distinguishes Gmail emails from Outlook emails. I cover the exact upsert schema in the Data Integrity section.

Before sending any email to the classifier, the loop applies a set of skip rules. Social media notifications from Facebook, Instagram, Twitter, TikTok, Reddit, Discord, and similar platforms are matched by regex against the sender address. Emails carrying Gmail's CATEGORY_PROMOTIONS or CATEGORY_SOCIAL labels are also skipped. LinkedIn is explicitly exempted from this filter because its emails often contain actionable professional content. This pre-filtering avoids burning LLM API calls on emails that would reliably classify as low priority anyway.

The loop also checks whether each email already has a classification in the database before sending it to the classifier. If a record exists, the email is skipped entirely. This means restarting the server does not trigger re-classification of previously processed emails. I wrote it this way to ensure minimum cost and idempotency.

When the classifier encounters a fatal error such as an expired API key, exhausted credit balance, or a 429 rate limit response, the loop enters a paused state rather than crashing or retrying in a tight loop. It sets classifierPaused = true and stops classifying. This is sort of a circuit breaker. On subsequent polls, it still persists new emails to the database so no mail is lost, but it attempts a single test classification to check whether the service has recovered. Once the test succeeds, classification resumes automatically. Error messages are also deduplicated so the same error is only logged once regardless of how many polls occur while paused.

For Outlook, Alfred does not poll directly. Instead, an adapter calls a Power Automate flow that returns Outlook messages. A dedicated payload mapper normalizes Microsoft field names, timestamp formats, and nested structures into the same EmailMessage domain object that Gmail produces. This means the rest of the pipeline, including classification, action rules, and chat, works identically regardless of whether an email originated from Gmail or Outlook. I wrote it this way so that I can later extend email providers by just adding a normalization mapper and then it should be plug and play.

Action Proposal, Approval, and Execution

Actions in Alfred follow an event-sourced lifecycle. Every state transition is recorded as an append-only entry in action log in an SQLite table. No rows are ever updated in place or deleted. The lifecycle flows through a fixed set of ActionStatus states: ProposedApprovedExecuted, or alternatively Rejected or RolledBack. This is purely for auditing so that I can track autonomous actions from the agent.

Proposal

The ProposeAction use case starts with an idempotency check. It queries the action log for any existing entry with the same resourceId and type. If one already exists, it returns null and stops. Otherwise, it appends a new entry with status: Proposed.

From there, the action's RiskLevel determines what happens next. Low-risk actions like Classify, Draft, and Notify carry RiskLevel.Auto and execute immediately without my input. High-risk actions like Archive, Delete, Send, and Forward carry RiskLevel.ApprovalRequired and sit in the proposed state until I act on them from the dashboard:

const risk = ACTION_RISK_LEVELS[action.type];
if (risk === RiskLevel.Auto) {
  const strategy = this.strategies.find((s) => s.source === action.source);
  if (strategy?.canExecute(action.type)) {
    resultData = await strategy.execute({ type, resourceId, payload });
  }
  await this.actionLog.updateStatus(actionId, ActionStatus.Executed, new Date().toISOString());
}

If the action produces result data such as a created draft ID or classification details, that data is stored alongside the log entry via updateResultData().

Approval and Execution

When I click Approve in the dashboard, the ApproveAction use case first updates the log entry's status to Approved with a timestamp, then immediately attempts execution. It finds the correct ActionExecutionStrategy by matching the action's source field. Three strategies exist: GmailActionStrategy handles archive, delete, send, and draft operations via the Gmail API; OutlookActionStrategy handles equivalent operations through Power Automate; and DevOpsActionStrategy handles work item creation and PR approval via the Azure DevOps REST API. This is based on the open-closed principle to allow for the extension and registration of multiple strategies.

Each strategy declares which action types it supports through a canExecute() method. If a strategy exists but cannot execute the specific action type, the action is marked as executed without performing any real mutation. If execution succeeds, the status moves to Executed. If it fails, the error is returned to the caller but the action remains in Approved state so the user can retry without losing the approval.

The Notify action type is intentionally a no-op at the execution level. It exists so the rules engine can propose surfacing an email to the user without triggering any mutation on the mailbox. The notification itself is handled by the push notification system, not the action executor.

Chat Interface (Intent and Tool Use Modes)

Alfred's chat is the primary way I interact with my workspace data through natural language. I designed it to support two distinct modes of operation, an intent extraction mode (the default) and tool_use mode powered by Anthropic's internal tool choice algorithm. Both implement a ChatStrategy interface defined in a chat-strategy file, which standardises the input (message, history, context, system prompt, dependencies) and output (response text, result strings, action steps).

Intent Extraction Mode

The IntentExtractionStrategy uses a two-LLM architecture. A fast, cheap model (Claude Haiku) handles intent extraction, while the main model (Claude Sonnet) composes the final user-facing response.

The strategy runs an agentic loop of up to 5 rounds. In each round, it sends the user's message, the last 20 conversation history entries (each truncated to 2000 characters), and any results from prior rounds to the fast LLM. The system prompt includes detailed routing rules that map natural language patterns to intent types: "check my Outlook" routes to search_emails with source: "outlook", "calendar" without a provider routes to list_calendar_events without a source, and "work items" routes to query_work_items.

The LLM returns a JSON object with an intents array. Each intent specifies a type matching a registered tool name, along with type-specific fields like query, source, and timeMin. Invalid tool names are filtered out against the ToolRegistry. The strategy then executes each intent by calling the corresponding tool's execute() function, which delegates to the appropriate IntentExecutorDeps method:

for (let round = 0; round < MAX_ROUNDS; round++) {
  const intents = await this.extractIntents(extractionLlm, message, recentHistory, priorResults, deps, validToolNames);
  if (intents.length === 0) break;
  const results = await this.executeTools(deps, intents);
  allResults.push(`--- Round ${round + 1} ---\n${results.join("\n\n")}`);
}

Multi-round execution is what makes complex queries possible. A request like "invite Sabrina to my 3pm meeting tomorrow" requires two rounds: round 1 searches for tomorrow's calendar events, and round 2 uses the event ID from that result to update the event with a new attendee. The LLM receives prior results in an ACTIONS ALREADY EXECUTED THIS TURN block and can return {"intents": [{"type": "none"}]} to signal that all needed data has been gathered and the loop should stop.

After the loop completes, the ChatService combines all gathered results with local context (email stats, pending actions, and follow-ups from the database) and sends everything to the main LLM for final response composition, with extended thinking enabled.

Tool Use Mode

The ToolUseStrategy takes a fundamentally different approach. Rather than extracting intents and executing them as a separate step, it gives the LLM direct access to tools via completeWithTools(). The LLM decides which tools to call, receives structured results, and continues the conversation until it produces a final text response.

This mode requires the LLM adapter to support the Claude tool-use API. The strategy converts all registered tools into Claude tool definitions (name, description, input schema) and passes them alongside the message. The loop runs for up to 5 rounds, checking the stopReason after each response. When the model returns end_turn, the final text becomes the response. When it returns tool calls, the strategy executes each tool, packages the results as ToolResultBlock objects with matching tool_use_id, and sends them back as a user message for the next round:

const response = await deps.llm.completeWithTools({ system: systemPrompt, messages, tools, maxTokens: 4096 });
if (response.stopReason === "end_turn") {
  return { response: response.text ?? "", results: allResults, actions: allActions };
}

If the model exhausts all 5 rounds without reaching end_turn, the strategy returns a graceful fallback message in Alfred's butler voice rather than surfacing a raw error to the user.

Tool Registry

Both modes share the ToolRegistry class in a tool-registry file, which acts as a central catalogue of all available tools. Each tool is registered with a name, description, JSON input schema, an execute function, and a summarize function that produces human-readable action steps such as "Searched Gmail for 'invoice'". The registry can export its tools in two formats: toToolDefinitions() for Claude's native tool-use API, and toIntentPrompt() for building the intent extraction system prompt.

System Prompts

All persona and mode-specific instructions are centralised in a system-prompts file. The BASE_PERSONA establishes Alfred's character as a refined English butler who addresses the user as "Master Jo" and has access to Google Workspace, Microsoft 365, and Azure DevOps. (Jeremy Irons is my favorite Alfred btw) Mode-specific instructions are appended on top: intent mode tells Alfred that actions have already been executed and results are in context so it should not pretend to be searching, while tool-use mode tells Alfred to actively call tools to fetch fresh data.

Authentication and Security

Alfred enforces security at multiple levels across both the dashboard and the agent server.

Dashboard Authentication

The dashboard uses NextAuth.js v5 configured in auth.ts with Google OAuth as the sole provider. Sessions use a JWT strategy with a 7-day maximum age. Access is restricted to a single authorised user through an email allowlist: the signIn callback compares the Google profile's email against the ALLOWED_EMAIL environment variable and rejects any mismatch:

callbacks: {
  signIn({ profile }) {
    return profile?.email?.toLowerCase() === allowedEmail;
  },
}

The auth system uses a custom sign-in page at /auth/login and redirects errors back to the same page for a clean user experience. Since Alfred is a personal, single-user tool, the allowlist approach is both simpler and more appropriate than a full role-based access system.

Server-Side Credentials

The agent server stores sensitive credentials in the macOS Keychain. Both are fetched lazily on first use and cached in memory for the lifetime of the process. This means credentials never appear in environment variables, config files, or logs.

Architectural Isolation

The dashboard is a pure client-rendered application. It contains no provider SDK imports, no direct database access, and no secret values. All data access flows through the agent server's HTTP API. I made sure that all credentials are ignored. This means that even if the dashboard source code were fully exposed, it would not leak any credentials or grant any access to the underlying data.

Resilience and Caching

Alfred applies several resilience patterns across the system to handle network failures, API rate limits, and performance constraints.

In-Memory TTL Cache

The TtlCache class in cache.ts provides a simple time-to-live cache backed by a JavaScript Map. Each entry stores its data alongside an expiresAt timestamp. The get() method checks expiration on every access and automatically evicts stale entries. The getOrFetch() method combines cache lookup with lazy population:

async getOrFetch<T>(key: string, ttlMs: number, fetcher: () => Promise<T>): Promise<T> {
  const cached = this.get<T>(key);
  if (cached !== undefined) return cached;
  const data = await fetcher();
  this.set(key, data, ttlMs);
  return data;
}

This is used for calendar events and DevOps data, both cached with a 3-minute TTL. During a multi-round chat conversation where Alfred might query the calendar several times, only the first call hits the API and subsequent calls return the cached result. The 3-minute window balances data freshness with meaningful API call reduction.

Agent Loop Resilience

The classifier pause behavior is covered in the Email Polling section above. Beyond that, the polling loop is designed so that a failure in any single stage — classification, action proposal, or action execution, does not crash or block the rest of the loop. Each stage fails independently and logs the error without taking down the whole cycle.

Power Automate Retries

The Power Automate client implements a 3-attempt retry with linear backoff (1s, 2s, 3s) for transient HTTP errors and timeouts. Non-retryable errors such as 4xx client errors (excluding 429) fail immediately without retrying. Each request uses AbortController with a 30-second timeout to prevent indefinite hangs.

Push Notification Delivery

The web push delivery mechanics including concurrent sends, Promise.allSettled(), and automatic cleanup of expired subscriptions are covered in the Push Notifications section under Discoveries where the full implementation is explained in context.

Deployment and Operations

Alfred runs as three persistent background services on macOS, managed by launchd, Apple's native process manager. The deployment system is entirely script-based with no containers, no cloud infrastructure, and no external process managers. Everything runs on a single Mac.

The Three Services

The agent server is the core process. It runs the Node.js HTTP API, the background email polling loop, the action execution pipeline, and the finance statement processor. It owns all external API calls to Gmail, Google Calendar, Anthropic, Azure DevOps, and Power Automate, along with all OAuth credentials stored in macOS Keychain and the SQLite database.

The dashboard is a Next.js application serving the client-rendered UI. In production it runs against a pre-built output directory and makes no direct calls to any external service. All data comes through the agent server's HTTP API. It receives a bearer token as an environment variable so it can authenticate its requests to the agent server.

The Cloudflare tunnel creates an encrypted outbound connection from the Mac to Cloudflare's edge network, making the dashboard publicly accessible without opening any inbound ports or touching the router. It routes HTTPS traffic from the public domain down to the local Next.js server on a local port.

launchd Service Configuration

Each service is defined as a .plist property list file. The plist files use placeholder tokens that are replaced with real values at deploy time using sed. The key properties are RunAtLoad: true to start on login, KeepAlive: true to auto-restart on crash, and ThrottleInterval: 10 to wait at least 10 seconds between restart attempts and prevent tight crash loops:

<key>ProgramArguments</key>
<array>
    <string>PROJECT_ROOT/node_modules/.bin/tsx</string>
    <string>apps/agent-server/src/index.ts</string>
</array>
<key>KeepAlive</key>
<true/>
<key>ThrottleInterval</key>
<integer>10</integer>

Each service logs stdout and stderr to separate files that can be tailed in real time for debugging.

The Deploy Script

Deployment runs through a single script that orchestrates six steps in order:

  • creating the log directory
  • sourcing the .env file to load environment variables
  • running npm install at the monorepo root to install all workspace dependencies
  • running npm run build to compile all TypeScript packages in dependency order (domain → application → infrastructure → contracts → agent server, then the Next.js dashboard)
  • copying each plist template into ~/Library/LaunchAgents/ with placeholders replaced by real paths,
  • And finally loading all three services with launchctl load to start them immediately. Before installing each plist it unloads any previously running version to prevent conflicts, resulting in a brief restart with minimal downtime:
for plist in com.alfred.agent.plist com.alfred.dashboard.plist com.alfred.cloudflared.plist; do
  launchctl unload "$LAUNCH_AGENTS_DIR/$plist" 2>/dev/null || true
  sed -e "s|PROJECT_ROOT|$PROJECT_ROOT|g" \
      -e "s|USER_HOME|$USER_HOME|g" \
      -e "s|CLOUDFLARED_BIN|$CLOUDFLARED_BIN|g" \
      -e "s|NODE_BIN_PATH|$NODE_BIN_PATH|g" \
      -e "s|BEARER_TOKEN_VALUE|${BEARER_TOKEN:-}|g" \
      "$DEPLOY_DIR/$plist" > "$LAUNCH_AGENTS_DIR/$plist"
done

The script automatically detects the Node.js binary path across nvm, Homebrew, and system installs, and locates the cloudflared binary for both Apple Silicon and Intel Homebrew paths. At the end it prints a macOS settings checklist reminding me to enable auto-login, prevent sleep, and configure startup after power failure, since the Mac effectively acts as a persistent home server.

First-Time Setup

Initial installation is handled by a setup script that checks prerequisites (Homebrew and Node.js 20 or above), installs cloudflared, creates the .env file interactively, runs the Google OAuth flow by opening a browser for consent and storing the resulting refresh token in Keychain, authenticates with Cloudflare, creates the tunnel, configures DNS routes, and then kicks off the deploy script to bring everything up.

Operational Commands

I have scripts for the full operational lifecycle. A status command shows whether each service is running, its PID, and the last 5 log lines. A teardown command unloads all services and removes the plist files from LaunchAgents while preserving logs. A universal launcher supports multiple modes: all for full production, dev for hot-reload development, agent or dashboard individually, status for health checks, and doctor for preflight validation.

Configuration

All configuration flows through environment variables loaded from a .env file at the project root. A config.ts module reads these and returns a typed AppConfig object. Three variables are required: GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET, and ANTHROPIC_API_KEY. Everything else is optional and enables features progressively. Setting AZURE_DEVOPS_ORG enables DevOps integration. Setting PA_FLOW_MAIL_SEARCH enables Outlook. Setting VAPID_PUBLIC_KEY enables push notifications and so on. If an optional config block is absent, the composition root simply skips registering those adapters and use cases, so the system degrades gracefully rather than failing to start.

Data Integrity

Ensuring that Alfred handles data meticulously was very important to me. It does not make sense to build an assistant that is sloppy with the information it presents. Therefore I wrote Alfred in such a way that he prevents duplicate and inconsistent data through idempotency checks, upsert semantics, and schema separation at every data boundary.

Idempotent Action Proposals

Before creating a new entry in the action log, the proposal system queries for any existing entry with the same resourceId and type. If a match is found, the proposal is silently skipped and returns null. This means the polling loop can encounter the same email multiple times, such as after a server restart, without generating duplicate action proposals:

const existing = await this.actionLog.getByResourceIdAndType(action.resourceId, action.type);
if (existing) return null;

Email Upsert Semantics

Whether an email arrives via polling, a webhook, or is encountered again after a restart, the upsert guarantees exactly one row per email ID. All fields including subject, body, labels, and read status are updated to their latest values, and an updated_at timestamp records when the last refresh occurred:

INSERT INTO emails (id, thread_id, from_address, ..., updated_at)
VALUES (@id, @threadId, @from, ..., datetime('now'))
ON CONFLICT(id) DO UPDATE SET
  thread_id = excluded.thread_id,
  from_address = excluded.from_address,
  ...
  updated_at = datetime('now')

Conversation Ordering

Chat messages are stored with a created_at timestamp and always queried in chronological order using ORDER BY created_at ASC. Messages are never reordered, edited, or deleted after creation. This ensures the conversation history Alfred sees when composing a response exactly matches what the user experienced.

Normalised Schema Design

Classifications are stored in a separate classifications table linked to emails by email_id. This separation means re-classifying an email, whether due to a model update or a rule change, only touches the classification row without affecting the underlying email data. The email's original content, headers, labels, and metadata remain untouched. Follow-ups and action log entries follow the same pattern. Each table has a single source of truth for its own data, and no operation on one table can corrupt another.

Pitfalls: From Intent Extraction to Tool Use

I started Alfred's chat system with a pure intent extraction approach. The idea was straightforward: send my message to a fast LLM, ask it to return structured JSON with an intent type and parameters, then map that intent to an executor function. A message like "show me today's calendar" would produce {"type": "list_calendar_events", "timeMin": "2026-03-16", "timeMax": "2026-03-16"}, and the system would call the calendar adapter directly:

const intents = await this.extractIntents(extractionLlm, message, recentHistory, priorResults, deps, validToolNames);
for (const intent of intents) {
  const entry = deps.toolRegistry.get(intent.type as string);
  if (entry) {
    const result = await entry.execute(deps.intentExecutor, intent);
    if (result) results.push(result);
  }
}

I built this following the Open/Closed Principle. Each intent type was a self-contained ToolEntry registered in a ToolRegistry. Adding a new capability meant registering a new entry with a name, schema, executor function, and summariser. No existing code needed modification:

toolRegistry.register({
  name: "search_emails",
  description: "Search emails by query, category, or sender",
  inputSchema: { ... },
  execute: async (deps, input) => { ... },
  summarize: (input) => `Searched emails: ${input.query}`,
});

In theory this was clean and extensible. In practice, the cost of adding intents started to compound. Every new capability required writing a system prompt fragment describing the intent format, adding routing rules so the LLM knew when to select it, writing the executor function, and testing that the LLM reliably produced the right JSON structure. At 5 intent types it was manageable. By the time I had 15 (email search, calendar list, calendar create, calendar update, calendar search, work item query, work item create, PR query, pipeline list, Teams messages, follow-ups, actions, repo list, commits, branch list), the intent extraction system prompt had ballooned. The LLM was juggling too many format rules and frequently produced malformed JSON or selected the wrong intent type.

The extraction prompt had grown to include detailed routing rules, source-specific provider logic, multi-intent support, and follow-up round awareness:

const INTENT_RULES = `
ROUTING RULES:
- "check my Outlook" → search_emails with source: "outlook"
- "search Gmail" → search_emails with source: "gmail"
- "Outlook calendar" → list_calendar_events with source: "outlook-calendar"
- "work items" / "tickets" → query_work_items
- "pull requests" / "PRs" → query_source_control with subtype: "pull_requests"
...
`;

Every new intent meant updating these routing rules, testing edge cases, and hoping the model did not confuse the new intent with existing ones. The Open/Closed architecture was holding up at the code level — I was not modifying existing executors, but the prompt was a single growing artifact shared by every intent. Adding one intent risked degrading the reliability of all the others.

This led me to Claude's native tool use API. Instead of asking the LLM to produce JSON matching my custom schema, I could give it proper tool definitions and let Claude's built-in tool calling handle the routing:

const tools = deps.toolRegistry.toToolDefinitions();
const response = await deps.llm.completeWithTools({
  system: systemPrompt,
  messages,
  tools,
  maxTokens: 4096,
});

Claude's tool use was noticeably more reliable. It natively understands tool schemas, validates parameters against the input schema, and handles multi-tool calls cleanly. The model picks the right tool more consistently than my intent extraction prompt ever did, because tool selection is a first-class capability of the model rather than something I was trying to engineer through prompt instructions.

But tool use burned through API credits quickly. Each round of the conversation becomes a full API call carrying the entire tool catalogue, conversation history, and system prompt. A simple question like "what meetings do I have today?" that previously cost one cheap Haiku call for intent extraction plus one Sonnet call for response composition now cost one or more full Sonnet calls with tool definitions attached, adding significant token overhead to every request.

I balanced models to keep costs sustainable. Intent extraction uses Haiku because it only needs to produce structured JSON, not reason deeply. Final response composition uses Sonnet with extended thinking enabled because that is where quality matters:

const strategyDeps = {
  llm: this.deps.llm,         // Sonnet — reasoning and response
  fastLlm: this.deps.fastLlm, // Haiku — intent extraction
  ...
};

Rather than committing to one approach, I gave the chat system the ability to switch between both modes. The mode parameter on each request selects the active strategy:

const strategy = mode === "tool_use" ? toolUseStrategy : intentStrategy;
const strategyResult = await strategy.run({ message, history, localContext, systemPrompt, deps });

Intent mode is cheaper and faster for straightforward queries where the routing rules work well. Tool use mode is more reliable for complex, ambiguous, or multi-step requests where maintaining routing rules would be impractical. Both strategies implement the same ChatStrategy interface and share the same ToolRegistry, so all capabilities are available in both modes without any duplication.

From Single Request-Response to Reasoning Loops

Early on, the chat used a single request-response pattern. I ask a question, Alfred gathers context from the database, sends everything to the LLM in one shot, and returns the response. The quality was poor. With 15+ tools and a rich system prompt, the model would frequently miss details, give shallow answers, or fail to connect information across multiple data sources. A question like "what's my schedule like tomorrow and do I have any overdue follow-ups?" would produce a partial answer because the model was trying to handle everything in a single pass.

My first instinct was to use a better model. I switched from Sonnet to Opus for the response composition step and the quality jumped immediately. Opus reasons more carefully, connects dots across context, and produces noticeably more nuanced responses. But it was expensive. Opus costs significantly more per token than Sonnet, and every chat message was a full context window call carrying email stats, action history, follow-up data, and conversation history.

This led me to implement reasoning loops. Instead of asking the model to do everything in one pass, I let it work iteratively. In intent mode, the strategy runs up to 5 rounds. Each round extracts intents, executes them, and feeds the results back into the next round's context:

for (let round = 0; round < MAX_ROUNDS; round++) {
  const intents = await this.extractIntents(extractionLlm, message, recentHistory, priorResults, deps, validToolNames);
  if (intents.length === 0) break;
  const results = await this.executeTools(deps, intents);
  allResults.push(`--- Round ${round + 1} ---\n${results.join("\n\n")}`);
}

In tool use mode, the loop is similar but driven by Claude's stop reason. The model keeps calling tools until it decides it has enough information and returns a final text response:

for (let round = 0; round < MAX_ROUNDS; round++) {
  const response = await deps.llm.completeWithTools({ system: systemPrompt, messages, tools, maxTokens: 4096 });
  if (response.stopReason === "end_turn") {
    return { response: response.text ?? "", results: allResults, actions: allActions };
  }
  // ... execute tool calls, feed results back
}

This multi-round approach means a request like "invite Sarah to my 3pm meeting tomorrow" works naturally.
Round 1 searches tomorrow's calendar events.
Round 2 uses the event ID from that result to update the event with a new attendee. The LLM sees prior results in an ACTIONS ALREADY EXECUTED THIS TURN block and returns {"intents": [{"type": "none"}]} when everything is resolved and the loop should stop.

{"timestamp":"2026-03-16T07:11:03.210Z","level":"info","msg":"\nchat:start","component":"chat","message":"What does my outlook calendar look like ?","historyLength":16,"mode":"tool_use"}
{"timestamp":"2026-03-16T07:11:07.854Z","level":"info","msg":"llm:completeWithTools","component":"llm","model":"claude-opus-4-6","inputTokens":8168,"outputTokens":131,"durationMs":4644,"stopReason":"tool_use"}
{"timestamp":"2026-03-16T07:11:07.854Z","level":"info","msg":"chat:tool-use-round","component":"chat","round":1,"stopReason":"tool_use","toolCallCount":1,"hasText":true,"durationMs":4644}
{"timestamp":"2026-03-16T07:11:07.855Z","level":"info","msg":"chat:tool-result","component":"chat","tool":"list_calendar_events","resultLength":33,"resultPreview":"Calendar Events: No events found."}
{"timestamp":"2026-03-16T07:11:13.314Z","level":"info","msg":"llm:completeWithTools","component":"llm","model":"claude-opus-4-6","inputTokens":8318,"outputTokens":120,"durationMs":5458,"stopReason":"end_turn"}
{"timestamp":"2026-03-16T07:11:13.315Z","level":"info","msg":"chat:tool-use-round","component":"chat","round":2,"stopReason":"end_turn","toolCallCount":0,"hasText":true,"durationMs":5459}
{"timestamp":"2026-03-16T07:11:13.315Z","level":"info","msg":"chat:complete","component":"chat","totalDurationMs":10106,"mode":"tool_use","actionCount":1}

The reasoning happens where it counts. Mechanical work like deciding which tools to call uses the cheapest model that can do it reliably, and the expensive synthesis step only fires once at the end. A 3-round conversation costs 3 Haiku calls plus 1 Sonnet call rather than 3 Opus calls.

Prompt Refinement

Prompt refinement turned out to be significantly harder with intent extraction than with tool use. With intent extraction, I was responsible for the entire instruction surface: routing rules, format specifications, edge case handling, multi-intent support, source disambiguation, date inference, and conversational context awareness. Every ambiguous user message required a new rule or clarification in the prompt. The prompt became a fragile, growing document where changing one section could silently break another.

With tool use, Claude does most of the heavy lifting. I define each tool's name, description, and input schema. Claude figures out when to call it, what parameters to pass, and how to combine results across multiple tools. The refinement effort shifted from "teach the model my custom intent format" to "write clear tool descriptions and let the model's built-in tool selection do its job." This was a dramatically smaller surface area to maintain.

The persona prompt is where I spent the most deliberate effort, and I structured it to follow the Open/Closed Principle. The BASE_PERSONA defines Alfred's character, his access to workspace systems, and the critical behavioural rules that apply regardless of which mode is active:

export const BASE_PERSONA = `You are Alfred, a distinguished personal workspace assistant. 
You are an old English gentleman — impeccably dressed in a three-piece suit at all times, 
refined in manner, and utterly devoted to your employer. You always address the user as 
"Master Jo". Your speech carries the quiet authority and warmth of a seasoned butler...

CRITICAL RULES:
- ALWAYS address the user as "Master Jo"
- ONLY use the data provided to you. Do not make up emails, events, or results.
- When calendar events were CREATED, confirm this to the user with details and calendar links.
...`;

Mode-specific instructions are appended on top without touching the base. Intent mode tells Alfred that actions have already been executed and results are already in context, so he should not pretend to be searching. Tool use mode tells Alfred to actively call tools to fetch fresh data. The buildSystemPrompt() function composes these cleanly:

export function buildSystemPrompt(mode: "intent" | "tool_use"): string {
  const modeInstructions = mode === "tool_use" ? TOOL_USE_MODE_INSTRUCTIONS : INTENT_MODE_INSTRUCTIONS;
  return BASE_PERSONA + "\n" + modeInstructions;
}

This separation means I can refine Alfred's personality, add new behavioural rules, or adjust mode-specific instructions entirely independently. Adding a new mode in the future means writing a new instruction block and adding a case to buildSystemPrompt(), without touching the persona or any existing mode instructions.

The persona itself evolved through iteration. Early versions were too stiff and formal. Later versions overcorrected and became too casual. The current version balances warmth with efficiency, giving Alfred permission to be dry-witted and occasionally opinionated while staying concise and never fabricating data.

Discoveries

The Floodgate Effect

Once I had the first working version of Alfred deployed, something unexpected happened: my mind would not stop generating ideas. The initial version could poll Gmail, classify emails, propose actions, and let me approve them from a dashboard. It was functional, but using it every day exposed gaps and opportunities I had not anticipated during planning. Every morning I would open the dashboard, see how Alfred handled my overnight inbox, and think "what if he could also do this?" The backlog grew faster than I could build.

This is something I did not expect about building a personal tool. When you are the only user, the feedback loop is immediate. There is no product manager filtering requests, no sprint planning, no prioritisation meetings. You feel the friction directly, and the fix is always within reach. That immediacy is both a gift and a trap. I had to learn to be disciplined about scope, because every "quick addition" carries a maintenance cost that compounds.

Financial Statement Processing

The first major expansion came from a personal pain point. I bank with multiple banks in Malaysia, and both send monthly e-statements as password-protected PDF attachments to my Gmail. Every month I would download the PDFs, unlock them, manually scan through transactions, and try to categorise spending in a spreadsheet. I actually stopped this a long time ago. It was tedious, error-prone, and I rarely kept up with it. I realised Alfred already had the infrastructure to solve this: he polls Gmail, he can download attachments, and he has an LLM for classification.

I built a six-stage pipeline that runs automatically during each polling cycle. Alfred searches Gmail for emails from the configured bank sender addresses, filters for emails with PDF attachments, and checks each against the bank_statements table to skip already-processed ones. The idempotency check matters because the polling loop runs every 60 seconds and the same bank emails will appear in search results repeatedly:

private async findUnprocessedIds(bank: BankConfig, filters: EmailSearchFilters): Promise<string[]> {
  const ids = await this.deps.emailRead.searchFilteredIds(filters);
  const unprocessed: string[] = [];
  for (const id of ids) {
    if (!(await this.deps.statementRepo.isStatementProcessed(id))) {
      unprocessed.push(id);
    }
  }
  return unprocessed;
}

For each unprocessed email, Alfred downloads the PDF attachment and decrypts it using the bank-specific password from environment config. This is where I hit the first real bug. The pdf-parse library accepts a password option, but its internal implementation completely ignores it. It passes the raw buffer directly to PDF.js's getDocument() instead of wrapping it in { data, password }. Every statement was failing with a cryptic "No password given" error. The fix was a workaround that tricks pdf-parse by passing a PDF.js parameter object in place of the buffer:

const pdfInput = { data: new Uint8Array(pdfBuffer), password } as unknown as Buffer;
const result = await pdf(pdfInput);

After decryption, the raw text goes to a bank-specific parser. Each bank formats its statements differently, so I built a StatementParserRegistry that routes to the correct parser based on the BankProvider enum.

The parser also strips page noise including headers, footers, and the Chinese and Malay translations that some banks include on every page, and collects multi-line transaction details like merchant names and reference numbers.

Once parsed, transactions go through a hybrid classification stage. The HybridTransactionClassifier first attempts rule-based categorisation using keyword matching (merchant names like "GRAB" map to transport, "MCDONALD'S" maps to food), and falls back to Claude Haiku for ambiguous transactions. This hybrid approach keeps costs low because most transactions have recognisable merchant names that do not need LLM inference.

The pipeline also handles historical backfill. On first run, it does not just process recent statements. It walks backward through the inbox month by month, processing older statements until it reaches a configurable cutoff, defaulting to 12 months. A backfill_state table tracks the cursor position per bank so the backfill can resume across server restarts:

private async processBackfill(bank: BankConfig): Promise<void> {
  const isComplete = await this.deps.backfillStateRepo.isComplete(bank.bankProvider);
  if (isComplete) return;

  const cursor = await this.deps.backfillStateRepo.getCursor(bank.bankProvider);
  const cutoff = new Date();
  cutoff.setMonth(cutoff.getMonth() - this.deps.backfillMonths);
  // ... fetch historical emails before cursor, process, advance cursor
}

All of this produces a normalised finance_transactions table where every transaction from every bank shares the same schema: date, description, amount, type (credit or debit), balance, category, merchant name, and statement period. Two banks, different formats, one unified table.

Making Financial Data Conversational

Having the data in SQLite was useful on its own, the dashboard has a Finance page with tables and charts, but the real power came from wiring it into Alfred's chat. I registered finance-specific tools in the ToolRegistry so that both chat modes can query transaction data naturally.

The chat can now answer questions like "how much did I spend on food last month?", "what were my biggest transactions in February?", or "show me all Grab transactions this year." Alfred queries the finance_transactions table, aggregates the results, and presents them in his butler persona.

What I did not anticipate is that this naturally enabled budgeting. Once Alfred could tell me "you spent RM 2,400 on dining in February, Master Jo," I started asking follow-up questions like "is that more than January?" and "set a reminder if I go over RM 2,000 next month." The transaction data combined with the follow-up system and push notifications created a lightweight budget monitoring capability that I never explicitly designed. It emerged from the intersection of features that already existed.

Progressive Web App

The dashboard started as a standard Next.js web app accessed through a browser tab. It worked, but it felt disposable. I would forget to check it, or close the tab and lose my place. Making Alfred a Progressive Web App changed that relationship. With a PWA manifest, a service worker, and the right meta tags, Alfred became an app I could install on my phone and in my Mac's dock. It has its own window, its own icon, and it persists across reboots.

The practical difference is small since it is still the same Next.js app behind the scenes. But the psychological difference is significant. An app in the dock feels like a tool. A browser tab feels temporary. I open Alfred every morning now the way I open Slack or my email client. It has presence.

Push Notifications with Service Workers

The feature I am most proud of is the push notification system. Before I built it, Alfred was purely pull-based. I had to open the dashboard to see if anything needed attention. Proposed actions would sit in the approval queue for hours because I simply forgot to check. Follow-ups would go overdue silently.

Push notifications made Alfred proactive. When the classification pipeline proposes a new action for approval, Alfred sends a push notification to my browser. When a high-priority email arrives, he notifies me immediately. When a DevOps PR webhook fires, I get a notification with a deep link straight to the approvals page.

The implementation uses the Web Push protocol with VAPID keys for authentication. The SendNotification use case checks user preferences before sending. I can toggle notifications per event type from the Settings page, and for high-priority emails I can set a minimum priority threshold:

const pref = await this.preferenceRepo.get(event.type);
if (pref && !pref.enabled) return;

if (event.type === NotificationEventType.HighPriorityEmail && emailPriority !== undefined) {
  const threshold = PRIORITY_THRESHOLDS[minPriority] ?? PRIORITY_THRESHOLDS.high;
  if (emailPriority > threshold) return;
}

The WebPushAdapter sends to all registered browser subscriptions concurrently using Promise.allSettled(), so a failed delivery to one device does not block others. It automatically cleans up expired subscriptions when the push service returns HTTP 410 or 404, which happens when a user clears browser data or uninstalls the PWA.

On the client side, a service worker listens for push events and displays native OS notifications with the app icon, a body preview, and a deep link URL. The notificationclick handler is smart about reusing existing windows: if the dashboard is already open, it focuses that tab instead of opening a new one:

self.addEventListener("notificationclick", (event) => {
  event.notification.close();
  const url = event.notification.data?.url ?? "/";
  event.waitUntil(
    self.clients.matchAll({ type: "window", includeUncontrolled: true }).then((clients) => {
      for (const client of clients) {
        if (client.url.includes(url) && "focus" in client) return client.focus();
      }
      return self.clients.openWindow(url);
    }),
  );
});

The usePushNotifications React hook manages the entire subscription lifecycle from the UI: checking browser support, requesting notification permission, fetching the VAPID public key from the server, subscribing via the Push API, and sending the subscription details to the server for storage. Unsubscribing reverses the process, removing the subscription from both the browser and the server database.

What made this feel like a real discovery is how it changed my workflow. Before push notifications, Alfred was a dashboard I checked. After push notifications, Alfred is an assistant who taps me on the shoulder. The difference between pull and push is the difference between a tool and a colleague. When my phone buzzes with "Action: archive. Proposed archive for 'Your NIKE order has shipped', Master Jo," I smile every time. It feels like Alfred is actually there, running the household.

Further Implementations

Retrieval-Augmented Generation for Personal Knowledge

The next frontier I want to explore is giving Alfred deep knowledge of everything I have written. I publish articles, write tweets, draft technical documentation, and take notes across multiple platforms. Right now Alfred knows my emails, my calendar, and my finances, but he does not know my voice. If someone asks me to write a thread about Clean Architecture, I start from scratch every time. If I need to reference a point I made in an article six months ago, I have to search manually.

I plan to build a RAG pipeline that indexes my published content, tweets, notes, and drafts into a vector store. A good friend of mine (Edem Kumodzi) already does this, read his article here. When I ask Alfred to help me write something, he would retrieve relevant passages from my own prior work and use them as context for generation. The goal is not for Alfred to write as me, but to write with full awareness of what I have already said, how I say it, and what positions I have taken. He should be able to say: "Master Jo, you wrote about this exact topic in your March article. Shall I pull the relevant points as a starting foundation?"

This is a step toward something larger. I want Alfred to have a total embodiment of who I am — not a shallow personality clone, but a deep contextual understanding of my thinking, my writing style, my professional opinions, and my personal preferences. He should know that I care about Clean Architecture and SOLID principles, that I have strong opinions about over-engineering, and that I prefer concise explanations with concrete examples. At the same time, he should remain his own person: a distinct entity with his butler persona who assists me rather than pretending to be me. The line between "knows me well" and "impersonates me" is one I want to walk carefully.

Expanding Service Integrations

Alfred currently connects to Google Workspace, Microsoft 365, and Azure DevOps. I want to push further into the services that shape my daily life.

WhatsApp is where most of my personal communication happens. The ability to search messages, get summaries of group conversations I have missed, or draft replies through Alfred would close a major gap. The challenge is that WhatsApp's API is designed for businesses rather than personal use, so I will likely need to explore the WhatsApp Business API with creative workarounds.

LinkedIn is the integration I am most excited about. I got the idea from a podcast about the discipline of maintaining professional relationships, and it resonated because I am genuinely terrible at it. I connect with people at conferences, have great conversations, and then never follow up. Alfred could do something far more personal than LinkedIn's built-in "keep in touch" feature: track my connections, identify people I have not interacted with in a while, cross-reference them with my calendar and email history, and nudge me with context. Not just "you haven't talked to Sarah in 3 months" but "you haven't talked to Sarah in 3 months. You last discussed the migration project at her company. She posted about a promotion last week. Shall I draft a congratulations message, Master Jo?" That level of contextual nudging is what turns a contact list into actual relationships.

Spotify might seem like an odd fit for a workspace assistant, but I spend a significant amount of my commute and focus time listening to engineering podcasts. I want Alfred to suggest relevant episodes based on what I am currently working on. If I am deep in a week of building a notification system, Alfred could recommend episodes about push notification architecture, service workers, or PWA best practices. The Spotify API is well-documented with solid search and recommendation endpoints, so this should be one of the more straightforward integrations to build.

Smart Home Integration

I have been thinking about extending Alfred beyond the digital workspace and into my physical space. Apple Shortcuts provides a bridge between software and home devices. If I can trigger Shortcuts programmatically, Alfred could control lights, check device status, set scenes, and interact with HomeKit accessories through natural language.

The most entertaining use case involves Juliana, my robot vacuum. She runs on a schedule, but I never actually know if she has finished cleaning or got stuck under the couch again. If I can query her status through a Shortcut or her manufacturer's API, Alfred could include in my morning briefing: "Juliana completed her cleaning cycle at 3 AM, Master Jo. All rooms covered, no incidents to report." Or more usefully: "Juliana appears to be stuck in the bedroom. She has not moved in 40 minutes. Shall I send a rescue party?"

The broader vision is for Alfred to be aware of my home the same way he is aware of my inbox. When I ask "is everything in order?", he should be able to answer with a status report covering emails, calendar, pending approvals, financial alerts, and whether the house has been cleaned. A proper butler would never limit his awareness to just the mail.

A Second Persona

My girlfriend has watched me use Alfred. This sparked an idea I had not considered: cloning Alfred's architecture for a second persona. The entire system is built on Clean Architecture with dependency injection, which means the persona, the rules, and the connected accounts are all configurable. The core infrastructure covering polling, classification, the action lifecycle, push notifications, and chat strategies is entirely provider-agnostic and user-agnostic.

In theory, creating a second instance means standing up another agent server pointed at different OAuth credentials, a different SQLite database, a different set of action rules, and a different system prompt. The persona would not be Alfred. She would get her own character, her own name, and her own way of speaking. But underneath, the same ChatService, the same ToolRegistry, the same AgentLoop, and the same strategy pattern would power everything.

The part that interests me most is how the persona shapes the experience. Alfred's butler character is not just flavour text. It affects how he delivers bad news ("I regret to inform you, Master Jo, that your credit card statement shows a rather generous dining budget this month"), how he prioritises information, and how he handles ambiguity. A different persona for a different person would need to match their communication style and preferences entirely. This is where the buildSystemPrompt() architecture pays off. The base capabilities and mode-specific instructions stay constant, while the persona layer is a separate, swappable block. Building a second agent is less about rewriting code and more about crafting a new character who happens to run on the same engine.

Conclusion

Building Alfred started as a weekend experiment: a polling loop that checked Gmail and labelled anything that looked important. What it became, over months of iteration, is something I did not fully anticipate: a personal operating system that sits between me and the noise of digital life.

The biggest lesson was not technical. It was architectural. Clean Architecture is not just an academic exercise you draw on whiteboards. It is the reason I was able to bolt on Microsoft Teams notifications, bank statement processing, and a full chat interface without rewriting the core. When your domain layer knows nothing about Gmail, adding Outlook is just another adapter. When your use cases speak in ports, swapping Claude Haiku for Sonnet is a one-line change in the composition root. The upfront cost of drawing those boundaries paid for itself ten times over.

That said, the path was not smooth. The jump from intent extraction to native tool use humbled me. Prompt engineering is not engineering in the traditional sense. There is no compiler to catch your mistakes, no type system to lean on. You ship a prompt, watch it hallucinate a tool name that does not exist, and go back to the drawing board. The multi-round reasoning loop took more iterations than any other feature, not because the code was complex, but because coaxing an LLM into reliable, structured behaviour across multiple turns is genuinely hard. Every fix revealed a new edge case. Every edge case demanded a new constraint in the system prompt. I have a much deeper respect now for anyone building production agentic systems.

The discovery that surprised me most was how naturally financial data fit into the system. I built Alfred to manage emails. The fact that bank statements arrive as email attachments meant the entire PDF extraction and transaction classification pipeline was, architecturally, just another use case plugged into the same ports. The backfill system, the hybrid classifier, the per-bank parser registry: none of it required changes to the core domain. That is Clean Architecture doing exactly what it promises.

Running everything on a Mac on my desk with a Cloudflare Tunnel was a deliberate choice. There is no monthly cloud bill. There is no cold start. My data never leaves my network unless I am the one requesting it through an encrypted tunnel. For a personal assistant that reads your email, knows your calendar, and processes your bank statements, that is not a nice-to-have. It is a requirement.

Alfred is far from finished. RAG-powered memory, WhatsApp integration, smart home control: the roadmap is long. But the foundation is solid. Every new capability I have added has reinforced the same pattern: define a port, write the use case, build the adapter, wire it in the composition root. The system grows without becoming fragile because each piece knows only what it needs to know.

If there is one thing I would tell someone starting a similar project, it is this: invest in the boundaries early. Not the features, not the UI, not the clever LLM tricks. The boundaries. Get the dependency direction right. Make your domain layer boring. Let your infrastructure layer be the only place that knows about the outside world. Everything else follows from that discipline. Alfred taught me that the most powerful personal software is not the one with the most features. It is the one you can keep evolving without fear of breaking what already works.

See you in the next one 😁

The Problem with AI Tests That Don't Know Your App

2026-03-16 23:36:34

AI-generated Cypress tests are promising — but by default, the AI has never seen your app.
The interesting part isn't "look, the AI wrote a test." The interesting part is whether an AI grounded in your team's own Swagger spec, component docs, and bug history can cover things you would miss.
That's where RAG comes in. RAG (Retrieval-Augmented Generation) is the pattern of feeding your own documents to an AI at query time. Instead of a generic model guessing at your button labels and API routes, it works from the same source of truth your team already uses.
Pair that with cy.prompt() — Cypress's experimental AI-native test authoring command — and something interesting happens. The AI works with more precision. It can map to your endpoints. It may even surface flows you forgot to cover.
That said, it's not a silver bullet. The human still writes better assertions. The AI covers breadth, the human covers intent. And any context that never made it into your docs won't make it into your tests either.
If you've tried AI-generated tests for your app: how much did the AI actually know about it?

How I turned approved SQL into governed business KPIs

2026-03-16 23:36:32

In a lot of companies, executives and business teams want answers from company data, but they do not know SQL.

That part is obvious.

What is less obvious is that SQL is not the real problem.

The real problem is this:

How do you let non-technical users ask business questions about company data without exposing raw SQL, direct database access, or completely uncontrolled AI generated queries?

That was the problem I wanted to solve.

The naive solution looks attractive

The first idea is always the same:

Connect an AI assistant directly to the database and let people ask questions in natural language.

At first, this sounds great.

In practice, it creates a different set of problems:

• the business definition of a metric is not stable

• different prompts may produce different SQL for the same question

• there is no strong boundary between approved and unapproved logic

• scheduling, monitoring, and delivery workflows are still missing

• auditability becomes weak very quickly

• private environments become painful to manage

In other words, query generation is only one small part of the problem.

The harder part is making the answers reliable.

The pattern I ended up using

Instead of letting AI write arbitrary SQL for business users, I flipped the model.

The system starts from real SQL written and approved by analysts.

The flow looks like this:

  1. An analyst writes a real SQL query.
  2. They define only the minimal input parameters needed for the business question.
  3. That query becomes a governed KPI.
  4. The KPI can contain multiple query variants.
  5. Business users never see SQL.
  6. They only see KPI cards and ask follow-up questions in plain language.
  7. AI maps the question to the right KPI variant.
  8. The backend executes only approved query paths.
  9. The UI renders the result as a scalar, a short list, or a chart.

That design changes everything.

The SQL remains controlled.

The business experience becomes flexible.

Why query variants matter

This was one of the most important parts of the design.

A single KPI often needs more than one query behind it.

For example, imagine a fintech KPI about money movement.

The same KPI may need:

• a default comparison variant for today versus yesterday

• a trend variant for a daily bar chart this week

• a breakdown variant for operational exceptions like refunds or failed payments

From the business user’s point of view, this still feels like one KPI.

From the backend point of view, it is a governed set of approved query variants.

That means the user can ask:

• How are we doing versus yesterday

• Show the daily trend this week

• Are refunds rising

But the system is not improvising SQL every time.

It is resolving the question to a predefined execution path.

That is the difference between flexibility and chaos.

What the AI actually does

This is the part I think many teams get wrong.

In my flow, AI does not generate arbitrary SQL against the database.

Its role is narrower and much more useful:

• interpret the user’s question

• map it to the correct KPI

• select the correct query variant

• resolve the right time context and parameters

• explain the result in business language

So the AI is acting as a language and intent layer, not as an unrestricted database operator.

That matters because it gives business users a natural interface without giving up control, auditability, or execution safety.

Why this works better for business users

Business users do not want to think about joins, schemas, or prompt engineering.

They want answers like:

• How did onboarding perform last week

• Show daily wires and P2P transfers this week

• Are failed payments increasing

They also want charts, lists, and short explanations.

If the underlying SQL is already approved and versioned, you can give them that experience safely.

The UI becomes simple because the backend is strict.

That is a much better tradeoff than giving everyone direct AI to database access.

Execution still matters

Even with this model, execution is still the real backbone.

In my case, query execution, scheduling, and monitoring all follow the same deployment model.

They can run:

• in the cloud

• or on-prem through a dedicated installed agent

In general, on-prem is the preferable setup for sensitive environments, because the data never needs to be exposed outside the customer environment.

The platform orchestrates the workflow, but execution stays close to the database.

That turned out to be a very important distinction.

A lot of teams do not just need answers.

They need answers without opening up their data environment too much.

What this unlocked

This approach gave me a few things at the same time:

• business users can ask follow-up questions in plain language

• analysts still control business logic

• the results stay tied to approved SQL

• charts and tables stay consistent with the same KPI definition

• scheduling and monitoring remain part of the same operational system

• cloud and on-prem execution both fit naturally into the model

So instead of treating natural language as a replacement for data workflows, I ended up using it as an access layer on top of governed workflows.

That feels much more robust.

Final thought

I think a lot of teams are focusing on the wrong question.

The question is not:

Can AI generate SQL

The more important question is:

How much execution freedom should AI have around company data

For business-facing analytics, I have become convinced that natural language works best when the SQL underneath is already approved, versioned, and operationally controlled.

The hard part is not letting AI write SQL.

The hard part is making business answers reliable.

I’m building this approach in DataPilot, where approved SQL becomes governed business KPIs and business users can ask follow-up questions without touching raw SQL.

If you want to see the product context behind this model, it’s here:
https://getdatapilot.com/product/business-kpis

Understanding the JavaScript Window Object

2026-03-16 23:36:28

Understanding the JavaScript Window Object: The Browser’s Global Powerhouse

When developers start learning browser-side JavaScript, they usually interact with elements using document.getElementById() or manipulate HTML through the DOM. However, behind the scenes, there is a larger object controlling the entire browser environment — the Window Object.

The Window object acts as the top-level container of everything running in a browser tab. Understanding this object helps developers clearly distinguish between Browser APIs (BOM) and Document APIs (DOM).

Let’s explore this powerful object step by step.

What is the Window Object?

The Window object represents the browser window or tab where your JavaScript is running. It is the global object in the browser environment, meaning that everything defined globally automatically becomes a property of the window.

console.log(window);

When executed in a browser console, this prints a large object containing browser APIs such as:

  • document
  • location
  • history
  • navigator
  • localStorage
  • timers
  • dialog boxes

Think of the window object as the root controller of the browser environment.

Window
 ├── Document (DOM)
 ├── Location
 ├── History
 ├── Navigator
 ├── LocalStorage
 └── Browser APIs
![ ](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ira64xme0ayc356v1eym.png)

Global Scope and the this Keyword

In browser JavaScript, global variables and functions automatically become properties of the window object.

Example

var language = "JavaScript";

function sayHello() {
    console.log("Hello Developer");
}

Behind the scenes, the browser interprets this as:

window.language = "JavaScript";
window.sayHello = function() {
    console.log("Hello Developer");
};

So these are equivalent:

console.log(language);
console.log(window.language);

Both return:

JavaScript

this at Global Level

At the global scope in browsers:

console.log(this === window);

Output:

true

This means the global execution context refers to the window object.

Key Properties of the Window Object

The Window object contains several important properties that provide access to browser capabilities.

1. window.document — Accessing the DOM

The document property refers to the DOM (Document Object Model) representing the HTML page.

console.log(window.document);

Example usage:

document.getElementById("title");

Even though we write document, internally it is:

window.document

The document object allows JavaScript to:

  • read HTML elements
  • modify content
  • attach event listeners
  • manipulate styles

2. window.location — URL Manipulation

The location object provides information about the current page URL.

console.log(window.location.href);

Example:

window.location.href = "https://google.com";

This redirects the browser to a new page.

Useful properties:

Property Description
href Full URL
hostname Domain name
pathname Page path
protocol http / https

Example:

console.log(location.hostname);

3. window.history — Browser Navigation

The history object allows navigation through the browser session history.

Example:

history.back();

Equivalent to clicking the back button.

Other methods:

history.forward();
history.go(-2);

Use cases include:

  • single-page applications
  • navigation control
  • custom routing systems

4. window.navigator — Browser Information

The navigator object provides information about the user’s browser and device.

Example:

console.log(navigator.userAgent);

It can reveal:

  • browser type
  • operating system
  • device type
  • language settings

Example:

console.log(navigator.language);

5. window.localStorage and sessionStorage

These APIs allow storing data inside the browser.

Local Storage

Data persists even after the browser closes.

localStorage.setItem("theme", "dark");

Retrieve data:

localStorage.getItem("theme");

Remove data:

localStorage.removeItem("theme");

Session Storage

Data persists only during the browser session.

sessionStorage.setItem("user", "Bhupesh");

When the tab closes, the data disappears.

Important Methods of the Window Object

The Window object also provides several utility methods.

1. Dialog Boxes

Alert

alert("Welcome to JavaScript");

Displays a message box.

Prompt

let name = prompt("Enter your name");

Allows user input.

Confirm

confirm("Are you sure?");

Returns:

true or false

2. Timers

Timers allow delayed or repeated execution.

setTimeout

Runs code once after a delay.

setTimeout(function() {
   console.log("Hello after 3 seconds");
}, 3000);

setInterval

Runs code repeatedly.

setInterval(function() {
   console.log("Running every second");
}, 1000);

3. Window Manipulation Methods

window.open()

Opens a new browser window.

window.open("https://openai.com");

window.close()

Closes the current window (if opened via script).

window.close();

window.scrollTo()

Scrolls to a specific position.

window.scrollTo(0, 500);

This scrolls the page 500px down.

Difference Between window (BOM) and document (DOM)

Many beginners confuse BOM and DOM, but they serve different roles.

Feature Window (BOM) Document (DOM)
Represents Browser window HTML document
Purpose Browser control Page content manipulation
Example location, history getElementById
Level Top-level object Child of window

Structure:

Window (BOM)
   └── Document (DOM)
         └── HTML Elements

Example relationship:

window.document.body

Best Practices

1. You Usually Don’t Need to Write window

Because the window object is global, writing it explicitly is optional.

Example:

alert("Hello");

Internally the browser reads this as:

window.alert("Hello");

2. Avoid Global Variables

Since global variables attach to window, excessive globals can pollute the environment.

Bad practice:

var user = "Bhupesh";

Better practice:

const app = {
   user: "Bhupesh"
};

3. Use Storage Carefully

Avoid storing sensitive data like:

  • passwords
  • authentication tokens

inside localStorage.

Final Thoughts

The Window object is the backbone of browser-based JavaScript. It provides access to:

  • the DOM (document)
  • browser navigation (history)
  • URL control (location)
  • client storage (localStorage)
  • timers and dialog boxes

By understanding the Window object, developers gain deeper insight into how JavaScript communicates with the browser environment.

In simple terms:

If JavaScript is the brain of a web page, the Window object is the entire operating system of the browser tab.

Mastering it will significantly improve your ability to build interactive, browser-aware applications.

Show DEV: I Built an Operating System for Claude Code

2026-03-16 23:35:49

I've been using Claude Code daily since it launched, and I kept running into the same problems: it forgets everything between sessions, makes the same mistakes twice, and has no structure for complex workflows.

So I built Claudify — a downloadable toolkit that turns Claude Code into a structured operating system.

What It Does

Claudify installs into your project directory and gives Claude Code:

  • 1,727 expert skills across 31 categories (SEO, debugging, deployment, testing, etc.)
  • 9 specialist agents with persistent memory that survives between sessions
  • 21 slash commands for common workflows (/commit, /review-pr, /audit, etc.)
  • 9 automated quality checks via pre/post hooks that catch errors before they ship
  • A self-improving knowledge base that learns from corrections and gets smarter over time

The Problem I Was Solving

Out of the box, Claude Code is powerful but stateless. Every session starts from zero. It doesn't know your project conventions, your preferred patterns, or what went wrong last time.

I wanted a system where Claude Code could:

  1. Remember project context, coding patterns, and past decisions
  2. Follow procedures consistently instead of improvising every time
  3. Catch its own mistakes through automated hooks and quality gates
  4. Route tasks to specialist agents (content, data, debugging) with the right domain knowledge

How It Works

One command installs everything:

npx claudify init

This drops a .claude/ directory into your project with:

  • CLAUDE.md — project instructions Claude reads automatically
  • agents/ — specialist subagents with their own memory files
  • skills/ — domain knowledge loaded on demand
  • commands/ — slash command definitions
  • settings.json — hook configurations for quality gates
  • memory.md — persistent context that survives between sessions

Claude Code reads CLAUDE.md on startup, which bootstraps the entire system. No IDE plugins, no cloud dependencies, no subscriptions.

What Makes It Different

Most AI coding tools focus on autocomplete or chat. Claudify focuses on operational structure — making Claude Code reliable enough to handle real workflows autonomously.

The key insight: Claude Code doesn't need more intelligence. It needs better memory, clearer procedures, and guardrails that prevent drift.

Tech Stack

  • Works with Claude Code, Cursor, Windsurf, and any tool that reads CLAUDE.md
  • Pure file-based — no servers, no APIs, no vendor lock-in
  • Skills are markdown files with frontmatter metadata
  • Hooks are shell scripts triggered by Claude Code events
  • Agents are markdown definitions with persistent memory files

Try It

The project is at claudify.tech. One-time purchase ($49 full / $19 skills-only pack), no subscription.

Happy to answer questions about the architecture, how the memory system works, or how the agent routing is structured. Would love feedback from other Claude Code users on what workflows you'd want automated.

Built with Claude Code, of course.

What's semantic caching?

2026-03-16 23:34:31

As more applications for generative AI come, its shortcomings become more apparent. One huge problem with LLMs is how expensive each query is, for example take Gemini — Gemini 2.5 Pro charges $1.25 per million input tokens and $10 per million output tokens. Their flagship Gemini 3.1 Pro doubles that to $2 and $12 per million tokens respectively. Even a moderately active app can rack up thousands of dollars a month pretty quickly. Imagine a small customer support bot with just 500 daily users — by month two, the API bill has quietly crossed $2,000. That's not an edge case, that's just what happens when you're not caching. As a business (or a personal user) saving costs where possible and speeding up operations is a huge important factor that decides how well your product does. One way to speed up and minimise costs is to use a simple 'semantic cache'.

What it is

A semantic cache is not too different from a traditional cache, it has the same idea behind it. Normally a traditional cache stores either LRU (Least Recently Used) or LFU (Least Frequently Used) data so that when the same query comes, it can just fetch the result stored rather than search it up again.

You however cannot apply the exact same pipeline for RAG or genAI products simply because the output is not 'deterministic', i.e, it's not the same. Take these examples:

What is the situation regarding AI in professional workplaces?

How are AI tools affecting workplaces?

Now semantically these seem similar enough to use and we can gauge that they kinda mean the same thing, but a normal cache does not understand that. It thinks these both are different because they are not exactly the same.

That's where semantic caching comes in. Rather than compare them directly, it compares the semantic meaning behind them and understands that it's kinda the same and thus we get a cache hit! We normally check how similar two documents are based on cosine similarity.

How it works

This is a typical pipeline for RAG systems that use semantic caching.

First the documents are chopped up etc and converted to word embeddings (vectors). Ofc you store them in a vector db like Chroma, FAISS or something of that sort which suits your use case. After the user sends a query we don't go to the db. Instead we first check with the semantic cache. It sees if the query is relevant to the cached query.

Two things can happen from here:

Cache hit: The query is similar enough to a cached one (above the threshold) → cached context is pulled and handed to the LLM → response is generated. Fast and cheap, no db lookup needed.

Cache miss: Nothing similar in the cache → normal vector db retrieval happens → relevant chunks are fetched, response is generated, and the new query gets cached for next time. Normal speed, but the cache is now warmer.

Word embeddings are compared using cosine similarity:

cosine(θ) = (A · B) / (||A|| × ||B||)

It's a very fast and simple method to see the angle between the direction of vectors. If similar, then they would aim in similar direction, i.e, the angle between them is low, i.e, cos of that angle is higher. Output is from 0 to 1 where 0 means not at all similar and 1 ofc means they are the exact same.

For example:

  • "What is the impact of AI on jobs?" vs "How is AI changing employment?" → score of ~0.91 → cache hit
  • "What is the impact of AI on jobs?" vs "How do I bake sourdough bread?" → score of ~0.08 → cache miss

Those first two are clearly the same question in spirit, and the score reflects that.

Why use it

  1. Significant cost savings. By reducing the queries sent to vector dbs, you cut down on a huge portion of charges incurred.
  2. Faster response time. If you already have the cached content, you don't need to retrieve it again. This allows the system to be a whole lot faster in production.
  3. Better use of resources. Since you aren't redoing similar queries, the system is free to do more tasks, allowing you to scale better or handle more complex features.

Compared to other approaches in RAG

Approach Handles Semantic Similarity Cost Savings Speed Boost Setup Complexity Works for Unique Queries Best For
Traditional Cache No (exact match only) High (when hits) Very High Low No High-volume apps with repetitive, exact queries
Semantic Cache Yes High High Medium No Apps with overlapping but varied query patterns
Query Rewriting Partially Low Low (adds a step) Medium Yes Improving retrieval on ambiguous or poorly phrased queries
Re-ranking No Low No (adds latency) Medium Yes Boosting relevance when retrieval is decent but ordering is off
Hybrid Search Partially Low Moderate High Yes Complex domains needing both keyword and semantic retrieval
Chunking Optimisation No Moderate Moderate Low–Medium Yes Improving retrieval quality at the source

As you can see, semantic caching isn't a silver bullet. It shines when there's a decent overlap in the kinds of queries your users send. For more diverse or unique query patterns, approaches like re-ranking or hybrid search may be better suited.

The cons

  1. More complex to build than a traditional cache system.
  2. Higher chances of getting semantically similar chunks that may not be relevant or useful for answering the query. Think of it like asking a librarian for "books about space travel" and getting recommendations cached from a previous "books about space exploration" query — close enough on the surface. But when you follow up with "books about the health risks of space travel", the cache might still serve those same exploration books because the queries look similar, even though what you actually need is quite different.
  3. Need to balance out the threshold. A higher threshold does not yield useful chunks and a lower limit may not bring semantically similar chunks, both degrade performance of system. Important to find out the right balance.
  4. Empty cache is slow and has high latency.
  5. Not suitable when every user query is unique.

When not to use it

Semantic caching isn't always the right tool. Skip it if:

  • Every query your users send is unique. Think code generation, legal research, or anything highly personalised — the cache will almost never hit and you're just adding overhead.
  • Your app is low traffic. If you're getting a handful of queries a day, there's no real benefit.
  • Your knowledge base changes constantly. If documents are being updated all the time, you'll spend more time invalidating the cache than benefiting from it.
  • Accuracy is non-negotiable. Cached context can be slightly off. For use cases where being slightly wrong is worse than being slow, don't cache.

How to best utilise it

  1. Calibrate your threshold carefully. A good starting point is somewhere between 0.85–0.90. From there, tune it based on your specific use case and monitor quality. There's no universal right answer here.
  2. Use TTL (Time To Live) values. Cached entries should expire, especially when your underlying data changes or when topics are time-sensitive. Stale cache is worse than no cache.
  3. Warm up your cache. Pre-populate it with common or anticipated queries so you're not starting completely cold in production. A cold cache gives you none of the benefits.
  4. Invalidate when your knowledge base updates. If the documents in your vector db change, cached responses based on old chunks can quietly degrade your output quality without you noticing.
  5. Monitor your hit rate. A healthy semantic cache typically sees somewhere around 30–60% hit rates. Too low and your threshold might be too strict; suspiciously high but quality is dropping means it's too loose.
  6. Think about scope — global vs user-level caching. A global cache saves the most but can serve mismatched cached results across very different user contexts. For personalised applications, a user-scoped cache might make more sense even if it's less efficient.

Tools that already do this

You don't have to build it from scratch. A few libraries have semantic caching built in or easily pluggable:

  • GPTCache — an open source library built specifically for caching LLM responses. Pretty flexible and worth looking at if you're rolling your own pipeline.
  • LangChain — has caching layers that plug into existing chains without too much effort. Good starting point if you're already using it.
  • Redis — with vector similarity extensions, Redis can act as a fast semantic cache layer, especially if you're already using it in your stack.

Worth knowing these exist before you reinvent the wheel.