MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

File 02: Automation Bias

2025-11-20 17:38:31

When Trust Becomes Blind Faith (Introduction)

Your CI/CD pipeline has been green for weeks. The automated tests pass, the linter is happy, deployment scripts run smoothly. Life is good. Until one day, production breaks, and you realize the automation you trusted completely missed a critical bug.

That's Automation Bias: the dangerous tendency to trust automated systems more than they deserve, even when they're wrong.

In psychology, automation bias describes our tendency to favor suggestions from automated decision-making systems and ignore contradictory information made without automation. It's not about laziness, it's about how our brains handle authority, even when that authority is a piece of software.

This bias becomes most dangerous precisely when systems work well. The more reliable a system appears, the more we trust it blindly. We stop questioning, we stop verifying, we stop thinking.

The Silent Failure (The problem)

In software development, automation bias shows up in every corner of our workflow.

  • A team's automated testing pipeline has been running successfully for months. When a critical bug slips through to production, the team initially assumes the automated tests must be correct and looks for other causes, wasting hours before realizing the tests had a false positive.

  • A security team receives hundreds of automated alerts daily. After months of mostly false positives, they begin to ignore alerts or dismiss them without investigation. When a real security breach occurs, the automated system correctly flags it, but the team dismisses it as another false positive.

  • Developers rely heavily on automated code analysis tools to catch bugs and style issues. They become complacent, assuming the tools will catch everything. When the tools miss a subtle logic error that causes a production outage, the team realizes they've stopped doing thorough manual code reviews.

  • A developer uses an AI coding assistant to generate code. The AI produces syntactically correct code that looks good, so the developer accepts it without thorough review. They trust the AI's output, assuming it understands the full context and edge cases. The code works in most scenarios but fails silently in an edge case, causing a data integrity issue in production.

The bias warps our judgment. We attribute more credibility to systems than to our own expertise. We reduce cognitive load by trusting automation, but in doing so, we become lazy thinkers.

When this happens:

  • Critical bugs slip through automated checks.
  • Security threats get ignored due to alert fatigue.
  • Human expertise degrades from lack of use.
  • Teams lose the ability to question automated outputs.

Ironically, the better our automation works, the worse we become at noticing when it doesn't.

Keeping Humans in the Loop (Mitigation)

Escaping automation bias begins with remembering that automation should enhance, not replace, human judgment. You can't eliminate automation, but you can design processes that keep you engaged.

Some practical ways:

  • Always verify critical automated decisions manually. Don't let automation make you passive in monitoring systems.

  • Question automated outputs regularly. Ask "What could the system be missing?" Make it a habit, not an exception.

  • Cross-check with multiple sources. Use different tools or methods to validate results. Don't rely on a single automated system for mission-critical decisions.

  • Maintain manual skills alongside automated tools. Ensure team members stay trained on manual processes even when automation is available.

  • Design human-in-the-loop processes. Require human confirmation for critical decisions. Make automated decision-making processes explainable.

  • Implement intelligent alerting that reduces noise. Manage alert fatigue before it makes you ignore real threats.

The goal is not to eliminate automation—that would be counterproductive. The goal is to keep humans engaged, skilled, and questioning. Automation should be a tool, not a crutch. Systems should enhance judgment, not replace it.

Debugging the human mind, one bias at a time.

Ethereum’s Trustless Manifesto, MetaMask Multichain Accounts, RIP-7560 Explained, x402 Protocol

2025-11-20 17:33:43

We are welcoming you to our weekly digest! Here, we discuss the latest trends and advancements in account abstraction, chain abstraction and everything related, as well as bring some insights from Etherspot’s kitchen.

The latest news we'll cover:

  • Ethereum and Vitalik Publish “Trustless Manifesto”
  • Etherspot Explains: RIP-7560 - Educational Piece
  • MetaMask Launches Multichain Accounts
  • Account Abstraction Dilemma and x402 Protocol’s Breakthrough

Please fasten your belts!

Ethereum and Vitalik Publish “Trustless Manifesto”

The Ethereum Foundation, in collaboration with Vitalik Buterin and the Account Abstraction team, has published a document titled the “Trustless Manifesto,” which has been permanently deployed on-chain.

The manifesto articulates core values of decentralisation, self-custody, verifiability and resistance to convenience-engineered centralisation. Its deployment as a smart contract, with no administrator or owner, signals a commitment to trust-minimised architecture where users can pledge adherence by calling a pledge() function.

The Account Abstraction team emphasises that the document is not merely symbolic: it reaffirms the idea that verification replaces blind trust, and every protocol design decision must consider whether it introduces unnecessary intermediaries.

The manifesto lays philosophical groundwork for account and chain abstraction builders, insisting that UX innovations not compromise permissionless access or verifiability. As wallet and multi-chain flows increase, this declaration sets a standard. If flows rely on opaque relayers or centralised bridges, they compromise Ethereum’s foundational trust model.

Ethereum and Vitalik Publish “Trustless Manifesto”

Etherspot Explains: RIP-7560 — Educational Piece

Etherspot published an explainer on X outlining how RIP-7560 advances Account Abstraction by shifting key mechanisms from smart-contract infrastructure into the protocol layer of rollups.

The post begins by revisiting ERC-4337, noting that it introduced UserOperations, bundlers, the shared EntryPoint contract, and optional paymasters to make wallets programmable without modifying Ethereum’s base protocol. These components enabled features such as batched actions, custom validation logic, recovery mechanisms, and the ability to transact without holding ETH.

The post explains that RIP-7560 builds on these foundations by proposing native transaction types handled directly by rollups rather than through an external EntryPoint contract and off-chain bundlers. This change moves validation, execution, and fee logic into the rollup protocol itself, reducing the number of moving parts and lowering overhead for AA workflows.

The native processing model allows smart wallets to operate more efficiently and enables rollups to standardize AA behavior across their ecosystems. This design reduces reliance on contract-based routing, potentially improving performance while maintaining compatibility with existing tooling.

The post emphasizes that RIP-7560 remains fully backward-compatible with ERC-4337. Wallets, paymasters, and bundlers built on top of 4337 can continue functioning as before, while rollups that adopt RIP-7560 will process UserOperations natively rather than through contract execution. Etherspot frames this as a meaningful evolution: ERC-4337 demonstrated that Account Abstraction works in practice, and RIP-7560 aims to establish AA as built-in infrastructure for rollups.

Follow Etherspot on X for more EIP/ERC/RIP explainers!

MetaMask Launches Multichain Accounts

The MetaMask team announced the official launch of Multichain Accounts, marking a major shift in how wallet accounts are structured.

Under the new system, a single “Multichain Account” can hold parallel key sets across multiple networks, EVM chains like Ethereum and Solana, and, soon, Bitcoin, all derived from a single seed phrase.

The blog post explains that as users increasingly transact across diverse chains, it has become untenable to manage separate addresses for each network. Multichain Accounts aim to simplify this by re-architecting the wallet “account” layer: it now groups addresses rather than creating new ones for each chain.

For builders of account abstraction and chain abstraction flows, this update matters: it reduces user friction when switching chains or onboarding across multiple networks. By lowering network-management complexity, wallets like MetaMask can provide a smoother surface for AA-powered features: sponsored transactions, cross-chain intents, unified balances, while still leveraging multichain architecture.

Account Abstraction Dilemma and x402 Protocol’s Breakthrough

Odaily published an analysis examining the limitations of Ethereum’s AA and the emergence of the x402 protocol as a more practical standard for cross-chain payments. The article highlights criticisms that AA, despite years of investment in ERC-4337, Paymasters, and wallet infrastructure, has been “all talk and no action” and overly dependent on an EVM-only model.

The author notes that Paymasters shift gas costs to project teams, but the “motivation to burn money on payment is very weak,” making it difficult to maintain sustainable ROI.

The piece explains that AA depends on smart contracts, on-chain state, and EVM execution, which limits its reach beyond Ethereum-compatible environments. Attempting to extend AA to ecosystems such as Solana or Bitcoin requires additional middleware layers, increasing cost and complexity. This contributes to what the author describes as AA becoming “technology for technology’s sake,” a product of Ethereum’s earlier research-driven culture.

In contrast, the article describes the x402 protocol as relying on the longstanding HTTP 402 status code, allowing it to work across Web2 APIs, Web3 RPCs, and traditional payment gateways using only an HTTP request header. This design makes x402 a naturally cross-chain solution, where facilitators can interact with multiple chains, index user payment history uniformly, and enable developers to integrate once to serve the entire ecosystem.

The analysis argues that x402 offers a unified upstream protocol layer, reducing compatibility burdens at the application layer. Within this framework, ERC-8004 becomes an optional trust layer rather than a universal standard. By positioning ERC-8004 as “plug and play” inside the x402 ecosystem, the model avoids the top-down adoption challenges that AA faced.

We’d like to add here that account abstraction is much more than a payment request mechanism. It’s a foundation for programmable wallets, permissions, batching, and secure automation. Off-chain protocols like x402 can play a role in coordination, but they don’t replace the on-chain execution layer that AA provides.

Start exploring Account Abstraction with Etherspot!

  • Learn more about account abstraction here.
  • Head to our docs and read all about Etherspot Modular SDK.
  • Skandha — developer-friendly Typescript ERC4337 Bundler.
  • Arka — an open-source Paymaster Service for gasless & sponsored transactions.
  • Explore our TransactionKit, a React library for fast & simple Web3 development.
  • Follow us on X (Twitter) and join our Discord.

❓Is your dApp ready for Account Abstraction? Check it out here: https://eip1271.io/

Follow us

Etherspot Website | X | Discord | Telegram | Github | Developer Portal

Statistics Day 6: Your First Data Science Superpower: Feature Selection with Correlation & Variance

2025-11-20 17:30:00

Feature selection is one of the most important steps before building any machine learning model.

And one of the simplest tools to do this is correlation.

But correlation alone doesn’t tell the whole story.
To use it correctly, you also need to understand variance, standard deviation, and a few other related statistical terms.

This blog breaks everything down in the simplest way possible — no heavy maths, just practical understanding.

1. What Is Correlation?

Correlation tells us how two numerical features move together.

  • If they grow together → positive correlation
  • If one grows while the other falls → negative correlation
  • If they don’t move in any clear pattern → zero correlation

Correlation ranges from –1 to +1:

  • +1 → perfectly move together
  • –1 → perfectly opposite
  • 0 → no relationship

In feature selection, correlation helps you answer:

“Which features are actually related to the target?”
“Which features are repeating the same information?”

2. How Do We Use Correlation for Feature Selection?

A. Select Features That Are Correlated With the Target

If you're predicting house price, and size_in_sqft has high correlation with price, that feature is useful.

Example:

Feature Correlation with Price
Size (sqft) 0.82
No. of rooms 0.65
Age of house –0.20
Zip code 0.05

High correlation → strong predictive power.

Correlation Heatmap

B. Remove Features That Are Highly Correlated With Each Other

When two features are too similar, they cause multicollinearity, which confuses models (especially regression).

Example:

  • height and total_floors → correlation 0.95
  • They’re giving the same information.
  • You keep only one.

This makes your model:

  • simpler
  • faster
  • less noisy
  • more stable

C. The Big Warning: Correlation Only Catches Linear Relationships

If a feature has a non-linear relationship with the target, correlation may say “0”, even when the feature is useful.

Example:
Predicting salary based on experience — relationship grows but flattens → non-linear curve.

Low correlation does not mean useless feature.

High vs Low Correaltion

Best practice:
Include the feature anyway and check feature importance using:

  • Random Forest
  • XGBoost
  • SHAP values

3. Variance — How Spread Out the Data Is

Variance tells you how much the values are spread from the average.

  • Low variance → values are almost the same
  • High variance → wide variety of values

Example:

Values Variance
50, 50, 50, 50 Very low
10, 80, 120, 200 Very high

In feature selection:

Features with extremely low variance (almost constant features) should be removed.

Variance graph

Example:

  • A column with 99% “No” and 1% “Yes”
  • Gives almost no information

This is called low-variance filtering.

4. Standard Deviation — The More Interpretable Version of Variance

Standard deviation (SD) is the square root of variance.

Why do we use SD?

Because SD is in the same units as the data, so it’s easier to interpret.

Example:

  • Variance = 2500
  • SD = 50 SD = “On average, values are 50 units away from the mean.”

In data science:

  • High SD → more spread
  • Low SD → less spread

SD is important in:

  • normal distribution
  • Z-score normalization
  • outlier detection

5. Practical Use Cases in Real Data Science

A. Feature Engineering

  • Remove highly correlated features
  • Keep features that correlate with the target
  • Remove low-variance features
  • Treat outliers using SD

B. Model Stability (Regression Models)

High correlation among features (multicollinearity):

  • inflates coefficients
  • makes the model unstable
  • reduces interpretability

Solution:

  • Correlation matrix
  • Variance Inflation Factor (VIF)

C. Detecting Outliers

Using SD:

  • Any value > 3 SD from the mean is often considered an outlier This helps clean the dataset before modeling.

D. Normalization

Z-score = (value – mean) ÷ SD
Used heavily in:

  • KNN
  • SVM
  • Gradient descent-based models

Because these models depend on distance, standardization is essential.

6. Quick Summary Table

Concept Meaning Why It Matters for Feature Selection
Correlation How two features move together Helps identify useful or redundant features
Variance How spread out the data is Remove near-constant features
Standard Deviation Average spread from the mean Used in scaling and outlier detection
High Feature-to-Target Correlation Strong predictor Keep it
High Feature-to-Feature Correlation Redundant Remove one
Low Correlation Not always useless Check with ML model importance

7. Final Takeaways

  • Use correlation to pick predictive features.
  • Remove features that are too similar to each other.
  • Use variance and standard deviation to spot boring or noisy features.
  • Always validate with ML models because correlation misses non-linear relationships.

Feature selection is not just theory — it’s one of the most practical skills in data science.

If you understand correlation, variance, and SD, you're already ahead.

Connect on Linkedin: https://www.linkedin.com/in/chanchalsingh22/
Connect on YouTube: https://www.youtube.com/@Brains_Behind_Bots

I love breaking down complex topics into simple, easy-to-understand explanations so everyone can follow along. If you're into learning AI in a beginner-friendly way, make sure to follow for more!

WordPress to Hugo: Lightning Fast Sites in 2025

2025-11-20 17:30:00

I shared how I transformed my old laptop into a home server. That experimental setup became the perfect testing ground for something I'd been curious about — migrating from WordPress to Hugo. What started as a weekend project has turned into this comprehensive guide.

Why Leave WordPress?

As someone obsessed with page load speeds, WordPress has always been a mixed bag. Sure, it's powerful, but between PHP generating pages on-the-fly, database queries, and plugin overhead, achieving true speed is a constant battle.

Every tenth of a second matters for SEO and UX. Studies show 50% of visitors abandon slow pages. While I'd optimized my WordPress site with OpenLiteSpeed and premium hosting, I knew there was untapped potential.

Hugo's advantages:
⚡ Static files serve faster than dynamic PHP
🔒 No database or plugins to patch constantly
✍️ Write in Markdown, deploy anywhere
🚀 Minimal maintenance overhead

The biggest concern? Migration horror stories — lost images, broken links, frustrated bloggers giving up halfway. But with the right approach, these pitfalls are avoidable.

Understanding Hugo's Core Concepts

Hugo is a static site generator that transforms Markdown into HTML. Unlike WordPress (dynamic generation on each visit), Hugo pre-builds everything once.

Key components:

Content directory: Articles and pages in Markdown
Layouts/Themes: Templates for appearance
Static directory: Images, CSS, JS served as-is

Each Markdown file starts with "front matter" — metadata similar to WordPress custom fields but cleaner:

---
title: "My Article"
date: 2025-01-20
tags: ["webdev", "performance"]
---

Installation and Setup

Install Hugo Extended

The "Extended" version is essential for modern themes with Sass support:

cd /usr/local/bin
wget https://github.com/gohugoio/hugo/releases/latest/download/hugo_extended_0.157.0_Linux-64bit.tar.gz
tar -xzf hugo_extended_0.157.0_Linux-64bit.tar.gz
rm hugo_extended_0.157.0_Linux-64bit.tar.gz
hugo version

Create Your Site

cd ~/projects
hugo new site my-hugo-site
cd my-hugo-site

Add Hextra Theme

Modern, fast, and clean:

git init
git submodule add https://github.com/imfing/hextra themes/hextra
cp themes/hextra/hugo.toml ./hugo.toml

Configure hugo.toml:

baseURL = "https://your-domain.com/"
title = "Your Site Title"
theme = "hextra"

Test Locally

hugo server -D --bind=0.0.0.0

Visit http://localhost:1313

Migration Process

Export from WordPress

Dashboard → Tools → Export → All Content

This generates a WXR file with all your content. Transfer it to your Hugo server.

Convert to Markdown

Rather than unreliable automated tools, here's a custom Python script using Pandoc.
First create convert_wp_to_hugo.py:

#!/usr/bin/env python3
import os, subprocess, re
from lxml import etree
from datetime import datetime

# Install dependencies first:
# apt install -y python3 python3-venv pandoc
# python3 -m venv venv && source venv/bin/activate
# pip install lxml pyyaml

INPUT_XML = "/path/to/wordpress-export.xml"
OUTPUT_DIR = "content/posts"
os.makedirs(OUTPUT_DIR, exist_ok=True)

def to_slug(s):
    return "".join(c.lower() if c.isalnum() or c in "-" else "-" for c in s).strip("-")

def yaml_safe(s: str) -> str:
    if re.search(r"[:#\-\?\[\]\{\},&\*!\|>'\"%@`]|^\s|\s$", s) or " " in s:
        return f'"{s.replace(\'"\', \'\\"\'}"'
    return s

root = etree.parse(INPUT_XML, parser=etree.XMLParser(encoding="utf-8"))

for item in root.xpath("//item"):
    status = item.findtext("{http://wordpress.org/export/1.2/}status") or ""
    post_type = item.findtext("{http://wordpress.org/export/1.2/}post_type") or "post"

    if status != "publish" or post_type not in {"post", "page"}:
        continue

    title = (item.findtext("title") or "Untitled").strip()
    slug = item.findtext("{http://wordpress.org/export/1.2/}post_name") or to_slug(title)
    date = item.findtext("{http://wordpress.org/export/1.2/}post_date") or datetime.utcnow().isoformat()
    html = item.findtext("{http://purl.org/rss/1.0/modules/content/}encoded") or ""

    # Convert HTML to Markdown via Pandoc
    p = subprocess.run(
        ["pandoc", "-f", "html", "-t", "gfm-smart", "--wrap=none"],
        input=html.encode("utf-8"), capture_output=True
    )
    md = p.stdout.decode("utf-8").strip()

    # Extract categories and tags
    categories = [cat.text.strip() for cat in item.findall("category") 
                 if cat.get("domain") == "category" and cat.text]
    tags = [cat.text.strip() for cat in item.findall("category") 
           if cat.get("domain") == "post_tag" and cat.text]

    # Build front matter
    fm = ["---", f'title: "{title.replace("\\", "\\\\").replace('"', '\\"')}"',
          f"slug: {slug}", f"date: {date}", "draft: false"]

    if categories:
        fm.append("categories:")
        for c in categories:
            fm.append(f"  - {yaml_safe(c)}")

    if tags:
        fm.append("tags:")
        for t in tags:
            fm.append(f"  - {yaml_safe(t)}")

    fm.extend(["---", ""])

    out_name = f"{date[:10]}-{slug}.md" if post_type == "post" else f"{slug}.md"

    with open(os.path.join(OUTPUT_DIR, out_name), "w", encoding="utf-8") as f:
        f.write("\n".join(fm) + "\n" + md + "\n")

print(f"✓ Conversion complete → {OUTPUT_DIR}")

Run it:

python3 convert_wp_to_hugo.py

Image Optimization

Serving modern image formats (AVIF/WebP) dramatically improves performance.

Image Conversion Script

Here is the code:

#!/usr/bin/env python3
import os, re, requests
from pathlib import Path
from PIL import Image
from io import BytesIO
import pillow_avif

# pip install pillow pillow-avif-plugin requests

MD_DIR = Path("content")
OUT_DIR = Path("static/images")
OUT_DIR.mkdir(parents=True, exist_ok=True)

def save_formats(img_bytes, base_path):
    """Save image in multiple formats"""
    try:
        img = Image.open(BytesIO(img_bytes))
        img.save(f"{base_path}.avif", "AVIF", quality=80)
        img.save(f"{base_path}.webp", "WEBP", quality=80)
        img.save(f"{base_path}.jpg", "JPEG", quality=85)
        print(f"{base_path.name}")
    except Exception as e:
        print(f"✗ Error: {e}")

# Find all image URLs in Markdown
img_pattern = re.compile(r'!\[([^\]]*)\]\((https?://[^)]+)\)')

for md_file in MD_DIR.rglob("*.md"):
    content = md_file.read_text(encoding="utf-8")
    urls = {url for _, url in img_pattern.findall(content) if url.startswith("http")}

    for url in urls:
        try:
            filename = url.split("?")[0].split("/")[-1]
            base_path = OUT_DIR / Path(filename).stem

            r = requests.get(url, timeout=30)
            r.raise_for_status()
            save_formats(r.content, str(base_path))
        except Exception as e:
            print(f"Failed {url}: {e}")

print(f"\n✓ All images processed → {OUT_DIR}")

Hugo Image Shortcode

Create layouts/shortcodes/img.html:

{{ $src := .Get "src" }}
{{ $alt := .Get "alt" | default "" }}
<picture>
  <source type="image/avif" srcset="{{ $src }}.avif" />
  <source type="image/webp" srcset="{{ $src }}.webp" />
  <img src="{{ $src }}.jpg" alt="{{ $alt }}" loading="lazy" />
</picture>

Use in Markdown:

{{< img src="/images/my-screenshot" alt="Description" >}}

Update Markdown Files

Replace standard image syntax with shortcode:

#!/usr/bin/env python3
import os, re

pattern = re.compile(r'!\[([^\]]*)\]\((https?://[^)]+)\)')

for root, _, files in os.walk("content"):
    for filename in files:
        if not filename.endswith('.md'):
            continue

        filepath = os.path.join(root, filename)
        with open(filepath, 'r', encoding='utf-8') as f:
            content = f.read()

        def replacement(match):
            alt_text, url = match.groups()
            image_name = os.path.splitext(os.path.basename(url))[0]
            return f'{{{{< img src="/images/{image_name}" alt="{alt_text}" >}}}}'

        new_content = pattern.sub(replacement, content)

        if new_content != content:
            with open(filepath, 'w', encoding='utf-8') as f:
                f.write(new_content)
            print(f"✓ Updated: {filepath}")

Deployment

Build and deploy your static site:

hugo  # Generates site in public/
cp -r public/* /var/www/html/

Your site is now pure HTML — no PHP, no database queries, just blazing-fast static content.

Pre-Migration Checklist

Backup everything: Full WordPress files + database export. Test restore separately.

Plan URL structure: Set up redirects from old WordPress URLs to Hugo structure for SEO.

Test thoroughly: Multiple browsers and devices, especially mobile performance.

Lost features: WordPress search, comments, contact forms need alternatives (Algolia, Disqus, Netlify Forms).

Security: Even static sites need proper SSL and security headers.

Performance Optimizations

For maximum speed:

  • Use Hugo's partialCached for expensive operations
  • Implement Hugo Pipes for asset bundling
  • Consider Page Bundles for better organization
  • Set up CI/CD for automated deployments

Common Issues

"try not defined" errors: Hugo version too old. Update to latest Extended version.

404 on production: Build with hugo and copy public/ folder to web server.

Can't access from network: Add --bind=0.0.0.0 and check firewall settings.

Results

After migrating to Hugo, my site loads in under 200ms — a dramatic improvement from WordPress. Page speeds that once required extensive optimization now come naturally.

The maintenance burden has practically disappeared. No more plugin updates, security patches, or database optimization. Just write in Markdown and deploy.

Conclusion

After completing this migration and running extensive performance tests (detailed in my previous article "WordPress vs Hugo: When Reality Challenges the Speed Myths"), I'll be honest: Hugo didn't convince me as much as I expected.

The performance gains? On paper, Hugo should dominate. In practice, with a properly optimized WordPress setup (OpenLiteSpeed + LS Cache), the difference was only 32ms for content-heavy pages. Hugo's advantage becomes clear at massive scale, but for most sites, a well-configured WordPress can match it.

What I learned:

  • Static isn't automatically faster - my Hugo pages had significant HTML parsing overhead
  • WordPress with good caching essentially becomes a static site generator
  • The real Hugo advantage is simplicity and predictability, not raw speed
  • You'll miss WordPress's admin interface more than you think

Was it worth it? As an experiment, absolutely. I learned valuable lessons about web performance, infrastructure, and the real trade-offs between platforms. The migration process itself was enriching.

Should you migrate? Only if you value simplicity over flexibility, are comfortable with Markdown workflows, and don't need dynamic features. A properly optimized WordPress might serve you just as well.

The future probably belongs to hybrid approaches anyway - static generation where it makes sense, dynamic features where needed. Both platforms are evolving toward this middle ground.

Have questions about the migration or want to discuss the performance results? Drop them in the comments below!

Have questions about the migration? Drop them in the comments below!

Resources

Follow me for more web development tutorials and infrastructure guides!

📬 Want essays like this in your inbox?

I just launched a newsletter about thinking clearly in tech — no spam, no fluff.

Subscribe here: https://buttondown.com/efficientlaziness

Efficient Laziness — Think once, well, and move forward.

Centralized EKS monitoring across multiple AWS accounts

2025-11-20 17:25:45

Complex systems require extensive monitoring and observability. Systems as complex as Kubernetes clusters have so many moving parts that sometimes it's a task and a half just to configure their monitoring properly. Today I'm going to talk in depth about cross-account observability for multiple EKS clusters, explore various implementation options, outline the pros and cons of each approach, and explain one of them in close detail. Whether you’re an aspiring engineer seeking best-practice advice, a seasoned professional ready to disagree with everything, or a manager looking for ways to optimize costs -- this article might be just right for you.

Cluster observability

What is good observability? Good observability answers the questions:

  • How is our cluster doing?
  • Are all the components working as intended?
  • Are there any errors?
  • What about our application?
  • What do our users see?
  • Are they experiencing any issues?

As you can see, apart from the health of the cluster itself, good observability monitors the health of the application and user-facing metrics as well. So far, we can highlight two types of data in our cluster: cluster operational metrics and application metrics.

EKS Cluster Monitoring

But is that all? Didn't we forget something?

How are we going to find out whether we have issues with our services if our monitoring is down? Exactly -- we can't. So we have to introduce meta-monitoring metrics: the data about the health of the monitoring system itself.
And now we have three types of metrics.

EKS Cluster Monitoring Monitoring

Observability fundamentals

The bare minimum set of data we can scrape from our cluster comes from the Prometheus Node Exporter (or VictoriaMetrics vmagent). This is the most technical data -- CPU, memory, network latencies, temperature, you name it. It doesn't get more technical than this.

K8s Node

But should we stop here? Technical data about our system shows only our part of the setup -- the health of the underlying system. It's extremely helpful when something goes very wrong with our service, but it doesn't show the full picture. To see that, we also need visibility into the customer-facing side of our service. We need application metrics: request latency, number of responses, application errors. The good thing is that exposing a metrics endpoint within our application may be enough -- Prometheus will scrape this endpoint by itself without additional daemons.

K8s Nodes In A Cluster

Something's missing, right? For example, where can we see application errors? Not just the error codes or counts, but actual error messages?

Sure, kubectl logs my-pod is cool and all, but a production-ready app is rarely a single pod (at least I hope so, for most of us).

So we think about log collection as well. This adds yet another pod to each node we have: a log collector agent.

Cluster With Logging Agent

Is the picture complete now?

Not quite. As mentioned previously, to check whether our monitoring setup is even alive, it would be nice to have some sort of external monitoring -- the monitoring of the monitoring. Monitoringception. Since the sole purpose of this external service is to answer the question of whether the monitoring is alive, it can be a very simple setup: a daemon that queries the monitoring endpoint. Because this daemon must remain alive even when the whole system is down, it has to run outside the perimeter of the application workload -- on an external EC2 instance outside the EKS cluster. To achieve the maximum independence of the system this service can be set up in an entirely different cloud provider or even on a dedicated VPS in a datacenter on another continent (or in your home lab).

It can even be done as a dead-man switch: an alert in the monitoring system that should always be in the "failing" state and should turn green only if the monitoring system is down. Don't like constant influx of alerts? Set up an notification when data hasn't been received for X minutes, it should do the trick as well.1

Full EKS Cluster

Cross-account observability

Now let's crank it up a notch.

This is what a service architecture might look like. But in modern development, it rarely is like this. Mature businesses typically divide their infrastructure by environments -- at the very least, development and production. They might also include, but are not limited to, staging, UAT, etc.

So we spin up both environments in the same cluster, but in different namespaces, right?

This is a bad practice, as resources are still shared between both applications, and one can affect the other in ways we wouldn't want.

The other option is to spin up another environment in the same AWS account, but on a different cluster. Is this a good option? Most likely, yes. Buuuut the Well-Architected Framework and other guidelines advise against sharing a single account between several environments. Full environment separation is better for both account and application security.

So now we have two AWS accounts.

It was fairly simple to organize observability within a single account -- the networks were contained, enclosed in a single perimeter, and available directly from one endpoint to another. Having several accounts introduces the complexity of cross-account networking and finally brings us to the actual topic of this article.

Cross-account networking options

As with every task in cloud operations, this topic can be approached in several ways:

  1. VPC Peering
  2. AWS Transit Gateway
  3. AWS NLBs with VPC Endpoints and VPC Endpoint Services
  4. AWS VPN / Direct Connect
  5. Public internet + Authorization + TLS

Let's discuss each of them briefly.

NOTE
For the examples below, the cost of each implementation will include only additional charges (i.e., networking and service charges). The price of the monitoring services (Prometheus + storage, Grafana, etc.) will not be referenced.

NOTE
A table summarizing all the options with detailed price breakdown and implementation complexity comparisons can be found in the annex.

VPC Peering

This option assumes creating multiple VPC Peering connections from the root account to branched accounts. It is quite cheap and may work well for small organizations with one or two accounts. It may also work if you only plan to create your infrastructure, because it requires CIDR planning: VPC CIDRs cannot overlap with this option. On top of that, you will have to configure route tables manually for each account. On the bright side: same-AZ traffic is free, and VPC peering is also free.

Pros

  • Native AWS networking, low latency
  • No additional data transfer costs within the same region
  • Simple security model with Security Groups
  • Direct IP connectivity, no NAT required

Cons

  • Non-transitive routing (requires mesh topology for 4+ accounts)
  • CIDR blocks cannot overlap between VPCs
  • Manual route table management for each peering

AWS Transit Gateway

This setup looks more like an enterprise-scale solution: better convenience, higher price. It allows for centralized management of the observability networking setup but requires extensive route planning. It is, however, very scalable by utilizing Transit Gateways, which can be attached to a large number of accounts. It can also be shared across organization accounts via Resource Access Manager (one more point for house Enterprise).

Pros

  • Scalable to hundreds of VPCs
  • Centralized management
  • Supports overlapping CIDR blocks with routing domains
  • Single point of policy enforcement
  • Can be shared across accounts via Resource Access Manager

Cons

  • Pricy
  • Additional hop adds ~1-2 ms latency
  • More complex initial setup
  • Bandwidth limits per VPC attachment (50 Gbps burst)
  • Requires careful route table planning

AWS NLBs + VPC Endpoints and VPC Endpoint Services

The third option, almost the golden mean, involves exposing services via Network Load Balancers and Endpoint Services and connecting to them via VPC Endpoints. It is cheaper than the previous option but more expensive than the first. NLBs and VPC Endpoints cost money to simply run. On the other hand, it doesn't limit the number of accounts that can be connected, doesn't require route table configuration, doesn't traverse the public internet unexposed, and with one-to-many connections, you only need to configure a small number of endpoints and endpoint services.

Pros

  • No CIDR overlap issues
  • Works across regions
  • Manual control over who can connect: accept/refuse endpoint service connections
  • Private connectivity without internet exposure
  • No route table modifications needed

Cons

  • Higher cost
  • Requires NLB per source cluster
  • Additional latency from the NLB layer
  • Cross-region data transfer can be expensive

AWS VPN / Direct Connect

This option is an overkill unless you have a dedicated NOC department, because it has all the fun: full network connectivity, encrypted traffic, interface configuration, BGP configuration, and many associated charges for traffic and connection speed.

Pros

  • Full network connectivity
  • Can support multiple use cases beyond monitoring
  • Encrypted tunnels
  • Direct Connect offers dedicated bandwidth and lower latency

Cons

  • Requires a specialized NOC department
  • You will have to have an existing datacenter to justify using Direct Connect
  • High cost (especially Direct Connect)
  • Complex setup and management
  • VPN has bandwidth limitations (~1.25 Gbps per tunnel)
  • Direct Connect has long setup times (weeks)
  • Requires BGP knowledge for Direct Connect

Public Internet + Authorization + TLS

And the final option: the very basic "just shove it onto the public internet and slap a login page on top of it" approach. Super simple to set up! Just spin up an ALB before the Prometheus instance, season it with TLS certificates, and expose it to the public internet. Super unsafe! Works for hobby and experimental projects. Please don't use in a production environment (unless you are absolutely certain -- but still, please don't).

Pros

  • Simplest networking
  • No VPC connectivity required
  • Works across any AWS accounts/regions
  • Easy to test and debug
  • Lowest AWS networking costs

Cons

  • Security concerns
  • Data transfer over public internet
  • Potential latency/reliability issues
  • Harder to pass security audits
  • Need to manage SSL certificates
  • Vulnerable to DDoS

As you can see, there are several options to consider and choose from. All of them have their pros and cons. For my scenario, I ended up choosing to set up NLBs with VPC Endpoints.

Why?

The first reason is that it's relatively easy to implement. To set up a connection between accounts, we need only an endpoint service and endpoints in the other accounts (plural). They connect in a one-to-many relationship: one endpoint service can handle several connecting endpoints.

The second reason is that I had overlapping CIDRs in each account, so VPC Peering was immediately off the table.

The third reason is that it has moderately less configuration overhead. I won't have to configure client connections or route tables. Proper routing between multiple accounts is a very precise art at which I most certainly suck.

And the last reason is that it's still quite secure. The traffic from services doesn't leave the AWS network, doesn't traverse the public internet, and connections are only allowed after approval (it can be configured to auto-approve, though, which may be convenient -- but healthy paranoia is my constant companion).

Chosen cross-account solution

The whole system is actually not that overcrowded.

For the sake of this article, let's assume we have three products: Bulba, Char, and Squi. Each product has two environments: normal and sparkling. We also have a root account (let's call it Ketch) for organization management and, for simplicity, observability as well. This gives us seven AWS accounts and seven EKS clusters. Why seven and not six? Well, the root account also runs an EKS cluster as an observability backend.

Each product account (Bulba, Char, and Squi) includes the following elements:

  • EKS cluster:
    • Prometheus server
    • Node exporter pods
    • Logs and traces collector pods (for my setup, Alloy, I choose you!)
    • Internal NLB to expose the Prometheus backend as a service (only in the private subnet)
  • Endpoint service that exposes the NLB for Prometheus. For each deployed endpoint service, a corresponding endpoint must be created in the root account (more on that later)
  • Endpoint for connecting to the root account logging solution (Loki)
  • Endpoint for connecting to the root account tracing solution (Tempo)
  • Security groups with allowed ingress and egress ports

The root account (Ketch) also contains several elements:

  • EKS cluster:
    • Prometheus server: centralized Prometheus backend that queries federated Prometheus backends in product accounts
    • Loki backend, which receives logs from product accounts
    • Tempo backend, which receives traces from product accounts
    • Internal NLB to expose Loki backend as a service (in the private subnet)
    • Internal NLB to expose Tempo backend as a service (in the private subnet)
  • S3 buckets for chunks and logs (Loki)
  • S3 bucket for traces (Tempo)
  • Endpoints for connecting to product account Prometheus backends in federated mode
  • Endpoint service for Loki, which accepts endpoint connections from product accounts
  • Endpoint service for Tempo, which accepts endpoint connections from product accounts
  • Security groups with allowed ingress and egress ports

That's quite a lot of components to keep track of. Luckily, we have IaC to ease the configuration and deployment of these components.

To understand why so many components are needed (particularly the endpoint service-endpoint pairs), we need to look at the traffic flow model.

Prometheus cross-account connections

Zooming in a bit, we can see the exact details of how the connection is configured.

Prometheus connection setup

Since our goal is to keep metrics data secure and prevent it from traversing public subnets or the public internet, we create an endpoint service. This service acts as an open receiver on one account (the product account) for the connector on the root account. This creates a 1-to-1 connection that is both secure and governed, since requests to connect to the endpoint service must be approved.

The setup for metrics differs from the setup for logs and traces.

Prometheus uses a pull model: the backend queries the endpoints for data, effectively pulling the data from them. Loki and Tempo, however, use a push model: deployed logs and traces collection pods send (push) the data to the centralized backend.

In this case, the endpoint service/endpoint pair is reversed and simpler: only one endpoint service per service (two in total) is created in the root account, and all product accounts create endpoints and request connections with the root endpoint service.

And this is how the accounts look in the end:

Product account

Root account

TIP
For consistency, you might want to use a metrics-gathering service with a push model (e.g., the aforementioned VictoriaMetrics vmagent). This way, endpoint services are only created in the root account, and product accounts only have endpoints.

Operational considerations

As mentioned previously, the solution I chose in this article is not the only correct one out there. For this setup, I had specific requirements that needed to be fulfilled, as well as particular implementation caveats to consider.

My example is definitely not the cheapest option. The cheapest would be using VPC Peering. But unfortunately, my existing setup -- with the same CIDRs in EKS clusters and the possibility to extend beyond those six (+1) AWS accounts -- made this option unavailable.

The described setup is also located in a single region -- in cross-region data transfer scenarios, costs can increase drastically and very quickly. For each additional region, an additional VPC endpoint/service will have to be created.

There's also always the possibility to just expose the backend ports to the public internet (with proper authorization, of course!) and don't bother with endpoint configuration entirely. But even with TLS-encrypted traffic, this is a rather unsafe option and will absolutely not help you pass any SOC 1/SOC 2/ISO 27001 certification.

Closing thoughts

This was a very interesting challenge to tackle and implement -- I had an absolute blast setting up the POC and confirming that it works. I was excited by the variety of options I could choose from and the differences between the tools available in these scenarios.

And I think it's beautiful. It not only shows the complexity of AWS services (which can sometimes be a downside), but also that there's always more than one solution to each problem. Every engineer will approach a challenge differently -- which, in my humble opinion, means our jobs are secure for the observable future.

Thank you for reading, and see you in the next one!

Annex

A. Summarizing table

Option CIDRs can overlap Scalability Implementation Complexity Price
VPC Peering X ●○○ ●●○○ $
AWS Transit Gateway ●●● ●●●○ $$$
AWS NLBs + VPC Endpoints + Services ●●○ ●●○○ $$
AWS VPN / Direct Connect ●●● ●●●● $$$-$$$$
Public internet + Authorization + TLS ●●● ●○○○ $$

B. Price breakdown

The variety, complexity and difference of the outlined options require more in-depth research than outlined in the scope of this article. I agree with that. To make your life easier and to highlight the cryptic $ signs in the table above, I decided to create an example price breakdown and give you some specific numbers, based on which you can at least in some proximity figure whether an option is sutable for you or not.

To keep the numbers comparable, we will assume the following prerequisites:

  • 5 AWS accounts (1 root and 4 product accounts)
  • All accounts are located in single region (us-west-2), but in multiple AZs
  • 500 GB of data transferred monthly
  • Each account has an EKS cluster with 2 nodes (c7a.medium) in each
  • 750 GB of storage for 1 month data retention and a little extra

Option 1: VPC Peering

  • VPC Peering Connections: 4 × $0 = $0 / month
  • Data Transfer (cross-AZ2, bidirectional): 500 GB × 2 × $0,01/GB = $10/month
  • NLBs in product accounts (Prometheus): 4 × $16.20/month = $64.80/month
  • NLB in root account (Loki): $16.20/month
  • NLB LCU charges: 500 GB × $0.006/GB = $3/month
  • EC2 instances for EKS: 10 × $37.46/month = $374.6/month
  • EBS storage (gp3): 750 GB × $0.08/GB = $60/month

Total Monthly Cost: $528.6

Option 2: AWS Transit Gateway

  • Transit Gateway: $36/month
  • TGW VPC Attachments: 5 × $0.05/hr = $180/month
  • TGW Data Transfer: 500 GB × $0.02 = $10/month
  • Data Transfer (cross-AZ, bidirectional): 500 GB × 2 × $0,01/GB = $10/month
  • EC2 instances for EKS: 10 × $37.46/month = $374.60/month
  • EBS storage (gp3): 750 GB × $0.08/GB = $60/month

Total Monthly Cost: $634.6

Option 3: AWS NLBs + VPC Endpoints + Services

  • VPC Endpoints in root account (Prometheus): 4 × $0.01/hr = $28.4/month
  • VPC Endpoints in product account (Loki): 4 × $0.01/hr = $28.4/month
  • VPC Endpoint Data Transfer: 500 GB × $0.01/GB = $5/month
  • NLBs in product accounts (Prometheus): 4 × $16.20/month = $64.80/month
  • NLB in root account (Loki): $16.20/month
  • NLB LCU charges: 500 GB × $0.006/GB = $3/month
  • Data Transfer (cross-AZ, bidirectional): 500 GB × 2 × $0,01/GB = $10/month
  • EC2 instances for EKS: 10 × $37.46/month = $374.60/month
  • EBS storage (gp3): 750 GB × $0.08/GB = $60/month

Total Monthly Cost: $590.4

Option 4a: Site-to-Site VPN

  • VPN Connections: 4 × $0.05/hour = $144/month
  • VPN Data Transfer: 500 GB × $0.09/GB = $45/month
  • Data Transfer (cross-AZ, bidirectional): 500 GB × 2 × $0,01/GB = $10/month
  • EC2 instances for EKS: 10 × $37.46/month = $374.60/month
  • EBS storage (gp3): 750 GB × $0.08/GB = $60/month

Total Monthly Cost: $633.6

Option 4b: AWS Direct Connect

  • Direct Connect Port (1 Gbps): $0.30/hour = $216/month
  • Direct Connect Data Transfer: 500 GB × $0.02 = $10/month
  • Data Transfer (cross-AZ, bidirectional): 500 GB × 2 × $0,01/GB = $10/month
  • EC2 instances for EKS: 10 × $37.46/month = $374.60/month
  • EBS storage (gp3): 750 GB × $0.08/GB = $60/month

Total Monthly Cost: $670.63

Option 5: Public Internet + TLS + Auth

  • NLBs in product accounts (Prometheus): 4 × $16.20/month = $64.80/month
  • NLB in root account (Loki): $16.20/month
  • NLB LCU charges: 500 GB × $0.006/GB = $3/month
  • Data Transfer (out to the internet): 500 GB × $0.09/GB = $45/month
  • EC2 instances for EKS: 10 × $37.46/month = $374.60/month
  • EBS storage (gp3): 750 GB × $0.08/GB = $60/month

Total Monthly Cost: $563.6

For pricing calculations, I used the following AWS resources::

As well as this handy AWS EC2 instance and price comparison tool:

C. IaC snippets

The code presented in this section does not represent a fully working infrastructure. It highlights only the most relevant parts of the PoC implementation. It has not been tested and serves solely as a reference. You can copy it, but without additional initial setup (at minimum a VPC and an EKS cluster) and further adjustments, it will not work. Use at your own risk!

Product account

# variables.tf
variable "eks_nlb_endpoint_services" {
  description = "EKS NLB endpoint services configuration"
  type = map(object({
    nlb_arn            = list(string)
    allowed_principals = list(string)
  }))
}

variable "loki_service_name" {
  description = "VPC Endpoint Service name for Loki"
  type = string
}

variable "tempo_service_name" {
  description = "VPC Endpoint Service name for Tempo"
  type = string
}

variable "vpc_id" {
  type        = string
}

variable "private_subnet_ids" {
  type        = list(string)
}

variable "name_prefix" {
  type        = string
}
# main.tf
resource "aws_vpc_endpoint_service" "nlb_endpoint_services" {
  for_each = var.eks_nlb_endpoint_services

  acceptance_required        = true
  allowed_principals         = each.value.allowed_principals
  network_load_balancer_arns = each.value.nlb_arn
}

resource "aws_security_group" "observability" {
  name        = "${var.name_prefix}-observability"
  description = "Allow traffic from observability resources"
  vpc_id      = var.vpc_id
}

resource "aws_vpc_security_group_ingress_rule" "loki" {
  security_group_id = aws_security_group.observability.id
  description       = "Allow traffic from observability resources: Loki"
  cidr_ipv4         = "10.0.0.0/16"
  from_port         = 8080
  to_port           = 8080
  ip_protocol       = "tcp"
}

resource "aws_vpc_security_group_ingress_rule" "tempo" {
  security_group_id = aws_security_group.observability.id
  description       = "Allow traffic from observability resources: Tempo"
  cidr_ipv4         = "10.0.0.0/16"
  from_port         = 4317
  to_port           = 4318
  ip_protocol       = "tcp"
}

resource "aws_vpc_endpoint" "loki" {
  vpc_id              = var.vpc_id
  service_name        = var.loki_service_name
  vpc_endpoint_type   = "Interface"
  security_group_ids  = [aws_security_group.observability.id]
  subnet_ids          = var.private_subnet_ids
  private_dns_enabled = false
}

resource "aws_vpc_endpoint" "tempo" {
  vpc_id              = var.vpc_id
  service_name        = var.tempo_service_name
  vpc_endpoint_type   = "Interface"
  security_group_ids  = [aws_security_group.observability.id]
  subnet_ids          = var.private_subnet_ids
  private_dns_enabled = false
}
# terraform.tfvars
eks_nlb_endpoint_services = {
  prometheus = {
    nlb_arn = [
      "arn:aws:elasticloadbalancing:us-west-2:${data.aws_caller_identity.current.account_id}:loadbalancer/net/xxxx-yyyy/zzzz"
    ]
    allowed_principals = ["arn:aws:iam::123456789012:root"]
  }
}
loki_service_name = "com.amazonaws.vpce.us-west-2.vpce-svc-qwertyuiop"
tempo_service_name = "com.amazonaws.vpce.us-west-2.vpce-svc-qwertyuiop"
vpc_id             = "vpc-xxxxx"
private_subnet_ids = ["subnet-xxxxx", "subnet-yyyyy"]
name_prefix        = "char"

Root account

# variables.tf
variable "account_id" {
  type        = string
}

variable "name_prefix" {
  type        = string
}

variable "vpc_id" {
  type        = string
}

variable "private_subnet_ids" {
  type        = list(string)
}

variable "allowed_principals" {
  description = "Allowed AWS account principals for VPC endpoint services"
  type        = list(string)
}

variable "loki_nlb_arns" {
  type        = list(string)
}

variable "tempo_nlb_arns" {
  type        = list(string)
}

variable "char_normal_prometheus_vpcesvc_name" {
  description = "VPC Endpoint Service name for Prometheus; normal account"
  type        = string
}

variable "char_sparkling_prometheus_vpcesvc_name" {
  description = "VPC Endpoint Service name for Prometheus; sparkling account"
  type        = string
}
# s3.tf
resource "aws_s3_bucket" "loki_chunks" {
  bucket = "${var.name_prefix}-${var.account_id}-loki-chunks"
}

resource "aws_s3_bucket" "loki_ruler" {
  bucket = "${var.name_prefix}-${var.account_id}-loki-ruler"
}

resource "aws_s3_bucket" "tempo" {
  bucket = "${var.name_prefix}-${var.account_id}-tempo-traces"
}
# iam.tf
resource "aws_iam_policy" "loki_buckets" {
  name = "${var.name_prefix}-loki-buckets"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "LokiBuckets"
        Effect = "Allow"
        Action = [
          "s3:ListBucket",
          "s3:PutObject",
          "s3:GetObject",
          "s3:DeleteObject"
        ]
        Resource = [
          aws_s3_bucket.loki_chunks.arn,
          "${aws_s3_bucket.loki_chunks.arn}/*",
          aws_s3_bucket.loki_ruler.arn,
          "${aws_s3_bucket.loki_ruler.arn}/*"
        ]
      }
    ]
  })
}

resource "aws_iam_role" "loki_pod_identity" {
  name = "${var.name_prefix}-loki-pod-identity"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "AllowEksAuthToAssumeRoleForPodIdentity"
        Effect = "Allow"
        Principal = {
          Service = "pods.eks.amazonaws.com"
        }
        Action = [
          "sts:AssumeRole",
          "sts:TagSession"
        ]
      }
    ]
  })

  tags = var.common_tags
}

resource "aws_iam_role_policy_attachment" "loki_pod_identity" {
  role       = aws_iam_role.loki_pod_identity.name
  policy_arn = aws_iam_policy.loki_buckets.arn
}

resource "aws_iam_policy" "tempo_bucket" {
  name = "${var.name_prefix}-tempo-bucket"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "LokiBuckets"
        Effect = "Allow"
        Action = [
          "s3:ListBucket",
          "s3:PutObject",
          "s3:GetObject",
          "s3:DeleteObject",
          "s3:GetObjectTagging",
          "s3:PutObjectTagging"
        ]
        Resource = [
          aws_s3_bucket.tempo.arn,
          "${aws_s3_bucket.tempo.arn}/*"
        ]
      }
    ]
  })
}

resource "aws_iam_role" "tempo_pod_identity" {
  name = "${var.name_prefix}-tempo-pod-identity"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "AllowEksAuthToAssumeRoleForPodIdentity"
        Effect = "Allow"
        Principal = {
          Service = "pods.eks.amazonaws.com"
        }
        Action = [
          "sts:AssumeRole",
          "sts:TagSession"
        ]
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "tempo_pod_identity" {
  role       = aws_iam_role.tempo_pod_identity.name
  policy_arn = aws_iam_policy.tempo_bucket.arn
}

resource "aws_eks_pod_identity_association" "loki" {
  cluster_name    = var.name_prefix
  namespace       = "monitoring"
  service_account = "loki"
  role_arn        = aws_iam_role.loki_pod_identity.arn
}

resource "aws_eks_pod_identity_association" "tempo" {
  cluster_name    = var.name_prefix
  namespace       = "monitoring"
  service_account = "tempo"
  role_arn        = aws_iam_role.tempo_pod_identity.arn
}
# vpc.tf
resource "aws_vpc_endpoint_service" "loki" {
  acceptance_required        = true
  allowed_principals         = var.allowed_principals
  network_load_balancer_arns = var.loki_nlb_arns
}

resource "aws_vpc_endpoint_service" "tempo" {
  acceptance_required        = true
  allowed_principals         = var.allowed_principals
  network_load_balancer_arns = var.tempo_nlb_arns

}

resource "aws_vpc_endpoint" "char-normal-prometheus" {
  vpc_id              = var.vpc_id
  service_name        = var.char_normal_vpcesvc_name
  vpc_endpoint_type   = "Interface"
  security_group_ids  = [aws_security_group.observability.id]
  subnet_ids          = var.private_subnet_ids
  private_dns_enabled = false
}

resource "aws_vpc_endpoint" "char-sparkling-prometheus" {
  vpc_id              = var.vpc_id
  service_name        = var.char_sparkling_vpcesvc_name
  vpc_endpoint_type   = "Interface"
  security_group_ids  = [aws_security_group.observability.id]
  subnet_ids          = var.private_subnet_ids
  private_dns_enabled = false
}

resource "aws_security_group" "observability" {
  name        = "observability"
  description = "Allow traffic from observability resources"
  vpc_id      = var.vpc_id
}

resource "aws_vpc_security_group_ingress_rule" "prometheus" {
  security_group_id = aws_security_group.observability.id
  cidr_ipv4         = "10.0.0.0/16"
  from_port         = 9090
  to_port           = 9090
  ip_protocol       = "tcp"
}
# terraform.tfvars.example
account_id                             = "123456789012"
name_prefix                            = "ketch"
vpc_id                                 = "vpc-xxxxx"
private_subnet_ids                     = ["subnet-xxxxx", "subnet-yyyyy"]
loki_nlb_arns                          = ["arn:aws:elasticloadbalancing:us-west-2:${var.account_id}:loadbalancer/net/xxxx-yyyy/zzzz"]
tempo_nlb_arns                         = ["arn:aws:elasticloadbalancing:us-west-2:${var.account_id}:loadbalancer/net/xxxx-yyyy/zzzz"]
char_normal_prometheus_vpcesvc_name    = "com.amazonaws.vpce.us-west-2.vpce-svc-xyz"
char_sparkling_prometheus_vpcesvc_name = "com.amazonaws.vpce.us-west-2.vpce-svc-xyz"
allowed_principals = [
  "arn:aws:iam::111111111111:root",
  "arn:aws:iam::222222222222:root"
]
  1. It doesn’t matter all that much which approach you choose to implement for monitoring your monitoring. In light of recent events, some folks even created a downdetector for a downdetector’s downdetector. I mean, it’s hilariously fun, but the key point remains solid: you need to know whether your eyes and ears (infrastructure-wise) are even working. ↩

  2. We assume 100% cross-AZ traffic in this example to maximize potential traffic costs and avoid complicating the calculations with percentages of same-AZ versus cross-AZ traffic. ↩

  3. Direct Connect may also require a specific partner to enable and perform the physical connection to the AWS network, so expect to add a few hundred (or even thousands) of dollars on top of the initial bill for setup. ↩

A Step-by-Step Guide on How to Integrate AI Into Your Existing Health App

2025-11-20 17:23:54

I thought plugging AI into a health app would be a weekend project. Spoiler: it wasn’t. It was messy, frustrating, and at one point, I wondered if my laptop fans could legally qualify as medical devices because of how hard they were working.

But here’s the thing: healthcare is already leaning on AI harder than most industries. According to Grand View Research, the global AI in healthcare market was valued at $22.45 billion in 2023 and is expected to grow at 37% annually through 2030. That’s not just hype, it’s reality. If you’re building or maintaining a health app today, the question isn’t whether you should integrate AI. It’s how soon can you do it without breaking everything?
In this guide, you will know everything about how to add AI to an existing health app.

Step 1: Admit That Your App Isn’t Ready for AI

When I first looked at my health app’s codebase, it felt like inviting a brain surgeon to operate in a garage. The app was functional: calorie tracking, step counts, and some reminders, but architecturally, it wasn’t ready for machine learning models.
Here’s what I had to do first for the AI integration in healthcare apps:

  • Clean the Data: My user data was riddled with inconsistencies. Think “10,000 steps” logged in one field and “10k” in another. AI models choke on that stuff.
  • Upgrade Storage: SQL alone wasn’t cutting it. I needed a pipeline that could handle structured + unstructured data, especially if I wanted natural language features.
  • Audit Permissions: Healthcare data = sensitive data. If you don’t nail HIPAA or GDPR compliance upfront, AI is the least of your worries.

Step 2: Pick the Right AI Use Case (Not the Shiny One)

The temptation? Predicting diseases like some sci-fi oracle. The reality? I didn’t have the data (or regulatory clearance) for that.
So I started smaller. I integrated an AI-powered symptom checker that could take user inputs in plain English and map them to potential health insights. Why this worked:

  • Easier Data Scope
  • Faster Integration
  • Immediate User Value

Lesson: Choose a use case that matches both your data maturity and user needs. If you aim too high, you’ll spend six months tweaking models no one will ever see.

Step 3: Build the Pipeline (aka Where I Broke Everything)

This was the most painful part. You don’t just “add AI” like a WordPress plugin. I needed an actual pipeline:

  • Data Ingestion (fitness trackers, manual logs, APIs)
  • Preprocessing (cleaning, normalization, anonymization)
  • Model Training/Integration (TensorFlow, PyTorch, or a managed API)
  • Deployment (embedding the model into the app flow)

The first time I deployed, the model was so slow it made my app feel like dial-up internet. Users would type “headache” and get results ten seconds later. Not exactly confidence-inspiring.

What fixed it? Offloading heavy computation to the cloud and only keeping lightweight inference on-device. That balance is critical if you want to avoid frustrating your users.

Step 4: Test Like You’re a Paranoid Doctor

Healthcare apps don’t get the same forgiveness as social apps. If your AI makes a mistake, people panic.

Here’s how I tested mine:

  • Edge Cases: What happens if someone types “I feel weird”?
  • Multilingual Input: Health is global, and so are users.
  • False Positives: Better to say “consult a doctor” than confidently misdiagnose.

I also pulled in a small circle of test users (read: friends and family) to break the system. One typed “I ate 50 bananas in an hour” just to see what would happen. It turns out models don’t like absurd diets either.

Step 5: Handle Privacy Before It Handles You

This one nearly derailed me. Collecting health data means you’re holding a ticking legal time bomb if you’re not careful.

What I learned:

  • Always anonymize data before training models.
  • Store personal identifiers separately from health metrics.
  • Log every access request for transparency.

I ended up spending more time on compliance than on coding. Boring? Yes. Necessary? Absolutely.

Step 6: Know When to Get Help

Somewhere between debugging preprocessing scripts and trying to optimize model latency, I realized I was way out of my depth. That’s when I started looking into outside help from teams that do this full-time.

Step 7: Launch Small, Learn Fast

When I finally rolled out the AI feature, I didn’t blast it to every user. I launched a beta. That way, feedback trickled in from a manageable group, and I could iterate without fear of a meltdown.Early users pointed out quirks I hadn’t even considered:

  • The symptom checker didn’t recognize slang like “tummy ache.”
  • Results felt too clinical for casual users.
  • Some people expected AI to replace doctors (which it shouldn’t).

Each round of feedback made the feature sharper and safer.

What I’d Tell You If You’re About to Try This

Integrating AI into a health app isn’t just a technical challenge; it’s a balancing act between user trust, regulatory compliance, and technical feasibility.

If you’re thinking about it, here’s my blunt advice:

  • Don’t chase flashy features; start practical.
  • Expect your first deployment to fail (mine did).
  • Prioritize privacy above all else.
  • And most importantly: remember you’re dealing with people’s health. AI should assist, not replace, medical judgment.

Looking back, I wouldn’t say I “mastered” AI in healthcare, but I survived it. And now, when my app’s users type in symptoms and get meaningful, timely insights, the pain feels worth it.
If you’re about to dive into the same rabbit hole, just remember: AI isn’t a magic wand. It’s a tool. Use it wisely, and maybe you’ll save yourself from debugging your life at 3 AM.