2025-11-20 17:38:31
Your CI/CD pipeline has been green for weeks. The automated tests pass, the linter is happy, deployment scripts run smoothly. Life is good. Until one day, production breaks, and you realize the automation you trusted completely missed a critical bug.
That's Automation Bias: the dangerous tendency to trust automated systems more than they deserve, even when they're wrong.
In psychology, automation bias describes our tendency to favor suggestions from automated decision-making systems and ignore contradictory information made without automation. It's not about laziness, it's about how our brains handle authority, even when that authority is a piece of software.
This bias becomes most dangerous precisely when systems work well. The more reliable a system appears, the more we trust it blindly. We stop questioning, we stop verifying, we stop thinking.
In software development, automation bias shows up in every corner of our workflow.
A team's automated testing pipeline has been running successfully for months. When a critical bug slips through to production, the team initially assumes the automated tests must be correct and looks for other causes, wasting hours before realizing the tests had a false positive.
A security team receives hundreds of automated alerts daily. After months of mostly false positives, they begin to ignore alerts or dismiss them without investigation. When a real security breach occurs, the automated system correctly flags it, but the team dismisses it as another false positive.
Developers rely heavily on automated code analysis tools to catch bugs and style issues. They become complacent, assuming the tools will catch everything. When the tools miss a subtle logic error that causes a production outage, the team realizes they've stopped doing thorough manual code reviews.
A developer uses an AI coding assistant to generate code. The AI produces syntactically correct code that looks good, so the developer accepts it without thorough review. They trust the AI's output, assuming it understands the full context and edge cases. The code works in most scenarios but fails silently in an edge case, causing a data integrity issue in production.
The bias warps our judgment. We attribute more credibility to systems than to our own expertise. We reduce cognitive load by trusting automation, but in doing so, we become lazy thinkers.
When this happens:
Ironically, the better our automation works, the worse we become at noticing when it doesn't.
Escaping automation bias begins with remembering that automation should enhance, not replace, human judgment. You can't eliminate automation, but you can design processes that keep you engaged.
Some practical ways:
Always verify critical automated decisions manually. Don't let automation make you passive in monitoring systems.
Question automated outputs regularly. Ask "What could the system be missing?" Make it a habit, not an exception.
Cross-check with multiple sources. Use different tools or methods to validate results. Don't rely on a single automated system for mission-critical decisions.
Maintain manual skills alongside automated tools. Ensure team members stay trained on manual processes even when automation is available.
Design human-in-the-loop processes. Require human confirmation for critical decisions. Make automated decision-making processes explainable.
Implement intelligent alerting that reduces noise. Manage alert fatigue before it makes you ignore real threats.
The goal is not to eliminate automation—that would be counterproductive. The goal is to keep humans engaged, skilled, and questioning. Automation should be a tool, not a crutch. Systems should enhance judgment, not replace it.
Debugging the human mind, one bias at a time.
2025-11-20 17:33:43
We are welcoming you to our weekly digest! Here, we discuss the latest trends and advancements in account abstraction, chain abstraction and everything related, as well as bring some insights from Etherspot’s kitchen.
The latest news we'll cover:
Please fasten your belts!
The Ethereum Foundation, in collaboration with Vitalik Buterin and the Account Abstraction team, has published a document titled the “Trustless Manifesto,” which has been permanently deployed on-chain.
The manifesto articulates core values of decentralisation, self-custody, verifiability and resistance to convenience-engineered centralisation. Its deployment as a smart contract, with no administrator or owner, signals a commitment to trust-minimised architecture where users can pledge adherence by calling a pledge() function.
The Account Abstraction team emphasises that the document is not merely symbolic: it reaffirms the idea that verification replaces blind trust, and every protocol design decision must consider whether it introduces unnecessary intermediaries.
The manifesto lays philosophical groundwork for account and chain abstraction builders, insisting that UX innovations not compromise permissionless access or verifiability. As wallet and multi-chain flows increase, this declaration sets a standard. If flows rely on opaque relayers or centralised bridges, they compromise Ethereum’s foundational trust model.
Etherspot published an explainer on X outlining how RIP-7560 advances Account Abstraction by shifting key mechanisms from smart-contract infrastructure into the protocol layer of rollups.
The post begins by revisiting ERC-4337, noting that it introduced UserOperations, bundlers, the shared EntryPoint contract, and optional paymasters to make wallets programmable without modifying Ethereum’s base protocol. These components enabled features such as batched actions, custom validation logic, recovery mechanisms, and the ability to transact without holding ETH.
The post explains that RIP-7560 builds on these foundations by proposing native transaction types handled directly by rollups rather than through an external EntryPoint contract and off-chain bundlers. This change moves validation, execution, and fee logic into the rollup protocol itself, reducing the number of moving parts and lowering overhead for AA workflows.
The native processing model allows smart wallets to operate more efficiently and enables rollups to standardize AA behavior across their ecosystems. This design reduces reliance on contract-based routing, potentially improving performance while maintaining compatibility with existing tooling.
The post emphasizes that RIP-7560 remains fully backward-compatible with ERC-4337. Wallets, paymasters, and bundlers built on top of 4337 can continue functioning as before, while rollups that adopt RIP-7560 will process UserOperations natively rather than through contract execution. Etherspot frames this as a meaningful evolution: ERC-4337 demonstrated that Account Abstraction works in practice, and RIP-7560 aims to establish AA as built-in infrastructure for rollups.
Follow Etherspot on X for more EIP/ERC/RIP explainers!
The MetaMask team announced the official launch of Multichain Accounts, marking a major shift in how wallet accounts are structured.
Under the new system, a single “Multichain Account” can hold parallel key sets across multiple networks, EVM chains like Ethereum and Solana, and, soon, Bitcoin, all derived from a single seed phrase.
The blog post explains that as users increasingly transact across diverse chains, it has become untenable to manage separate addresses for each network. Multichain Accounts aim to simplify this by re-architecting the wallet “account” layer: it now groups addresses rather than creating new ones for each chain.
For builders of account abstraction and chain abstraction flows, this update matters: it reduces user friction when switching chains or onboarding across multiple networks. By lowering network-management complexity, wallets like MetaMask can provide a smoother surface for AA-powered features: sponsored transactions, cross-chain intents, unified balances, while still leveraging multichain architecture.
Odaily published an analysis examining the limitations of Ethereum’s AA and the emergence of the x402 protocol as a more practical standard for cross-chain payments. The article highlights criticisms that AA, despite years of investment in ERC-4337, Paymasters, and wallet infrastructure, has been “all talk and no action” and overly dependent on an EVM-only model.
The author notes that Paymasters shift gas costs to project teams, but the “motivation to burn money on payment is very weak,” making it difficult to maintain sustainable ROI.
The piece explains that AA depends on smart contracts, on-chain state, and EVM execution, which limits its reach beyond Ethereum-compatible environments. Attempting to extend AA to ecosystems such as Solana or Bitcoin requires additional middleware layers, increasing cost and complexity. This contributes to what the author describes as AA becoming “technology for technology’s sake,” a product of Ethereum’s earlier research-driven culture.
In contrast, the article describes the x402 protocol as relying on the longstanding HTTP 402 status code, allowing it to work across Web2 APIs, Web3 RPCs, and traditional payment gateways using only an HTTP request header. This design makes x402 a naturally cross-chain solution, where facilitators can interact with multiple chains, index user payment history uniformly, and enable developers to integrate once to serve the entire ecosystem.
The analysis argues that x402 offers a unified upstream protocol layer, reducing compatibility burdens at the application layer. Within this framework, ERC-8004 becomes an optional trust layer rather than a universal standard. By positioning ERC-8004 as “plug and play” inside the x402 ecosystem, the model avoids the top-down adoption challenges that AA faced.
We’d like to add here that account abstraction is much more than a payment request mechanism. It’s a foundation for programmable wallets, permissions, batching, and secure automation. Off-chain protocols like x402 can play a role in coordination, but they don’t replace the on-chain execution layer that AA provides.
Start exploring Account Abstraction with Etherspot!
❓Is your dApp ready for Account Abstraction? Check it out here: https://eip1271.io/
Follow us
Etherspot Website | X | Discord | Telegram | Github | Developer Portal
2025-11-20 17:30:00
Feature selection is one of the most important steps before building any machine learning model.
And one of the simplest tools to do this is correlation.
But correlation alone doesn’t tell the whole story.
To use it correctly, you also need to understand variance, standard deviation, and a few other related statistical terms.
This blog breaks everything down in the simplest way possible — no heavy maths, just practical understanding.
Correlation tells us how two numerical features move together.
Correlation ranges from –1 to +1:
In feature selection, correlation helps you answer:
“Which features are actually related to the target?”
“Which features are repeating the same information?”
If you're predicting house price, and size_in_sqft has high correlation with price, that feature is useful.
Example:
| Feature | Correlation with Price |
|---|---|
| Size (sqft) | 0.82 |
| No. of rooms | 0.65 |
| Age of house | –0.20 |
| Zip code | 0.05 |
High correlation → strong predictive power.
When two features are too similar, they cause multicollinearity, which confuses models (especially regression).
Example:
height and total_floors → correlation 0.95This makes your model:
If a feature has a non-linear relationship with the target, correlation may say “0”, even when the feature is useful.
Example:
Predicting salary based on experience — relationship grows but flattens → non-linear curve.
Low correlation does not mean useless feature.
Best practice:
Include the feature anyway and check feature importance using:
Variance tells you how much the values are spread from the average.
Example:
| Values | Variance |
|---|---|
| 50, 50, 50, 50 | Very low |
| 10, 80, 120, 200 | Very high |
In feature selection:
Features with extremely low variance (almost constant features) should be removed.
Example:
This is called low-variance filtering.
Standard deviation (SD) is the square root of variance.
Why do we use SD?
Because SD is in the same units as the data, so it’s easier to interpret.
Example:
In data science:
SD is important in:
High correlation among features (multicollinearity):
Solution:
Using SD:
Z-score = (value – mean) ÷ SD
Used heavily in:
Because these models depend on distance, standardization is essential.
| Concept | Meaning | Why It Matters for Feature Selection |
|---|---|---|
| Correlation | How two features move together | Helps identify useful or redundant features |
| Variance | How spread out the data is | Remove near-constant features |
| Standard Deviation | Average spread from the mean | Used in scaling and outlier detection |
| High Feature-to-Target Correlation | Strong predictor | Keep it |
| High Feature-to-Feature Correlation | Redundant | Remove one |
| Low Correlation | Not always useless | Check with ML model importance |
Feature selection is not just theory — it’s one of the most practical skills in data science.
If you understand correlation, variance, and SD, you're already ahead.
Connect on Linkedin: https://www.linkedin.com/in/chanchalsingh22/
Connect on YouTube: https://www.youtube.com/@Brains_Behind_Bots
I love breaking down complex topics into simple, easy-to-understand explanations so everyone can follow along. If you're into learning AI in a beginner-friendly way, make sure to follow for more!
2025-11-20 17:30:00
I shared how I transformed my old laptop into a home server. That experimental setup became the perfect testing ground for something I'd been curious about — migrating from WordPress to Hugo. What started as a weekend project has turned into this comprehensive guide.
As someone obsessed with page load speeds, WordPress has always been a mixed bag. Sure, it's powerful, but between PHP generating pages on-the-fly, database queries, and plugin overhead, achieving true speed is a constant battle.
Every tenth of a second matters for SEO and UX. Studies show 50% of visitors abandon slow pages. While I'd optimized my WordPress site with OpenLiteSpeed and premium hosting, I knew there was untapped potential.
Hugo's advantages:
⚡ Static files serve faster than dynamic PHP
🔒 No database or plugins to patch constantly
✍️ Write in Markdown, deploy anywhere
🚀 Minimal maintenance overhead
The biggest concern? Migration horror stories — lost images, broken links, frustrated bloggers giving up halfway. But with the right approach, these pitfalls are avoidable.
Hugo is a static site generator that transforms Markdown into HTML. Unlike WordPress (dynamic generation on each visit), Hugo pre-builds everything once.
Key components:
Content directory: Articles and pages in Markdown
Layouts/Themes: Templates for appearance
Static directory: Images, CSS, JS served as-is
Each Markdown file starts with "front matter" — metadata similar to WordPress custom fields but cleaner:
---
title: "My Article"
date: 2025-01-20
tags: ["webdev", "performance"]
---
The "Extended" version is essential for modern themes with Sass support:
cd /usr/local/bin
wget https://github.com/gohugoio/hugo/releases/latest/download/hugo_extended_0.157.0_Linux-64bit.tar.gz
tar -xzf hugo_extended_0.157.0_Linux-64bit.tar.gz
rm hugo_extended_0.157.0_Linux-64bit.tar.gz
hugo version
cd ~/projects
hugo new site my-hugo-site
cd my-hugo-site
Modern, fast, and clean:
git init
git submodule add https://github.com/imfing/hextra themes/hextra
cp themes/hextra/hugo.toml ./hugo.toml
Configure hugo.toml:
baseURL = "https://your-domain.com/"
title = "Your Site Title"
theme = "hextra"
hugo server -D --bind=0.0.0.0
Visit http://localhost:1313
Dashboard → Tools → Export → All Content
This generates a WXR file with all your content. Transfer it to your Hugo server.
Rather than unreliable automated tools, here's a custom Python script using Pandoc.
First create convert_wp_to_hugo.py:
#!/usr/bin/env python3
import os, subprocess, re
from lxml import etree
from datetime import datetime
# Install dependencies first:
# apt install -y python3 python3-venv pandoc
# python3 -m venv venv && source venv/bin/activate
# pip install lxml pyyaml
INPUT_XML = "/path/to/wordpress-export.xml"
OUTPUT_DIR = "content/posts"
os.makedirs(OUTPUT_DIR, exist_ok=True)
def to_slug(s):
return "".join(c.lower() if c.isalnum() or c in "-" else "-" for c in s).strip("-")
def yaml_safe(s: str) -> str:
if re.search(r"[:#\-\?\[\]\{\},&\*!\|>'\"%@`]|^\s|\s$", s) or " " in s:
return f'"{s.replace(\'"\', \'\\"\'}"'
return s
root = etree.parse(INPUT_XML, parser=etree.XMLParser(encoding="utf-8"))
for item in root.xpath("//item"):
status = item.findtext("{http://wordpress.org/export/1.2/}status") or ""
post_type = item.findtext("{http://wordpress.org/export/1.2/}post_type") or "post"
if status != "publish" or post_type not in {"post", "page"}:
continue
title = (item.findtext("title") or "Untitled").strip()
slug = item.findtext("{http://wordpress.org/export/1.2/}post_name") or to_slug(title)
date = item.findtext("{http://wordpress.org/export/1.2/}post_date") or datetime.utcnow().isoformat()
html = item.findtext("{http://purl.org/rss/1.0/modules/content/}encoded") or ""
# Convert HTML to Markdown via Pandoc
p = subprocess.run(
["pandoc", "-f", "html", "-t", "gfm-smart", "--wrap=none"],
input=html.encode("utf-8"), capture_output=True
)
md = p.stdout.decode("utf-8").strip()
# Extract categories and tags
categories = [cat.text.strip() for cat in item.findall("category")
if cat.get("domain") == "category" and cat.text]
tags = [cat.text.strip() for cat in item.findall("category")
if cat.get("domain") == "post_tag" and cat.text]
# Build front matter
fm = ["---", f'title: "{title.replace("\\", "\\\\").replace('"', '\\"')}"',
f"slug: {slug}", f"date: {date}", "draft: false"]
if categories:
fm.append("categories:")
for c in categories:
fm.append(f" - {yaml_safe(c)}")
if tags:
fm.append("tags:")
for t in tags:
fm.append(f" - {yaml_safe(t)}")
fm.extend(["---", ""])
out_name = f"{date[:10]}-{slug}.md" if post_type == "post" else f"{slug}.md"
with open(os.path.join(OUTPUT_DIR, out_name), "w", encoding="utf-8") as f:
f.write("\n".join(fm) + "\n" + md + "\n")
print(f"✓ Conversion complete → {OUTPUT_DIR}")
Run it:
python3 convert_wp_to_hugo.py
Serving modern image formats (AVIF/WebP) dramatically improves performance.
Here is the code:
#!/usr/bin/env python3
import os, re, requests
from pathlib import Path
from PIL import Image
from io import BytesIO
import pillow_avif
# pip install pillow pillow-avif-plugin requests
MD_DIR = Path("content")
OUT_DIR = Path("static/images")
OUT_DIR.mkdir(parents=True, exist_ok=True)
def save_formats(img_bytes, base_path):
"""Save image in multiple formats"""
try:
img = Image.open(BytesIO(img_bytes))
img.save(f"{base_path}.avif", "AVIF", quality=80)
img.save(f"{base_path}.webp", "WEBP", quality=80)
img.save(f"{base_path}.jpg", "JPEG", quality=85)
print(f"✓ {base_path.name}")
except Exception as e:
print(f"✗ Error: {e}")
# Find all image URLs in Markdown
img_pattern = re.compile(r'!\[([^\]]*)\]\((https?://[^)]+)\)')
for md_file in MD_DIR.rglob("*.md"):
content = md_file.read_text(encoding="utf-8")
urls = {url for _, url in img_pattern.findall(content) if url.startswith("http")}
for url in urls:
try:
filename = url.split("?")[0].split("/")[-1]
base_path = OUT_DIR / Path(filename).stem
r = requests.get(url, timeout=30)
r.raise_for_status()
save_formats(r.content, str(base_path))
except Exception as e:
print(f"Failed {url}: {e}")
print(f"\n✓ All images processed → {OUT_DIR}")
Create layouts/shortcodes/img.html:
{{ $src := .Get "src" }}
{{ $alt := .Get "alt" | default "" }}
<picture>
<source type="image/avif" srcset="{{ $src }}.avif" />
<source type="image/webp" srcset="{{ $src }}.webp" />
<img src="{{ $src }}.jpg" alt="{{ $alt }}" loading="lazy" />
</picture>
Use in Markdown:
{{< img src="/images/my-screenshot" alt="Description" >}}
Replace standard image syntax with shortcode:
#!/usr/bin/env python3
import os, re
pattern = re.compile(r'!\[([^\]]*)\]\((https?://[^)]+)\)')
for root, _, files in os.walk("content"):
for filename in files:
if not filename.endswith('.md'):
continue
filepath = os.path.join(root, filename)
with open(filepath, 'r', encoding='utf-8') as f:
content = f.read()
def replacement(match):
alt_text, url = match.groups()
image_name = os.path.splitext(os.path.basename(url))[0]
return f'{{{{< img src="/images/{image_name}" alt="{alt_text}" >}}}}'
new_content = pattern.sub(replacement, content)
if new_content != content:
with open(filepath, 'w', encoding='utf-8') as f:
f.write(new_content)
print(f"✓ Updated: {filepath}")
Build and deploy your static site:
hugo # Generates site in public/
cp -r public/* /var/www/html/
Your site is now pure HTML — no PHP, no database queries, just blazing-fast static content.
Backup everything: Full WordPress files + database export. Test restore separately.
Plan URL structure: Set up redirects from old WordPress URLs to Hugo structure for SEO.
Test thoroughly: Multiple browsers and devices, especially mobile performance.
Lost features: WordPress search, comments, contact forms need alternatives (Algolia, Disqus, Netlify Forms).
Security: Even static sites need proper SSL and security headers.
For maximum speed:
partialCached for expensive operations"try not defined" errors: Hugo version too old. Update to latest Extended version.
404 on production: Build with hugo and copy public/ folder to web server.
Can't access from network: Add --bind=0.0.0.0 and check firewall settings.
After migrating to Hugo, my site loads in under 200ms — a dramatic improvement from WordPress. Page speeds that once required extensive optimization now come naturally.
The maintenance burden has practically disappeared. No more plugin updates, security patches, or database optimization. Just write in Markdown and deploy.
After completing this migration and running extensive performance tests (detailed in my previous article "WordPress vs Hugo: When Reality Challenges the Speed Myths"), I'll be honest: Hugo didn't convince me as much as I expected.
The performance gains? On paper, Hugo should dominate. In practice, with a properly optimized WordPress setup (OpenLiteSpeed + LS Cache), the difference was only 32ms for content-heavy pages. Hugo's advantage becomes clear at massive scale, but for most sites, a well-configured WordPress can match it.
What I learned:
Was it worth it? As an experiment, absolutely. I learned valuable lessons about web performance, infrastructure, and the real trade-offs between platforms. The migration process itself was enriching.
Should you migrate? Only if you value simplicity over flexibility, are comfortable with Markdown workflows, and don't need dynamic features. A properly optimized WordPress might serve you just as well.
The future probably belongs to hybrid approaches anyway - static generation where it makes sense, dynamic features where needed. Both platforms are evolving toward this middle ground.
Have questions about the migration or want to discuss the performance results? Drop them in the comments below!
Have questions about the migration? Drop them in the comments below!
Follow me for more web development tutorials and infrastructure guides!
📬 Want essays like this in your inbox?
I just launched a newsletter about thinking clearly in tech — no spam, no fluff.
Subscribe here: https://buttondown.com/efficientlaziness
Efficient Laziness — Think once, well, and move forward.
2025-11-20 17:25:45
Complex systems require extensive monitoring and observability. Systems as complex as Kubernetes clusters have so many moving parts that sometimes it's a task and a half just to configure their monitoring properly. Today I'm going to talk in depth about cross-account observability for multiple EKS clusters, explore various implementation options, outline the pros and cons of each approach, and explain one of them in close detail. Whether you’re an aspiring engineer seeking best-practice advice, a seasoned professional ready to disagree with everything, or a manager looking for ways to optimize costs -- this article might be just right for you.
What is good observability? Good observability answers the questions:
As you can see, apart from the health of the cluster itself, good observability monitors the health of the application and user-facing metrics as well. So far, we can highlight two types of data in our cluster: cluster operational metrics and application metrics.
But is that all? Didn't we forget something?
How are we going to find out whether we have issues with our services if our monitoring is down? Exactly -- we can't. So we have to introduce meta-monitoring metrics: the data about the health of the monitoring system itself.
And now we have three types of metrics.
The bare minimum set of data we can scrape from our cluster comes from the Prometheus Node Exporter (or VictoriaMetrics vmagent). This is the most technical data -- CPU, memory, network latencies, temperature, you name it. It doesn't get more technical than this.
But should we stop here? Technical data about our system shows only our part of the setup -- the health of the underlying system. It's extremely helpful when something goes very wrong with our service, but it doesn't show the full picture. To see that, we also need visibility into the customer-facing side of our service. We need application metrics: request latency, number of responses, application errors. The good thing is that exposing a metrics endpoint within our application may be enough -- Prometheus will scrape this endpoint by itself without additional daemons.
Something's missing, right? For example, where can we see application errors? Not just the error codes or counts, but actual error messages?
Sure, kubectl logs my-pod is cool and all, but a production-ready app is rarely a single pod (at least I hope so, for most of us).
So we think about log collection as well. This adds yet another pod to each node we have: a log collector agent.
Is the picture complete now?
Not quite. As mentioned previously, to check whether our monitoring setup is even alive, it would be nice to have some sort of external monitoring -- the monitoring of the monitoring. Monitoringception. Since the sole purpose of this external service is to answer the question of whether the monitoring is alive, it can be a very simple setup: a daemon that queries the monitoring endpoint. Because this daemon must remain alive even when the whole system is down, it has to run outside the perimeter of the application workload -- on an external EC2 instance outside the EKS cluster. To achieve the maximum independence of the system this service can be set up in an entirely different cloud provider or even on a dedicated VPS in a datacenter on another continent (or in your home lab).
It can even be done as a dead-man switch: an alert in the monitoring system that should always be in the "failing" state and should turn green only if the monitoring system is down. Don't like constant influx of alerts? Set up an notification when data hasn't been received for X minutes, it should do the trick as well.1
Now let's crank it up a notch.
This is what a service architecture might look like. But in modern development, it rarely is like this. Mature businesses typically divide their infrastructure by environments -- at the very least, development and production. They might also include, but are not limited to, staging, UAT, etc.
So we spin up both environments in the same cluster, but in different namespaces, right?
This is a bad practice, as resources are still shared between both applications, and one can affect the other in ways we wouldn't want.
The other option is to spin up another environment in the same AWS account, but on a different cluster. Is this a good option? Most likely, yes. Buuuut the Well-Architected Framework and other guidelines advise against sharing a single account between several environments. Full environment separation is better for both account and application security.
So now we have two AWS accounts.
It was fairly simple to organize observability within a single account -- the networks were contained, enclosed in a single perimeter, and available directly from one endpoint to another. Having several accounts introduces the complexity of cross-account networking and finally brings us to the actual topic of this article.
As with every task in cloud operations, this topic can be approached in several ways:
Let's discuss each of them briefly.
NOTE
For the examples below, the cost of each implementation will include only additional charges (i.e., networking and service charges). The price of the monitoring services (Prometheus + storage, Grafana, etc.) will not be referenced.NOTE
A table summarizing all the options with detailed price breakdown and implementation complexity comparisons can be found in the annex.
This option assumes creating multiple VPC Peering connections from the root account to branched accounts. It is quite cheap and may work well for small organizations with one or two accounts. It may also work if you only plan to create your infrastructure, because it requires CIDR planning: VPC CIDRs cannot overlap with this option. On top of that, you will have to configure route tables manually for each account. On the bright side: same-AZ traffic is free, and VPC peering is also free.
This setup looks more like an enterprise-scale solution: better convenience, higher price. It allows for centralized management of the observability networking setup but requires extensive route planning. It is, however, very scalable by utilizing Transit Gateways, which can be attached to a large number of accounts. It can also be shared across organization accounts via Resource Access Manager (one more point for house Enterprise).
The third option, almost the golden mean, involves exposing services via Network Load Balancers and Endpoint Services and connecting to them via VPC Endpoints. It is cheaper than the previous option but more expensive than the first. NLBs and VPC Endpoints cost money to simply run. On the other hand, it doesn't limit the number of accounts that can be connected, doesn't require route table configuration, doesn't traverse the public internet unexposed, and with one-to-many connections, you only need to configure a small number of endpoints and endpoint services.
This option is an overkill unless you have a dedicated NOC department, because it has all the fun: full network connectivity, encrypted traffic, interface configuration, BGP configuration, and many associated charges for traffic and connection speed.
And the final option: the very basic "just shove it onto the public internet and slap a login page on top of it" approach. Super simple to set up! Just spin up an ALB before the Prometheus instance, season it with TLS certificates, and expose it to the public internet. Super unsafe! Works for hobby and experimental projects. Please don't use in a production environment (unless you are absolutely certain -- but still, please don't).
As you can see, there are several options to consider and choose from. All of them have their pros and cons. For my scenario, I ended up choosing to set up NLBs with VPC Endpoints.
Why?
The first reason is that it's relatively easy to implement. To set up a connection between accounts, we need only an endpoint service and endpoints in the other accounts (plural). They connect in a one-to-many relationship: one endpoint service can handle several connecting endpoints.
The second reason is that I had overlapping CIDRs in each account, so VPC Peering was immediately off the table.
The third reason is that it has moderately less configuration overhead. I won't have to configure client connections or route tables. Proper routing between multiple accounts is a very precise art at which I most certainly suck.
And the last reason is that it's still quite secure. The traffic from services doesn't leave the AWS network, doesn't traverse the public internet, and connections are only allowed after approval (it can be configured to auto-approve, though, which may be convenient -- but healthy paranoia is my constant companion).
The whole system is actually not that overcrowded.
For the sake of this article, let's assume we have three products: Bulba, Char, and Squi. Each product has two environments: normal and sparkling. We also have a root account (let's call it Ketch) for organization management and, for simplicity, observability as well. This gives us seven AWS accounts and seven EKS clusters. Why seven and not six? Well, the root account also runs an EKS cluster as an observability backend.
Each product account (Bulba, Char, and Squi) includes the following elements:
The root account (Ketch) also contains several elements:
That's quite a lot of components to keep track of. Luckily, we have IaC to ease the configuration and deployment of these components.
To understand why so many components are needed (particularly the endpoint service-endpoint pairs), we need to look at the traffic flow model.
Zooming in a bit, we can see the exact details of how the connection is configured.
Since our goal is to keep metrics data secure and prevent it from traversing public subnets or the public internet, we create an endpoint service. This service acts as an open receiver on one account (the product account) for the connector on the root account. This creates a 1-to-1 connection that is both secure and governed, since requests to connect to the endpoint service must be approved.
The setup for metrics differs from the setup for logs and traces.
Prometheus uses a pull model: the backend queries the endpoints for data, effectively pulling the data from them. Loki and Tempo, however, use a push model: deployed logs and traces collection pods send (push) the data to the centralized backend.
In this case, the endpoint service/endpoint pair is reversed and simpler: only one endpoint service per service (two in total) is created in the root account, and all product accounts create endpoints and request connections with the root endpoint service.
And this is how the accounts look in the end:
TIP
For consistency, you might want to use a metrics-gathering service with a push model (e.g., the aforementioned VictoriaMetrics vmagent). This way, endpoint services are only created in the root account, and product accounts only have endpoints.
As mentioned previously, the solution I chose in this article is not the only correct one out there. For this setup, I had specific requirements that needed to be fulfilled, as well as particular implementation caveats to consider.
My example is definitely not the cheapest option. The cheapest would be using VPC Peering. But unfortunately, my existing setup -- with the same CIDRs in EKS clusters and the possibility to extend beyond those six (+1) AWS accounts -- made this option unavailable.
The described setup is also located in a single region -- in cross-region data transfer scenarios, costs can increase drastically and very quickly. For each additional region, an additional VPC endpoint/service will have to be created.
There's also always the possibility to just expose the backend ports to the public internet (with proper authorization, of course!) and don't bother with endpoint configuration entirely. But even with TLS-encrypted traffic, this is a rather unsafe option and will absolutely not help you pass any SOC 1/SOC 2/ISO 27001 certification.
This was a very interesting challenge to tackle and implement -- I had an absolute blast setting up the POC and confirming that it works. I was excited by the variety of options I could choose from and the differences between the tools available in these scenarios.
And I think it's beautiful. It not only shows the complexity of AWS services (which can sometimes be a downside), but also that there's always more than one solution to each problem. Every engineer will approach a challenge differently -- which, in my humble opinion, means our jobs are secure for the observable future.
Thank you for reading, and see you in the next one!
| Option | CIDRs can overlap | Scalability | Implementation Complexity | Price |
|---|---|---|---|---|
| VPC Peering | X | ●○○ | ●●○○ | $ |
| AWS Transit Gateway | ✓ | ●●● | ●●●○ | $$$ |
| AWS NLBs + VPC Endpoints + Services | ✓ | ●●○ | ●●○○ | $$ |
| AWS VPN / Direct Connect | ✓ | ●●● | ●●●● | $$$-$$$$ |
| Public internet + Authorization + TLS | ✓ | ●●● | ●○○○ | $$ |
The variety, complexity and difference of the outlined options require more in-depth research than outlined in the scope of this article. I agree with that. To make your life easier and to highlight the cryptic $ signs in the table above, I decided to create an example price breakdown and give you some specific numbers, based on which you can at least in some proximity figure whether an option is sutable for you or not.
To keep the numbers comparable, we will assume the following prerequisites:
Total Monthly Cost: $528.6
Total Monthly Cost: $634.6
Total Monthly Cost: $590.4
Total Monthly Cost: $633.6
Total Monthly Cost: $670.63
Total Monthly Cost: $563.6
For pricing calculations, I used the following AWS resources::
- Amazon VPC pricing
- Elastic Load Balancing pricing
- Amazon EC2 On-Demand Pricing
- Amazon EBS pricing
- AWS Pricing Calculator
As well as this handy AWS EC2 instance and price comparison tool:
The code presented in this section does not represent a fully working infrastructure. It highlights only the most relevant parts of the PoC implementation. It has not been tested and serves solely as a reference. You can copy it, but without additional initial setup (at minimum a VPC and an EKS cluster) and further adjustments, it will not work. Use at your own risk!
# variables.tf
variable "eks_nlb_endpoint_services" {
description = "EKS NLB endpoint services configuration"
type = map(object({
nlb_arn = list(string)
allowed_principals = list(string)
}))
}
variable "loki_service_name" {
description = "VPC Endpoint Service name for Loki"
type = string
}
variable "tempo_service_name" {
description = "VPC Endpoint Service name for Tempo"
type = string
}
variable "vpc_id" {
type = string
}
variable "private_subnet_ids" {
type = list(string)
}
variable "name_prefix" {
type = string
}
# main.tf
resource "aws_vpc_endpoint_service" "nlb_endpoint_services" {
for_each = var.eks_nlb_endpoint_services
acceptance_required = true
allowed_principals = each.value.allowed_principals
network_load_balancer_arns = each.value.nlb_arn
}
resource "aws_security_group" "observability" {
name = "${var.name_prefix}-observability"
description = "Allow traffic from observability resources"
vpc_id = var.vpc_id
}
resource "aws_vpc_security_group_ingress_rule" "loki" {
security_group_id = aws_security_group.observability.id
description = "Allow traffic from observability resources: Loki"
cidr_ipv4 = "10.0.0.0/16"
from_port = 8080
to_port = 8080
ip_protocol = "tcp"
}
resource "aws_vpc_security_group_ingress_rule" "tempo" {
security_group_id = aws_security_group.observability.id
description = "Allow traffic from observability resources: Tempo"
cidr_ipv4 = "10.0.0.0/16"
from_port = 4317
to_port = 4318
ip_protocol = "tcp"
}
resource "aws_vpc_endpoint" "loki" {
vpc_id = var.vpc_id
service_name = var.loki_service_name
vpc_endpoint_type = "Interface"
security_group_ids = [aws_security_group.observability.id]
subnet_ids = var.private_subnet_ids
private_dns_enabled = false
}
resource "aws_vpc_endpoint" "tempo" {
vpc_id = var.vpc_id
service_name = var.tempo_service_name
vpc_endpoint_type = "Interface"
security_group_ids = [aws_security_group.observability.id]
subnet_ids = var.private_subnet_ids
private_dns_enabled = false
}
# terraform.tfvars
eks_nlb_endpoint_services = {
prometheus = {
nlb_arn = [
"arn:aws:elasticloadbalancing:us-west-2:${data.aws_caller_identity.current.account_id}:loadbalancer/net/xxxx-yyyy/zzzz"
]
allowed_principals = ["arn:aws:iam::123456789012:root"]
}
}
loki_service_name = "com.amazonaws.vpce.us-west-2.vpce-svc-qwertyuiop"
tempo_service_name = "com.amazonaws.vpce.us-west-2.vpce-svc-qwertyuiop"
vpc_id = "vpc-xxxxx"
private_subnet_ids = ["subnet-xxxxx", "subnet-yyyyy"]
name_prefix = "char"
# variables.tf
variable "account_id" {
type = string
}
variable "name_prefix" {
type = string
}
variable "vpc_id" {
type = string
}
variable "private_subnet_ids" {
type = list(string)
}
variable "allowed_principals" {
description = "Allowed AWS account principals for VPC endpoint services"
type = list(string)
}
variable "loki_nlb_arns" {
type = list(string)
}
variable "tempo_nlb_arns" {
type = list(string)
}
variable "char_normal_prometheus_vpcesvc_name" {
description = "VPC Endpoint Service name for Prometheus; normal account"
type = string
}
variable "char_sparkling_prometheus_vpcesvc_name" {
description = "VPC Endpoint Service name for Prometheus; sparkling account"
type = string
}
# s3.tf
resource "aws_s3_bucket" "loki_chunks" {
bucket = "${var.name_prefix}-${var.account_id}-loki-chunks"
}
resource "aws_s3_bucket" "loki_ruler" {
bucket = "${var.name_prefix}-${var.account_id}-loki-ruler"
}
resource "aws_s3_bucket" "tempo" {
bucket = "${var.name_prefix}-${var.account_id}-tempo-traces"
}
# iam.tf
resource "aws_iam_policy" "loki_buckets" {
name = "${var.name_prefix}-loki-buckets"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "LokiBuckets"
Effect = "Allow"
Action = [
"s3:ListBucket",
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject"
]
Resource = [
aws_s3_bucket.loki_chunks.arn,
"${aws_s3_bucket.loki_chunks.arn}/*",
aws_s3_bucket.loki_ruler.arn,
"${aws_s3_bucket.loki_ruler.arn}/*"
]
}
]
})
}
resource "aws_iam_role" "loki_pod_identity" {
name = "${var.name_prefix}-loki-pod-identity"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "AllowEksAuthToAssumeRoleForPodIdentity"
Effect = "Allow"
Principal = {
Service = "pods.eks.amazonaws.com"
}
Action = [
"sts:AssumeRole",
"sts:TagSession"
]
}
]
})
tags = var.common_tags
}
resource "aws_iam_role_policy_attachment" "loki_pod_identity" {
role = aws_iam_role.loki_pod_identity.name
policy_arn = aws_iam_policy.loki_buckets.arn
}
resource "aws_iam_policy" "tempo_bucket" {
name = "${var.name_prefix}-tempo-bucket"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "LokiBuckets"
Effect = "Allow"
Action = [
"s3:ListBucket",
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:GetObjectTagging",
"s3:PutObjectTagging"
]
Resource = [
aws_s3_bucket.tempo.arn,
"${aws_s3_bucket.tempo.arn}/*"
]
}
]
})
}
resource "aws_iam_role" "tempo_pod_identity" {
name = "${var.name_prefix}-tempo-pod-identity"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "AllowEksAuthToAssumeRoleForPodIdentity"
Effect = "Allow"
Principal = {
Service = "pods.eks.amazonaws.com"
}
Action = [
"sts:AssumeRole",
"sts:TagSession"
]
}
]
})
}
resource "aws_iam_role_policy_attachment" "tempo_pod_identity" {
role = aws_iam_role.tempo_pod_identity.name
policy_arn = aws_iam_policy.tempo_bucket.arn
}
resource "aws_eks_pod_identity_association" "loki" {
cluster_name = var.name_prefix
namespace = "monitoring"
service_account = "loki"
role_arn = aws_iam_role.loki_pod_identity.arn
}
resource "aws_eks_pod_identity_association" "tempo" {
cluster_name = var.name_prefix
namespace = "monitoring"
service_account = "tempo"
role_arn = aws_iam_role.tempo_pod_identity.arn
}
# vpc.tf
resource "aws_vpc_endpoint_service" "loki" {
acceptance_required = true
allowed_principals = var.allowed_principals
network_load_balancer_arns = var.loki_nlb_arns
}
resource "aws_vpc_endpoint_service" "tempo" {
acceptance_required = true
allowed_principals = var.allowed_principals
network_load_balancer_arns = var.tempo_nlb_arns
}
resource "aws_vpc_endpoint" "char-normal-prometheus" {
vpc_id = var.vpc_id
service_name = var.char_normal_vpcesvc_name
vpc_endpoint_type = "Interface"
security_group_ids = [aws_security_group.observability.id]
subnet_ids = var.private_subnet_ids
private_dns_enabled = false
}
resource "aws_vpc_endpoint" "char-sparkling-prometheus" {
vpc_id = var.vpc_id
service_name = var.char_sparkling_vpcesvc_name
vpc_endpoint_type = "Interface"
security_group_ids = [aws_security_group.observability.id]
subnet_ids = var.private_subnet_ids
private_dns_enabled = false
}
resource "aws_security_group" "observability" {
name = "observability"
description = "Allow traffic from observability resources"
vpc_id = var.vpc_id
}
resource "aws_vpc_security_group_ingress_rule" "prometheus" {
security_group_id = aws_security_group.observability.id
cidr_ipv4 = "10.0.0.0/16"
from_port = 9090
to_port = 9090
ip_protocol = "tcp"
}
# terraform.tfvars.example
account_id = "123456789012"
name_prefix = "ketch"
vpc_id = "vpc-xxxxx"
private_subnet_ids = ["subnet-xxxxx", "subnet-yyyyy"]
loki_nlb_arns = ["arn:aws:elasticloadbalancing:us-west-2:${var.account_id}:loadbalancer/net/xxxx-yyyy/zzzz"]
tempo_nlb_arns = ["arn:aws:elasticloadbalancing:us-west-2:${var.account_id}:loadbalancer/net/xxxx-yyyy/zzzz"]
char_normal_prometheus_vpcesvc_name = "com.amazonaws.vpce.us-west-2.vpce-svc-xyz"
char_sparkling_prometheus_vpcesvc_name = "com.amazonaws.vpce.us-west-2.vpce-svc-xyz"
allowed_principals = [
"arn:aws:iam::111111111111:root",
"arn:aws:iam::222222222222:root"
]
It doesn’t matter all that much which approach you choose to implement for monitoring your monitoring. In light of recent events, some folks even created a downdetector for a downdetector’s downdetector. I mean, it’s hilariously fun, but the key point remains solid: you need to know whether your eyes and ears (infrastructure-wise) are even working. ↩
We assume 100% cross-AZ traffic in this example to maximize potential traffic costs and avoid complicating the calculations with percentages of same-AZ versus cross-AZ traffic. ↩
Direct Connect may also require a specific partner to enable and perform the physical connection to the AWS network, so expect to add a few hundred (or even thousands) of dollars on top of the initial bill for setup. ↩
2025-11-20 17:23:54
I thought plugging AI into a health app would be a weekend project. Spoiler: it wasn’t. It was messy, frustrating, and at one point, I wondered if my laptop fans could legally qualify as medical devices because of how hard they were working.
But here’s the thing: healthcare is already leaning on AI harder than most industries. According to Grand View Research, the global AI in healthcare market was valued at $22.45 billion in 2023 and is expected to grow at 37% annually through 2030. That’s not just hype, it’s reality. If you’re building or maintaining a health app today, the question isn’t whether you should integrate AI. It’s how soon can you do it without breaking everything?
In this guide, you will know everything about how to add AI to an existing health app.
Step 1: Admit That Your App Isn’t Ready for AI
When I first looked at my health app’s codebase, it felt like inviting a brain surgeon to operate in a garage. The app was functional: calorie tracking, step counts, and some reminders, but architecturally, it wasn’t ready for machine learning models.
Here’s what I had to do first for the AI integration in healthcare apps:
Step 2: Pick the Right AI Use Case (Not the Shiny One)
The temptation? Predicting diseases like some sci-fi oracle. The reality? I didn’t have the data (or regulatory clearance) for that.
So I started smaller. I integrated an AI-powered symptom checker that could take user inputs in plain English and map them to potential health insights. Why this worked:
Lesson: Choose a use case that matches both your data maturity and user needs. If you aim too high, you’ll spend six months tweaking models no one will ever see.
Step 3: Build the Pipeline (aka Where I Broke Everything)
This was the most painful part. You don’t just “add AI” like a WordPress plugin. I needed an actual pipeline:
The first time I deployed, the model was so slow it made my app feel like dial-up internet. Users would type “headache” and get results ten seconds later. Not exactly confidence-inspiring.
What fixed it? Offloading heavy computation to the cloud and only keeping lightweight inference on-device. That balance is critical if you want to avoid frustrating your users.
Step 4: Test Like You’re a Paranoid Doctor
Healthcare apps don’t get the same forgiveness as social apps. If your AI makes a mistake, people panic.
Here’s how I tested mine:
I also pulled in a small circle of test users (read: friends and family) to break the system. One typed “I ate 50 bananas in an hour” just to see what would happen. It turns out models don’t like absurd diets either.
Step 5: Handle Privacy Before It Handles You
This one nearly derailed me. Collecting health data means you’re holding a ticking legal time bomb if you’re not careful.
What I learned:
I ended up spending more time on compliance than on coding. Boring? Yes. Necessary? Absolutely.
Step 6: Know When to Get Help
Somewhere between debugging preprocessing scripts and trying to optimize model latency, I realized I was way out of my depth. That’s when I started looking into outside help from teams that do this full-time.
Step 7: Launch Small, Learn Fast
When I finally rolled out the AI feature, I didn’t blast it to every user. I launched a beta. That way, feedback trickled in from a manageable group, and I could iterate without fear of a meltdown.Early users pointed out quirks I hadn’t even considered:
Each round of feedback made the feature sharper and safer.
What I’d Tell You If You’re About to Try This
Integrating AI into a health app isn’t just a technical challenge; it’s a balancing act between user trust, regulatory compliance, and technical feasibility.
If you’re thinking about it, here’s my blunt advice:
Looking back, I wouldn’t say I “mastered” AI in healthcare, but I survived it. And now, when my app’s users type in symptoms and get meaningful, timely insights, the pain feels worth it.
If you’re about to dive into the same rabbit hole, just remember: AI isn’t a magic wand. It’s a tool. Use it wisely, and maybe you’ll save yourself from debugging your life at 3 AM.