MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

CinemaSins: Everything Wrong With The Wiz In 15 Minutes Or Less

2025-11-23 06:02:29

TL;DR

CinemaSins takes a fresh look at The Wiz (now that Wicked is back in theaters) with a rapid-fire “Everything Wrong With The Wiz In 15 Minutes Or Less” video. They highlight the sins of the film, introduce their writers, and point you toward their YouTube channels, social media, and website for more content.

They also invite fans to fill out a quick poll, join their Discord and Reddit communities, and support the team on Patreon for exclusive perks. Follow them on Twitter, Instagram, TikTok, and more for daily movie critiques and behind-the-scenes fun.

Watch on YouTube

CinemaSins: Everything Wrong With KPop Demon Hunters In 16 Minutes Or Less

2025-11-23 06:02:22

Everything Wrong With KPop Demon Hunters In 16 Minutes Or Less

CinemaSins serves up their signature snark in a bite-sized roast of the new KPop Demon Hunters movie, rattling off every plot hole, trope and over-the-top moment in just 16 minutes. Dive deeper at cinemasins.com or catch more sin-filled content on YouTube via TVSins, CommercialSins and the CinemaSins Podcast Network.

Hungry for more? Hit their Linktree for polls, Patreon support and all the socials—Twitter, Instagram, TikTok, Discord and Reddit. Big shout-out to sin scribes Jeremy, Chris, Aaron, Jonathan, Deneé, Ian and Daniel for keeping the cinematic guilt trip hilarious.

Watch on YouTube

How to Achieve 100/100 on PageSpeed Insights

2025-11-23 05:55:52

Reading time: 9 minutes

Why write about PageSpeed optimization in 2025?

PageSpeed optimization isn't new. Google released Lighthouse in 2016, and performance best practices have been documented for years. You might expect that by 2025, most professional websites would score well.

The opposite is true.

Modern websites are often slower than sites built five years ago. Despite faster networks and more powerful devices, the average website in 2025 is heavier, more complex, and performs worse than its predecessors. Why?

The complexity creep:

  • Modern frameworks ship more JavaScript by default
  • Analytics, marketing pixels, and chat widgets accumulate over time
  • High-resolution images and videos are now standard
  • Third-party integrations (payment processors, CRMs, booking systems) each add overhead
  • "It works on my machine" testing misses real-world mobile performance

I regularly encounter professionally built websites from 2024-2025 scoring 40-70/100. These aren't old legacy sites—they're new builds using modern tools, costing thousands of dollars, from established agencies.

Why does achieving 100/100 matter?

You might ask: "Isn't 85/100 good enough? Does the last 15 points really matter?"

The answer depends on what you're optimizing for.

The business case:

Google's research shows that 53% of mobile users abandon sites taking longer than 3 seconds to load. Each additional second of load time correlates with approximately 7% reduction in conversions. These aren't small numbers—they directly affect revenue.

For a business with 10,000 monthly visitors and a 3% conversion rate:

  • A fast site (2 seconds, 100/100 score): 300 conversions/month
  • A slow site (8 seconds, 60/100 score): Approximately 210 conversions/month

That's 90 lost conversions monthly. At $100 average transaction value, that's $108,000/year in lost revenue.

The SEO case:

Core Web Vitals became Google ranking factors in 2021. Two identical sites with identical content will rank differently based on performance. The faster site gets more organic traffic. More traffic means more conversions. The performance advantage compounds over time.

The user experience case:

Beyond metrics, there's a qualitative difference between a site that scores 85/100 and one that scores 100/100. The 100/100 site feels instant. Content appears immediately. Nothing jumps around. Users trust it more, engage more, and return more often.

The competitive advantage:

If your competitors score 60-70/100 and you score 100/100, you've created a measurable advantage in user experience, search rankings, and conversion rates. In competitive markets, these margins matter.

So yes, the last 15 points matter—not for the score itself, but for the business outcomes those points represent.

What does a perfect PageSpeed score require?

Most developers know the basics—optimize images, minify CSS, reduce JavaScript. But knowing the basics and achieving 100/100 are different things. The gap between 85/100 and 100/100 isn't about doing more of the same. It requires understanding which techniques have the most impact and implementing them correctly.

I've built multiple sites scoring 100/100/100/100 across all four metrics (Performance, Accessibility, Best Practices, SEO). In this guide, I'll explain the specific techniques that matter most, why they work, and what to watch out for.

You'll learn:

  • How critical CSS inlining affects render times
  • Why image format choice matters more than compression level
  • How to eliminate render-blocking resources
  • What causes layout shift and how to prevent it
  • How to test performance under real-world conditions

Before we start, a reality check: getting to 100/100 takes time the first time through—typically 4-6 hours for a complete site. Once you understand the patterns, subsequent optimizations go faster. But there's no shortcut around learning what works.

Why does critical CSS inlining improve scores?

The problem: When browsers load a page, external CSS files block rendering. The browser downloads your HTML, encounters <link rel="stylesheet">, pauses rendering, downloads the CSS, parses it, then finally renders the page. This delay affects your Largest Contentful Paint (LCP) score significantly.

The solution: Inline your critical CSS directly in the <head> tag. The browser can render immediately without waiting for external files.

A real example: A professional services site I optimized scored 94/100 before CSS inlining. After moving critical styles inline, it scored 100/100. The only change was moving approximately 3KB of above-the-fold CSS into the HTML head.

Here's what that structure looks like:

<head>
  <style>
    /* Critical CSS - inline everything needed for initial render */
    body { margin: 0; font-family: system-ui, sans-serif; }
    header { background: #1a1a1a; color: white; padding: 1rem; }
    .hero { min-height: 400px; background: linear-gradient(...); }
  </style>

  <!-- Load non-critical CSS asynchronously -->
  <link rel="stylesheet" href="/full-styles.css" media="print" onload="this.media='all'">
</head>

What to watch for: Keep inline CSS under 8KB. Beyond this size, you're delaying HTML download time, which can actually hurt your First Contentful Paint score instead of helping it. Extract only the styles needed for above-the-fold content.

Framework considerations: If you're using Nuxt, Next.js, or similar frameworks, look for build-time CSS extraction features. Nuxt's experimental.inlineSSRStyles handles this automatically during static generation.

How do modern image formats reduce file size?

The problem: Images typically account for 60-80% of page weight. Unoptimized images directly affect load times, especially on mobile networks.

The solution: Use AVIF format where supported, with WebP fallback, and serve appropriately sized images for different viewports.

A real example: I built a healthcare website (medconnect.codecrank.ai) with professional medical imagery and team photos. Initial image exports totaled approximately 2.5MB per page. After optimization:

  • Format conversion: Changed from PNG/JPG to AVIF (50-80% smaller at equivalent quality)
  • Responsive sizing: Served 400px images for mobile, 1200px for desktop (not full 4K resolution for all devices)
  • Result: Lightweight page weight with excellent visual quality
  • Performance score: 100/100/100/100 (perfect scores across all metrics)

The implementation:

<picture>
  <source srcset="hero-400.avif 400w, hero-800.avif 800w, hero-1200.avif 1200w"
          type="image/avif">
  <source srcset="hero-400.webp 400w, hero-800.webp 800w, hero-1200.webp 1200w"
          type="image/webp">
  <img src="hero-800.jpg"
       alt="Hero image"
       width="1200"
       height="800"
       loading="lazy">
</picture>

How browsers handle this: Modern browsers automatically select the best format and size they support. Chrome uses AVIF, Safari uses WebP (AVIF support pending), and older browsers fall back to JPG. Mobile devices get 400px versions, desktop gets 1200px. You write it once, browsers handle the rest.

Tools worth knowing: Squoosh (squoosh.app) for manual conversion with quality preview, or Sharp (Node.js library) for batch processing. Both give you control over quality settings per image.

What makes resources render-blocking?

The problem: External JavaScript files block the browser's main thread during download and execution. Even optimized scripts add 200-500ms of blocking time, which directly affects your Time to Interactive and Total Blocking Time scores.

The solution: Defer JavaScript execution until after initial page render. Load scripts after the page displays content to users.

The Google Analytics consideration: Standard GA4 implementation is the most common performance issue I encounter. The default tracking code blocks rendering and adds approximately 500ms to LCP.

Standard implementation (blocks rendering):

<script async src="https://www.googletagmanager.com/gtag/js?id=G-XXXXXXXXXX"></script>

Performance-optimized implementation:

<script>
  window.addEventListener('load', function() {
    // Load GA4 after page fully renders
    var script = document.createElement('script');
    script.src = 'https://www.googletagmanager.com/gtag/js?id=G-XXXXXXXXXX';
    document.head.appendChild(script);
  });
</script>

The trade-off: This approach won't track users who leave within the first 2 seconds of page load. In practice, this represents less than 1% of traffic for most sites and is worth the performance improvement.

What to check: Review your <head> section. Any <script> tag without defer or async attributes is blocking. Move it to the page bottom or defer its execution.

How do you prevent cumulative layout shift?

The problem: Content shifting position while the page loads creates a poor user experience and hurts your CLS (Cumulative Layout Shift) score. Common causes include images loading without reserved space, web fonts swapping in, or ads inserting dynamically.

The solution: Reserve space for all content before it loads.

A real example: A professional services site I built maintains a CLS score of 0 (zero layout shift). Here's the approach:

1. Set explicit image dimensions:

<img src="hero.jpg" width="1200" height="800" alt="Hero">
<!-- Browser reserves 1200x800 space before image loads -->

2. Use CSS aspect-ratio for responsive images:

img {
  width: 100%;
  height: auto;
  aspect-ratio: 3/2; /* Maintains space even as viewport changes */
}

3. Configure fonts to display fallbacks immediately:

@font-face {
  font-family: 'CustomFont';
  src: url('/fonts/custom.woff2');
  font-display: swap; /* Show system font immediately, swap when custom font loads */
}

The result: Content positions remain stable throughout the load process. The page feels responsive and professional.

Debugging layout shift: Run PageSpeed Insights and review the filmstrip view. If you see elements jumping position, add explicit dimensions or aspect-ratios to the shifting elements.

What common issues occur during optimization?

Even when following established practices, you might encounter these issues:

"I inlined my CSS but my score dropped"

This happens when inline CSS exceeds 8-10KB. The browser must download the entire HTML file before rendering anything, which delays First Contentful Paint.

Solution: Extract only above-the-fold styles. Identify which CSS is needed for initial viewport rendering, inline only that portion, and load the rest asynchronously:

<link rel="stylesheet" href="/full-styles.css" media="print" onload="this.media='all'">

"My AVIF images appear blurry or pixelated"

Default AVIF quality settings are often too aggressive. A quality setting of 50 works for photographs but degrades images containing text or graphics.

Solution: Increase quality to 75-85 for images with text or fine details. Use image conversion tools that show quality previews before batch processing.

"CLS score remains poor despite setting dimensions"

Common culprits beyond images: web fonts loading (text reflows when custom font loads), ads inserting dynamically, or content above images pushing them down during load.

Solutions:

  • Fonts: Use font-display: swap and preload critical font files
  • Images: Always set explicit width and height attributes or use CSS aspect-ratio
  • Dynamic content: Set minimum heights on containers before content populates

"Performance is great on desktop but needs work on mobile"

Mobile devices have slower processors and network connections. What renders quickly on a development machine often struggles on mid-range Android phones over 4G networks. If you're only testing on desktop, you're missing how most users experience your site.

Solution: Always test mobile performance with Chrome DevTools throttling enabled (4x CPU slowdown, Fast 3G network). This simulates realistic mobile conditions and reveals actual user experience. Aim for 90+ on mobile, and you will likely get 100 on desktop.

How should you test performance?

The challenge: Your site might perform well on your development machine but struggle in real-world conditions. Mobile users on 4G with mid-range phones experience performance very differently than you do on a MacBook Pro with fiber internet.

The testing process:

  1. Run PageSpeed Insights (pagespeed.web.dev) on mobile configuration first
  2. Check Core Web Vitals against targets:
    • LCP (Largest Contentful Paint): under 2.5 seconds
    • TBT (Total Blocking Time): under 200ms
    • CLS (Cumulative Layout Shift): under 0.1
  3. Review the filmstrip: Look for empty frames where nothing renders
  4. Address the largest issue first (typically render-blocking CSS or oversized images)
  5. Re-test and iterate until reaching 90+/100

A perspective on perfection: Don't necessarily chase 100/100 if you're already at 95+. The final 5 points often represent diminishing returns. Consider whether that time is better spent on content, user experience, or other priorities.

Real device testing: Test on actual mobile devices when possible, not just Chrome DevTools simulation. Real hardware reveals issues that simulators miss.

What do these techniques look like in practice?

Examples you can test right now:

I built these sites to demonstrate these techniques in production:

Professional Services Example - zenaith.codecrank.ai

  • Scores: 100/100/100/100 (all four metrics)
  • LCP: 1.8 seconds
  • Techniques used: Inline critical CSS, AVIF images, zero render-blocking scripts, CLS: 0
  • Verification: Click "⚡ PageSpeed Me" in the footer to test live

MedConnect Pro (Healthcare Site) - medconnect.codecrank.ai

  • Scores: 100/100/100/100 (perfect scores across all metrics)
  • Techniques used: AVIF conversion, responsive images, semantic HTML, accessibility-first design
  • Verification: Click "⚡ PageSpeed Me" in the footer to test live

Mixology (Cocktail Recipes) - mixology.codecrank.ai

  • Score: 96/100 mobile, 100/100 desktop
  • Techniques used: Static generation, optimized images, minimal JavaScript

You can test any of these sites yourself. Newer sites even include a 'PageSpeed Me' button linking directly to Lighthouse testing. I have nothing to hide—the scores are verifiable.

The Honest Take

Most developers ship sites that "work" and move on. Getting to 100/100 takes additional time and attention to detail that many choose not to invest.

I put PageSpeed buttons on every site I build because the work should speak for itself. If I claim 100/100, you can verify it immediately. If I don't achieve it, I'll explain exactly why (client-requested features, necessary third-party integrations, etc.).

Fair warning: PageSpeed scores can fluctuate by a few points depending on network conditions, server load, and time of day. A site scoring 100/100 might test at 98/100 an hour later. What matters is consistent high performance (90-100 range), not chasing a perfect score every single test.

This transparency is uncommon in web development. Many agencies don't want you testing their work. I built these techniques into my process specifically so performance isn't an afterthought—it's built in from the start.

The results are measurable, verifiable, and reproducible. Some clients care deeply about performance. Some don't. I serve those who do.

What's the realistic time investment?

Optimizing to 100/100 isn't quick the first time through. For a typical site, expect:

  • 1-2 hours: Image optimization (format conversion, resizing, quality testing)
  • 1-2 hours: CSS optimization (critical extraction, inline implementation)
  • 1 hour: Layout shift fixes (dimensions, aspect-ratios, font configuration)
  • 1 hour: Testing and iteration

Total: 4-6 hours for first-time implementation

However: Once you've completed this process for one site, you understand the patterns. Your second site takes approximately 2 hours. Your tenth site takes 30 minutes because you've built the tooling and established the workflow.

Most developers never invest this time because "good enough" ships. But if you care about user experience, SEO performance, and conversion rates, it's worth learning these techniques.

What comes next?

You now understand how to achieve 100/100 PageSpeed scores. You know the techniques, the trade-offs, and the testing approach.

In my next article, I'll examine why performance optimization often gets overlooked in professional web development. I'll share a real case study—a $5,000 professional website scoring 40/100—and explain what affects the cost and quality of web development.

Want to verify these techniques work? Visit any of the sites mentioned above and click "⚡ PageSpeed Me" to test them live. Then consider: what would perfect performance scores mean for your business?

Need help with performance optimization? Visit codecrank.ai to learn about our approach to web development. We build performance optimization into every project from day one.

All performance metrics verified with Google Lighthouse. Sites tested on mobile with 4G throttling and mid-tier device simulation.

Building a Simple Ticket Tracker CLI in Go

2025-11-23 05:49:28

A lightweight, command-line alternative to complex ticketing systems.

Using golang and cobra-cli, I built a simple command-line interface for managing support tickets. Tickets are stored locally in a CSV file.

Introduction

As developers, we often find ourselves juggling multiple tasks, bugs, and feature requests. While tools like Jira, Trello, or GitHub Issues are powerful, sometimes you just need something simple, fast, and local to track your daily work without leaving the terminal.

That's why I built Ticket CLI—a simple command-line tool written in Go to track daily tickets and store them in a CSV file. No servers, no databases, just a binary and a text file.

The Tech Stack

For this project, I choose:

  • Go: For its speed, simplicity, and ability to compile into a single binary.
  • Cobra: The industry standard for building modern CLI applications in Go. It handles flag parsing, subcommands, and help text generation effortlessly.
  • Standard Library (encoding/csv): To keep dependencies low, I used Go's built-in CSV support for data persistence.

How It Works

The project follows a standard Go CLI structure:

ticket-cli/
├── cmd/            # Cobra commands (add, list, delete)
├── internal/       # Business logic
│   └── storage/    # CSV handling
└── main.go         # Entry point

1. The Command Structure

Using Cobra, I defined commands like add, list, and delete. Here's a snippet of how the add command handles flags to create a new ticket:

// cmd/add.go
var addCmd = &cobra.Command{
    Use:   "add",
    Short: "Add a new ticket",
    Run: func(cmd *cobra.Command, args []string) {
        // default values logic...

        t := storage.Ticket{
            ID:          fmt.Sprintf("%d", time.Now().UnixNano()),
            Title:       flagTitle,
            Customer:    flagCustomer,
            Priority:    flagPriority,
            Status:      flagStatus,
            Description: flagDescription,
        }

        if err := storage.AppendTicket(t); err != nil {
            fmt.Println("Error saving ticket:", err)
            return
        }
        fmt.Println("Ticket saved with ID:", t.ID)
    },
}

2. Data Persistence (The "Database")

Instead of setting up SQLite or a JSON store, I opted for CSV. It's human-readable and easy to debug. The internal/storage package handles reading and writing to tickets.csv.

// internal/storage/storage.go
func AppendTicket(t Ticket) error {
    // ... (file opening logic)
    w := csv.NewWriter(f)
    defer w.Flush()

    rec := []string{t.ID, t.Date, t.Title, t.Customer, t.Priority, t.Status, t.Description}
    return w.Write(rec)
}

Installation & Usage

You can clone the repo and build it yourself:

git clone https://github.com/yourusername/ticket-cli
cd ticket-cli
go mod tidy
go build -o ticket-cli .

Adding a Ticket

./ticket-cli add --title "Fix login bug" --priority high --customer "Acme Corp"

Listing Tickets

./ticket-cli list

Filter by date

./ticket-cli list --date 2025-11-15

CSV Storage

A CSV file is created automatically at the Project directory, and it keeps getting updated, once a new ticket is added using the .ticket-cli add --flags

Columns :

ID, Date, Title, Customer, Priority, Status, Description

Future Improvements

This is just an MVP. Some ideas for the future include:

  • JSON/SQLite Storage: For more complex querying.
  • TUI (Text User Interface): Using bubbletea for an interactive dashboard.
  • Cloud Sync: Syncing tickets to a Gist or S3 bucket.

Conclusion

Building CLI tools in Go is a rewarding experience. Cobra makes the interface professional, and Go's standard library handles the rest. If you're looking for a weekend project, try building your own developer tools!

Check out the code on GitHub.

From Prototype to Production: How to Engineer Reliable LLM Systems

2025-11-23 05:48:00

Over the past two years, large language models have moved from research labs to real-world products at an incredible pace. What began as a single API call quickly evolves into a distributed system touching compute, networking, storage, monitoring, and user experience. Teams soon realize that LLM engineering is not prompt engineering — it’s infrastructure engineering with new constraints.
In this article, we’ll walk through the key architectural decisions, bottlenecks, and best practices for building robust LLM applications that scale.

1. Why LLM Engineering Is Different

Traditional software systems are built around predictable logic and deterministic flows. LLM applications are different in four ways:

1.1 High and variable latency

Even a small prompt can require billions of GPU operations. Latency varies dramatically based on:

  • token length (prompt + output)
  • GPU generation
  • batching efficiency
  • model architecture (transformer vs. MoE) As a result, you must design for latency spikes, not averages.

1.2 Non-deterministic outputs

The same input can return slightly different answers due to sampling. This complicates:

  • testing
  • monitoring
  • evaluation
  • downstream decision logic LLM systems need a feedback loop, not one-off QA.

1.3 GPU scarcity and cost#

LLMs are one of the most expensive workloads in modern computing. GPU VRAM, compute, and network speed all constrain throughput.
Architecture decisions directly affect cost.

1.4 Continuous evolution

New models appear monthly, often with:

  • higher accuracy
  • lower cost
  • new modalities
  • longer context windows LLM apps must be built to swap models without breaking the system.

2. The LLM System Architecture

A production LLM application has five major components:

  • Model inference layer (API or self-hosted GPU)
  • Retrieval layer (vector DB / embeddings)
  • Orchestration layer (agents, tools, flows)
  • Application layer (backend + frontend)
  • Observability layer (logs, traces, evals) Let’s break them down.

3. Model Hosting: API vs. Self-Hosted#

3.1 API-based hosting (OpenAI, Anthropic, Google, Groq, Cohere)

Pros:

  • Zero GPU management
  • High reliability
  • Fast iteration
  • Access to top models
    Cons:

  • Expensive at scale

  • Limited control over latency

  • Vendor lock-in

  • Private data may require additional compliance steps
    Use API hosting when your product is early or workloads are moderate.

3.2 Self-Hosted (NVIDIA GPUs, AWS, GCP, Lambda Labs, vLLM)#

Pros:

  • Up to 60% cheaper at high volume
  • Full control over batching, caching, scheduling
  • Ability to deploy custom/finetuned models
  • Deploy on-prem for sensitive data
    Cons:

  • Complex to manage

  • Requires GPU expertise

  • Requires load-balancing around VRAM limits
    Use self-hosting when:

  • you exceed ~$20k–$40k/mo in inference costs

  • latency control matters

  • models must run in-house

  • you need fine-tuned / quantized variants

    4. Managing Context and Memory#

4.1 Prompt engineering is not enough

Real systems require:

  • message compression
  • context window optimization
  • retrieval augmentation (RAG)
  • caching (semantic + exact match)
  • short-term vs long-term memory separation

    4.2 RAG (Retrieval-Augmented Generation)

    RAG extends the model with external knowledge. You need:

  • a vector database (Weaviate, Pinecone, Qdrant, Milvus, pgvector)

  • embeddings model

  • chunking strategy

  • ranking strategy
    Best practice:
    Use hybrid search (vector + keyword) to avoid hallucinations.

    4.3 Agent memory

    Agents need memory layers:

  • Ephemeral memory: what’s relevant to the current task

  • Long-term memory: user preferences, history

  • Persistent state: external DB, not the LLM itself

    5. Orchestration: The Real Complexity

    As soon as you do more than “ask one prompt,” you need an orchestration layer:

  • LangChain

  • LlamaIndex

  • Eliza / Autogen

  • TypeChat / E2B

  • Custom state machines
    Why?
    Because real workflows require:

  • tool use (API calls, DB queries)

  • conditional routing (if…else)

  • retries and fallbacks

  • parallelization

  • truncation logic

  • evaluation before showing results to users
    Best practice:
    Use a deterministic state machine under the hood.
    Use LLMs only for steps that truly require reasoning.

    6. Evaluating LLM Outputs

    LLM evals are not unit tests. They need:
    a curated dataset of prompts
    automated scoring (BLEU, ROUGE, METEOR, cosine similarity)
    LLM-as-a-judge scoring
    human evaluation

    6.1 Types of evaluations

  • Correctness: factual accuracy

  • Safety: red teaming, jailbreak tests

  • Reliability: consistency across temperature=0

  • Latency: P50, P95, P99

  • Cost: tokens per workflow
    Best practice:
    Run nightly evals and compare the current model baseline with:

  • new models

  • new prompts

  • new RAG settings

  • new finetunes
    This prevents regressions when you upgrade.

    7. Monitoring & Observability

    Observability must be built early.

    7.1 What to log

  • prompts

  • responses

  • token usage

  • latency

  • truncation events

  • RAG retrieval IDs

  • model version

  • chain step IDs

    7.2 Alerting

    Alert on:

  • latency spikes

  • cost spikes

  • retrieval failures

  • model version mismatches

  • hallucination detection thresholds
    Tools like LangSmith, Weights & Biases, or Arize AI can streamline this.

    8. Cost Optimization Strategies

    LLM compute cost is often your biggest expense. Ways to reduce it:

    8.1 Use smaller models with good prompting

    Today’s 1B–8B models (Llama, Mistral, Gemma) are extremely capable.
    Often, a well-prompted small model beats a poorly-prompted big one.

    8.2 Cache aggressively

  • semantic caching

  • response caching

  • template caching
    This reduces repeated calls.

    8.3 Use quantization

    Quantized 4-bit QLoRA models can cut VRAM use by 70%.

    8.4 Batch inference

    Batching increases GPU efficiency dramatically.

    8.5 Stream tokens

    Streaming reduces perceived latency and helps UX.

    8.6 Cut the context

    Long prompts = long latency = expensive runs.

    9. Security & Privacy Considerations

    LLM systems must handle:

    9.1 Prompt injection

    Never trust user input. Normalize, sanitize, or isolate it.

    9.2 Data privacy

    Don’t send sensitive data to external APIs unless fully compliant.

    9.3 Access control

    Protect:

  • model APIs

  • logs

  • datasets

  • embeddings

  • vector DBs

    9.4 Output filtering

    Post-processing helps avoid toxic or harmful outputs.

    10. Future of LLM Engineering

    Over the next 18 months, we’ll see:

  • long-context models (1M+ tokens)

  • agent frameworks merging into runtime schedulers

  • LLM-native CI/CD pipelines

  • cheaper inference via MoE and hardware-optimized models

  • GPU disaggregation (compute, memory, interconnect as separate layers)
    The direction is clear:
    LLM engineering will look more like distributed systems engineering than NLP.

    Conclusion

    Building a production-grade LLM system is much more than writing prompts. It requires thoughtful engineering across compute, memory, retrieval, latency, orchestration, and evaluation.
    If your team is moving from early experimentation to real deployment, expect to invest in:

  • reliable inference

  • RAG infrastructure

  • model orchestration

  • observability

  • cost optimization

  • security
    The companies that succeed with LLMs are not the ones that use the biggest model — but the ones that engineer the smartest system around the model.

🚀 Integrating API Gateway with Private ALB: The New, Simpler, and More Scalable Way

2025-11-23 05:46:52

🧩 The Problem

When you work with microservices in AWS (especially in ECS, EKS, or internal applications inside a VPC), sooner or later you need to expose a REST endpoint through Amazon API Gateway, but without making your backend public.

For many years, the only “official” way to integrate API Gateway (REST) with a private Application Load Balancer (ALB) was by placing a Network Load Balancer (NLB) in the middle.

This created three common issues in real-world projects:

  1. More infrastructure than necessary (ALB + NLB just to create a bridge).
  2. Higher latency because traffic needed to make an extra hop.
  3. More cost and operational overhead: two load balancers to monitor, scale, and secure.

For students or small teams, this architecture was confusing and far from intuitive:

“Why do I need an NLB if my backend is already behind an ALB?”

And yes… we were all asking ourselves the same thing.

🔧 The Old Solution:

Until recently, the flow looked like this:

API Gateway → VPC Link → NLB → ALB (Privado)

old-solution

The NLB acted as a “bridge” because API Gateway could only connect to an NLB using VPC Link. ALB wasn’t supported directly.

This worked, but:

  • It was more complex to explain in classes or onboarding sessions.
  • Costs were higher (NLB hourly charges + NLCUs).
  • It introduced extra points of failure.
  • It didn’t feel natural for modern ALB-based architectures.

⭐ The New Solution:

AWS finally heard our prayers 🙏.

Now API Gateway (REST) supports direct private integration with ALB using VPC Link v2.
The new flow looks like this:

API Gateway → VPC Link v2 → ALB (Privado)

new-solution

In summary:

  • No NLB.
  • No unnecessary bridge.
  • Lower cost.
  • Lower latency.
  • Easier to teach and understand.

This allows you to naturally expose your internal microservices behind a private ALB without adding any extra resources.

⚔️ Comparison: Before vs Now

Aspect Old (NLB Required) New (Direct to ALB)
Infrastructure More complex (extra NLB) Much simpler
Cost Hourly NLB + extra NLCUs Only ALB + API Gateway
Latency Higher (extra hop through NLB) Lower
Maintenance Two load balancers One load balancer
Security Good, but more SG rules Equally secure, fewer failure points
Clarity Hard to explain Much more intuitive
Scalability Depends on NLB Highly scalable VPC Link v2
Flexibility Limited to NLB Supports multiple ALBs/NLBs

🎓 Why is this change important?

If you’re learning cloud architecture — especially microservices — this change is a huge benefit:

  • The architecture is easier to understand.
  • You can focus on real concepts, not historical workarounds.
  • Lower cost for student projects or test environments.
  • The traffic path becomes cleaner and more direct.
  • You can use ALB (which is more flexible) without depending on an NLB.

It also unlocks modern patterns:

  • Expose EKS/ECS microservices via API Gateway without making them public.
  • Build more secure internal corporate APIs.
  • Use a single VPC Link v2 to route to multiple ALBs.

🎯 Conclusions

AWS simplified a pattern that had been unnecessarily complex for years. The direct integration between API Gateway and private ALB:

  • Lowers cost
  • Reduces complexity
  • Reduces latency
  • Improves architectural clarity
  • Is perfect for students, small teams, and enterprises

If you were building internal APIs and previously needed an NLB just to bridge API Gateway to ALB… you can forget about that now.

The architecture is cleaner, more modern, and aligned with real cloud-native best practices.

📚 References

Here are the official sources and recommended materials for deeper study:

🌐 Languages

Spanish