MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

Iterando lo recursivo

2026-01-09 00:27:00

En otro de mis múltiples proyectos laterales o mascota, o como se desee llamarlos, escribí un pequeño parser para un HTML reducido, para una app de notas. Mi objetivo es que el usuario pueda escribir como en un pequeño procesador de textos, y que el resultado se guarde en MarkDown. El soporte de todo esto es JEditorPane, ya que es una app escrita en Java sobre Swing (me gusta mucho C#, pero poder distribuir toda la aplicación en un solo archivo jar es un puntazo).

El caso es que JEditorPane puede guardar el código en varios formatos, pero el que me pareció más adecuado fue HTML. Así que el objetivo es que el usuario, sin preocuparse de formatos, acabe escribiendo algo parecido a lo siguiente en el editor, para una nota de tareas cotidianas:

<html>
<body>
A realizar:
<ol>
    <li>Recoger la bici del taller.</li>
    <li>Dar de comer a las tortugas.</li>
    <li>Pedir cita para colonoscopia.</li>
</ol>
</body>
</html>

...y que la propia aplicación lo traduzca a lo siguiente:

# Tareas cotidianas

A realizar:

1. Recoger la bici del taller.
2. Dar de comer a las tortugas.
3. Pedir cita para colonoscopia.

Aunque pueda parecer de traducción inmediata (quizás mediante algún tipo de preprocesador), los problemas aparecerían enseguida si no escribiéramos un parser. Su objetivo será convertir la entrada en HTML en un árbol sintáctico. El ejemplo anterior quedaría como árbol sintáctico así:

Root
|________________________|
p                        ol
|                        |______________________________|________|
"A realizar"             li                             li       li
                         |                              |        |
                         "Recoger la bici   "Dar de comer       "Pedir cita para
                          del taller"        a las tortugas."    la colonoscopia"

Y aquí es donde tenemos la curiosidad de hoy. Tenemos que recorrer el árbol para generar texto MarkDown. ¿Cómo recorrer el árbol sintáctico? Una forma sería la siguiente, apoyándonos en la recursividad y en el patrón de diseño "Visitor".

class Tree {
    // ...
    public void run(Visitor v)
    {
        this.runOver( v, this.root );
    }

    private void runOver(Visitor v, Element element)
    {
        v.visit( element );

        for(Element subelement: this.subElements) {
            this.runOver( v, subElement );
        }
    }
}

Conseguimos separar así el código que se encarga del árbol del código que se encarga de realizar las diferentes tareas sobre él. El método runOver(v: Visitor, element: Element) recorre todos los elementos usando una estrategia en profundidad. Este método es muy sencillo, gracias al uso de la recursividad: todos los elementos son visitados empleando, de hecho, muy poco código.

Si tuviéramos una clase Visitor como la siguiente:

class Visitor {
    // ...
    public void visit(Element element)
    {
        String elementText = element.toString();

        if ( !elementText.isEmpty() ) {
            text.append( elementText );
            text.append( '\n' );
        }
    }

    private StringBuilder text;
}

Se ejecutaría así (lo siguiente es una representación del stack o pila de llamadas):

runOver( root, v )
    |-runOver( p, v )
    |    |-runOver( p.get( 0 ), v );
    |-runOver( ol, v )
    |    |-runOver( ol.get( 0 ), v )
    |    |    |-runOver( li.get( 0 ), v )
    |    |-runOver( ol.get( 1 ), v )
    |    |    |-runOver( li.get( 0 ), v )
    |    |-runOver( ol.get( 2 ), v )
    |    |    |-runOver( li.get( 0 ), v )

Y obtendríamos la siguiente salida:

A realizar:
Recoger la bici del taller.
Dar de comer a las tortugas.
Pedir cita para colonoscopia.

La recursividad tiene dos desventajas: la primera se evapora con el tiempo, y es que al principio cuesta pensar y entender código que se llama a sí mismo. Es más fácil si se tiene en cuenta el caso más importante: el caso base, en el que se produce el retorno de las llamadas recursivas. En este caso, se trata de aquel elemento que no tiene subelementos.

La seguna desventaja tiene que ver con el rendimiento: lo cierto es que el simple hecho de utilizar recursividad supone el riesgo de desbordar el stack, es decir, la pila de llamadas. Cada llamada recursiva supone una entrada más en la pila de llamadas, por lo que, si uno de nuestros árboles tiene 100 elementos de profundidad (es decir, antes de que se produzca el retorno), se reservarán 100 entradas en el stack (aunque algunos compiladores pueden hacer ciertas optimizaciones en algunas situaciones).

En cualquier caso, no tendremos este problema si utilizamos un enfoque iterativo. Es decir, si en lugar de un algoritmo que se invoque a sí mismo para cada elemento, utilizamos un bucle que enumere los elementos. Si pensamos en el papel del stack, entonces llegaremos a la conclusión de que la pila de llamadas hace las veces de colección de elementos visitados: cada nueva llamada se realiza por cada uno de los subelementos de un elemento dado, y cada retorno se realiza porque ya se han visitado todos los subelementos.

La estrategia es sustituir esta pila de llamadas implícita, el stack, por una colección explícita, una colección real en nuestro método run() (el método privado runOver() ya no será necesario). La cuestión es que, si utilizamos una pila, tendremos que introducir los subelementos de un elemento dado en el orden inverso a como aparecen de manera natural. Así que, en su lugar, emplearemos una cola.

Partamos entonces de una cola: es decir, se introducen elementos por el final y se retiran por el principio. En esta cola introducimos la raíz (el elemento root), y el bucle se ejecuta mientras la cola esté vacía. Cuando visitamos un elemento, añadimos sus subelementos al final de la cola, y continuamos con el siguiente en el principio de la cola.

class Tree {
    // ...
    public void run(Visitor v)
    {
        var elements = new ArrayDeque<Element>();
        elements.add( this.getRoot() )

        while( !elements.isEmpty() ) {
            var element = elements.peek();

            elements.remove();
            elements.addAll( element.getSubElements() );

            v.visit( element );
        }
}

Supongamos el ejemplo anterior, la nota de tareas a realizar. Supongamos que queremos En la primera vuelta introducimos la raíz, por lo que se elimina y se introducen sus subelementos. El contenido de elementos pasa a ser:

elementos -> [ p, ol ]
text      -> ""

Se toma el primer elemento p, siendo su único subnodo el texto "A realizar.". Entonces, p se elimina.

elementos -> [ ol, "A realizar." ]
text      -> ""

En el siguiente paso, se toman los hijos de ol, y se incluyen al final de elements.

elementos -> [ "A realizar.", li, li, li ]
text      -> ""

Ahora se toma el primer elemento. Este es el único hasta ahora que contiene texto, por lo que se introduce en el StringBuilder. El texto no tiene subelementos, por lo que la cola no se ve afectada.

elementos -> [ li, li, li ]
text      -> "A realizar.\n"

Volvemos a tomar el primer elemento, para añadir sus subelementos.

elementos -> [ li, li, "Recoger la bici del taller." ]
text      -> "A realizar.\n"

Y lo hacemos de nuevo.

elementos -> [ li, "Recoger la bici del taller.", "Dar de comer a las tortugas." ]
text      -> "A realizar.\n"

Y de nuevo.

elementos -> [ "Recoger la bici del taller.", "Dar de comer a las tortugas.", "Pedir cita para colonoscopia." ]
text      -> "A realizar.\n"

A partir de ahora, como el texto no contiene subelementos, simplemente, en las siguientes tres vueltas de bucle, se añaden los textos al StringBuilder.

elementos -> [ "Dar de comer a las tortugas.", "Pedir cita para colonoscopia." ]
text      -> "A realizar.\nRecoger la bici del taller.\n"

Seguimos...

elementos -> [ "Pedir cita para colonoscopia." ]
text      -> "A realizar.\nRecoger la bici del taller.\nDar de comer a las tortugas.\n"

Y finalizamos. Al no contener ningún elemento, el algoritmo termina.

elementos -> []
text      -> "A realizar.\nRecoger la bici del taller.\nDar de comer a las tortugas.\nPedir cita para colonoscopia.\n"

¡Hurra! Nuestro código se ejecuta sin necesidad de preocuparse de si la complejidad de nuestro árbol puede o no desbordar el stack. Además, nuestro código, al efectuar menos llamadas a funciones, será más rápido. A cambio, eso sí, empleamos más memoria, la de la cola elementos. ¡Siempre tendremos que cambiar rendimiento por memoria y viceversa!

10 AWS Production Incidents That Taught Me Real-World SRE

2026-01-09 00:25:12

After responding to hundreds of AWS production incidents, I've learned that textbook solutions rarely match production reality. Here are 10 incidents that taught me how AWS systems actually break and how to fix them fast.

AWS Production incidents

1. HTTP 4XX Alarms: When Your Users Can't Reach You

3 AM wake-up call: CloudWatch alarm firing for elevated 4XX errors. Traffic looked normal, but 30% of requests were getting 403s.
What I thought: API Gateway throttling or IAM issues.
What it actually was: A code deployment changed how we validated JWT tokens. The validation was now rejecting tokens from our mobile app's older version (which 30% of users hadn't updated yet).
The approach:
Check CloudWatch Insights for specific 4XX types (400, 403, 404)
Correlate with recent deployments using AWS Systems Manager
Examine API Gateway execution logs for rejection patterns
The fix:
Quick triage query in CloudWatch Insights
fields @timestamp, @message, statusCode, userAgent
| filter statusCode >= 400 and statusCode < 500
| stats count() by statusCode, userAgent
| sort count desc

Fast action: Rolled back the deployment, added backward compatibility for token validation, and set up monitoring for version distribution.
Lesson learned: 4XX errors are user-facing problems. Always correlate them with deployment times and check for breaking changes in validation logic.

golden signals

2. HTTP 5XX Alarms: The System Is Breaking

The scenario: 5XX errors spiking during peak traffic. Load balancer health checks passing, but 15% of requests failing.
What I thought: Backend service overwhelmed.
What it actually was: Lambda functions timing out because of cold starts during a traffic spike, returning 504 Gateway Timeout through API Gateway.
The approach:
Distinguish between different 5XX codes (500, 502, 503, 504)
Check ELB/ALB target health in real-time
Examine Lambda concurrent executions and duration
The fix:
Added provisioned concurrency for critical Lambda functions
aws lambda put-provisioned-concurrency-config \
--function-name critical-api-handler \
--provisioned-concurrent-executions 10 \
--qualifier PROD

Implemented exponential backoff in API Gateway

Fast action: Enabled Lambda provisioned concurrency for traffic-sensitive functions and added CloudWatch alarms for concurrent execution approaching limits.
Lesson learned: 5XX errors need immediate action. Set up separate alarms for 502 (bad gateway), 503 (service unavailable), and 504 (timeout)—each tells a different story.

3. Route53 Health Check Failures: DNS Thinks You're Dead

The incident: Route53 failover triggered automatically at 2 PM, routing all traffic to our secondary region, which wasn't ready for full load.
What I thought: Primary region having issues.
What it actually was: Security group change blocked Route53 health check endpoint. Service was healthy, but Route53 couldn't verify it.
The approach:
Verify health check endpoint is accessible from Route53 IP ranges
Check security groups and NACLs
Test health check URL manually from different regions
The fix:
Whitelist Route53 health checker IPs in security group
Route53 publishes IP ranges at:
https://ip-ranges.amazonaws.com/ip-ranges.json

Quick health check test
curl -v https://api.example.com/health \
-H "User-Agent: Route53-Health-Check"

Fast action: Added Route53 health checker IPs to security group, implemented internal health checks that validate both endpoint accessibility and actual service health.
Lesson learned: Route53 health checks are not the same as your service being healthy. Ensure your health check endpoint tells the full story—database connectivity, downstream dependencies, not just "service is running."

4. Database Connection Pool Exhaustion: The Silent Killer

The scenario: Application logs showing "connection pool exhausted" errors. RDS metrics looked fine—CPU at 20%, connections well below max.
What I thought: Need to increase RDS max_connections.
What it actually was: Application wasn't releasing connections properly after exceptions. Connection pool filled up with zombie connections.
The approach:
Check RDS DatabaseConnections metric vs your pool size
Examine application connection acquisition/release patterns
Look for long-running queries holding connections
The fix:
Implemented proper connection management
`from contextlib import contextmanager

@contextmanager
def get_db_connection():
conn = connection_pool.get_connection()
try:
yield conn
conn.commit()
except Exception:
conn.rollback()
raise
finally:
conn.close() # Critical: Always release

Added connection pool monitoring
cloudwatch.put_metric_data(
Namespace='CustomApp/Database',
MetricData=[{
'MetricName': 'ConnectionPoolUtilization',
'Value': pool.active_connections / pool.max_size * 100
}]
)`

Fast action: Implemented connection timeout, added circuit breakers, and created CloudWatch dashboard tracking connection pool health.
Lesson learned: Database connection pools need aggressive monitoring. Set alarms at 70% utilization, not 95%. By then, it's too late.

5. API Rate Limits: When AWS Says "Slow Down"

The incident: Lambda functions failing with "Rate exceeded" errors during a batch job. Processing completely stopped.
What I thought: Hit AWS service limits.
What it actually was: Batch job making 10,000 concurrent DynamoDB writes with no backoff strategy. Hit write capacity limits within seconds.
The approach:
Identify which AWS API is rate limiting (check error messages)
Check Service Quotas dashboard for current limits
Implement exponential backoff with jitter
The fix:
`import time
import random
from botocore.exceptions import ClientError

def exponential_backoff_retry(func, max_retries=5):
for attempt in range(max_retries):
try:
return func()
except ClientError as e:
if e.response['Error']['Code'] in ['ThrottlingException', 'TooManyRequestsException']:
if attempt == max_retries - 1:
raise
# Exponential backoff with jitter
sleep_time = (2 ** attempt) + random.uniform(0, 1)
time.sleep(sleep_time)
else:
raise

Use AWS SDK built-in retry
import boto3
from botocore.config import Config

config = Config(
retries = {
'max_attempts': 10,
'mode': 'adaptive'
}
)
dynamodb = boto3.client('dynamodb', config=config)
`
Fast action: Implemented rate limiting on application side, added CloudWatch metrics for throttled requests, and requested limit increases where justified.
Lesson learned: Don't fight AWS rate limits—work with them. Build backoff into your code from day one, not after the incident.

6. Unhealthy Target Instances: The Load Balancer Lottery

The scenario: ALB sporadically marking healthy instances as unhealthy. Some requests succeeded, others got 502 errors.
What I thought: Instances actually becoming unhealthy under load.
What it actually was: Health check interval too aggressive (5 seconds) with tight timeout (2 seconds). During brief CPU spikes, instances couldn't respond in time and got marked unhealthy.
The approach:
Review target group health check settings
Check instance metrics during health check failures
Examine health check response times
The fix:
Adjusted health check to be more forgiving
aws elbv2 modify-target-group \
--target-group-arn arn:aws:elasticloadbalancing:... \
--health-check-interval-seconds 30 \
--health-check-timeout-seconds 5 \
--healthy-threshold-count 2 \
--unhealthy-threshold-count 3

Made health check endpoint lightweight
Don't do: health check that queries database
Do: health check that verifies process is alive

Fast action: Separated deep health checks (for monitoring) from load balancer health checks (for routing). ALB health checks should be fast and simple.
Lesson learned: Aggressive health checks cause more problems than they solve. Balance between catching real failures and avoiding false positives.

7. Lambda Cold Starts: The Hidden Latency Tax

The incident: P99 latency for API calls spiking to 8 seconds during low traffic periods, while P50 stayed at 200ms.
What I thought: Backend database performance issue.
What it actually was: Lambda cold starts. Functions were shutting down during quiet periods, causing massive latency when the next request arrived.
The approach:
Check Lambda Duration metrics and look for bimodal distribution
Examine Init Duration in CloudWatch Logs Insights
Calculate cold start frequency
The fix:
CloudWatch Insights query to identify cold starts
fields @timestamp, @duration, @initDuration
| filter @type = "REPORT"
| stats
avg(@duration) as avg_duration,
avg(@initDuration) as avg_cold_start,
count(@initDuration) as cold_start_count,
count(*) as total_invocations
| limit 20

Solutions applied:

  1. Provisioned concurrency for critical paths
  2. Keep functions warm with EventBridge schedule
  3. Optimize cold start time (smaller deployment package)

Fast action: Implemented provisioned concurrency for user-facing APIs, scheduled pings to keep functions warm, and reduced deployment package size by 60%.
Lesson learned: Cold starts are inevitable with Lambda. Design around them—use provisioned concurrency for latency-sensitive operations, or accept the trade-off for batch jobs.

8. DynamoDB Throttling: When NoSQL Says No

The incident: Writes succeeding, but reads failing with ProvisionedThroughputExceededException during daily report generation.
What I thought: Need to increase read capacity units.
What it actually was: Report query using Scan operation without pagination, creating hot partition that consumed all capacity in seconds.
The approach:
Check DynamoDB metrics: ConsumedReadCapacity, ThrottledRequests
Identify access patterns causing hot partitions
Review query patterns (Scan vs Query)
The fix:
`Before: Scan without pagination (disaster)
response = table.scan()
items = response['Items']

After: Query with pagination and exponential backoff
def query_with_pagination(table, key_condition):
items = []
last_evaluated_key = None

while True:
    if last_evaluated_key:
        response = table.query(
            KeyConditionExpression=key_condition,
            ExclusiveStartKey=last_evaluated_key
        )
    else:
        response = table.query(
            KeyConditionExpression=key_condition
        )

    items.extend(response['Items'])

    last_evaluated_key = response.get('LastEvaluatedKey')
    if not last_evaluated_key:
        break

return items

Enable DynamoDB auto scaling
aws application-autoscaling register-scalable-target \
--service-namespace dynamodb \
--resource-id "table/YourTable" \
--scalable-dimension "dynamodb:table:ReadCapacityUnits" \
--min-capacity 5 \
--max-capacity 100`

Fast action: Converted Scans to Queries where possible, implemented pagination, enabled auto-scaling, and added composite sort keys to enable efficient queries.
Lesson learned: DynamoDB throttling is almost always a design problem, not a capacity problem. Fix your access patterns before throwing money at provisioned capacity.

9. ELB Connection Draining: Killing Requests During Deployment

The incident: 5% of requests failed during every deployment with 502 errors, despite using blue-green deployments.
What I thought: Instances shutting down too quickly.
What it actually was: Connection draining timeout set to 30 seconds, but some API calls took up to 60 seconds. ALB killed connections mid-request.
The approach:
Check ALB access logs for 502s during deployment windows
Review connection draining settings
Measure actual request duration (P99)
The fix:
Increase connection draining timeout
aws elbv2 modify-target-group-attributes \
--target-group-arn arn:aws:elasticloadbalancing:... \
--attributes Key=deregistration_delay.timeout_seconds,Value=120

Add deployment health check
Wait for active connections to drain before proceeding
while [ $(aws elbv2 describe-target-health \
--target-group-arn $TG_ARN \
--query 'TargetHealthDescriptions[?TargetHealth.State==
draining] | length(@)') -gt 0 ]
do
echo "Waiting for connections to drain..."
sleep 10
done

Fast action: Increased deregistration delay, implemented graceful shutdown in application (stop accepting new requests, finish existing ones), added pre-deployment validation.
Lesson learned: Connection draining timeout should be longer than your longest request duration. Monitor P99 request latency and set draining timeout accordingly.

10. Security Group Lockout: How I Locked Myself Out of Production

The incident: Deployment script failed mid-way, leaving security groups in an inconsistent state. Couldn't SSH to instances, couldn't roll back.
What I thought: Need to manually fix security groups.
What it actually was: Automation script had no rollback mechanism. Changed security groups in production without testing.
The approach:
Use AWS Systems Manager Session Manager (doesn't need SSH)
Document security group changes before modifying
Always test infrastructure changes in staging
The fix:
Access instance without SSH using Session Manager
aws ssm start-session --target i-1234567890abcdef0

Implement security group changes with backup
1. Describe current security groups
aws ec2 describe-security-groups \
--group-ids sg-12345 > security-group-backup.json`

  1. Make changes atomically aws ec2 authorize-security-group-ingress \ --group-id sg-12345 \ --ip-permissions IpProtocol=tcp,FromPort=443,ToPort=443,IpRanges='[{CidrIp=0.0.0.0/0}]'
  2. Validate change worked
  3. Only then remove old rule

Better: Use CloudFormation for security groups
Changes are tracked, rollback is automatic

Fast action: Enabled Systems Manager Session Manager on all instances, started managing security groups through CloudFormation, implemented change approval process.

Lesson learned: Never modify security groups manually in production. One wrong click can lock you out. Use infrastructure as code and Session Manager as a safety net.

Tools That Make This Easier
When incidents happen, speed matters. I built an Incident Helper to automate the repetitive parts of incident response, gathering CloudWatch logs, checking service health, and identifying common AWS issues.
It won't solve incidents for you, but it cuts down the time spent collecting information so you can focus on fixing the actual problem.

Fix the production alarm
The Real Lesson
AWS gives you powerful tools, but they don't come with training wheels. Every service has failure modes you won't discover until 3 AM on a Saturday.
The incidents that teach you the most aren't the catastrophic ones—they're the subtle ones that make you question your assumptions. The 4XX error that reveals a deployment process gap. The throttling error that exposes an architecture flaw.

Document your incidents. Build your runbooks. Test your failovers. Discuss weekly with your teams. Because the next incident is already scheduled, you just don't know when.

Day-30 Kubernetes Networking Decoded: Why We Need Ingress and Ingress Controllers

2026-01-09 00:21:17

Moving beyond NodePort and LoadBalancers to smarter traffic routing.
Today’s exploration into Kubernetes networking took me a step deeper than the standard Service types. After getting comfortable with ClusterIP, NodePort, and LoadBalancer, I ran into a logical question: What happens when I have 50 microservices? Do I really want to pay for 50 Cloud Load Balancers or open 50 random ports on my nodes?

The answer, thankfully, is no. That is where Ingress and Ingress Controllers come in. Here is what I learned today about how they work and why they are essential for production clusters.

The Problem: Why Standard Services Aren't Enough
Before understanding Ingress, I had to understand the limitations of the other methods:

NodePort: It opens a specific port on every Node in the cluster. It's messy to manage, has security implications, and you are limited to a specific port range (30000-32767). It’s fine for testing, but bad for production.

LoadBalancer: This creates a distinct external IP address (usually a cloud load balancer from AWS, GCP, or Azure) for each service. If you have 20 microservices, that’s 20 separate bills for 20 load balancers. It’s expensive and inefficient.

The Solution: Ingress
I learned that Ingress is essentially a smart router for your cluster. Instead of exposing every service directly to the internet, you expose one entry point, and that entry point decides where the traffic goes based on rules you define.

Think of it like an office building:

NodePort is like giving everyone their own key to a side door.

LoadBalancer is like building a separate main entrance for every single employee.

Ingress is having one main reception desk. You walk in, tell the receptionist who you are looking for ("I need the Billing Department"), and they direct you to the right room.

In technical terms, Ingress allows you to do Path-Based Routing or Host-Based Routing.

example.com/api -> Routes to the Backend Service

example.com/shop -> Routes to the Frontend Service

The Missing Piece: The Ingress Controller
Here was the "aha!" moment for me today: Ingress by itself does nothing.

If you create an Ingress resource (the YAML file), it’s just a piece of paper with rules on it. It’s a configuration request. For those rules to actually work, you need an implementation. This is called the Ingress Controller.

The Ingress Controller is the actual software (a Pod) running in your cluster that reads your Ingress rules and processes the traffic.

Ingress = The Rules (The Map)

Ingress Controller = The Enforcer (The Traffic Cop)

The most popular controller is NGINX, but there are others like Traefik, HAProxy, and Istio.

Why They Are Needed (Summary)
Cost Efficiency: You only pay for one Cloud Load Balancer (which sits in front of the Ingress Controller) regardless of how many services you have inside.

Clean URLs: You can route traffic based on domains (app.com, api.app.com) or paths (/app, /login) rather than weird port numbers like 192.168.1.5:32044.

SSL/TLS Termination: You can manage your security certificates in one place (the Ingress) rather than configuring SSL on every single microservice application.

Learning about Ingress feels like graduating from "making things work" to "making things scalable." It separates the routing logic from the application logic and saves massive amounts of cloud resources.

Linkedin: https://www.linkedin.com/in/dasari-jayanth-b32ab9367/

Browser-Based kubectl Access: Managing Kubernetes Without Bastion Hosts

2026-01-09 00:17:00

TL;DR

  • This article presents a browser-based kubectl access pattern

  • Designed for temporary, auditable cluster interaction

  • No bastion host, no SSH, no heavy management tools

  • All actions go through the Kubernetes API and RBAC

  • Not intended for daily production operations

Accessing Kubernetes clusters securely is a recurring challenge, especially in environments where SSH access, bastion hosts, or heavy management tools are discouraged.

In this article, I’ll walk through a browser-based kubectl access pattern that enables temporary, auditable interaction with a Kubernetes cluster, without relying on jump hosts or always-on management platforms.

This approach is intentionally not designed for daily production operations. Its value lies in controlled access, not convenience.

Why Kubernetes Access Is Hard to Get Right

Most teams rely on one or more of these approaches:

  • Bastion hosts with SSH access

  • kubectl configured on local laptops/machines

  • Full-featured Kubernetes management tools

  • Cloud-provider shell environments

All of them work, but they come with trade-offs:

  • credential sprawl

  • infrastructure overhead/increased attack surface

  • long-lived access paths/credentials

  • limited auditability/unclear audit boundaries

In regulated or security-sensitive environments, these trade-offs become unacceptable.

Yet teams still need:

  • break-glass access

  • short-lived troubleshooting

  • training and workshop environments

  • controlled/restricted support access

This is the gap the browser-based approach addresses.

What “Browser-Based kubectl” Actually Means

This pattern does not introduce a new Kubernetes UI.

Instead, it exposes a restricted web terminal that runs kubectl inside the cluster, using:

  • a dedicated ServiceAccount

  • strict RBAC permissions

  • native Kubernetes audit logging

All access happens through HTTP(S).

There is:

  • no SSH access,
  • no node-level access and login,
  • no user kubeconfig distribution.

High-Level Architecture

Conceptually, the flow looks like this:

User Browser
     |
     v
HTTP(S) (restricted)
     |
     v
Ingress / Load Balancer
     |
     v
Service
     |
     v
Web Terminal Pod
     |
     v
kubectl
     |
     v
Kubernetes API Server

Important details:

  • The terminal runs as a Pod inside the cluster

  • Authorization is enforced by Kubernetes RBAC

  • All kubectl actions result in Kubernetes API requests, which can be captured by Kubernetes audit logs depending on the audit policy

  • Access can be disabled instantly by removing the Pod or Service

  • No persistent external access paths remain

and represented in an ASCII diagram:

Web Terminal for kubectl - ascii diagram

A few pictures, in operation

  • basic commands:
    Web Terminal for kubectl commands

  • jump into pod and use kubectl exec command -within the defined namespace:
    Web Terminal for kubectl exec command

Security Model: Why This Is Auditable by Design

This pattern relies on layered security, not a single control.

Network layer

  • TLS termination

  • IP allowlists

  • No direct node exposure

Kubernetes authorization

  • Dedicated ServiceAccount

  • Least-privilege RBAC

  • Optional read-only mode

Auditability

  • Every action flows through the Kubernetes API

  • Native audit logs capture requests

Auditability: What Is (and Is Not) Logged

This is an important clarification.

  • kubectl commands themselves are not logged

  • Kubernetes API requests are

When a user executes a command:

  • kubectl get pods

  • kubectl describe deployment ...

  • kubectl apply -f …

The resulting API calls can be recorded in Kubernetes audit logs, depending on the configured audit policy.

This means:

  • Resource access and mutations are traceable

  • RBAC enforcement is preserved

  • No hidden or opaque access paths exist (unlike SSH sessions)

This pattern relies on Kubernetes’ native security model, not on custom logging logic.

⚠️ Important

This approach does not bypass Kubernetes security controls, it depends on them.

How This Compares to Other Access Patterns

How web based kubectl Compares to Other Access Patterns

This pattern is not a replacement, it fills a specific operational niche.

When You Should (and Should Not) Use This

Recommended

  • Break-glass scenarios

  • Training and workshops

  • Restricted production environments

  • Short-lived support access

Not recommended

  • Daily production operations

  • CI/CD automation

  • Persistent admin workflows

The limitations are intentional.They help prevent accidental misuse.

Lessons Learned

After experimenting with this pattern in real environments, a few things became clear:

  • Kubernetes RBAC remains the single most important control

  • Auditability improves when access paths are explicit

  • Removing SSH simplifies security reviews

  • Temporary access patterns reduce long-term risk

Convenience is easy to add. Removing access later is much harder.

Final Thoughts

Secure Kubernetes access is less about tools and more about boundaries.

Browser-based kubectl access provides a minimal, auditable, and intentionally constrained way to interact with a cluster when traditional approaches are unavailable or undesirable.

Used correctly, it solves a real problem, without becoming a new one.

Reference Implementation

The repository demonstrating this pattern is available here: https://github.com/zsoterr/k8s-web-terminal-kubectl

And what's next?

I am planning a number of modifications and additions. You can find more information about these in README.md in the GitHub repository.

Note

This DEV.to post is a concise version of a longer, experience-based guide.

If you’re interested in deeper technical details, you can read it among My medium stories

About the Author

I’m Róbert Zsótér, Kubernetes & AWS architect. If you’re into Kubernetes, EKS, Terraform, AI and cloud-native security, follow my latest posts here:

Let’s build secure, scalable clusters, together.

Note: Originally published on Medium Browser-Based kubectl Access: Managing Kubernetes Without Bastion Hosts or Heavy Tools

React (RSC) Exploits Are Real and It's Hurting Engineering Teams

2026-01-09 00:11:01

If you work in web engineering, the transition into 2026 hasn’t been defined by new features, but by a single, terrifying realization: The boundary between your client and server is broken.

In December, "React2Shell" (CVE-2025-55182) exposed a critical RCE in React Server Components. Barely after teams patched, two more high-severity exploits emerged in the same subsystem. This is the definitive technical post-mortem.

The "React2Shell" RCE (CVE-2025-55182)

To understand why this vulnerability (CVSS 10.0) was so devastating, we first have to understand the machinery it destroyed, the React Flight Protocol.

For years, the mental model for React developers was simple: React renders UI in the browser, and an API (REST/GraphQL) fetches data. The server and client were distinct worlds, separated by a clear network boundary.

React Server Components (RSC) erased that boundary.

react-flight-protocol

In the RSC world, the server doesn't send HTML to the client. It streams a proprietary serialization format called "Flight." This stream contains descriptions of UI components, serialized data, and crucially Promises that resolve to data.

How React2Shell Works

React2Shell was not standard SQL injection or XSS. It was logic abuse targeting a deserializer. Attackers crafted HTTP POST requests with specific "Flight" payloads. These binary-like streams mimicked valid component trees while hiding a "Thenable" Object, or fake Promise.

  1. Injection: Attacker sends serialized object containing malicious __proto__ keys.
  2. Deserialization: Server parses stream and tries resolving fake Promise.
  3. Prototype Pollution: Lack of sanitization lets __proto__ overwrite Object.prototype on running Node.js process.
  4. Gadget Chain: Modified prototype triggers internal calls targeting Function constructor.

Accessing Function constructor means game over. Arbitrary strings like require('child_process').exec(...) compile and execute immediately.

How Next.js was Affected

Because Next.js App Router enables RSC by default, every Next.js application (versions 15.x, 16.x, and canary builds) was vulnerable out of the box. You didn't need to write a buggy Server Action to be hacked. You didn't even need to use Server Components explicitly. If your application booted up react-server-dom-webpack, it was listening for these Flight payloads.

client-server-components

Credits: Next.js for the image

The DevOps Nightmare

For DevOps engineers, "React2Shell" was a worst-case scenario.

  • WAF Blindness: Standard Web Application Firewalls (WAFs) are trained to look for SQL injection (' OR 1=1) or script tags (<script>). They were not trained to inspect the proprietary text format of React Flight. The malicious payloads passed right through Cloudflare and AWS WAF rules initially.
  • Supply Chain Hell: The vulnerability wasn't in user code; it was deep in node_modules. You couldn't just "fix the code." You had to wait for Vercel and the React team to release patched binaries, then rebuild and redeploy every single microservice.

The Aftershocks (DoS and Leaks)

Just as engineering teams were recovering from the RCE panic, the security community found more cracks in the foundation. On December 11, 2025, the React team issued a new advisory for three additional CVEs.

While these aren't RCEs, they are operationally devastating and highlight the fragility of the current serialization implementation.

1. Denial of Service (CVE-2025-55184)

Severity: High (7.5/10)

This vulnerability allows attackers to "freeze" infrastructure with a single request.

Mechanism:
Flight protocol allows data chunks to reference other chunks. Researchers discovered they could create circular dependencies in payloads.

  • Chunk A references Chunk B.
  • Chunk B references Chunk A.

When the React deserializer attempts to resolve this structure, it enters an infinite synchronous loop.

nodejs-server-event-loop

The Impact on Infrastructure:
Because Node.js is single-threaded, this infinite loop blocks the entire Event Loop.

  1. CPU Spike: The CPU immediately jumps to 100%.
  2. Request Drops: The server stops responding to all other users. Health checks fail.
  3. Cluster Destabilization: In Kubernetes, the liveness probe might fail, causing the pod to restart. But if the attacker sends a stream of these requests, they can put the entire cluster into a "CrashLoopBackOff" state, effectively taking down the application.

Note: The initial fix for this was incomplete, leading to CVE-2025-67779 (the "fix for the fix"), which forced DevOps teams to patch their systems a third time in two weeks.

2. Source Code Leak (CVE-2025-55183)

Severity: Medium (5.3/10)

Arguably most embarrassing vulnerability for React ecosystem. It allows attackers to trick servers into sending source code back to clients.

Mechanism:
Exploit relies on JavaScript string coercion. If Server Action returns object implicitly calling .toString() on a function within server runtime, default V8 behavior returns function source code.

"Secrets" Risk:
Developers often assume code inside use server files is private. They might write:

// DO NOT DO THIS
const STRIPE_KEY = "sk_live_12345"; 
export async function purchase() { ... }

Under normal circumstances, STRIPE_KEY stays on the server. But with CVE-2025-55183, an attacker can manipulate the serialization of the return value to dump the scope of the function, potentially revealing hardcoded API keys, internal comments, and database schema details.

Immediate Action Plan

If you are running Next.js App Router (v13.3+) or React 19, you must execute the following plan immediately:

  1. The "Immediate" Update: Do not trust semantic versioning ranges. Pin your versions to the releases that explicitly fix all discussed CVEs.
npm install [email protected] [email protected] [email protected]
# Verify deeply nested dependencies
npm list react-server-dom-webpack

Ensure react-server-dom-webpack is at least version 19.0.2.

  1. Implement Rate Limiting:
    The DoS exploit is cheap for attackers. You must implement aggressive rate limiting on all routes, especially POST requests, at the infrastructure level (Nginx, Cloudflare, AWS WAF).

  2. Audit for Secrets:
    Assume your server code can be leaked. Run a scan of your app/ directory. If you find a single API key, hardcoded password, or internal IP address in a .ts or .js file, move it to an environment variable immediately.

Conclusion

React Server Components are a powerful evolution of the web, but "React2Shell" has proven that this power comes with a terrifying new attack surface. The Flight protocol is complex, and as we have seen, complexity is the enemy of security.

Thank you for reading! If you found this blog post helpful, please consider sharing it with others who might benefit. Feel free to check out my other blog posts and visit my socials!

Read more

CinemaSins: Everything Wrong With Austin Powers in Goldmember in 19 Minutes Or Less

2026-01-09 00:00:17

This webpage introduces a CinemaSins video, "Everything Wrong With Austin Powers in Goldmember," promising to hilariously pick apart every single "sin" in the admittedly dumb (but fun!) movie. It's basically a quick, informal dissection of the film's flaws in under 20 minutes.

Watch on YouTube