MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

The “var” Error in C# — Why “The contextual keyword ‘var’ may only appear within a local variable declaration” Happens

2025-11-22 22:35:23

The “var” Error in C# — Why “The contextual keyword ‘var’ may only appear within a local variable declaration” Happens<br>

The “var” Error in C# — Why “The contextual keyword ‘var’ may only appear within a local variable declaration” Happens

If you’re working with Unity or learning C#, sooner or later you’ll see this legendary error:

The contextual keyword ‘var’ may only appear within a local variable declaration

At first glance, it feels completely cryptic.

You typed var like everyone says you should… and the compiler just yelled at you.

In this article you’ll learn:

  • What var actually is in C#
  • Why it only works in local variable declarations
  • The real difference between C# and old UnityScript / JavaScript-style syntax
  • How to fix common patterns that trigger this error
  • A clean, idiomatic C# version of the sprint/movement code

1. What var really means in C

In C#, var is a contextual keyword, not a magic dynamic type.

This works:

void Update()
{
    var speed = 10.0f;           // compiler infers: float
    var name  = "Player";        // compiler infers: string
    var pos   = transform.position; // compiler infers: UnityEngine.Vector3
}

Why? Because:

  • You’re inside a method (Update())
  • You’re declaring a local variable
  • The variable has an initializer, so the compiler can infer its type

The compiler rewrites it to:

float speed = 10.0f;
string name = "Player";
Vector3 pos = transform.position;

So far so good.

2. Where var is allowed (and where it isn’t)

var is allowed in:

  • Local variable declarations inside methods, constructors, property getters/setters, foreach, etc.
void Start()
{
    var rb = GetComponent<Rigidbody>();       // local variable
}

void Update()
{
    foreach (var enemy in enemies)           // foreach variable
    {
        // ...
    }

    using var scope = myLock.EnterScope();   // C# 8+ using declaration
}

var is NOT allowed in:

  • Fields (class-level variables)
  • Method parameters
  • Return types
  • Properties, events, indexers
  • Any place where the type must be known outside the method body

Examples of illegal var usage:

public class PlayerController : MonoBehaviour
{
    // ❌ Not allowed – this is a field
    var speed = 10.0f;

    // ❌ Not allowed – return type
    public var CreateEnemy() { ... }

    // ❌ Not allowed – parameter type
    public void Move(var direction) { ... }
}

All of these will give you some version of:

The contextual keyword ‘var’ may only appear within a local variable declaration

Because the compiler only lets var live in local variable declarations, where the type is fully inferable from the initializer and stays inside the method scope.

3. The Unity trap: mixing UnityScript and C

Many old Unity forum posts and tutorials use UnityScript (a JavaScript-like language Unity used to support) or JScript-style syntax, such as:

#pragma strict

@script RequireComponent( CharacterController )

var moveTouchPad : Joystick;
var rotateTouchPad : Joystick;

This is not C#.

So if you try to “translate” that into C# and write something like:

var Sprint : MonoBehaviour  {

var float NaturalSpeed = 10.0f;
var float tempSpeed = 0.0f;
var float SpeedMultiplier = 1.2f;
var Vector3 moveDirection;
var float FirstPersonControl;

...
}

You’ll get multiple errors, including our friend:

The contextual keyword ‘var’ may only appear within a local variable declaration

…because in C# you don’t write types after the variable name and you don’t use var in fields.

Correct C# field declarations

Here’s how these should actually look in C#:

public class Sprint : MonoBehaviour
{
    public float naturalSpeed    = 10.0f;
    private float tempSpeed      = 0.0f;
    public float speedMultiplier = 1.2f;
    private Vector3 moveDirection;
    private float firstPersonControl;
}

Notes:

  • Access modifiers come before the type in C#: public float speed;
  • No var here — these are fields, not local variables
  • Types are float, Vector3, not var float or var Vector3

Use var inside methods, not at the field level.

4. Fixing the sprint / movement logic in clean C

Let’s refactor the sprint-related code into idiomatic C# and solve other common errors along the way (like Vector3 + float issues).

Common error 2: Operator '+' cannot be applied to operands of type 'UnityEngine.Vector3' and 'float'

In Unity, you can’t add a Vector3 and a float directly:

// ❌ Won’t compile
transform.Translate(moveDirection + tempSpeed * Time.deltaTime);

Here moveDirection is a Vector3 and tempSpeed * Time.deltaTime is a float.

You need either:

  • A Vector3 multiplied by the float, or
  • Add the float to a specific component: moveDirection.x + something

A clean C# sprint controller

Here’s a more idiomatic version of a simple sprint/movement script:

using UnityEngine;

public class SprintController : MonoBehaviour
{
    [Header("Movement")]
    public float naturalSpeed    = 5.0f;
    public float sprintMultiplier = 1.5f;

    private Vector3 _moveDirection;
    private CharacterController _characterController;

    private void Awake()
    {
        _characterController = GetComponent<CharacterController>();
    }

    private void Update()
    {
        // 1. Read input
        float horizontal = Input.GetAxis("Horizontal");
        float vertical   = Input.GetAxis("Vertical");

        // 2. Base movement direction in local space
        _moveDirection = new Vector3(horizontal, 0f, vertical);

        if (_moveDirection.sqrMagnitude > 1f)
        {
            _moveDirection.Normalize();
        }

        // 3. Calculate speed
        float currentSpeed = naturalSpeed;

        if (Input.GetKey(KeyCode.LeftShift))
        {
            currentSpeed *= sprintMultiplier;
        }

        // 4. Apply speed and deltaTime
        Vector3 velocity = _moveDirection * currentSpeed;

        // 5. Move with CharacterController to handle collisions
        _characterController.Move(velocity * Time.deltaTime);
    }
}

Where does var fit here? In local variables:

private void Update()
{
    var horizontal = Input.GetAxis("Horizontal");   // OK – local
    var vertical   = Input.GetAxis("Vertical");     // OK – local

    var input = new Vector3(horizontal, 0f, vertical);
    // ...
}

You can use var for locals if it improves readability, but fields keep their explicit type.

5. Patterns that will always trigger this var error

Use this as a quick mental checklist. If you do this with var, you’ll get the error:

❌ Using var in a field

public class Example : MonoBehaviour
{
    var speed = 10f;        // ❌ Not allowed
}

✅ Fix:

public class Example : MonoBehaviour
{
    public float speed = 10f;
}

❌ Using var in a method signature

public var GetSpeed()            // ❌
{
    var speed = 10f;             // ✅ local – this is fine
    return speed;
}

public void Move(var direction)  // ❌
{
    // ...
}

✅ Fix:

public float GetSpeed()
{
    var speed = 10f;   // OK
    return speed;
}

public void Move(Vector3 direction)
{
    // ...
}

❌ Trying to “combine” UnityScript-style syntax with C

// ❌ This is not C#
var float NaturalSpeed = 10.0f;
var Vector3 moveDirection;

✅ Fix:

// ✅ Proper C#
public float naturalSpeed = 10.0f;
private Vector3 moveDirection;

6. When should you actually use var in C#?

Now that you know the rules, here’s a practical guideline:

👍 Great places to use var

  • When the type is obvious on the right-hand side:
var rb    = GetComponent<Rigidbody>();
var enemy = new Enemy();           // clear enough
var pos   = transform.position;    // Unity devs know this is Vector3
  • When working with LINQ or long generic types:
var grouped = items
    .GroupBy(x => x.Category)
    .ToDictionary(g => g.Key, g => g.ToList());

👎 Places to avoid var

  • Public API boundaries (method parameters/return types)
  • Fields that represent important domain concepts and benefit from being explicit:
// Prefer explicit type here
private float _moveSpeed;
private Vector3 _currentDirection;

Use var primarily for local inference, not to hide important types.

7. Key takeaways

Let’s recap the big ideas so you never get bitten by this error again:

  1. var is only for local variables inside methods, not for fields, parameters or return types.
  2. It requires an initializer so the compiler can infer the type.
  3. UnityScript syntax (var x : float) is not C#. In modern Unity projects, you should use pure C#.
  4. For sprint/movement code:
    • Use explicit fields for configuration (float, Vector3)
    • Use var only for locals when it makes the code more readable
    • Don’t add Vector3 + float directly; multiply Vector3 by float instead
  5. When in doubt, start with explicit types, then introduce var where it’s safe and clear.

Written by: Cristian Sifuentes – C# / .NET Engineer | Unity Enthusiast | Clean Code Advocate

Have you run into other confusing C# compiler errors in your Unity projects?

Drop them in the comments — they often hide deeper language concepts that are worth mastering.

Simple Regression Linear

2025-11-22 22:33:15

Loading the Required Packages

To proceed with reading the data, performing numerical operations, and visualizing relationships, we need the following libraries:

  • pandas – for reading and handling CSV files
  • numpy – for working with arrays and numerical transformations
  • matplotlib – for plotting and visual exploration of the data

Installation (run once):

pip install pandas numpy matplotlib

Dataset Introduction – The Classic Advertising Dataset

This is the famous Advertising dataset from the book Introduction to Statistical Learning (ISLR).

  • All monetary values are in thousands of dollars
  • TV – advertising budget spent on television
  • Radio – advertising budget spent on radio
  • Newspaper – advertising budget spent on newspapers
  • Sales – actual sales (target variable)
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

Level 2: Loading and Initial Inspection of the Dataset

Loading the Dataset

Reads the CSV file and stores it in a pandas DataFrame called df.

(If your file has a different name or path, adjust the string accordingly.)

df = pd.read_csv("/home/pyrz-tech/Desktop/MachineLearning/advertising.csv")

Preview the First Rows

df.head()Displays the first 5 rows of the DataFrame, allowing a quick visual verification of the loaded data.

df.head()

Dataset Dimensions

df.shape Returns the total number of rows and columns in the dataset.

df.shape
(200, 4)

Column Information

df.info() Shows column names, data types, non-null counts, and memory usage.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   TV         200 non-null    float64
 1   Radio      200 non-null    float64
 2   Newspaper  200 non-null    float64
 3   Sales      200 non-null    float64
dtypes: float64(4)
memory usage: 6.4 KB

Descriptive Statistics

df.describe() Provides summary statistics (count, mean, std, min, quartiles, max) for numerical columns.

df.describe()

Quick Summary of What We’ve Seen So Far

After running the basic checks, we confirmed:

  • Shape: 200 rows × 4 columns
  • All feature columns (TV, Radio, Newspaper) and the target (Sales) are of type float64
  • No missing values

Visual Inspection of Individual Feature–Sales Relationships

We now carefully examine the relationship between each advertising channel and Sales using individual scatter plots with regression lines. The goal is to visually assess:

  • Strength of the linear relationship
  • Density and spread of points around the fitted line
  • Which feature appears to have the strongest and most compact linear relationship with Sales

Visual Inspection Using Matplotlib’s scatter() Method

We now plot the relationship between each advertising feature and Sales using pure matplotlib.scatter() (no seaborn regplot) so that we can fully control the appearance and clearly see the raw data points.

plt.scatter(df.TV, df.Sales)


plt.scatter(df.Radio, df.Sales)


plt.scatter(df.Newspaper, df.Sales)

Visual Analysis Summary and Feature Selection for Simple Linear Regression

As observed in the scatter plots above:

  • All three advertising channels (TV, Radio, Newspaper) show a positive relationship with Sales.
  • The TV advertising budget exhibits the strongest, most densely clustered, and clearest linear relationship with Sales.
  • The TV feature has the steepest slope, the tightest spread around the trend, and the fewest apparent outliers.

Therefore, based on visual inspection and exploratory analysis, we select TV as the single predictor variable for our Simple Linear Regression model.

Selected Feature

Feature: TV

Target: Sales

Creating a Clean Subset for Focused Analysis

To work more cleanly and concentrate only on the selected feature (TV) and the target (Sales), we create a new DataFrame called cdf (clean DataFrame) containing just these two columns.

From now on, we will perform all subsequent steps (visualization, modeling, evaluation) using cdf instead of the full df. This keeps our workspace focused and readable.

cdf = df[['TV', 'Sales']]

Train-Test Split (Manual Random Split)

We now split the clean dataset (cdf) into training and test sets using a simple random mask.

Approximately 80 % of the data will be used for training and the remaining 20 % for testing.

This is a common manual approach when we want full control over the splitting process without importing train_test_split from scikit-learn.

train and test DataFrames are ready for model training and evaluation.

msk = np.random.rand(len(cdf)) < 0.8
train = cdf[msk]
test = cdf[~msk]

print(f'msk => {msk[:4]} ...')
print(f'train => {train.head()}')
print('...')
print(f'test => {test.head()} ...')
print('...')
print(f'len(train) => {len(train)}')
print(f'len(test) => {len(test)}')
msk => [ True  True  True False] ...
train =>       TV  Sales
0  230.1   22.1
1   44.5   10.4
2   17.2   12.0
5    8.7    7.2
6   57.5   11.8
...
test =>        TV  Sales
3   151.5   16.5
4   180.8   17.9
8     8.6    4.8
9   199.8   15.6
10   66.1   12.6 ...
...
len(train) => 156
len(test) => 44

### Visualizing the Training and Test Sets on the Same Plot

Before training the model, we plot both the training and test data points on the same scatter plot (with different colors) to visually confirm that:

  • The split appears random
  • Both sets cover the same range of TV and Sales values
  • There is no systematic bias in the split
plt.scatter(train.TV, train.Sales)
plt.scatter(test.TV, test.Sales, color='green')

#### Converting Training Data to NumPy Arrays

For the scikit-learn LinearRegression model, we need the feature and target variables as NumPy arrays (or array-like objects).

We use np.asanyarray() to convert the pandas columns from the training set into the required format.

train_x = np.asanyarray(train[['TV']])
train_y = np.asanyarray(train[['Sales']])

Fitting the Simple Linear Regression Model

We now import the LinearRegression class from scikit-learn, create a model instance, and train it using the prepared training arrays (train_x and train_y).

After running, the simple linear regression model is fully trained using only the TV advertising budget to predict Sales.

The coefficient tells us how much Sales increases (in thousand units) for every additional thousand dollars spent on TV advertising.

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(train_x, train_y)

Visualizing the Fitted Regression Line

In this step we plot the training data points together with the regression line found by the model. This allows us to visually verify that the fitted line reasonably captures the linear relationship between TV advertising and Sales.

The line is drawn using the learned parameters:

  • model.coef_[0] → slope of the line
  • model.intercept_ → y-intercept
plt.scatter(train_x, train_y)
plt.plot(train_x, reg.coef_[0][0] * train_x + reg.intercept_[0], '-g')

Preparing Test Data and Making Predictions

We convert the test set to NumPy arrays (required format for scikit-learn) and use the trained model to predict Sales values for the test observations.

test_x = np.asanyarray(test[['TV']])
test_y = np.asanyarray(test[['Sales']])
predict_y = np.asanyarray(reg.predict(test_x))

Evaluating Model Performance with R² Score

We import the r2_score metric from scikit-learn to measure how well our Simple Linear Regression model performs on the test set.

The R² score (coefficient of determination) tells us the proportion of variance in Sales that is explained by the TV advertising budget.

  • R² ≈ 1.0 → perfect fit
  • R² ≈ 0 → model explains nothing
from sklearn.metrics import r2_score

Computing and Displaying the R² Score

We use the imported r2_score function to calculate the coefficient of determination on the test data and print the result directly.

This single line gives us the final performance metric: the higher the value (closer to 1.0),
the better our simple linear regression model using only TV advertising explains the variation in Sales.

print(f'r^2 score is : {r2_score(test_y, predict_y)}')
r^2 score is : 0.8674734235783073




follow me in github:

-https://github.com/PyRz-Tech

Optimizing Data Processing on AWS with Data Compaction

2025-11-22 22:18:28

Original Japanese article: AWSでの効率的なデータ処理を考える~データコンパクション~

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

In my previous article,
Designing a Cost-Efficient Parallel Data Pipeline on AWS Using Lambda and SQS,
I introduced a pattern where a large file is split into chunks using AWS Lambda and processed in parallel through SQS.

In this article, we look at the opposite scenario: when a large number of small event files—such as data from IoT devices or application logs—are continuously generated and uploaded.
For these use cases, we explore the Compactor Pattern.

What Is the Small-File Problem in Data Lakes?

When building a data lake, we often need to ingest huge numbers of small files, especially from IoT devices or application logs.
Over time, this can negatively impact performance in several ways:

  • A massive number of tiny files accumulate in S3, increasing load on table storage and metadata management
  • Query performance in Athena / AWS Glue degrades when scanning many small files
  • Frequent snapshot updates in Iceberg/Delta increase costs and contention

By the way, in my previous article I implemented a backoff mechanism, but the number of conflicts was painful to tune… (that implementation needs a revisit!)

What Is the Compactor Pattern?

The Compactor Pattern is an approach that periodically merges many small files in a data lake into fewer large files.

By consolidating files, we can reduce query overhead, metadata pressure, and performance bottlenecks.

Typical Flow

  1. Scheduled or Trigger-Based Execution
    Run compaction periodically (e.g., every hour/day) or when a threshold number of files is reached.

  2. Small File Detection
    Scan S3 or Iceberg/Delta manifests to detect small files.

  3. Merge (Compaction)
    Use AWS Glue (or similar) to merge files and rewrite them as larger Parquet files.

  4. Cleanup
    Remove old small files or unused snapshots (garbage collection).

Pre-Compaction: Compact Before Writing to the Data Lake

In this pattern, incoming small files are buffered (temporary storage, queue, etc.), compacted, and only then written into the data lake.
Think of it as cleaning up at the entrance.

Pros

  • Optimized file structure from the start
  • Reduces load and snapshot contention in the data lake
  • Simpler Iceberg/Delta table management

Cons

  • Higher latency (buffering required)
  • Reduced real-time characteristics
  • If compaction fails before writing, data-loss risk exists → retry design is important

Post-Compaction: Compact After Writing to the Data Lake

In this pattern, small files are written directly into Iceberg/Delta, and compaction is performed later by a separate job.
Think of it as cleaning up at the exit.

Pros

  • Lowest write latency
  • Friendly for real-time ingestion
  • Lower write-failure risk (files written in small chunks)

Cons

  • Small files temporarily accumulate, degrading performance
  • Snapshot/transaction conflicts may increase in Iceberg/Delta

Implementing Pre-Compaction

Architecture

This pattern consists of two major components:

  1. Ingest (File Registration)
  • Small files uploaded from IoT devices or services → stored in S3
  • S3 Event triggers Lambda
  • Lambda registers metadata (URI, size, status=PENDING) into DynamoDB
  1. Compaction
  • Triggered when file count or total size exceeds threshold
  • Lambda merges files (DuckDB in this sample)
  • Writes merged Parquet to S3 / Iceberg

Sample Code: Ingest

# ingest.py
import os
import json
import uuid
import boto3
from urllib.parse import unquote_plus

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(os.environ['CHUNK_TABLE'])

def lambda_handler(event, context):
    for rec in event['Records']:
        bucket = rec['s3']['bucket']['name']
        key = unquote_plus(rec['s3']['object']['key'])
        size = rec['s3']['object']['size']
        uri = f's3://{bucket}/{key}'
        table.put_item(
            Item={
                'ChunkId': str(uuid.uuid4()),
                'Uri': uri,
                'SizeBytes': size,
                'Status': 'PENDING',
                'Timestamp': int(context.aws_request_id[:8], 16)
            }
        )
    return {'statusCode': 200, 'body': json.dumps({'message': 'Registered'})}

Sample Code: Compaction

# compaction.py
import os
import boto3
import duckdb
import time
from datetime import datetime
from pyiceberg.catalog.glue import GlueCatalog

# Environment variables
TABLE_NAME = os.environ['CHUNK_TABLE']
TARGET_TOTAL_SIZE = int(os.environ.get('TARGET_TOTAL_SIZE', 100 * 1024 * 1024))  # Default 100MB
ICEBERG_CATALOG_NAME = os.environ.get('ICEBERG_CATALOG_NAME', 'my_catalog')
ICEBERG_NAMESPACE = os.environ.get('ICEBERG_NAMESPACE', 'icebergdb')
ICEBERG_TABLE_NAME = os.environ.get('ICEBERG_TABLE_NAME', 'yellow_tripdata')

# AWS clients
dynamodb = boto3.resource('dynamodb')
s3 = boto3.client('s3')

def lambda_handler(event, context):
    items = get_pending_items()
    selected_items = []
    accumulated_size = 0

    for item in items:
        item_size = item['SizeBytes']
        if accumulated_size + item_size > TARGET_TOTAL_SIZE:
            break
        selected_items.append(item)
        accumulated_size += item_size

    if not selected_items:
        return {'message': 'Below threshold, skipping processing', 'count': 0, 'size': 0}

    uris = [item['Uri'] for item in selected_items]

    print(f"Executing merge process {uris}")

    arrow_table = merge_parquet_in_memory(uris)

    print(f"arrow_table {arrow_table}")

    append_to_iceberg(arrow_table)
    mark_done([item['ChunkId'] for item in selected_items])

    return {'message': 'Compaction completed', 'merged_rows': arrow_table.num_rows}

def get_pending_items():
    table = dynamodb.Table(TABLE_NAME)
    resp = table.scan(
        FilterExpression="#st = :pending",
        ExpressionAttributeNames={'#st': 'Status'},
        ExpressionAttributeValues={':pending': 'PENDING'}
    )
    return resp.get('Items', [])

def merge_parquet_in_memory(uris):
    con = duckdb.connect(database=':memory:')
    con.execute("SET home_directory='/tmp'")
    con.execute("INSTALL httpfs;")
    con.execute("LOAD httpfs;")

    # Read and merge Parquet files
    df = con.read_parquet(uris, union_by_name=True).arrow()
    return df

def append_to_iceberg(arrow_table, retries=5):
    catalog = GlueCatalog(region_name="ap-northeast-1", name=ICEBERG_CATALOG_NAME)
    delay = 10

    for attempt in range(retries):
        try:
            table = catalog.load_table(f"{ICEBERG_NAMESPACE}.{ICEBERG_TABLE_NAME}")
            table.refresh()
            current_snapshot = table.current_snapshot()
            snapshot_id = current_snapshot.snapshot_id if current_snapshot else "None"
            print(f"Attempt {attempt + 1}: Using snapshot ID {snapshot_id}")

            table.append(arrow_table)
            print("Data has been appended to the Iceberg table.")
            return
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if "Cannot commit" in str(e) or "branch main has changed" in str(e):
                if attempt < retries - 1:
                    delay *= 2
                    print(f"Retrying after {delay} seconds.")
                    time.sleep(delay)
                else:
                    print("Maximum retry attempts reached. Aborting process.")
                    raise
            else:
                raise

def mark_done(ids):
    table = dynamodb.Table(TABLE_NAME)
    for cid in ids:
        table.update_item(
            Key={'ChunkId': cid},
            UpdateExpression="SET #st = :c",
            ExpressionAttributeNames={'#st': 'Status'},
            ExpressionAttributeValues={':c': 'COMPACTED'}
        )

Results

Uploaded Files

Registered Data in DynamoDB

Files on Iceberg

Points to Consider

  • Backoff tuning
    Iceberg snapshot conflicts happen frequently, so retry/backoff strategy must be tuned based on your environment.

  • File size control
    Optimal Iceberg file size is typically 128 MB–1 GB.

  • EventBridge trigger frequency
    Too slow → loss of freshness
    Too fast → wasted invocations, duplicate compaction risks

Implementing Post-Compaction

This is a much simpler setup.

Architecture

AWS recommends using Athena to run OPTIMIZE and VACUUM operations:
https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-data-optimization.html

Sample Code: Post-Compaction Lambda (Optimize and Vacuum)

This Lambda executes OPTIMIZE and VACUUM commands on the Iceberg table via Athena.

import boto3

athena = boto3.client('athena', region_name='ap-northeast-1')

TEMP_OUTPUT = 's3://20250421testresult/'

def lambda_handler(event, context):
    queries = [
        "OPTIMIZE icebergdb.yellow_tripdata REWRITE DATA USING BIN_PACK",
        "VACUUM icebergdb.yellow_tripdata"
    ]

    for query in queries:
        response = athena.start_query_execution(
            QueryString=query,
            QueryExecutionContext={'Database': 'icebergdb'},
            ResultConfiguration={'OutputLocation': TEMP_OUTPUT}
        )
        print(f"Started Athena query: {response['QueryExecutionId']}")

Results

Before Execution


After Execution


As you can see, the table has been successfully optimized.

Automatic Compaction with AWS Glue (Post-Compaction)

Iceberg tables registered in Glue Data Catalog can use Table Optimizer, which supports:

  • Automatic compaction
  • Snapshot retention
  • Orphan file cleanup

Docs:
https://docs.aws.amazon.com/glue/latest/dg/table-optimizers.html
https://docs.aws.amazon.com/glue/latest/dg/compaction-management.html

Configuration

Make sure to check all options for compaction, snapshot retention, and orphan file cleanup:

Notes

  • Charged per DPU → cost increases with fragmentation
  • Only available in supported regions

Use Cases for Each Compaction Approach

Lambda-Based Compaction (Pre-Compaction)

Use Cases

  • When Glue auto-compaction cannot be used, such as with Delta Lake.
  • When you want to implement compaction logic tailored to complex business requirements.
  • When you want to leverage existing serverless infrastructure like Lambda/Step Functions for flexible configurations.

Pros

  • Flexible logic implementation: Can freely customize file selection criteria and merge procedures.
  • Multi-format support: Works with Iceberg, Delta Lake, and other formats.
  • Cost control: Lambda runs only when needed, avoiding DPU billing.

Cons

  • High implementation and operational cost: Requires building and managing Lambda, DynamoDB, EventBridge, etc.
  • Increased monitoring effort: Custom metrics and failure detection logic must be implemented and maintained.
  • Scalability considerations: Be mindful of performance bottlenecks with large datasets.

Lambda-Based Compaction (Post-Compaction)

Use Cases

  • When Glue auto-compaction cannot be used, such as with Delta Lake.
  • When you want to automate periodic file consolidation while keeping operational overhead low.

Pros

  • Low implementation effort: Only need to run queries via Athena from Lambda.

Cons

  • Scalability considerations: Performance bottlenecks may appear with large datasets.

AWS Glue Auto-Compaction (Post-Compaction)

Use Cases

  • When managing Apache Iceberg tables centrally via the Glue Catalog.
  • When you want to automate periodic file consolidation while minimizing operational overhead.
  • When you prefer to rely on standard features without custom compaction logic, suitable for medium to large-scale data lakes.

Pros

  • Minimal implementation effort: Enable via Glue console or CLI.
  • Simplified management: Monitor via CloudWatch metrics and the Glue console.
  • Native support: Supports compaction, snapshot retention, and orphan file deletion.

Cons

  • Glue DPU billing: Charged per minute, costs may increase depending on frequency.
  • Limitations: Automatic processing is limited to Iceberg.
  • Trigger flexibility: For fine-grained or dynamic triggers, additional design is required.

Conclusion

In this article, we discussed data compaction—an important consideration when operating a data lake.
In a data lake, files of various sizes are ingested at different times. When a large number of small files accumulate, processing efficiency can degrade.

By performing compaction as introduced in this article, you can maintain an environment that allows for efficient data processing. There are several approaches available, so you should choose the configuration that best fits your requirements and the current state of your system.

I hope this article serves as a useful reference for designing an optimal architecture for your data lake.

The Ultimate Technical Writing Stack for 2025

2025-11-22 22:18:27

Technical writing in 2025 isn’t something you can treat as an afterthought anymore. Over the last few years, I’ve watched docs go from “nice to have” to a major part of how teams onboard users, support developers, and even ship product updates. And honestly, the pressure has only increased.

I’ve had to rethink the way I work. Developers expect answers instantly. Products ship updates faster than ever. APIs change without warning. And documentation if it’s not well-structured and well-maintained  becomes the bottleneck for everyone.

This article isn’t a list of popular tools. I’m not interested in trend-chasing or giving you ten random apps you’ll never use. Instead, I want to walk you through the stack that has actually held up for me and for other writers I’ve worked with. These are tools that make writing smoother, keep docs accurate, and help teams adapt without everything falling apart.

1. Documentation Platforms That Don’t Break When You Grow

DeveloperHub

If you need structured, developer-facing documentation without wrestling with config files, DeveloperHub gets out of the way and lets you write. It handles navigation, versioning, AI search and reusable components cleanly, which is why a lot of teams end up choosing it once their docs stop being “small.”

Why teams switch to DeveloperHub once their docs stop being “cute and small”:

  • No config hell. No endless YAML acrobatics, no build failures because you forgot a comma.
  • Navigation built for grown-up docs. Once you go beyond a handful of pages, bad TOC structure becomes pain. DeveloperHub keeps it clean.
  • Real versioning. Not “fake branches.” Actual production-ready version control for complex products.
  • AI search that doesn’t embarrass you. Users actually find what they need without you manually tuning search indexes every week.

Apidog

Whenever I’m documenting anything API-related, Apidog saves me time  a lot of time. Instead of manually fixing broken examples or outdated parameters, Apidog just pulls from your OpenAPI/Swagger file. Your docs stay tied to real schema changes, and you’re not constantly “catching up” to engineering. If your product revolves around APIs, this tool honestly feels like cheating in the best way.

GitBook

GitBook is simple and clean, and sometimes that’s exactly what you need. I like it for early-stage products or teams with lighter documentation. You’ll eventually hit limits if your structure gets more complex, but if simplicity is your priority, it’s a solid choice.

Docusaurus

If your team prefers everything inside Git  Markdown, versioning, PR reviews, CI/CD Docusaurus is still one of the most dependable options. It does require setup and a bit of discipline, but when it’s configured properly, it’s incredibly powerful. I tend to recommend it for engineering-heavy teams that already live in GitHub.

2. Authoring Tools That Make Writing Less Painful

VS Code + Markdown

When I’m working in a docs-as-code environment, nothing beats this combo. Clean diffs, predictable formatting, and extensions that actually help instead of getting in the way. It’s not fancy, but it’s reliable.

Obsidian

For early drafts, planning, and personal notes, Obsidian has become my default workspace. The linking system makes it easy to map large topics before formalizing them. Half of my documentation ideas start here before they ever make it into a platform.

3. Tools That Make Docs Easier to Understand

Search: Algolia or Meilisearch

Good search changes everything. Most users don’t navigate your sidebar they search. If that search is bad, your entire documentation experience suffers. I’ve seen teams switch to Algolia or Meilisearch and immediately see fewer support questions.

Mermaid & Excalidraw

I got tired of outdated diagrams years ago. Mermaid lets me keep diagrams text-based, versioned, and easy to update. Excalidraw handles the quick sketches or conceptual visuals that need to feel more human.

OpenAPI / JSON Schema

If your docs involve APIs, using schemas is non-negotiable. It keeps things consistent and prevents the classic “the docs say one thing, the API does another” problem.

4. Workflow Tools That Keep Everything Moving

Linear or Jira

I treat documentation as real work  meaning it deserves its own backlog. The moment I started tracking docs in a proper ticketing system, the quality and predictability of updates improved dramatically.

Slack + GitHub Integration

If your docs team isn’t wired into GitHub through Slack, you’re basically choosing to be blind. Changes to the product should never sneak up on writers. Ever.

Pull requests, merged updates, feature flags going live those things need to hit your Slack instantly. When the alerts flow straight into a channel the docs team actually watches, you see trouble before it turns into a fire.

A PM tweaks an API? You know.A developer merges a breaking UI change at 11 p.m.? You know.A feature ships without telling anyone? Not with this setup.

This pipeline keeps writers reacting early, not scrambling after the damage is already done.

CI/CD for Docs

image.png

Manual publishing is where most documentation problems start. I’ve lost count of how many times I’ve seen teams ship a feature and forget to update the docs simply because the publishing step depended on someone remembering to click a button.

Automated pipelines remove that entire layer of risk. When docs are wired into CI/CD, they build and deploy the same way the product does: consistently, predictably, and without anyone babysitting the process. Every merged PR triggers an update. Every version release builds a matching documentation version. And if something breaks  a missing file, a bad link, an outdated reference  the pipeline catches it before users ever see it.

5. AI as a Support Tool (Not the Writer)

I use AI the same way I’d use a junior assistant: to summarize PRs, draft example snippets, or help brainstorm structure. But the final decisions accuracy, clarity, tone, narrative  still need a human. AI is helpful, but it’s not the writer.

Picking the Right Stack (Short Version)

If you’re solo or a small team:

  • Apidog (if you handle APIs)
  • Obsidian
  • Docusaurus

If you’re scaling fast:

  • DeveloperHub
  • OpenAPI
  • Algolia
  • GitHub + CI/CD

If you want simplicity that still works:

  • DeveloperHub or GitBook
  • VS Code
  • Mermaid

Final Thoughts

The right tools make documentation feel lighter, not heavier. Over the years, I’ve learned that a good stack doesn’t just help writers; it helps entire teams move faster. When your docs are clear, accurate, and easy to update, everything else becomes simpler.

2025 isn’t about using every shiny new tool out there. It’s about finding the combination that lets you write clearly, keep up with product changes, and help developers get the answers they need without friction. And honestly, once you settle into a stack that actually works for your workflows not someone else’s  the entire documentation process stops feeling like a chore and starts feeling like part of the product itself.

I’ve seen teams where the docs become the source of truth, not the afterthought. That doesn’t happen by accident. It happens when your tools support the way you think, the way you write, and the speed at which your product evolves. If there’s one thing I’ve learned, it’s this: when your documentation stack is solid, everything downstream gets easier onboarding, support, product adoption, even internal communication.

My AI Stopped "Guessing" and Started "Thinking": Implementing a Planning &amp; Reasoning Architecture

2025-11-22 22:11:54

In previous articles, I talked about how I generate tests using LLMs, parse Swagger schemas, and fight against hardcoded data. But "naked" LLM generation has a fundamental problem: it is linear. The model often tries to guess the next step without understanding the big picture.

Yesterday, I deployed the biggest architectural update since I started development — the System of Planning and Reasoning.

Now, Debuggo doesn't just "write code." It acts like a Senior QA: first, it analyzes requirements, assesses risks, decomposes the task into subtasks, and only then begins to act.

I want to show you "under the hood" how this works and, most importantly, honestly compare: did it actually get faster?

The Problem: Why Does AI Get Lost?

Previously, if I asked: "Create a group, add a user to it, verify the table, and delete the group" the AI often lost context halfway through the test. It might forget the ID of the created group by the time it needed to delete it, or start clicking on elements that hadn't loaded yet.

I needed the AI to "stop and think" before pushing buttons.

The Solution: Agentic Architecture

I implemented a multi-layer system based on the ReAct (Reasoning + Acting) pattern and state machines.

Here is what the test generation architecture looks like now:

graph TD
    A[Test Case] --> B[Planning Agent]
    B --> C{Analysis}
    C --> D[Complexity & Risks]
    C --> E[Dependencies]
    C --> F[Subtasks]
    D & E & F --> G[Execution Plan]
    G --> H[Reasoning Loop]
    H --> I[Step Generation]

1. Planning Agent: The Brain of the Operation

Before generating a single step, the Planning Agent launches. It performs a static analysis of the future test.

Complexity Score The agent calculates the mathematical complexity of the test from 0 to 100:

  • Number of steps × 10 (max 50 points)
  • Diversity of actions (click, type, hover) × 5 (max 30 points)
  • Test type (API adds +20 points)

If the complexity is above 80, the system automatically switches to "High Alert" mode (stricter validation and more frequent DOM checks).

Decomposition into Subtasks Instead of a "wall of text" with 15 steps, the system breaks the test into logical blocks. Example from the system logs:

┌─────────────────────────────────────────────────────────┐
│ Subtask 1: Fill out Group Creation Form                 │
├─────────────────────────────────────────────────────────┤
│ Steps: [1-5] | Actions: navigate → type → click         │
└─────────────────────────────────────────────────────────┘
          ↓
┌─────────────────────────────────────────────────────────┐
│ Subtask 2: Verify Result                                │
├─────────────────────────────────────────────────────────┤
│ Steps: [9-12] | Actions: wait → assert                  │
└─────────────────────────────────────────────────────────┘

This allows the AI to focus on a specific micro-goal without losing the overall context.

2. Reasoning System: The ReAct Pattern

The most interesting part happens during the generation process. I abandoned direct prompting in favor of a Reasoning Loop (Thought → Action → Observation).

Now, every step goes through a cycle like this:

Turn 2: Planning
├─ Thought: "I need to split the test into 4 subtasks"
├─ Action: "Create execution plan"
├─ Observation: "4 subtasks, 78% confidence"
└─ Confidence: 0.88

The system literally "talks to itself" (saving this conversation to the DB), making decisions based on a Confidence Score. If confidence drops below 0.75, the system pauses to look for an alternative path.

3. Self-Healing (Error Recovery)
Even with a cool plan, things can go wrong. I implemented a State Machine that handles failures on the fly.

For example, if the AI gets a selector_not_found error, it triggers the MODIFY strategy:

  1. The Agent re-analyzes the HTML.
  2. Finds an alternative anchor (e.g., text instead of ID).
  3. Generates a new selector.
  4. Updates the step and retries.

Real Benchmarks: The Cost of "Thinking"

Implementing agents isn't free. "Thinking" takes time. I decided to check if it was worth it by comparing the generation of the exact same tests before and after implementing Reasoning.

The results were unexpected.

Test 1: Simple (EULA Popup)
Goal: Login and accept the agreement.

  • Before Reasoning (Linear): 00:58 (4 steps)
  • After Reasoning (Agent): 03:22 (5 steps)
  • Verdict: 📉 Slower.

The system spent time planning a simple task. However, it automatically added a 5th step: verifying that we are actually on the homepage after accepting the EULA. Previously, this step was skipped.
Takeaway: Slower, but the test became more reliable.

Test 2: Medium (E2E User Creation)
Goal: Create an admin, logout, login as the new admin.

  • Before Reasoning (Linear): 06:38 (20 steps)
  • After Reasoning (Agent): 09:56 (20 steps)
  • Verdict: 😐 Overhead.

The number of steps didn't change. The linear model handled it fine, while the Agent spent an extra 3 minutes "thinking" and checking dependencies. This is the honest price of architecture.

Test 3: Complex (Download Template)
Goal: Find a specific template deep in the menu, download it, verify the list.

This is where the magic happened.

  • Before Reasoning (Linear): 23:38 (39 steps!)
  • After Reasoning (Agent): 08:11 (12 steps!)
  • Verdict: 🚀 3x Faster and Cleaner.

Why the difference? Without planning, the old model got "lost." It clicked the wrong places, went back, tried again—generating 39 steps of garbage and errors. The new model built a plan first, understood the direct path, and did everything in 12 steps.

Main Takeaway
Yes, on simple tests we see a dip in generation speed (overhead for LLM work). But on complex scenarios, where a standard AI starts "hallucinating" and walking in circles, the Planning Agent saves tens of minutes and produces a clean, optimal test.

The AI doesn't get lost anymore.

Current version metrics:

  • Plan Confidence: > 0.75
  • Error Rate: < 5%
  • Recovery Success: > 80%

I'm continuing to monitor this architecture. If you have complex cases that usually break test generators—I'd be happy if you tried them in Debuggo.

NPR Music: Ghost-Note: Tiny Desk Concert

2025-11-22 22:08:39

Ghost-Note’s Tiny Desk concert kicked off with Robert “Sput” Searight’s trademark “buckle up,” and never let go. The supergroup—born in 2015 as a drum-and-percussion duo by Searight and Nate Werth (of Snarky Puppy fame)—laid down gritty funk in tracks like “JB’s Out” and “Move with a Purpose,” complete with tight call-and-response riffs and bubbling harmonies. Dominique Xavier Taplin’s spacey keys paved the way for Mackenzie Green’s sultry “Synesthesia,” and Searight amped the energy even higher on “Be Somebody,” a loving nod to James Brown.

They wrapped things on a high note with “Slim Goodie,” a playful love story that features fiery percussion solos from Werth and Searight and Mackenzie Green’s pleading vocals that leave you craving your own Slim Goodie. With a full lineup of drums, horns, guitar, bass, keys, and vocals, Ghost-Note proved their evolution from a duo into a full-on funk powerhouse.

Watch on YouTube