MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

High Performance, Low Cost: Building a Professional RAG Chatbot from Scratch

2026-01-27 16:33:08

Building a RAG Chatbot from Scratch: Part 1

Choosing the Right Engine

Hello everyone! Today, I’m kicking off a short series where I’ll be documenting my journey of building a specialized chatbot. Unlike a standard chatbot that provides general answers, I want this one to have a very specific "job": answering questions based on the 2024 Indonesian Government Financial Reports compiled by the Ministry of Finance.

You might be wondering: "What’s the difference between a regular chatbot and a RAG-based chatbot?" The primary difference lies in the information source and how the AI formulates its response.

Understanding the RAG Difference

In a standard AI setup, the process is quite linear:

As you can see, the user asks a question, and the AI responds based on the data it was trained on. However, for my project, I am adding a critical component that prevents the AI from needing to "guess" or rely on outdated training data.


By adding Stored Information (our 2024 Financial Report), the general-purpose AI becomes a Specialized AI. It will only provide answers relevant to the context found in that stored data. We will discuss what happens when a user asks something "out of context" in future articles, but today, my focus is on selecting the right AI model.

Selecting the Model: Why Nebula Lab?

When looking for a model, I felt overwhelmed by the different platforms—GPT, Claude, and Gemini all live in different ecosystems. I initially looked at OpenRouter, a popular API aggregator. However, after some research and a tip from a friend, I discovered Nebula Lab.

Nebula Lab (ai-nebula.com) is an API aggregator that offers not just LLMs, but also marketing tools. Here is why I decided to switch from OpenRouter to Nebula:

  • Cost-Effectiveness: Their prices are significantly lower. For example, GPT-5.2 is listed at $1.40 USD per 1M tokens. Compared to official OpenAI pricing, Nebula is genuinely more affordable.

  • No Platform Fees: Unlike some aggregators that charge a 5% platform fee, Nebula Lab doesn't tack on extra costs.
  • Model Variety: They host all the heavy hitters, including OpenAI, Google, and Anthropic.
  • Clean UI: The interface is simple and easy to navigate.

  • Clear Documentation: For a beginner, their documentation is straightforward and easy to implement.

Testing the API

Getting started was incredibly easy:

  1. Navigate to the Model Center.
  2. Select API Key on the left sidebar.
  3. Generate your key.


To ensure everything was working, I tested the connection using two methods provided in their documentation:

1. Testing via CURL


I ran the following command in my terminal (Command Prompt/PowerShell):

curl https://llm.ai-nebula.com/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer YOUR_API_KEY" \
    -d '{
        "model": "gpt-5.2",
        "messages": [{"role": "user", "content": "Hello!"}]
    }'


The response was instant and normal. The "GPT-5.2" model responded perfectly.

2. Testing via Python


I then used Python (version 3.13.2) for a more integrated test:

from openai import OpenAI

client = OpenAI(
    api_key="sk-xxxxxxxxxxxxxxxxxxx", # Replace with your actual key
    base_url="https://llm.ai-nebula.com/v1"
)

response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

print(response.choices[0].message.content)


Success! The code ran smoothly without a single hitch. I’m really impressed with Nebula Lab’s variety and ease of use.

What's Next?

In the next article, we’ll start building the actual chatbot and gradually begin injecting our financial data to transform this from a simple API call into a full-fledged RAG system.

If you want to try it out yourself, check out Nebula Lab here: https://openai-nebula.com/

Parallel & Concurrent Computing

2026-01-27 16:30:14

Parallel and concurrent computing are no longer niche topics for high-performance researchers; they are essential for anyone wanting to squeeze real performance out of modern hardware.

1. Motivation: The End of "Free Lunch"

For decades, software got faster simply because hardware engineers increased CPU clock speeds. However, around 2004, we hit a "Power Wall." Increasing clock speeds further generated more heat than could be dissipated.

  • CPU Core Stagnation: Instead of making one core faster (increasing GHz), manufacturers began adding more cores to a single chip.
  • The Shift: To gain performance now, developers must write code that can run across these multiple cores simultaneously.

2. Serial vs. Parallel Execution

The difference lies in how tasks are queued and processed.

Feature Serial Execution Parallel Execution
Workflow One task must finish before the next begins. Multiple tasks (or parts of a task) run at the same time.
Hardware Uses a single processor core. Uses multiple cores or multiple processors.
Analogy A single grocery checkout line. Multiple checkout lanes open at once.

3. Key Definitions

Concurrency vs. Parallelism

These terms are often used interchangeably, but they describe different concepts:

  • Concurrency: The art of dealing with many things at once. It’s about structure. A system is concurrent if it can handle multiple tasks by switching between them (interleaving).
  • Parallelism: The act of doing many things at once. It’s about execution. It requires hardware capable of running tasks at the exact same moment.

Deterministic vs. Non-deterministic Execution

  • Deterministic: Given the same input, the program always produces the same output and follows the same execution path.
  • Non-deterministic: The outcome or the order of execution can change between runs, even with the same input. This is common in parallel systems because the thread scheduler decides when each task runs, often leading to different interleaving.

4. Common Pitfalls

Writing parallel code is notoriously difficult because of the "bugs" that only appear when timing is just right (or wrong).

Race Conditions

A race condition occurs when the output depends on the sequence or timing of uncontrollable events.

  • Example: Two threads try to increment a counter simultaneously. If they both read "10," add 1, and write back "11," the counter only increases by 1 instead of 2.

Deadlocks

A deadlock is a "Mexican Standoff" in code. It happens when:

  1. Thread A holds Resource 1 and waits for Resource 2.
  2. Thread B holds Resource 2 and waits for Resource 1. Neither can proceed, and the program freezes.

Synchronization Issues

To prevent race conditions, we use "locks" or "mutexes." However, over-synchronizing leads to problems:

  • Contention: Too many threads fighting for the same lock, which slows the system down to serial speeds.
  • Starvation: A thread is perpetually denied access to resources because other "greedier" threads keep taking them.

Understanding how memory is allocated is the "make or break" moment for designing parallel systems. It dictates how your workers (threads or processes) talk to each other and how much they’ll fight over resources.

2.1 Shared Memory Parallelism (Multithreading)

In this model, multiple threads live within a single process. Imagine a single kitchen (the memory) where multiple chefs (threads) are working at the same counter.

  • Shared Space: All threads can see and modify the same variables. This makes communication lightning-fast because you don't have to "send" data; it's already there.
  • The Synchronization Tax: Since everyone is touching the same "ingredients," you need strict rules (locks/mutexes) to prevent them from chopping the same carrot at the same time. This adds significant logic complexity.
  • The Python Catch (GIL): In standard Python (CPython), the Global Interpreter Lock (GIL) ensures only one thread executes Python bytecode at a time. Even on a 16-core machine, multithreading in Python won't give you a 16x speedup for CPU-heavy math; it’s mostly useful for I/O tasks like downloading files.

2.2 Distributed Memory Parallelism (Multiprocessing)

Here, you have multiple processes, each with its own private "kitchen." No process can peek into another's memory.

  • Independence: Since memory isn't shared, you don't have to worry about one process accidentally overwriting another’s variables. This eliminates many race conditions.
  • Message Passing: If Process A needs data from Process B, it must be explicitly "sent" over a communication channel (like a Pipe or Queue). This is called Message Passing.
  • True Parallelism: Because each process has its own memory and its own instance of the Python interpreter, the GIL is bypassed. This is the go-to method for compute-bound tasks (e.g., heavy data processing, image rendering).
  • The Overhead: Creating a new process is "heavier" and slower than creating a thread, and sending large amounts of data between processes can be a performance bottleneck.

Summary Comparison

Feature Multithreading (Shared) Multiprocessing (Distributed)
Memory Shared among all threads Private to each process
Communication Fast (Shared variables) Slower (Message passing)
Complexity High (Needs locks/semaphores) Lower (Isolation)
Python GIL Restricted by GIL Bypasses GIL
Best Use Case I/O-bound (API calls, DB reads) CPU-bound (Math, Data Science)

The Global Interpreter Lock (GIL) is perhaps the most famous (and infamous) technical detail of the Python programming language. It is essentially a "safety latch" that has shaped how the entire Python ecosystem handles performance.

3.1 What is the GIL and Why Does It Exist?

The GIL is a mutex (a lock) that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once.

  • The Reason: Python uses reference counting for memory management. If two threads increment or decrement the "use count" of an object simultaneously, it could lead to memory leaks or, worse, deleting an object that is still in use.
  • The Benefit: It makes the implementation of CPython (the standard Python version) much simpler and faster for single-threaded programs. It also makes integrating C libraries (which might not be thread-safe) much easier.

3.2 Impact on Python Multithreading

Because of the GIL, even if your computer has 32 CPU cores, a standard Python program using threading will only utilize one core at a time for execution.

  • The Illusion of Parallelism: To a human, it looks like threads are running in parallel because the GIL switches between them very quickly (every 5ms or so).
  • CPU-Bound Bottleneck: If your code is doing heavy math (CPU-bound), multithreading actually makes it slower than a single-threaded program. This is because of the "lock overhead"—the time wasted by threads fighting over who gets to hold the GIL.

3.3 How the GIL is Bypassed

The GIL isn't an impenetrable wall; it’s more like a gate that can be opened under specific conditions.

1. Native Extensions (The "C" Escape)

Libraries like NumPy, SciPy, and Pandas are written in C or Fortran. When you perform a massive matrix multiplication in NumPy, the library "releases" the GIL, does the heavy lifting in C across multiple cores, and "grabs" the GIL back only when it’s done.

Note: This is why Python is a powerhouse for Data Science despite the GIL.

2. I/O Operations

When a thread is waiting for something external—like a website to respond, a file to be read from a disk, or a database query—it voluntarily releases the GIL.

  • While Thread A waits for a download, Thread B can take the GIL and start working. This makes Python threads excellent for network-heavy tasks.

3. Multiprocessing

As we discussed earlier, the GIL is per-interpreter. By using the multiprocessing module, you launch entirely separate instances of the Python interpreter.

  • Each process has its own GIL.
  • Each process can sit on its own CPU core.
  • This is the standard way to achieve "True Parallelism" in Python for pure Python code.

Summary: Threading vs. Multiprocessing in Python

Task Type Recommended Approach Why?
CPU-Bound (Math, Compression) multiprocessing Bypasses GIL, uses all cores.
I/O-Bound (Web Scraping, API) threading Efficiently uses "waiting time."
Scientific Computing NumPy / Pandas Releases GIL internally in C code.

In Python, the threading module is the go-to choice for tasks where the bottleneck isn't your CPU's speed, but rather the latency of external systems.

4.1 Threading Use Cases

Threads are ideal for I/O-bound workloads. In these scenarios, the processor spends most of its time idle, waiting for a response from a device or network.

  • Network Requests: Fetching data from multiple APIs or web scraping. While Thread A waits for a server in New York to respond, the GIL is released, allowing Thread B to start a request to a server in London.
  • Disk Operations: Reading or writing multiple files. Since disk I/O is significantly slower than CPU cache, threads allow you to overlap the "wait time" of different file operations.
  • User Interfaces (GUIs): Keeping the interface responsive. One thread handles the "click" events while a background thread does the heavy lifting, preventing the window from freezing.

4.2 ThreadPoolExecutor

Modern Python development favors the concurrent.futures.ThreadPoolExecutor over the older threading.Thread class. It provides a higher-level interface for managing a "pool" of threads.

What is a Thread Pool?

Instead of creating and destroying a thread for every single task (which is expensive), you create a Pool of workers that stay alive and pick up tasks from a queue as they become available.

Key Methods: map vs. submit

The ThreadPoolExecutor offers two primary ways to run tasks:

  1. map(func, *iterables):
    • Works like the built-in map.
    • Executes the function across all items in the iterable in parallel.
    • Pros: Very simple; returns results in the order they were submitted.
  2. submit(func, *args):
    • Schedules a single callable and returns a Future object.
    • Pros: More flexible; allows you to handle individual task completion and different arguments for each task.

Code Example: Efficiently Fetching Data

from concurrent.futures import ThreadPoolExecutor
import requests

urls = ["https://google.com", "https://python.org", "https://github.com"]

def fetch_status(url):
    response = requests.get(url)
    return f"{url}: {response.status_code}"

# Using a context manager ensures threads are cleaned up automatically
with ThreadPoolExecutor(max_workers=3) as executor:
    # 'map' handles the distribution of URLs to the 3 threads
    results = list(executor.map(fetch_status, urls))

for r in results:
    print(r)

Why use a Pool instead of manual Threads?

  • Resource Management: It prevents you from accidentally spawning 10,000 threads and crashing your system.
  • Cleanliness: Using the with statement (context manager) ensures that all threads are joined and resources are released even if an error occurs.
  • Future Objects: It provides "Futures," which are placeholders for results that haven't happened yet, allowing you to check if a task is "done" or if it "cancelled."

Would you like to move on to Section 5: Multiprocessing, to see how we handle those heavy CPU-bound tasks instead?

http://googleusercontent.com/youtube_content/0

To master parallel computing, you must be able to diagnose the bottleneck. Is your code waiting for the "brain" (CPU) or the "delivery truck" (I/O)? Choosing the wrong tool for the workload can actually make your code slower.

6.1 Identifying Workload Characteristics

CPU-Bound (Compute-Heavy)

The speed is limited by the CPU's clock speed and core count.

  • Examples: Matrix multiplication, image processing, data compression, searching for prime numbers.
  • Significance: These tasks keep the processor usage at 100%.

I/O-Bound (Wait-Heavy)

The speed is limited by Input/Output operations. The CPU often sits idle, waiting for data.

  • Examples: Web scraping (Network), reading thousands of small CSVs (Disk), waiting for a database query to return.
  • Significance: Processor usage is usually low; the system is waiting on external latency.

6.2 Performance Comparison Table

Here is how each execution style behaves under different pressures:

Workload Type Serial Execution Multithreading Multiprocessing
I/O-Bound Very Slow (Total wait time) Fastest (Overlaps wait time) Fast (But uses more memory)
CPU-Bound Slow Slowest (GIL overhead + context switching) Fastest (Uses all cores)

6.3 Demonstrations (Mental Model)

I/O-Bound: The sleep() Test

Imagine a task that does nothing but time.sleep(1). This simulates waiting for a network response.

  • Serial: To do this 10 times, it takes 10 seconds.
  • Multithreading: You spawn 10 threads. They all start "sleeping" at the same time. The total time is roughly 1 second.
  • Why? The GIL is released during sleep(), letting threads wait in parallel.

CPU-Bound: The Mathematical Loop

Imagine calculating the sum of squares for 50 million numbers.

  • Serial: Takes X seconds.
  • Multithreading: Takes X + overhead seconds. Because of the GIL, only one thread can do math at a time. The CPU is essentially "juggling" threads, which wastes time.
  • Multiprocessing: If you have 4 cores, it takes roughly X / 4 seconds. Each core handles a chunk of the numbers independently.

Summary: The Decision Tree

  1. Is the CPU usage low while the program is running? $\rightarrow$ It's I/O-bound. Use threading or asyncio.
  2. Is one core pegged at 100%? $\rightarrow$ It's CPU-bound. Use multiprocessing or a library like NumPy.
  3. Are you limited by memory? $\rightarrow$ Be careful with multiprocessing, as each process copies the memory space.

While Multithreading is like having one chef with multiple hands, Multiprocessing is like hiring four chefs in four separate kitchens. This is the only way to achieve "true" parallelism for Python-native code.

7.1 The multiprocessing Module

This module bypasses the GIL by creating entirely new instances of the Python interpreter for each task.

  • Process-based parallelism: Each process has its own memory space and its own GIL.
  • Safety: Since memory isn't shared by default, one process can't accidentally corrupt another's data.

7.2 Pool, Map, and Starmap

The multiprocessing.Pool class is the workhorse for data-parallel tasks.

  • map(func, iterable): The simplest way to parallelize. It chops the iterable into chunks and sends them to the worker processes.
  • starmap(func, iterable_of_tuples): Used when your function requires multiple arguments.
    • Example: If func(x, y) is your function, starmap takes [(1, 2), (3, 4)].

ProcessPoolExecutor

Found in concurrent.futures, this provides an identical interface to the ThreadPoolExecutor we saw earlier. It is generally preferred in modern code for its consistency and better error handling.

7.3 Communication & Shared Memory

Sometimes processes do need to talk to each other. Since they don't share memory, we use special constructs:

Tool Description Best For...
Value / Array Allocates a small piece of shared memory (C-style) that all processes can see. Simple counters or flags.
Queue A thread- and process-safe FIFO (First-In-First-Out) pipe. Passing complex objects or results back to the main process.
Pipe A direct connection between two processes. Fast, two-way communication between exactly two workers.

7.4 Limitations in Interactive Environments (Jupyter)

A common "gotcha" for data scientists is that the multiprocessing module often fails or behaves unpredictably in Jupyter Notebooks or the IPython console.

  1. Serialization (Pickling): Python must "pickle" (serialize) your function and data to send it to the other process. If you define a function inside a notebook cell, the worker process might not be able to find its definition.
  2. The if __name__ == "__main__": block: On Windows and macOS, you must wrap your multiprocessing code in this block to prevent a recursive loop of process creation.
    • Jupyter doesn't always handle this entry point correctly.

Workaround: If you run into issues in Jupyter, move your functions into a separate .py file and import them into your notebook.

7.5 Summary: When to use Multiprocessing

  • YES: For "number crunching" (e.g., calculating $\pi$ to a billion digits).
  • YES: For heavy image/video processing.
  • NO: For simple I/O (it uses way more RAM than threads).
  • NO: When you need to share massive amounts of data (the "pickling" overhead will kill your performance).

When multiple threads or processes try to change the same piece of data at the same time, you enter the world of Race Conditions. This is the most common source of "heisenbugs"—bugs that seem to disappear when you try to look for them.

8.1 Shared State Modification Problems

A race condition occurs when the final outcome of a program depends on the timing or scheduling of the execution.

If two threads are incrementing a shared variable, the operation looks like one step in Python (x += 1), but the CPU sees three distinct steps:

  1. Read the current value of $x$.
  2. Add 1 to that value.
  3. Write the new value back to memory.

If Thread A is interrupted after step 1, and Thread B finishes all three steps, Thread A will eventually overwrite Thread B's work with an outdated value.

8.2 Demonstration of Incorrect Results

In a perfectly synchronized world, if you have 10 threads each adding 1 to a counter 100,000 times, the result should be 1,000,000.

In a race condition scenario, the result might be 742,384. This happens because thousands of "updates" were lost when threads stomped on each other’s data.

8.3 Threading vs. Multiprocessing Behavior

The way these two handle "shared state" is fundamentally different, which changes how they fail.

  • In Multithreading: Race conditions are common and dangerous. Because all threads share the same memory, they can all "see" and "touch" the same variables globally.
  • In Multiprocessing: Race conditions are rare by default. Since each process has its own memory, incrementing x in Process A does nothing to x in Process B.
    • Exception: You only face race conditions in multiprocessing if you explicitly use Shared Memory constructs (like Value or Array) or shared external resources (like a database or a file on disk).

8.4 Synchronization Primitives

To fix these issues, we use tools that force threads to "wait their turn."

1. The Lock (Mutex)

A Lock is the simplest tool. It has two states: locked and unlocked.

  • A thread must "acquire" the lock before touching the shared data.
  • If another thread holds the lock, everyone else must wait.
  • Analogy: The "talking stick" in a meeting. You can't speak unless you hold the stick.

2. The Semaphore

A Semaphore is like a Lock, but it allows a specific number of threads to enter.

  • Analogy: A restaurant with 10 tables. The first 10 groups get in; the 11th must wait until someone leaves.

3. The RLock (Re-entrant Lock)

A standard Lock can cause a thread to "deadlock itself" if it tries to acquire the same lock twice. An RLock allows the same thread to acquire the lock multiple times without freezing.

Summary: The Cost of Safety

While synchronization prevents data corruption, it comes with a performance price:

  • Overhead: Managing locks takes CPU time.
  • Serial Bottlenecks: If every thread is waiting for the same lock, your "parallel" program is actually running one-by-one (serial).

Numerical integration is a "perfect" parallel problem. It follows the embarrassingly parallel pattern, where a large task can be easily divided into independent sub-tasks that don't need to communicate with each other.

9.1 The Grid-Based Technique (Rectangle Rule)

To find the area under a curve f(x) between a and b, we divide the interval into $N$ small rectangles. The total area is the sum of the areas of these rectangles.

$$Area \approx \sum_{i=0}^{N-1} f(x_i) \Delta x$$

In a serial approach, a single CPU core calculates rectangle #1, then #2, then #3, all the way to N. If N is 100 million, this takes a significant amount of time.

9.2 Identifying Parallelizable Regions

The beauty of integration is that the calculation of "Rectangle #500" does not depend on the result of "Rectangle #499."

  • The Strategy: Split the total range $[a, b]$ into sub-intervals.
  • The Workers: If you have 4 cores, Core 1 handles the first 25%, Core 2 the second 25%, and so on.
  • The Reduction: Once all cores finish their local sums, you add those 4 sums together to get the final answer.

9.3 Implementation Strategies

Multithreading Approach

  • Performance: Low. Because integration is CPU-bound (pure math), the Python GIL will prevent the threads from running the math in parallel.
  • Use Case: Only beneficial if the function $f(x)$ involved an I/O wait (e.g., fetching a coordinate from a remote database), which is rare in pure math.

Multiprocessing Approach

  • Performance: High. This is the correct tool. By using a ProcessPoolExecutor, each core gets a chunk of the grid.
  • Efficiency: You get nearly "linear scaling." If 1 core takes 10 seconds, 4 cores should take roughly 2.5 seconds.

9.4 Performance Measurement

To prove the speedup, we use the time module. It is vital to measure only the calculation, excluding the time it takes to set up the data.

import time

start_time = time.time()

# ... Parallel Integration Logic ...

end_time = time.time()
print(f"Execution Time: {end_time - start_time:.4f} seconds")

Critical Metrics:

  1. Speedup ($S$): $S = \frac{T_{serial}}{T_{parallel}}$
  2. Efficiency ($E$): $E = \frac{S}{Number\ of\ Cores}$ (Ideally, this is close to 1.0 or 100%).

9.5 Summary Table: Integration Performance

Method Execution Expected Speedup
Serial One core, one by one. 1x (Baseline)
Multithreading Context switching on one core. ~0.9x (Slower due to overhead)
Multiprocessing Multiple cores simultaneously. ~3.8x (on a 4-core machine)
NumPy (Vectorized) Optimized C-backend/SIMD. Fastest (often 50x - 100x)

To wrap up our foundations, we look at the "low-hanging fruit" of the computing world. An Embarrassingly Parallel problem is one where little to no effort is needed to separate the problem into a number of parallel tasks.

10.1 Definition and Characteristics

A problem is embarrassingly parallel if there is no dependency (or very little) between the sub-tasks.

  • No Communication: Task A doesn't need to know what Task B is doing to finish its job.
  • No Shared State: Workers don't need to update a global variable constantly (which avoids those pesky race conditions).
  • High Scalability: These problems scale almost perfectly; doubling your CPU cores usually halves the execution time.

10.2 Core Examples

Monte Carlo Simulations

These simulations use repeated random sampling to obtain numerical results (like predicting stock market trends or calculating $\pi$). Since every "random trial" is independent, you can run a million trials on one core or divide them across a thousand cores with zero logic changes.

Weather Ensemble Models

Meteorologists don't just run one weather forecast; they run dozens of "ensembles" with slightly different starting conditions. Since Forecast A doesn't affect Forecast B, they are computed in parallel across massive supercomputers.

Batch Data Processing

Imagine you have 10,000 high-resolution photos to resize. Resizing photo #1 has nothing to do with photo #100. This is a classic "Map" operation where a worker pool can chew through the pile of files as fast as the disk can provide them.

CNN (Convolutional Neural Network) Workloads

In Deep Learning, a Convolutional layer applies filters to an image. Each "pixel" calculation or each "filter" application can be done independently. This is why GPUs—which have thousands of tiny cores—are so much faster than CPUs for AI tasks.

FFT (Fast Fourier Transform)

While the classic DFT is $O(N^2)$, the FFT reduces complexity to $O(N \log N)$. In many implementations, the data is split into "even" and "odd" parts that can be processed recursively in parallel, making it a staple of digital signal processing.

Summary of the "Parallel Spectrum"

Type Communication Needs Difficulty to Parallelize
Embarrassingly Parallel None Very Easy
Coarse-Grained Occasional Moderate
Fine-Grained Constant/Frequent Hard (High risk of overhead)

While Python’s multiprocessing is great for a single machine, MPI (Message Passing Interface) is the gold standard for high-performance computing (HPC) across clusters of multiple computers. It is the language of supercomputers.

11.1 MPI Fundamentals

Unlike the shared-memory models we’ve discussed, MPI is built entirely on the Distributed-Memory Model.

  • Independent Processes: Each process has its own address space. There is no shared "global variable." If Process 0 has a variable x, Process 1 cannot see it unless Process 0 explicitly sends it.
  • The "Rank": Every process in an MPI job is assigned a unique ID called a Rank (starting from 0). You use this rank to tell each process what part of the work it should do.
  • The "Communicator": This is a group of processes that can talk to each other. The default group containing all your processes is called COMM_WORLD.

11.2 mpi4py: MPI for Python

The mpi4py library provides the Python bindings for the MPI standard. It allows Python scripts to communicate across a network.

Key Concepts

  • COMM_WORLD: The primary communicator.
  • Get_size(): Tells you the total number of processes running.
  • Get_rank(): Tells the current process its unique ID.
  • Point-to-Point Communication: Using send() and recv() to move data between specific ranks.
  • Collective Communication: Using bcast() (one-to-all) or reduce() (all-to-one) to synchronize data.

11.3 Running MPI Programs

You cannot run an MPI script by simply typing python script.py. You must use a process manager, typically mpirun or mpiexec, which handles the launching of multiple instances across your CPU cores or network nodes.

The Command:
mpirun -n 4 python3 my_script.py
(This launches 4 independent instances of your script.)

Example: The "Who Am I?" Pattern

from mpi4py import MPI

# Get the communicator
comm = MPI.COMM_WORLD

# Get the size (total processes) and rank (my ID)
size = comm.Get_size()
rank = comm.Get_rank()

print(f"Hello! I am process {rank} out of {size} total processes.")

if rank == 0:
    data = {'key': 'value'}
    comm.send(data, dest=1)
    print("Process 0 sent data to Process 1.")
elif rank == 1:
    data = comm.recv(source=0)
    print(f"Process 1 received: {data}")

Summary: MPI vs. Multiprocessing

Feature multiprocessing mpi4py (MPI)
Scope Single Machine (Multi-core) Multi-Node (Clusters/Supercomputers)
Memory Shared-memory constructs available Strictly Distributed (Message Passing)
Launch Standard Python interpreter mpirun / mpiexec
Scaling Limited by one motherboard Scales to thousands of CPUs

This completes the technical foundation of Parallel & Concurrent Computing! We've traveled from CPU core stagnation all the way to distributed supercomputing.

Would you like me to create a "Cheat Sheet" summarizing which tool (Threading, Multiprocessing, or MPI) you should use based on the specific type of project you are working on?

In MPI, communication is how independent processes coordinate to solve a single problem. There are two primary ways processes "talk": one-to-one (Point-to-Point) or all-together (Collective).

12.1 Point-to-Point Communication

This is the most basic form of messaging, involving exactly two processes: a sender and a receiver.

  • send(obj, dest): The source process sends a Python object to a specific rank.
  • recv(source): The destination process waits to receive an object from a specific rank.
  • Blocking Communication: By default, these operations are "blocking." The sender waits until the message is safely in the transmission buffer, and the receiver waits (sleeps) until the message actually arrives. If you recv() and no one ever send(), your program will hang forever.

12.2 Collective Communication

Collective operations involve all processes in a communicator (e.g., COMM_WORLD). These are highly optimized and usually much faster than writing multiple point-to-point loops.

Operation Description Analogy
Broadcast (bcast) One process sends the same data to everyone else. A teacher giving a handout to the whole class.
Scatter (scatter) One process takes a list and gives one piece to each process. Dealing a deck of cards to players.
Gather (gather) One process collects a piece of data from everyone else into a list. A teacher collecting homework from every student.
Reduce (reduce) Everyone sends data to one process, which "crunches" it (e.g., Sum, Max). Everyone votes, and the teller announces only the total count.

12.3 Performance: The "Case" Matters

In mpi4py, there is a massive performance difference between lowercase methods (e.g., send) and uppercase methods (e.g., Send).

Lowercase Methods (send, recv, bcast)

  • Mechanism: Uses pickle to serialize Python objects.
  • Flexibility: Can send almost any Python object (dicts, lists, custom classes).
  • Performance: Slower. The overhead of pickling and unpickling large amounts of data can create a bottleneck.

Uppercase Methods (Send, Recv, Bcast)

  • Mechanism: Uses Buffer-based communication. It points directly to a contiguous block of memory.
  • Flexibility: Requires data to be in a buffer-like format, typically NumPy arrays.
  • Performance: Extremely Fast. This is near-C speeds because it avoids the Python overhead and communicates the raw memory directly.

Rule of Thumb: If you are moving NumPy arrays for math, always use the uppercase methods (e.g., comm.Send(my_array, dest=1)).

12.4 Summary: When to use what?

  • Use Point-to-Point for complex logic where specific workers need unique instructions.
  • Use Collective for mathematical synchronization (e.g., summing partial results of an integral).
  • Use Uppercase methods whenever you are doing heavy data lifting with NumPy.

The Hitchhiker's Guide to LTS: Key changes when upgrading from Java 8 to Java 11

2026-01-27 16:29:12

This is the first article in a series on what developers can expect when upgrading between LTS versions of Java. In this part, we'll look at the key changes that programmers will encounter when switching from Java 8 to Java 11.

1337_hitch_guide_to_the_lts_8_11/image1.png

Introduction

When Java 25 was released, we published an article about it. Its author highlighted the main changes and discussed how convenient and exciting they're for developers. After the publication, one of our readers reached out and said they'd like us to cover the challenges developers face when moving from one LTS version of Java to the next, starting with Java 8.

After thinking about it, we decided, "Why not?" After all, when reading various blogs, one often comes across comments from developers who say that they're still using Java 8. For them, an article like that can spark serious consideration about making the jump. For everyone else, it's simply a pleasant retrospective.

We're kicking off this series by comparing Java 8 with the next LTS release: Java 11.

LTS who?
LTS (Long Term Supported) is a software release model where certain stable versions receive extended support, including security updates, bug fixes, and technical support, for a longer period than standard releases.

First, let's take a look at some Java 8 features. It introduced some pretty advanced changes, and since some people are still using it 11 years later, let's review the most important ones.

First things that come to mind are Stream API, lambdas, and references to methods and constructors that transform constructs like these:

List<User> activeUsers = new ArrayList<>();
for (User user : users) {
  if (user.isActive()) {
    activeUsers.add(user);
  }
}

activeUsers.sort(new Comparator<User>() {
  @Override
  public int compare(User u1, User u2) {
    return u1.getCreatedAt().compareTo(u2.getCreatedAt());
  }
});

List<UserDto> result = new ArrayList<>();
for (User user : activeUsers) {
  result.add(UserMapper.toDto(user));
}

Into more concise ones:

List<UserDto> result = users.stream()
        .filter(User::isActive)
        .sorted(Comparator.comparing(User::getCreatedAt))
        .map(UserMapper::toDto)
        .collect(Collectors.toList());

Let's not forget the many functional interfaces that made the above-mentioned constructs possible. Here are some of them:

Implementations for default methods in interfaces appeared in Java 8.

Java's first steps toward functional-style constructs caused a real stir in 2014. However, time doesn't stand still, and four years later the world saw the next LTS release—Java 11.

So, did anything significant happen to the language between versions 8 and 11? Let's take a look.

Changes developers can expect in Java 11

I'd like to point out that this article is just a summary of the things Java programmers should focus on first, not a complete list of changes. Some of them may be a sticking point, because in certain cases, they may prevent your Java 8 project from running in the Java 11 environment. Others help streamline code thanks to new language features and constructs. Well, let's get it started.

Enough said. var, JEP 286

I'd like to remind that Java is a strongly typed language, meaning that all types must be known at compile time. Before Java 10, programmers had to specify a variable's full type when initializing it.

However, since the compiler can determine the type from the initialization expression, there's no reason to write it out, especially if it's long and awkward. Why not hide that redundant representation behind a special keyword?

JEP 286 made this possible. Instead of something like an abstract StateDatabaseHelperContainerMapMessage, we can now use the concise var:

var stateDbHelper = new StateDatabaseHelperContainerMapMessage();

P.S. I'd like to thank him for the enterprise name.

This is an excellent approach that is definitely worth adopting. Still, keep in mind that the initialization expression should make it clear what kind of object we're dealing with. The examples below illustrate how not to use var:

var x = foo();
var data = get();

If the type isn't clear from the variable name and initializing expression, it's better to use explicit typing!

The module system in Java. JEP 261

Before Java 9, a Java project—whether an application or a library—was simply a set of classes loaded via the classpath. It was just a list of necessary classes, JAR files, and the directories that contained them. This architectural approach presented developers with the following issues:

  • Lack of a higher encapsulation level. The entire application had access to any public class on the classpath, regardless of whether it was intended for external use or presented as an internal implementation.
  • Dependency issues. If a dependency exists at compile time but is missing from the classpath, the application crashes during execution rather than when it starts.

The Java Platform Module System (JPMS), introduced in JEP 261, enables representing Java applications or libraries as a set of modules rather than a set of classes. These modules:

  • declare their dependencies on other modules;
  • hide the internal implementation packages from external use.

Here's a brief example of how you can leverage this. Let's say we have the following library structure:

src/
 └─ com.example.lib/
    ├─ com/example/lib/api/LibPublicApi.java
    └─ com/example/lib/internal/InternalClass.java

We want library users to interact with it only via the classes from the api package, while keeping the classes in the internal package inaccessible from the outside—even though they are declared as public. We can achieve this by defining the library as a separate Java module.

To do this, let's create the module-info.java file in the package root and configure it as follows:

module com.example.lib {
 exports com.example.lib.api;
}

The module-info.java file and the module com.example.lib { .... } construct indicate that this package and its subpackages are a Java module. The exports construct opens the com.example.lib.api package to anyone who uses this module. That's all regarding encapsulation. Not explicitly exported packages will be unavailable outside this module.

If one module requires another, we explicitly state this in the module configuration by adding the following line:

requires com.example.lib

So, if we run the application and the JVM can't find the required module, the application/library will crash when it starts rather than during execution.

By the way, the module system proved to be especially useful for the standard Java library. By splitting the JDK into modules, the platform developers could clearly define which parts belong to the public APIs and which are internal implementation. This allowed to gradually restrict access to internal APIs and provide official replacements. This eliminated the risk of changes to the internal implementation breaking user code. A partial timeline of these changes is shown below:

  • JEP 260: Encapsulate Most Internal APIs;
  • JEP 396: Strongly Encapsulate JDK Internals by Default;
  • JEP 403: Strongly Encapsulate JDK Internals.

Another significant outcome of implementing the module system was the introduction of the jlink tool, which is designed to create customizable Java runtime images. Since the standard Java library has been divided into modules, developers can create a minimal environment that includes only parts of the platform that a specific application actually needs.

The jlink tool analyzes an application's module dependencies and creates a self-contained runtime that includes only the necessary standard library modules. This approach can significantly reduce the distribution size and increase the speed of starting the application.

Let's say we have a modular application, com.example.app. We can build a custom runtime for it with a single command:

jlink \
  --module-path $JAVA_HOME/jmods:mods \
  --add-modules com.example.app \
  --output app-runtime

As a result, the process creates an app-runtime directory that contains a minimal Java runtime image with only the modules required by com.example.app. The application runs using the JVM from this directory, without using Java installed on the system. This allows a Java application to ship with its own runtime.

You can find a more detailed overview of the Java module system here.

G1 is now the default GC. JEP 248

In Java 8, the default garbage collector, Parallel GC, prioritized achieving the highest possible throughput, which resulted in rare but prolonged stop-the-world pauses.

As the Java ecosystem evolved and the platform transitioned to Java 11, the requirements for applications significantly changed. JVM heap sizes have increased, and microservice architecture and containerization have become common. In this environment, high GC throughput was no longer the only priority. Even brief pauses lasting a few seconds were deemed unacceptable for online services, despite the fact that the total time spent in GC remained relatively low.

For this reason, G1 became the default garbage collector starting with Java 9. Unlike Parallel GC, G1 deliberately trades some throughput for shorter, more predictable pauses. It runs garbage collection more frequently and aims to keep stop-the-world pauses within defined limits. As a result, although total GC time may increase, the impact on application latency is far more stable and controllable.

So, if you used to explicitly enable G1 using the -XX:+UseG1GC JVM flag when starting your application, you no longer need to do so after moving away from Java 8.

An API for Immutable collections. JEP 269

Creating immutable collections with constant values is a fairly common task. Prior to Java 9, the API for this task was not user-friendly. Developers had to create something like this:

Set<String> set = new HashSet<>();
set.add("a");
set.add("b");
set.add("c");
set = Collections.unmodifiableSet(set);

Or this:

Set<String> set = 
    Collections.unmodifiableSet(Stream.of("a", "b", "c").collect(toSet()));

Since Java 9, it has been possible to define immutable collections as follows:

Set<String> set = Set.of("a", "b", "c");

Now that's an improvement!

Remember that these methods don't accept null values or duplicates as arguments. Otherwise, you'll run into a NullPointerException or an IllegalArgumentException, respectively.

Such methods exist for all collections and associative arrays (that is, for Map). This is crucial to keep in mind when migrating from Java 8!

Compact strings. JEP 254

Before Java 9, the internal representation of a string used a char array because Java strings rely on UTF-16 encoding, which allocates two bytes per character. However, JEP 254 points out that:

  • strings often consume a significant portion of the heap;
  • most strings contain only Latin characters.

Each Latin character fits into a single byte. To save memory, Java changed the internal string representation from char[] to byte[] and added a flag that indicates which encoding the string uses:

  • ISO-8859-1 / Latin-1 (one byte per character) when all characters in the string fit into it;
  • UTF-16 otherwise (two bytes per character).

Since we're talking about strings, I can't skip over the new methods that were added to the String class:

  • repeat creates a new string by repeating the original one a given number of times;
  • strip removes the leading and trailing whitespace;
  • stripLeading removes whitespace only from the beginning of the string;
  • stripTrailing removes whitespace only from the end of the string;
  • isBlank checks whether the string contains anything other than whitespace;
  • lines splits one string into multiple lines using line terminators.

Keep this in mind when upgrading to Java 11. This means you won't have to build all these methods yourself or drag in any third-party dependencies for them in your own project.

Removed from the JDK

Starting with JDK 11, some large modules have been removed from the standard Java distribution. Notably, JavaFX was removed from the JDK and moved to a separate project, OpenJFX, which is now distributed and developed independently.

As part of JEP 320, the Java EE and CORBA modules were removed from the JDK due to their outdated status and lack of active development. This change streamlined the Java platform, shifting the focus of its development to core capabilities and moving enterprise and UI solutions to external ecosystems.

If your project had one of the above modules, they need to be added separately.

Additional fields in @deprecated. JEP 277

When developing an API, it's important to notify users when the lifecycle of its components is coming to an end. If certain methods become outdated, developers shouldn't rely on them anymore and should start using more suitable alternatives instead. Java provides the Deprecated annotation for this exact purpose.

Before Java 9, the annotation didn't carry any additional information, creating ambiguity about its meaning in a given context. To provide developers with more clarity regarding the outdated API, the Deprecated category now includes two parameters:

  • since is a string parameter indicating the version in which the marked API was officially recognized as deprecated;
  • forRemoval is a boolean parameter that signals to developers whether the API will be removed in future versions.

These clarifying parameters streamline communicating the API status to its users.

HttpClient instead of HttpUrlConnection. JEP 321 and JEP 110

Before Java 11, HTTP requests were handled via HttpUrlConnection. The GET request and response output looked as follows:

URL url = new URL("https://api.example.com/data");
HttpURLConnection connection = (HttpURLConnection) url.openConnection();

connection.setRequestMethod("GET");
connection.setConnectTimeout(5000);
connection.setReadTimeout(5000);
connection.setRequestProperty("Accept", "application/json");

int status = connection.getResponseCode();

InputStream inputStream;
if (status >= 200 && status < 300) {
    inputStream = connection.getInputStream();
} else {
    inputStream = connection.getErrorStream();
}

BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
StringBuilder responseBody = new StringBuilder();

String line;
while ((line = reader.readLine()) != null) {
    responseBody.append(line);
}

reader.close();
connection.disconnect();

System.out.println(responseBody);

By the time Java 11 was released, JEP 321 introduced HttpClient, which offered several advantages over HttpUrlConnection:

  • it's non-blocking, allowing asynchronous requests;
  • it provides a more convenient API;
  • it's higher-level;
  • it supports HTTP/2 and WebSocket.

This is a simple example of a request via HttpClient:

HttpClient client = HttpClient.newHttpClient();

HttpRequest request = HttpRequest.newBuilder()
        .uri(URI.create("https://api.example.com/data"))
        .header("Accept", "application/json")
        .GET()
        .build();

HttpResponse<String> response =
        client.send(request, HttpResponse.BodyHandlers.ofString());

System.out.println(response.body());

Here's also sendAsync, which enables creating a chain of asynchronous actions:

client.sendAsync(request, HttpResponse.BodyHandlers.ofString())
      .thenApply(HttpResponse::body)
      .thenAccept(System.out::println)
      .exceptionally(ex -> {
          ex.printStackTrace();
          return null;
      });

This is a very important change to keep in mind in order to implement new features without relying on the outdated HttpUrlConnection.

Files.readString and Files.writeString

These are some minor changes related to reading and writing files. In Java 11, to read the file's contents or write something to it, just use the Files.readString and Files.writeString methods, respectively:

var fileContent = Files.readString(Path.of("file.txt"));
var content = "Hello, File!";
Files.writeString(Path.of("file.txt"), content);

In the past, the simplest way to do this was:

String text = new String(Files.readAllBytes(path), StandardCharsets.UTF_8);

By the way, using Path.of to define a path is now considered the preferred option. As of Java 11, the previously used Paths.get() is now considered obsolete.

A data filter for serialization. JEP 290

This change benefits developers who use Java mechanisms to deserialize external data in their applications.

Before Java 9, developers had to write custom classes to control which classes were deserialized. For example, these:

public class ObjectInputStreamWithClassCheck extends ObjectInputStream {
  private final static List<String> ALLOWED_CLASSES = Arrays.asList(
        User.class.getName()
  );

  public ObjectInputStreamWithClassCheck(InputStream in) throws .... {
    super(in);
  }

  @Override
  protected Class<?> resolveClass(ObjectStreamClass desc) throws .... {
    if (!ALLOWED_CLASSES.contains(desc.getName())) {
      throw new NotSerializableException(
          "Class is not available for deserialization"
      );
    }

    return super.resolveClass(desc);
  }
}

They also had to use them for deserialization:

var ois = new ObjectInputStreamWithClassCheck(externalData);
Object obj = ois.readObject();

With JEP 290, this capability is available in the standard library. To specify which objects are available for deserialization, just use the ObjectInputFilter filter:

ObjectInputFilter myFilter = 
                  ObjectInputFilter.Config.createFilter("java.util.Date;!*");
ObjectInputStream ois = new ObjectInputStream(externalData);
ois.setObjectInputFilter(myFilter);
Object obj = ois.readObject();

You can configure filters using special string expressions. In the example above, we enabled deserialization only for objects of the java.util.Date class. If anything else is deserialized, we'll encounter the InvalidClassException.

By the way, we have an article discussing the consequences of unsafe deserialization. I recommend reading it.

Conclusion

That concludes the article. We understand that it's impossible to cover everything but did our best to spotlight the issues that developers are most likely to come across. I'd like to briefly mention JShell's arrival and the option to use the java command to immediately compile and run a single file. Note that version 8 was the last with the 1.* prefix. Starting with version 9, Java versions are designated by whole numbers.

It's time to say goodbye! We'll continue this series of articles by discussing the transition to the next LTS version, Java 17. So, if you found this content interesting, consider subscribing to our blog! See ya soon!

2026 New Year Keyboard Upgrade

2026-01-27 16:20:59

To kick off 2026, I finally did two things I had been postponing: some long overdue switch maintenance and a small but meaningful keycap upgrade.

The keyboard is a Pteron36 designed by Harshit Goel. It is a minimalist split build intended to work well with compact layouts like Miryoku, where layers + home row modifiers let ~36 keys cover the operations usually spread across an 87–112 key keyboard.

I also do not type on QWERTY — I use the Workman layout. I will do a separate post on why and how I transitioned. Workman Layout

Maintenance: Why and What?

After ~5 years of heavy use and dust + grime built up, the switches started feeling scratchy. The ideal fix is usually to open each switch and lube the stem + springs—but my switches are soldered to the PCB, and I did not have the bandwidth or the motivation to desolder and fully disassemble everything.

So I went with lazy lubing: pressed the stem down and apply lube to the inside walls of the housing and the exposed sides of the stem. It does not reach the springs, but it can still reduce friction and smooth out the keypress.

For lube, I used Krytox GPL 205g0, a common choice in the keyboard hobby.

Photo of dismantled keys

The Upgrade: Keycaps That Fake a Key Well

I had originally planned a bigger upgrade: moving to a Dactyl Manuform style board with a concave "key well," since that curvature can make finger travel feel more natural.

But after talking to a few folks in the ergonomic keyboard community, I pivoted to a lighter change: trying KLP Lamé keycaps. These are sculpted, curved keycaps designed to reduce vertical finger travel and create a pseudo key-well effect on flatter boards.

I found a 3D printing service that could print them in resin. The shape makes the rows feel "guided", especially when reaching above/below the home row.

Closeup of sculpted keycaps

Result

After lubing, the board feels noticeably smoother and the scratchiness is largely gone. The keycaps are still an adjustment, but it is a good one—my fingers feel like they "land" sooner than they used to, especially on vertical reaches.

Completed keyboard

Why I built a native macOS app to fight "Configuration Drift"

2026-01-27 16:19:46

We’ve all been there. You get a brand new MacBook, and for the first three months, it’s a dream. Then, slowly, the "drift" sets in. A Homebrew update breaks a symlink. Your /opt/homebrew folder starts eating 20GB. Xcode caches grow to the size of a small moon.

As an engineering leader, I got tired of the "voodoo" fixes and manual cleanup scripts. I wanted something native, fast, and local-first.

So, I built MacFlow.

What is MacFlow?

MacFlow is a 100% native macOS assistant designed to give you total control over your development environment.

🎥 The Full Walkthrough

I recorded a walkthrough of the current beta features here:

Key Features

  • AI Workspace Discovery: Tell MacFlow what you're building, and it finds and maps the necessary stacks locally.
  • Real-time Drift Detection: Get notified when your environment diverges from your intended state.
  • Deep System Hygiene: Reclaim GBs of space from NPM, Docker, and Xcode caches in one click.
  • Local-First Security: Apple Notarized and runs entirely on your machine. No data leaves your Mac.

Join the Beta

We are currently in Open Beta and looking for feedback from the dev community. If you care about a perfectly dialed-in machine, I'd love for you to give it a spin.

👉 Read the full launch details and technical breakdown on our blog: macflow.ai/blog/introducing-macflow-native-macos-command-center

Download the Beta at MacFlow.ai

I'll be around in the comments to answer any technical questions!

A Call for Volunteers: Web Developers &amp; Graphic Designers

2026-01-27 16:17:33

Happy New Year everyone!🎉

We’re currently building the official PlotSense website and are looking for passionate Web Developers and Graphic / Product Designers to join our growing product team.

PlotSense is an open, explainable AI-driven data visualisation project focused on turning analytics into clear, human-understandable insights.

Who We’re Looking For?

  1. Web Developers (Frontend & Backend)
    • Build and maintain frontend & backend components
    • Turn designs into clean, functional, responsive UI
    • Bonus: strong product thinking or experience with tech products

  2. Graphic / UI-UX Designers
    • Create brand visuals, UI assets, and social media creatives
    • Strong eye for layout, typography, and visual storytelling
    • Experience designing for digital or tech products is a plus

You’ll be joining the product team to expand our capabilities and bring fresh ideas, helping shape the look, feel, and experience of PlotSense.

Why Volunteer?
✔ Work on a real AI + data product
✔ Build portfolio-worthy, visible work
✔ Collaborate with data & ML professionals
✔ Gain recognition as the project grows
✔ Gain Open-Source project experience

What You’ll Be Working On?
• PlotSense marketing & documentation website
• Explaining AI-powered visual analytics in a clear, engaging way
• Showcasing demos, use cases, and future roadmap
• Shaping the public face of PlotSense

If you’re interested…
Fill this form 👇

https://forms.gle/mqStonTebrgvD1pX8

Let’s build something meaningful where data tells a better story and AI explains it.