2026-03-01 01:46:53
I am Madhesh, a passionate developer with a strong interest in Agentic AI and DevOps. I enjoy learning new things, and I have always wanted to start writing blogs to connect with people. I chose to work on RAG because large language models (LLMs) are everywhere, and RAG adds significant power to them by providing proper context for user queries.
LLMs often hallucinate on domain-specific or recent data because they don’t have the proper context for user queries. Traditional LLM outputs rely solely on trained data, which may not contain up-to-date or domain-specific information. RAG overcomes these problems with strong retrieval pipelines. In this blog, I walk through designing and implementing a complete RAG pipeline using Elastic as the vector database. From ingesting documents to semantic retrieval and LLM augmentation, discover how Elastic’s vector capabilities deliver accurate, hallucination-resistant AI applications.
The naive way to search for relevant content in a document or database is by using a basic keyword search.
Example - search in a file:
grep "keyword" file.txt
Example - SQL keyword search in a database:
SELECT * FROM table_name WHERE column_name LIKE '%keyword%';
Keyword search works by finding exact matches. But if the user uses different words with the same meaning, keyword search fails. That is where semantic search and vector embeddings become useful.
TF-IDF is a classic method to score how important a term is in a document relative to a corpus.
DF(t) = number of documents containing term t
IDF(t) = log(N / DF(t)), where N = total number of documents
TF-IDF weights terms that are frequent in a document but rare in the corpus, giving more relevant ranking than pure keyword counts.
BM25 is a ranking algorithm used in retrieval systems to determine the relevance of documents to a given user query. It is the default ranking algorithm used in systems like Elasticsearch and Whoosh. BM25 improves over TF-IDF by
Compute BM25 in Python:
from rank_bm25 import BM25Okapi
docs = [
"machine learning is powerful",
"deep learning uses neural networks",
"machine learning and AI"
]
tokenized = [doc.split() for doc in docs]
bm25 = BM25Okapi(tokenized)
query = "machine learning".split()
scores = bm25.get_scores(query)
print(scores)
BM25 produces a score for each document based on the query and ranks them by relevance.
When a user query uses a different word but similar meaning, keyword methods fail. This is where vector embeddings solve the problem.
Embeddings transform text into numerical vectors that capture semantic meaning. Similar texts have vectors close to each other in vector space.
Generate embeddings:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
texts = ["machine learning", "deep learning"]
vectors = model.encode(texts)
print(vectors.shape) # (2, 384)
A RAG pipeline consists of several stages. The process of document ingestion occurs, and when an online query comes in, the retrieval of relevant documents and the generation of a response occur. Then, with the context it has, it augments and starts to generate an accurate response.
First, ingest raw data into the RAG system. To make it effective, choose proper preprocessing techniques:
Chunking breaks large documents into smaller pieces that are easier to index and retrieve. Good chunking balances context with retrieval efficiency.
Once text is chunked and embedded into vectors, store it in a vector database (e.g., Elasticsearch). The vector DB stores embeddings and performs similarity search to match user queries with relevant chunks.
curl -X PUT "localhost:9200/docs" -H "Content-Type: application/json" -d '
{
"mappings": {
"properties": {
"text": { "type": "text" },
"vector": { "type": "dense_vector", "dims": 384 }
}
}
}'
curl -X POST "localhost:9200/docs/_doc" -H "Content-Type: application/json" -d '
{
"text": "machine learning is powerful",
"vector": [0.12, -0.93, ...] # real embedding vector
}'
curl -X GET "localhost:9200/docs/_search" -H "Content-Type: application/json" -d '
{
"query": {
"match": {
"text": "machine learning"
}
}
}'
curl -X GET "localhost:9200/docs/_search" -H "Content-Type: application/json" -d '
{
"knn": {
"field": "vector",
"query_vector": [0.12, -0.93, ...],
"k": 3,
"num_candidates": 10
}
}'
curl -X GET "localhost:9200/docs/_search" -H "Content-Type: application/json" -d '
{
"query": {
"bool": {
"should": [
{ "match": { "text": "machine learning" }},
{
"knn": {
"field": "vector",
"query_vector": [0.12, -0.93, ...],
"k": 3,
"num_candidates": 10
}
}
]
}
}
}'
Hybrid search combines keyword ranking (BM25) and semantic ranking (vector similarity).
Reranking is a post-processing step that improves result relevance by applying stronger scoring methods. It considers semantic relevance and similarity to reorder results for better quality. Reranking is more computationally expensive and is usually applied only to top results.
Elastic can serve as the retrieval backend for a RAG system. When a user query arrives:
This integration reduces hallucination and increases response accuracy.
When building a RAG pipeline, most developers focus heavily on the LLM and ignore the retrieval layer. In practice, retrieval quality matters more than model size. If the retriever returns irrelevant chunks, even the best LLM will confidently generate incorrect answers. I realized this while experimenting with chunk sizes and indexing strategies and small changes in chunking and overlap significantly changed answer quality.
Another important point is that hybrid search often performs better than pure vector search. Vector similarity is powerful for semantic understanding, but keyword signals still matter in production. In many cases, combining BM25 with vector search improved precision and reduced noise. Reranking also made a visible difference, especially when the initial retrieval returned loosely related results.
Latency is another real-world factor that is often underestimated. Running embeddings, querying vectors, reranking, and then calling an LLM adds up quickly. In production systems, you must balance accuracy with response time. Tuning the top-K retrieval size, embedding model selection, and reranking depth directly impacts both performance and cost.
Finally, data freshness matters. RAG systems must support continuous indexing. If documents are not updated properly, the system becomes stale and starts returning outdated context. In production, retrieval pipelines must be monitored just like any other backend service.
Elastic Cloud provides a fully managed Elasticsearch environment with built-in scaling, security, and monitoring. Instead of managing nodes, shard allocation, replication, and cluster health manually, Elastic Cloud handles infrastructure operations. This allows developers to focus on indexing documents, embedding pipelines, hybrid retrieval, and LLM integration rather than maintaining search infrastructure.
For a RAG pipeline, Elastic Cloud supports:
A production-ready RAG architecture on Elastic Cloud typically includes:
As embeddings scale into millions of vectors, cluster sizing becomes critical. Elastic Cloud allows vertical and horizontal scaling by adjusting node size and instance count without downtime. This is essential when handling increasing search traffic or expanding document collections.
Security is also a major factor. Elastic Cloud provides TLS encryption, API keys, and access controls out of the box. In AI applications dealing with private documents or enterprise data, this becomes non-negotiable.
In real-world systems, RAG is not only about retrieval and generation quality. It is about cluster stability, index performance, scaling strategy, and operational visibility. Elastic Cloud provides the infrastructure layer that makes large-scale RAG systems stable, secure, and production-ready.
Engineers can over-engineer things. The true value of RAG lies in strengthening LLM responses with real context from scalable systems like Elasticsearch. RAG makes LLMs less prone to hallucination and vastly improves relevance and accuracy.
If neither step 1 (retrieval) nor step 2 (generation) gives high-quality results, then consider improving both parts of a RAG pipeline and the retrieval components.
Project Repository:
Github on RAG
Note: The content of this blog is fully organic. AI was utilized solely for grammatical error correction and Structural alignment.
2026-03-01 01:46:45
Page load time is important. According to the folks @ Retail TouchPoints:
A one-second delay eats away 7% of the coveted conversion rate.
A one-second delay decreases customer satisfaction by 16%.
Honestly, I always felt like having a slow website caused more damage than that. Anyway, no one wants a slow site.
If your site is slow, one easy win to improve speed is by minfifying your css and javascript. Minfication works by taking out uneccessary white space and comments in your files. When the white space is removed, the file is smaller.
The easiest way I have found to minify css and javascript in Symfony is the Sensio Labs Minify Bundle. This bundle makes it so incredibly easy to minify your javascript and css that it's virtually no effort on your part. This is an easy win.
2026-03-01 01:42:54
Large Language Models are no longer prototypes running in notebooks.
They’re running in production systems that serve thousands (sometimes millions) of users.
And that changes everything.
If you’re working on:
Then one truth becomes painfully clear:
Shipping once is easy. Maintaining and refactoring continuously is hard.
This blog breaks down battle-tested patterns for continuous refactoring with LLM systems, patterns that actually work in production.
Traditional software:
LLM systems:
LLM systems behave more like living organisms than static software.
So your architecture must evolve continuously.
One of the biggest anti-patterns in LLM engineering:
prompt = "Answer the question politely."
That’s not engineering. That’s chaos.
Production Pattern
Prompt Refactoring Framework:
| Layer | Refactor Strategy |
|---|---|
| System Prompt | Stability + constraints |
| Context Injection | Reduce noise |
| Few-shot Examples | Optimize token efficiency |
| Output Formatting | Enforce structured JSON |
Tip: Treat prompt updates like schema migrations, never casual edits.
Your RAG pipeline is not “set and forget.”
It degrades.
Production Refactor Pattern
1. Add Retrieval Metrics
2. Continuous Chunk Optimization
3. Retrieval A/B Testing
Split traffic between:
Pro Tip: A RAG pipeline is a product, not an integration.
AI agents are seductive.
But production agents are fragile.
When scaling AI agents, refactoring means:
Production-Grade Agent Refactor Checklist
Tool call validation layer
Execution timeout guard
Retry with structured fallback
Deterministic planning phase
Logging full thought chains (internally only)
In enterprise AI architecture, agents should:
Plan deterministically.
Execute probabilistically.
Validate strictly.
That separation alone reduces failure rates dramatically.
In early-stage systems, everything talks to the LLM directly.
In production? That becomes a nightmare.
The Refactor: Layered AI ArchitectureClient Layer
↓
Orchestration Layer
↓
LLM Abstraction Layer
↓
Retrieval Layer
↓
Observability & Evaluation Layer
Why?
Because this enables:
This is where LLM engineering becomes real software engineering.
AI code review tools are everywhere.
Most fail because they:
Production Refactor Strategy
The secret?
AI code review must behave like a senior engineer, not a linter.
If you're not measuring, you're guessing.
Modern LLM systems need:
Build an LLM CI/CD Loop
Prompt Change →
Offline Evaluation →
Shadow Deployment →
Live Monitoring →
Auto Rollback if Degraded
This is DevOps for AI systems.
LLM systems are expensive if left unoptimized.
Refactor targets:
Introduce:
Cost optimization is architecture, not finance.
In enterprise environments, continuous refactoring becomes even more critical because:
This is where companies often bring in specialists.
For example, firms like [Dextra Labs – AI Consulting & LLM Engineering Experts] help enterprises design scalable enterprise AI architecture, production-grade RAG pipelines, and robust AI agents with continuous evaluation baked in from day one.
Rather than just building demos, they focus on:
Because production AI is not a hackathon project.
We’re already seeing:
But until that becomes reliable, humans must design:
Refactorable-by-default LLM systems.
Before you scale your LLM system, ask:
If not, refactor before you scale.
Continuous refactoring with LLMs isn’t optional.
It’s the difference between:
As LLM engineering matures, the teams that win won’t be the ones who ship first.
They’ll be the ones who refactor continuously.
2026-03-01 01:42:48
readme-to-index — My First GitHub Marketplace Release 🎉
Today I published my first GitHub Action to the Marketplace.
It’s called readme-to-index, and it does something very simple:
It turns your
README.mdinto a clean, styledindex.html.
That’s it.
No Jekyll.
No Ruby.
No themes.
No _config.yml.
No implicit behaviour.
Just Markdown → HTML → Done.
I have a lot of small projects.
Many of them already have good READMEs. In fact, for most of them, the README is the documentation.
So the obvious question is:
Why create a separate site when the README already exists?
GitHub Pages + Jekyll is great. But for small libraries and utilities, it can feel like overkill.
I wanted something:
So I built a GitHub Action that does exactly one thing:
README.md → index.html
Styled using Simple.css (by default, but you can use any stylesheet you like).
README.md to index.html using Pandoc
<title> from the first # Heading
Your README remains the canonical source of truth.
Add this to your workflow:
- uses: davorg/readme-to-index@v1
with:
output: _site/index.html
Then deploy using the standard GitHub Pages artifact flow.
The action supports a few optional inputs:
readme
Path to the Markdown source file.
Default: README.md
output
Path to the generated HTML file.
Default: index.html
css_url
CSS stylesheet to include in the generated HTML.
Default: https://cdn.simplecss.org/simple.min.css
install_pandoc
Whether to install Pandoc automatically using apt-get.
Default: true
Set this to false if your workflow already installs Pandoc.
extra_pandoc_args
Additional arguments passed directly to the Pandoc command.
Example:
- uses: davorg/readme-to-index@v1
with:
extra_pandoc_args: "--toc"
For this to deploy as a website, you’ll need to enable GitHub Pages in your repository settings.
Go to:
Repository → Settings → Pages → Build and deployment
Set:
That’s it. The workflow will handle the rest.
Here’s a minimal working example:
name: Publish README to GitHub Pages
on:
push:
branches: [ main ]
workflow_dispatch:
permissions:
contents: read
pages: write
id-token: write
jobs:
pages:
runs-on: ubuntu-latest
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
steps:
- uses: actions/checkout@v4
- uses: davorg/readme-to-index@v1
with:
output: _site/index.html
- uses: actions/configure-pages@v5
- uses: actions/upload-pages-artifact@v3
with:
path: _site
- id: deployment
uses: actions/deploy-pages@v4
You absolutely can.
But this approach has two real advantages:
There’s no Ruby toolchain.
There are no implicit layouts.
There are no theme conventions to understand.
It takes one Markdown file and produces one HTML file.
That’s it.
Because it’s "just a step", you can use it:
It doesn’t rely on GitHub Pages’ default behaviour.
It works wherever Pandoc runs.
Shipping something to the GitHub Marketplace feels surprisingly significant.
It’s a tiny tool.
It does one thing.
But it does it cleanly.
That’s the kind of tooling I like building.
If you’re interested:
👉 https://github.com/marketplace/actions/readme-to-index-html
Feedback welcome. Stars appreciated. Minimalism encouraged.
2026-03-01 01:39:26
This is a submission for the DEV Weekend Challenge: Community
I'm a mom. My son is three. For the past year, we've been building games together using AI. He tells me what the game should do, I type prompts into Claude, and we ship it. His portfolio lives at madladstudios.com. Nine games and counting.
Through this, I've connected with a growing community of parents who want to do the same thing: help their kids grow up using AI as a creative tool instead of being passive consumers. They're homeschoolers, tech-industry parents, AI-curious teachers. They all say the same thing:
"My kid's just a toddler. I don't know where to start."
That's the cold start problem. It's not that you don't know how to use Claude or ChatGPT. It's staring at a blank prompt box and wondering how to make it relevant to this little human who mostly wants to watch Helper Cars. What do you ask for? How specific should you be? What's even realistic for a toddler?
Pixel Foundery — a prompt generator for parents who want to build things with their kids using AI.
Three inputs:
One output: a ready-to-paste prompt tailored to your kid, designed to produce a working single-file HTML project when pasted into any AI chatbot.
Hit "Generate," copy the prompt, paste it into Claude or ChatGPT, and you have something to build together in minutes.
The prompts are opinionated. They ask for:
Because that's what actually works when you're building with a two-year-old. I know — I've done it.
Pick your kid's age, pick what they love, pick a project type, generate a prompt. Paste it into your AI tool of choice. Build something cool with your kid tonight.
Your kid can build apps before they can read. This is the tool that makes it easy.
Single HTML file. No frameworks. No dependencies. Fork it, open it, pick an age + interest + project type, and get a ready-to-paste prompt that produces a working app in any AI chatbot.
git clone https://github.com/meimakes/pixelfoundery.com.git
open index.html
That's it. Pick your kid's age, what they love, what to build. Copy the prompt. Paste into Claude / ChatGPT / Gemini. Build together.
25+ handcrafted prompt templates across 6 categories:
| Category | What it builds |
|---|---|
| 🎮 Game | Tap-to-jump, catch, whack-a-mole, memory, racing |
| 📖 Story | Choose-your-own-adventure, interactive storybooks, character creators |
| 🎨 Art | Finger painting, pixel art, stamp tools, coloring pages |
| 🔬 Science | Color mixing, plant growing, weather sim, physics playground |
| 🧮 Math | Counting, shapes, bigger/smaller, addition, patterns |
| 🎵 Music | Drum machine, piano, sound mixer, rhythm games, melody maker |
Every prompt…
The project exists in two versions, both honoring the same idea: a parent and a kid should be able to build something together in five minutes.
The repo: index.html — A single HTML file. No build step. No dependencies. No server. 25+ handcrafted prompt templates across 6 categories. Fork it, open it in a browser, done. The prompts themselves produce single-file HTML projects — it's turtles all the way down.
The live site: pixelfoundery.com — Built on Anything, upgraded with real-time AI generation via Claude Sonnet. Same interface, same inputs, but instead of drawing from templates, it creates unique prompts every time.
One version trusts the templates. The other trusts the model. Both produce the same thing: a prompt you can paste into any AI and build something real with your kid.
The static version is a single index.html — HTML, CSS, vanilla JS. No frameworks, no build step. The irony of a tool that generates single-file projects being a single-file project itself is not lost on me.
The live version is powered by Anything with Claude Sonnet generating prompts on the fly. Same UI, infinite variety. The template library serves as both the standalone experience and the fallback if the AI is unavailable.
Light and dark mode via prefers-color-scheme, fully responsive, mobile-first.
The prompts are the product. The UI took a few hours. Writing prompts that reliably produce good results from multiple AI models across six project categories and eight interest areas — that's where the real work was. Getting the specificity right so Claude, ChatGPT, and Gemini all produce something a 3-year-old can actually play with required a lot of iteration.
The gap this fills: Every existing resource for teaching kids to code either assumes coding knowledge, targets 8-year-olds, requires a $200 robot kit, or is too vague to be actionable. Pixel Foundery is just the missing first step — a good prompt, personalized to your kid.
Parents don't need "less screen time" advice. They need better screen time. My community is parents who want to sit next to their kids and make something together. That's a fundamentally different relationship with technology.
I'm also writing Raising Pixels — a newsletter about computational thinking for tiny humans, building with AI as a family, and raising kids who create more than they consume.
2026-03-01 01:38:51
I’ve been working on a React data grid called RGGrid, built specifically for internal tools and business applications.
Most grid libraries stop at sorting/filtering. In real apps I kept needing:
So I built RGGrid around those use cases.
Key features:
Would love feedback — especially around API design and plugin architecture.
Demo: Playground
If this project ends up helping you, you can support it here: Buy me a coffee