2026-05-01 04:10:41
RAG is not just an out-of-the-box system. It is a pipeline of decisions, and each decision is equally important.
A demo RAG can be built with three lines of LangChain or any other framework, but for it to work in production, there are five important choices you have to make carefully. If any one of them is sub-optimal, the whole system can quietly degrade.
I have seen many ways these systems break in practice. Your chunker may split an important fact across two chunks. Your retriever may pull the right document at rank 7 when your architecture only passes the top 5 to the model. Your generator may produce a citation that looks correct but does not exist in any source document. Each of these is a quiet failure. The user just sees a confident wrong answer.
This article walks through the five decisions you need to make to build a well-optimized RAG.
The first decision is whether to build a RAG in the first place.
This used to be obvious. Context windows were 4K to 32K tokens, so retrieval was the only way to fit a large knowledge base into the model. That has changed. Frontier models now support 200K to 2M token context windows, and context caching has dropped the cost of repeated input to roughly 10% of uncached tokens.
For small to medium corpora and repeat-query workloads, loading everything into context and caching it is often cheaper and simpler than retrieval.
Long context does not eliminate RAG. It changes when RAG is the right choice. A 40-page HR policy for a single query may fit in context, but a 50 GB internal wiki cannot fit in the context.
RAG is still the right choice when:
If your corpus fits in a cached context and your query volume is bounded, you may not need RAG. Just put the documents in the prompt and use context caching to keep costs down.
If you are doing RAG, the way you prepare and split your documents decides the quality of retrieval. Even the best reranker cannot fix a bad chunk.
Chunk size is something you should test and adjust, not pick once and forget. In many RAG systems, smaller chunks (200 to 400 tokens) with simple recursive splitters work better than the larger defaults. There is a balance: chunks that are too big add noise around the relevant sentence, and chunks that are too small lose the context around them. The default of 1000 tokens is usually not the best choice.
Semantic chunking groups sentences by meaning. It can improve retrieval on documents that mix several topics. The cost is that when documents change, you may need to redo embeddings around the cluster boundaries. It is good for stable corpora, but usually not ideal for streaming or constantly changing data.
Tables and images in PDFs usually need a vision model to be parsed properly. If you only extract the PDF as text, tables become broken numbers and lost spacing.
Example 1: pricing table problem
A PDF pricing table has columns like:
Plan | Price | Users | Features
A naive text extraction may turn it into:
Basic 10 5 Pro 30 20 Enterprise Custom Unlimited
The chunker can no longer tell which price belongs to which plan.
Example 2: vision parser benefit
A vision model (GPT-4o, Claude Sonnet, Gemini) can read the same table and return structured text:
Basic plan: $10, 5 users \n Pro plan: $30, 20 users \n Enterprise: custom price, unlimited users
Example 3: chart problem
A chart image often has no useful text inside the PDF. Plain text extraction skips it entirely.
Example 4: vision parser for charts
A vision model can describe the chart in plain language:
Revenue increased from January to June, with the largest jump in May.
Metadata is a first-class retrieval signal. For every chunk, store the source file, page number, section heading, author, and date. Filtering by metadata before searching embeddings often makes retrieval much cleaner.
Example 1: page number. If the answer comes from page 12 of a PDF, saving the page number lets you cite the exact source.
Example 2: section heading. A chunk from the "Refund Policy" section is more useful for a refund question than a random chunk from the same document.
Example 3: date filter. If the user asks about the latest pricing, you can first filter for recent documents and then search inside those chunks.
Example 4: source filter. If the user asks about HR policy, you can search only HR documents instead of the whole knowledge base.
In code, the difference looks like this:
# Bad: chunker ignores structure and drops metadata
chunks = text.split(\n")
# Better: recursive splitter, structural separators, metadata preserved
splitter = RecursiveCharacterTextSplitter(
chunk_size=300,
chunk_overlap=40,
separators=[## ",### ",\n",", " "],
)
chunks = [
Chunk(text=c, source=doc.source, page=doc.page, date=doc.date)
for c in splitter.split(doc.text)
]
Tune ingestion before you tune retrieval. The ceiling is set here.
Basic retrieval means using one embedding model, doing one similarity search, and returning the top few results. That used to be enough. For most production systems today it is not, because the single-vector assumption breaks on questions that are unclear or need information from multiple places.
Example 1: ambiguous question
User asks:
What is the policy for returns?
"Returns" could mean product returns, tax returns, or returning equipment. A simple search may pick the wrong meaning.
Example 2: multi-hop question
User asks:
Which customers had failed payments and later contacted support?
The system needs payment data and support ticket data. One similarity search will not connect both.
Example 3: single-vector problem
A single embedding represents the whole query as one meaning. Some questions contain multiple sub-questions, so one vector loses part of the intent.
Production RAG today usually combines several techniques.
Before searching, it can help to rewrite the user's question into a form that looks more like the document text. The user's question and the answer often mean the same thing in different words, and embedding search can miss that.
Example 1
Original question:
How do I cancel a subscription?
Document text:
To cancel, open Settings > Billing and select Manage Plan.
Rewritten query:
Cancel subscription settings billing manage plan
Example 2
Original question:
Can I get my money back?
Document text:
Refunds are available within 14 days of purchase.
Rewritten query:
refund policy money back purchase 14 days
\

HyDE has the model first write a fake answer to the question, then embeds the fake answer and uses that for search. The fake answer may be wrong in detail, but it sits in the same part of the embedding space as the real answer, so search has a better chance of finding the right passage.
Example 1
Original question:
How do I cancel my subscription?
The model generates a fake answer:
To cancel your subscription, go to Settings, open Billing, and choose Cancel Plan.
Searching with this fake answer often finds the real document section:
Open Settings > Billing > Manage Plan to cancel your subscription.
Example 2
Original question:
What happens if a payment fails?
Fake answer:
If a payment fails, the system retries the charge and may pause the account.
This helps search find passages about failed payments, retries, billing status, and account suspension, even if the exact retry logic in the fake answer is wrong.
Query decomposition breaks a compound question into smaller questions, searches each one separately, and combines the results.
Example
Original question:
Compare Stripe and Square on international fees and dispute handling.
Break it into:
Search each separately, then let the model write the comparison from the retrieved chunks. A single search for "Stripe Square international fees dispute handling" usually returns a vague comparison page and misses the specific fee and dispute sections.
Hybrid search combines dense vector search (for meaning) with BM25 (for exact words). BM25 is useful for error codes, product SKUs, and any technical token where the exact string matters.
Example 1: error code
Error E1027 during checkout
Vector search may find general checkout problems. BM25 finds the exact code E1027.
Example 2: SKU lookup
Find details for SKU ABX-4421
BM25 matches the SKU exactly. Vector search may return a similar-looking product, which is not useful here.
Example 3: semantic match
How do I stop my subscription?
Vector search can match this with "Cancel your plan from Billing Settings", even though the wording is different.
Combine the scores from both with reciprocal rank fusion. A hybrid usually beats pure dense for any corpus with domain jargon.

Two-stage retrieval first uses a fast model to pull 50 to 100 candidates, then a slower reranker scores the candidates more carefully and picks the top 5.
The reranker is slower per pair, but more accurate, because it scores the query and the passage together rather than as separate vectors. Common choices are Cohere Rerank and BGE Reranker.
Filter by date ranges, tenant IDs, and document types before similarity search, not after. Post-filtering wastes the top-k on documents the user is not allowed to see.
If you can only add one upgrade over naive retrieval, add a reranker. It is the single change with the best payoff for the least work.
A basic RAG pipeline searches once and answers once. It works only if the search results are good. If the retriever brings back bad chunks, the model may still write a confident answer using that bad information. There is no checking step, so the system does not know whether the retrieved content was useful.
A better pattern checks the retrieved results before answering. If the results look weak, the system tries something else instead of generating a low-quality answer. There are a few common ways to implement this.
A small classifier labels each retrieved document as relevant, ambiguous, or irrelevant. If most are irrelevant, the system runs a different search (often web search) instead of generating from the bad context.

The model decides whether to retrieve at all on each generation step, and critiques its own output against the retrieved evidence using reflection tokens.
The RAG system runs as a workflow rather than a single shot. It searches, checks the results, and decides what to do next. If the results are good, it answers. If they are bad, it rewrites the query, runs web search, or escalates to a human. These loops are usually built on LangGraph, LlamaIndex Workflows, or a similar state machine.
The shape of the loop is:
query → retrieve → grade \n ├── good → generate answer \n └── bad → rewrite query or websearch → retrieve → …
The downside of a loop is more latency and more tokens per query. The benefit is that the system can say "I don't know" when the evidence is weak, instead of guessing. This tradeoff is usually worth it in high-stakes domains like medical, legal, and finance, but often not in a casual chatbot.

You need to test the search part and the answer-writing part separately. If you only score the final answer, you cannot tell which half is broken. A good generator can produce a polished, confident-looking answer on top of bad retrieval, and you will not see the problem until a user reports it.
These are the metrics most teams use to measure RAG performance.
Context precision, context recall, MRR (mean reciprocal rank), and hit rate at k. The question these answer is: did the right documents show up, and how high in the ranking?
Faithfulness measures whether the answer stays inside the retrieved context. Answer relevancy measures whether the answer actually addresses the question that was asked. A faithful answer to the wrong question is still useless.
Score against a ground-truth answer set. This is slow to build and painful to maintain, but it is the only thing that tells you whether the full system actually works for users. Start with 50 queries and grow the set every time a real user reports a bad answer.
RAGAS, DeepEval, and Phoenix automate these metrics by using a stronger model to grade a weaker one. The judge has biases, often toward longer answers and certain phrasings. Calibrate it against human labels on a small sample before trusting the scores. Otherwise the judge's biases become your system's biases.
Several teams have written about how they apply these patterns in real systems. The useful lesson from each one is usually the constraint that shaped the architecture, not the architecture itself.
DoorDash supports Copilot. DoorDash built a RAG system over its support articles and added two checking layers: a real-time guardrail that validates responses before they reach users, and a quality judge that monitors answers after the fact. The retrieval part was straightforward. The validation layer is what brought hallucinations down by about 90% after launch.
Royal Bank of Canada (Arcane). RBC built Arcane to help financial advisors search complex investment policies. The hard part was not picking a better embedding model. The hard part was normalizing semi-structured documents from many internal systems and connecting cross-references between policies at answer time.
LinkedIn customer support. LinkedIn combined RAG with a knowledge graph built from historical support cases. The graph preserves relationships that text chunking would lose, like shared root causes and linked resolutions. Retrieval pulls connected sub-graphs rather than isolated chunks. After six months in production, it cut median resolution time by 28.6%.
The common thread has nothing to do with the model or the vector store. Each system is a pipeline of deliberate decisions, and the decisions that mattered most were the ones shaped by a constraint specific to that team, not the ones a reference architecture would suggest.
A working RAG system is built from many small decisions, and each one has a quiet way of breaking the system if you choose it badly. That is why every step needs to be made on purpose. The teams that ship well-performing RAG systems get there by recognizing that the embedding model is rarely the thing that matters most.
Papers
Research and studies
Case studies
DoorDash Path to high-quality LLM-based Dasher support automation
RBC Arcane, RAG System for Investment Policy Search and Advisory at RBC (ZenML LLMOps Database)
\
\n
\
2026-05-01 04:00:20
The process of designing and building systems for collecting, storing, and analyzing data at scale, foundational for data science and business intelligence initiatives.
In this listicle, you'll find some of the best data engineering courses, and career paths that can help you jumpstart your data engineering journey!
How I learned to stop using pandas and love SQL.
Processing large data, e.g. for cleansing, aggregation or filtering is done blazingly fast with the Polars data frame library in python thanks to its design.
Explore the evolution of DataOps in data engineering, its parallels with DevOps, challenges it addresses, and best practices. Transformative future of DataOps.
Standard Audiences: A product that extends the functionality of regular Audiences, one of the most flexible, powerful, and heavily leveraged tools on mParticle.
Here are two common errors that you'll want to watch out for when using the to_sql method to save a data frame into an Oracle database.
Let's see how Nessie, Dremio and MinIO work together to enhance data quality and collaboration in your data engineering workflows.
The following is a basic code snippet to save a DataFrame to an Oracle database using SQLAlchemy and pandas.
Metabase is a business intelligence tool for your organisation that plugs in various data-sources so you can explore data and build dashboards. I'll aim to provide a series of articles on provisioning and building this out for your organisation. This article is about getting up and running quickly.
When deploying MinIO in virtualized environments, it’s important to make sure that the proper conditions are in place.
Aptible Enclave fortifies data security in DevOps with its secure infrastructure for database management.
Result: predictable costs, fewer incidents, reproducible jobs across environments.
Master key time series feature engineering techniques to enhance predictive models in finance, healthcare & more with our comprehensive guide.
Discover WarpStream, a powerful and user-friendly Kafka API-compatible data streaming platform designed to simplify your data infrastructure.
We built data governance for a world where humans read the warning labels. AI agents don't read. They just query. That gap is now a production risk.
MinIO includes several ways to replicate data so you can choose the best methodology to meet your needs.
Is Astronomy data science?
RAG fails less from the LLM and more from retrieval: bad chunking, weak metadata, embedding drift, and stale indexes. Fix the pipeline first.
Explore time series analysis: from cross-validation, decomposition, transformation to advanced modeling with ARIMA, Neural Networks, and more.
What started as a simple script evolved into a full-fledged data engineering and NLP pipeline that can process a decade's worth of legal decisions in minutes.
As we sit down for this exclusive interview, Leonid offers a rare glimpse into the intricate process of weaving the digital fabric that shapes our lives.
Cross-cluster replication (CCR) in Apache Doris is proven to be fast, stable, and easy to use. It secures a real-time data synchronization latency of 1 second.

When it comes to Big Data infrastructure on Google Cloud Platform , the most popular choices Data architects need to consider today are Google BigQuery – A serverless, highly scalable and cost-effective cloud data warehouse, Apache Beam based Cloud Dataflow and Dataproc – a fully managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way.
Why we chose to finally buy a unified data workspace (Atlan), after spending 1.5 years building our own internal solution with Amundsen and Atlas
Learn how to build an n8n workflow that processes text, stores data in two databases, and sends messages to Slack.
Since the big bang in the data technology landscape happened a decade and a half ago, giving rise to technologies like Hadoop, which cater to the four ‘V’s. — volume, variety, velocity, and veracity there has been an uptick in the use of databases with specialized capabilities to cater to different types of data and usage patterns. You can now see companies using graph databases, time-series databases, document databases, and others for different customer and internal workloads.
Too lazy to scrape nlp data yourself? In this post, I’ll show you a quick way to scrape NLP datasets using Youtube and Python.
The BI interview hasn't caught up with the job. Here are 30 questions that reflect what it actually means to be a BI engineer in 2026.
Everything you've ever wanted to learn about OpenMetadata.
Is dbt kicking your butt? Take a look at SDF.
In this article, we cover how to use pipeline patterns in python data engineering projects. Create a functional pipeline, install fastcore, and other steps.
With each day, enterprises increasingly rely on data to make decisions.
What is Apache SeaTunnel, and can it help you with your data engineering?
See how Andrei Shcherbinin built production-ready ML systems with 12x faster attribution, 95% chatbot automation, and stronger monitoring.
MLOps is a set of practices and tools aimed at addressing the specific needs of engineers building models and moving them into production.
Monitor data quality with Amazon Deequ, InfluxDB, and Grafana in a Dockerized environment using Scala/Java and Apache Spark.
A new generation of AI-native data pipelines is emerging — built for unstructured data, dynamic schemas, and LLM-powered workloads.
Do we need a radical new approach to data warehouse technology? An immutable data warehouse starts with the data consumer SLAs and pipes data in pre-modeled.
In this post, I discuss the algorithms of a nested loop, hash join, and merge join in Python.
Explore how data engineering revolutionizes gaming with AI, AR/VR, blockchain, and more, enabling immersive experiences and shaping the industry's future.
An overview of dbc, an online open-source tool to facilitate adbc and apache arrow.
Apparently hot-cold data separation is hot now. Let's figure out why.
It doesn’t matter if you are running background tasks, preprocessing jobs or ML pipelines. Writing tasks is the easy part. The hard part is the orchestration— Managing dependencies among tasks, scheduling workflows and monitor their execution is tedious.
Data augmentation is a technique used by practitioners to increase the data by creating modified data from the existing data.
Influenza Vaccines and Data Science in Biology
Round 1 of the R Systems BlogBook: Chapter 1 contest is now live! Showcase your expertise, participate, and win exciting prizes. Submit your entry today!
Learn about LanceDB and how it fits into a stack that allows you to more easily create your own LLM models
In "Towards Open Options Chains", Chris Chow presents his solution for collecting options data: a data pipeline with Airflow, PostgreSQL, and Docker.
In this article, we explore these challenges and present a strategic approach to optimize JOINs in BigQuery.
How to index academic research papers by extracting metadata (e.g., title, authors, abstract) for AI agents and AI workflows using LLMs and CocoIndex.
Introducing a data platform architecture framework that enables organizations to systematically design and implement scalable data platform.
Every micro-interaction is silently recorded, analyzed, and monetized.
DevOps for Data is not about fixing pipelines or deploying models. It’s about designing systems that remain reliable, secure, and predictable.
Maximizing efficiency is about knowing how the data science puzzles fit together and then executing them.
Langchain is a crucial component for developing LLM models. It helps in orchestration and act as building block
Meet The Entrepreneur: Alon Lev, CEO, Qwak
Extracts, embeds, and stores multimodal PDF elements — text with SentenceTransformers and images with CLIP — in vector database for unified semantic search.
Data trust starts and ends with communication. Here’s how best-in-class data teams are certifying tables as approved for use across their organization.
Is the data engineer still the "worst seat at the table?" Maxime Beauchemin, creator of Apache Airflow and Apache Superset, weighs in.
Large Language Models (LLMs) represent artificial intelligence systems which learn human language from massive text databases.
Web 3 is loudly making rounds as a decentralized internet. How will this affect data control in general?
Data Version Control (DVC) is a data-focused version of Git. In fact, it’s almost exactly like Git in terms of features and workflows associated with it.
Bridging the gap between Application Developers and Data Scientists, the demand for Data Engineers rose up to 50% in 2020, especially due to increase in investments in AI-based SaaS products.
PandasAI is an open-source tool that makes data analysis feel like a casual chat with a data-savvy friend.
This is a collaboration between Baolong Mao's team at JD.com and my team at Alluxio. The original article was published on Alluxio's blog. This article describes how JD built an interactive OLAP platform combining two open-source technologies: Presto and Alluxio.
Writing ML code as pipelines from the get-go reduces technical debt and increases velocity of getting ML in production.
Location-based information makes the field of geospatial analytics so popular today. Collecting useful data requires some unique tools covered in this blog.
Apache Kafka has gotten rather long in the tooth, is Apache Iggy the successor?
Modern distributed systems are all about tradeoffs. Performance, reliability, scalability, and consistency don't come for free—you always pay a price somewhere.
Integrating data engineering with AI has led to the popularity of modern data integration and the expertise required.
See how to leverage the Airflow ShortCircuitOperator to create data circuit breakers to prevent bad data from reaching your data pipelines.
Applying machine learning models at scale in production can be hard. Here's the four biggest challenges data teams face and how to solve them.
Get hands-on with Apache Iceberg by building a prototype data lakehouse on your laptop.
A brief run-through of DeltaStream and how it simplifies working with streaming data such as Kinesis and Apache Kafka, taking advantage of Apache Flink.
This post explains what a data connector is and provides a framework for building connectors that replicate data from different sources into your data warehouse
Learn how to build a multilingual text-to-audio converter using Python. This guide covers essential libraries, techniques, and best practices
What is a skills-based economy and how is LinkedIn moving from vision to implementation? There’s AI, taxonomy and ontology involved in building the Skills Graph
Learn how Apache Doris breaks down data silos for insurance firms, streamlining customer data integration and boosting efficiency.
mParticle & HackerNoon are excited to host a Growth Marketing Writing Contest. Here’s your chance to win money from a whopping $12,000 prize pool!
Discover different archetypes of data engineers and how their collaboration drives data-driven success.
The benefits that come with using Docker containers are well known: they provide consistent and isolated environments so that applications can be deployed anywhere - locally, in dev / testing / prod environments, across all cloud providers, and on-premise - in a repeatable way.
The team built a 27B parameter model that didn't just analyze biological data—it made a novel, wet-lab-validated scientific discovery
Firms increasingly make use of artificial intelligence (AI) infrastructures to host and manage autonomous workloads.
Learn how engineers think about reliability, scalability, and maintainability—by asking the right questions early.
This is the first completed webinar of our “Great Expectations 101” series. The goal of this webinar is to show you what it takes to deploy and run Great Expectations successfully.
Here is not really an article, but more some notes about how we use dbt in our team.
handoff is a serverless data pipeline orchestration framework simplifies the process of deploying ETL/ELT tasks to AWS Fargate.
Keen to delve into data contracts and discover how they can enhance your data quality? Join me as we explore Lyft's Verity data contract approach together!
A great guide, on how to learn Apache Airflow from scratch in 2024. This article covers basic concepts of Airflow and useful for Data Scientist, Data Engineers
Discover how serverless AI/ML pipelines streamline data engineering by automating scalable data processing and deployment without infrastructure management.
Discover how Apache DolphinScheduler's Worker tasks function within its distributed, open-source workflow scheduling system.
Learn how custom transformation logic enhances data indexing with AI, vector search, TF-IDF, metadata enrichment, and optimized document chunking.
Here are six important steps for setting goals for data teams.
The R Systems BlogBook contest, powered by HackerNoon, is coming soon! Get ready to share your experiences and win exciting prizes—stay tuned for more details.
MinIO is the perfect companion for Airflow because of its industry-leading performance and scalability, which puts every data-intensive workload within reach.
Learning about best data visualisation tools may be the first step in utilising data analytics to your advantage and the benefit of your company
Bigger context windows help, but not enough. Learn how Recursive Language Models improve long-context reasoning with better scaling and stable performance.
Learn to set up a robust data lakehouse environment with Apache Iceberg, Dremio, and Nessie for scalable SQL operations.
Explore Apache Flink and Spark in real-world business scenarios. Choose the right tool for your big data needs
In this tutorial, we built a real-time data dashboard using Airbyte and Streamlit, in Python programming language.
We compare the differences between Kafka and Pulsar, demonstrating how a logical next step for scalability when using Kafka is switching to Pulsar.
See how a federated data governance model address challenges of centralized systems by enabling flexibility, regulatory compliance, and innovation for business
Flatten nested JSON and XML dynamically in Spark using a recursive PySpark function for analytics-ready data without hardcoding.
Best practices for building a data team at a hypergrowth startup, from hiring your first data engineer to IPO.
Multimodal AI workloads are breaking Spark and Ray. See how Daft’s streaming model runs 7× faster and more reliably across audio, video, and image pipelines.
Apache Doris 2.1 just got a major speed boost with Arrow Flight SQL for up to 10x faster data transfers.
PyTorch Geometric Temporal is a deep learning library for neural spatiotemporal signal processing.
Write efficient and flexible data-pipelines in Python that generalise to changing requirements.
Scaling AI/ML Data Needs: Migrating On-Premise Data Engineering Workloads to Azure Cloud
My attempt to noodle around.
From creating and querying Iceberg tables to managing branches and snapshots with Nessie’s Git-like controls, you’ve seen how this stack can simplify complex da
I've worked on teams building ML-powered product features, everything from personalization to propensity paywalls. Meetings to find and get access to data consumed my time, other days it was consumed building ETLs to get and clean that data. The worst situations were when I had to deal with existing microservice oriented architectures. I wouldn't advocate that we stop using microservices, but if you want to fit in a ML project in an already in-place strict microservice oriented architecture, you're doomed.
How to handle changing data when the source system doesn't help.
Discover the importance of API security in the age of Generative AI. Learn how robust API protection ensures data integrity.
As organizations increasingly deploy AI systems for decision-making, ensuring both data and AI pipeline security becomes critical to safeguard integrity, trust.
Learn Kafka Schema Evolution: Understand, Manage & Scale Data Streams with Confluent Schema Registry. Essential for Data Engineers & Architects.
Ontologies organize data, enhance interoperability, and drive insights across domains with structured frameworks.
Migrating from Convox to Nomad and some AWS performance issues we encountered along the way thanks to Datadog
Everyone's demo uses 50 documents and a clean knowledge base. We had 14,000 files and a decade of conflicting policies.
Build for the decision, not the data. If you can't name the specific decision a dashboard is supposed to support, you're building a museum exhibit
Discover how CocoIndex transforms data orchestration with a pure Data Flow Programming model — ensuring traceable, immutable, and declarative pipelines for know
I tested 5 LLMs on 10 real SQL queries and graded them against actual data. Here's the scoreboard and the failure mode that should worry you most.
Multi-part series that will take you from beginner to expert in Delta Lake
Learn cost-effective Apache Airflow optimization for intermittent tasks. Explore Google Cloud automation, reducing idle time, and minimizing costs
Learn the basics of data engineering with a practical ETL pipeline project. Explore how weather, flight, city data are extracted, transformed, loaded into a DB.
Apache Doris provides a new data type: Variant, for semi-structured data analysis, which enables 8 times faster query performance than JSON with 1/3 storage.
Super performant Rust data stack to prepare realtime data for AI at massive scale - CocoIndex & Qdrant
CocoIndex now supports Kuzu as a native graph database target, enabling real-time LLM-powered knowledge graphs with plug-and-play configuration.
High Availability in the cloud: why us-east-1 alone is not a strategy (it's a gamble)
These guides are designed to provide you with practical experience in working with Apache Iceberg.
Uncover the five essential skills every successful machine learning engineer should have. Boost your ML engineering career with these invaluable insights.
A 5-minute introduction to Redpanda. An API-compatible, simple, high-performance, and cost-effective drop-in replacement for Apache Kafka.
Data teams come in all different shapes and sizes. How do you build data observability into your pipeline in a way that suits your team structure? Read on.
In a nutshell, data reliability is a BIG challenge and there is a need for a solution that is easy to use, understand, and deploy, and also not hea
Elusion is a new contender that takes a fundamentally different approach to data engineering and analysis.
Comparing Apache Flink & Apache Spark in stream data processing. Exploring architectural nuances, applications, and key distinctions between the platforms.
In this article, I will talk about how I improved overall data processing efficiency by optimizing the choice and usage of data warehouses.
Stop duplicate records. Learn to build idempotent data pipelines in Databricks and Snowflake using partitioning, hashing, and atomic transactions.
Left-Shift Data Platform: How to overcome early stage startup challenges to be Data-Driven
In this first post in our 2-part ML Ops series, we are going to look at ML Ops and highlight how and why data quality is key to ML Ops workflows.
How advances in cryptography and decentralization are reshaping conventional data architectures.
This article presents the collaboration of Alibaba, Alluxio, and Nanjing University in tackling the problem of Deep Learning model training in the cloud. Various performance bottlenecks are analyzed with detailed optimizations of each component in the architecture. This content was previously published on Alluxio's Engineering Blog, featuring Alibaba Cloud Container Service Team's case study (White Paper here). Our goal was to reduce the cost and complexity of data access for Deep Learning training in a hybrid environment, which resulted in over 40% reduction in training time and cost.
Your model works in Jupyter but fails at 3 AM. Why data quality and observability are the silent killers of 85% of AI projects.
How to detect, capture, and propagate changes in source databases to target systems in a real-time, event-driven manner with Change Data Capture (CDC).
This article will discuss compression in the Big Data context, covering the types and methods of compression
Dive into Apache Iceberg catalogs for organizing data lakes like a pro, tackling challenges, and picking the right fit!
Learn how one badly‑timed analytics query can crash your production database, cost millions on Black Friday, and why data engineering exists to prevent it.
We built one data platform. Six users described six completely different systems. Here's what that gap costs, and why documentation won't fix it.
To connect to a database and query data, you need to begin by installing Pandas and Sqlalchemy.
This article is a comprehensive directory of Apache Iceberg resources, including educational materials, tutorials, and hands-on exercises.
Learn how data engineering supports autonomous driving perception through annotation workflows, dataset augmentation, synthetic data generation, and versioning.
A Data Pipeline Solution - Part I](https://hackernoon.com/towards-open-options-chains-a-data-pipeline-solution-for-options-data-part-i)
In "Towards Open Options Chains", Chris Chow presents his solution for collecting options data: a data pipeline with Airflow, PostgreSQL, and Docker.
HarperDB is more than just a database, and for certain users or projects, HarperDB is not serving as a database at all. How can this be possible?
Next-gen AI ad platforms use vector databases, indexing, and privacy-aware AI for real-time optimization, boosting ad spend efficiency while staying compliant.
Learn how Apache Spark and Databricks implement Slowly Changing Dimensions (Types 0–6) to preserve history, scale analytics, and ensure accurate data modeling.
This is a Plain English Papers summary of a research paper called On Data Engineering for Scaling LLM Terminal Capabilities. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.
Large language models excel at discussing programming concepts, explaining terminal commands, and reasoning about file systems. Yet when asked to actually accomplish a task in a terminal, they fail spectacularly. They suggest nonsensical commands, misinterpret output, and give up at the first error. This gap between linguistic capability and practical competence has persisted despite rapid advances in model scale and architecture.
The industry's response has been predictable: build bigger models. Deploy models with more parameters, more training tokens, more compute. Yet recent work shows that even substantial models like Qwen3-32B achieve only 3.4% on Terminal-Bench 2.0, a standard benchmark for terminal task completion. This suggests the bottleneck isn't model capacity. It's something more fundamental: the training data itself.
A new paper approaches terminal agent capabilities through a different lens. Rather than chasing model scale or architectural innovations, the authors conducted a systematic study of data engineering practices for terminal agents. The conclusion challenges conventional wisdom: a carefully constructed dataset combined with strategic filtering and curriculum learning can teach an 8B parameter model to match the performance of models four to ten times larger trained on standard data.
The conventional story about AI progress emphasizes algorithmic breakthroughs and computational scale. What actually happens in practice is less glamorous. For embodied tasks, where models need to execute sequences of actions rather than simply generate text, what you train on matters far more than how much compute you throw at the problem.
This paper introduces three key contributions that make this shift possible. First, Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports both seed-based and skill-based task construction. Second, a comprehensive analysis of filtering strategies, curriculum learning approaches, and scaling behavior. Third, Terminal-Corpus, a large-scale open-source dataset of terminal interactions that demonstrates these principles work in practice.
The results vindicate this approach. Nemotron-Terminal models, trained on Terminal-Corpus and initialized from Qwen base models, achieve substantial performance jumps: the 8B version improves from 2.5% to 13.0%, the 14B version from 4.0% to 20.2%, and the 32B version from 3.4% to 27.4%. These aren't incremental improvements. They represent fundamental shifts in efficiency.
Manually creating thousands of high-quality terminal interactions would be prohibitively expensive. A human expert writing terminal task trajectories might produce a few per day. Building a dataset with enough diversity to teach genuine capability would require months of expert time and substantial cost. So the paper takes a different approach: systematize the process of generating diverse, realistic terminal tasks.
Terminal-Task-Gen operates in two phases. The first phase, Dataset Adaptation, takes existing benchmarks and task descriptions from sources like Terminal-Bench, then reformulates them as interactive terminal interactions. This provides a foundation but is limited in coverage. Few benchmarks exist for terminal tasks, and even those that do capture only a fraction of possible terminal operations.
The second phase, Synthetic Task Generation, is where the real leverage appears. The pipeline defines a Skill Taxonomy, a structured breakdown of terminal operations and concepts. These skills range from basic navigation (moving between directories, listing files) to more complex operations (understanding command output, iterating based on errors, chaining operations together). By combining skills from this taxonomy in different ways, the system generates novel terminal tasks that teach these skills systematically.

Overview of Terminal-Task-Gen combining Dataset Adaptation and Synthetic Task Generation. The pipeline takes benchmark data and a skill taxonomy, producing diverse terminal interaction trajectories.
The output is Terminal-Corpus, a dataset containing thousands of terminal interaction sequences. Unlike static benchmarks, these trajectories capture the dynamic nature of terminal interaction: the user issues a command, observes output, interprets that output, and adjusts their approach accordingly. This mimics how humans actually use terminals, which is critical because models trained on static problem-solution pairs often fail to handle unexpected outputs or errors.
Not all synthetic data improves model performance. Some generated tasks might be trivially easy, offering no learning signal. Others might be internally inconsistent, teaching the model to hallucinate plausible-sounding but incorrect commands. Still others might be so convoluted that they confuse rather than clarify patterns.
The paper systematically studies filtering strategies to distinguish high-signal examples from low-signal ones. The analysis reveals which filtering criteria actually correlate with downstream performance on Terminal-Bench 2.0. This matters because naive scaling, where you simply generate enormous amounts of data and train on all of it, typically underperforms careful curation.
Some trajectories might be rejected because they contain errors in their reasoning or incorrect command sequences. Others might be excluded because they're too similar to existing examples, offering little diversity. The filtering process is not arbitrary; it's grounded in empirical analysis of what data actually improves model performance.
This represents a fundamental insight about data engineering: curation is as important as generation. A smaller dataset of high-quality examples outperforms a larger dataset with noise. The specific filtering strategies used here would be context-dependent, but the principle is universal.
Once you have filtered, high-quality data, the question of how to present it during training becomes crucial. Not all orderings are equally effective.
Curriculum learning applies a simple principle: harder material is easier to learn when preceded by foundational material. A model learning terminal tasks benefits from first encountering simple interactions, then gradually progressing to more complex ones. This scaffolding makes learning more efficient than random sampling.
For terminal tasks, natural curriculum structures emerge. Basic navigation (changing directories, listing files) can serve as a foundation. File operations (copying, moving, deleting) build on that foundation. Multi-step reasoning tasks that require chaining commands together come later. Understanding command output and error recovery grow more sophisticated across the curriculum.
The paper studies how these curriculum principles apply to terminal agent training. Strategic ordering of examples during training improves both convergence speed and final performance compared to random shuffling. This is particularly important because terminal tasks have inherent sequential dependencies. You can't reasonably ask a model to debug a complex pipeline if it hasn't yet learned basic piping syntax.
Data engineers face a practical reality: training compute is limited. Generating more data costs compute to train on. At some point, marginal improvements from additional data diminish, and that compute would be better spent elsewhere.
The paper includes scaling experiments that reveal how performance improves as training data volume increases. These curves answer a crucial question: have we hit a plateau, or would additional data continue helping?

Impact of training data scale on model performance. Terminal-Bench 2.0 performance increases consistently with training data volume for both Qwen3-8B and Qwen3-14B.
The results show clear improvement patterns for both model sizes. Performance grows consistently with more data, though the growth rate eventually slows. The curves suggest that the models tested haven't yet hit a hard ceiling, but marginal returns are diminishing.
Understanding the composition of these trajectories helps explain the scaling behavior. The token distribution shows what length trajectories look like, while the turn distribution reveals how many interaction steps typical tasks involve.

Distribution of tokens in generated trajectories. This shows the length characteristics of synthetic terminal tasks.
These statistics matter because they determine training requirements. If typical trajectories require thousands of tokens, then a dataset of several million trajectories becomes gigabytes of data. Understanding these distributions helps practitioners plan data generation, training infrastructure, and budget allocation.
All of this methodology yields concrete results. An 8B model trained on Terminal-Corpus reaches 13.0% accuracy on Terminal-Bench 2.0, jumping from a baseline of 2.5%. The 14B model reaches 20.2% (from 4.0%), and the 32B model reaches 27.4% (from 3.4%). Scaling the baseline models without better data produces marginal improvements. Scaling the data engineering produces orders of magnitude improvement.
Most strikingly, the 8B model trained on Terminal-Corpus now matches or exceeds the performance of much larger models trained on standard data. This comparison shifts the entire conversation around terminal agents. You don't need a 70B parameter model to build a capable agent. You need thoughtful data engineering.
This work reveals something important about AI capabilities that the industry often overlooks. Sometimes the bottleneck isn't compute, it isn't model architecture, and it isn't algorithmic innovation. It's training data engineering.
For tasks where models need to execute, perceive feedback, and adapt, the quality and structure of training data becomes paramount. A model trained on synthetic trajectories that systematically cover the skill space, filtered for signal, and presented in a curriculum that respects task dependencies outperforms larger models trained haphazardly.
This has practical implications. Unlike model architecture research or compute scaling, data engineering is accessible. It doesn't require the largest clusters or the most specialized hardware. It requires systematic thinking about what signals teach capability, how to generate diverse examples, what examples to exclude, and how to present examples during training.
The open-sourcing of Nemotron-Terminal models and Terminal-Corpus accelerates this direction. Future work can build on this foundation, improving the pipeline further. The bottleneck moves from "how do we build capable terminal agents" to "how do we engineer training data even more effectively."
The broader lesson applies beyond terminal agents. Any task where models must execute actions, perceive outcomes, and adjust strategy benefits from this kind of data engineering thinking. As AI systems move from pure language understanding toward embodied AI, systematic approaches to training data quality become not an optimization, but a fundamental requirement.
Original post: Read on AIModels.fyi
I built one pipeline four times. The winner wasn’t the fastest tool; it was the one that failed loudly, stayed debuggable, and didn’t punish ops.
The practical use of Data Vault models, as illustrated through querying customer orders and analyzing product sales, demonstrates the methodology's flexibility,
In this blog, we’ll delve into the crucial role that data plays in machine learning and why it’s often said that in the world of AI, “data is king.”
Perfect dashboards don’t mean perfect systems. Explore how observability debt hides behind metrics, distorts truth, and weakens engineering judgment in 2025.
Learn the concepts of data profiling and how it can speed up the debugging the quality related incidents across the data stack.
Solve schema drift in analytical AI agents using sqldrift. Real-world validation on 255 BIRD queries achieves 94.1% success with automated LLM correction.
How to become a better data leader that the data engineers love?
How operational engineering—not infrastructure—determines whether cloud modernization delivers reliability in regulated financial data platforms.
Data lineage refers to the process of tracking data from its origin to its destination, including all transformations and movements in between. It is crucial fo
Predictive Modeling in Data Science is more like the answer to the question “What is going to happen in the future, based on known past behaviors?”
Enterprise data solutions—handling myriad data sources and massive data volume—are expensive. Stream processing reduces costs and brings real-time scalability.
In this article, we’ll investigate use cases for which data engineers may need to interact with NoSQL database, as well as the pros and cons.
Stop the "Small File Syndrome" in your Data Lake. Learn how to implement Compaction, Z-Ordering, and automated maintenance in Databricks and Snowflake.
Stop slow queries and high cloud costs. Learn advanced SQL tuning for Snowflake and Databricks, including Pruning, Join Salting, and Search Optimization.
PowerBI is shifting from "PBIX" to "PBIR". This article explains what actually changes, who benefits and how teams should prepare for the future without panic.
Features of the specialized data types near integers and strings, which we use in every-day life, will allow us to store and operate complex data structures.
Learn about data transformation and discretization in data preprocessing. Explore normalization techniques, binning, and histograms.
This isn’t about saving bits—it’s about shaping history into a governed, trustworthy, searchable corpus for humans and AI.
A step-by-step walkthrough of building a real-time data pipeline to merge and synchronize MySQL data sources using Apache SeaTunnel.
Ask anyone in the data industry what’s hot and chances are “data mesh” will rise to the top of the list. But what is a data mesh and is it right for you?
Dashboards don’t represent actual state, models degrade unnoticed, and incidents show up as “weird numbers” instead of errors.
Apache Beam is a declarative programming model for large-scale data processing, not a service or framework like a REST API.
Data lakes are an essential component in building any future-proof data platform. In this article, we round up 7 reasons why you need a data lake.
SUPCON dumped siloed data tools for Apache SeaTunnel—now core sync tasks run 0-failure!
AI is about to expose weak BI architecture. "DirectQuery" collapses under machine curiosity. Decision-aligned design is the only way forward.
Apache NiFi cluster can process up to 50 GB of data per day. Apache NiFi can provide a balance between performance and cost-effectiveness.
Hidden cloud BI cost: data egress between platforms. Learn how “zombie data movement” quietly inflates analytics bills in modern BI architectures.
Ultra charge AI native data pipelines with X times of performance boost by batching
How application and product engineering teams can implement data encryption to effectively address data vulnerability issues.
The Medallion Architecture is a framework that turns messy e-commerce data into business-ready insights.
Dive into the detailed features and architecture of Apache DolphinScheduler 3.1.9!
Traditional data lineage shows dependencies—not proof. Learn how Minimum Incident Lineage helps teams reproduce, audit, and resolve data incidents faster.
Learn everything you need to know about Data Engineering via these 96 free HackerNoon stories.
Noom helps you lose weight. We help you get a job at Noom. In today’s article, we’ll show you one of Noom’s hard SQL interview questions.
The generative AI hype continues,are we aware of the potential risks we face daily as users? we should shift now from the hype to more trust in AI.
This post is a deep dive into the inverted index and NGram BloomFilter index, providing a hands-on guide to applying them for various queries.
A small modern data stack that ETLs data from a PostgreSQL database into a ClickHouse database.
The article talks about how data analytics is evolving at workplaces from traditional querying , excel and dashboards to natural language conversations
Learn the impact of airflow on the data quality checks and why you should look for an alternative solution tool
Dashboards show what happened. SQL embeddings remember how you figured it out—and let AI start there next time instead of guessing from scratch.
This tutorial shows how Alibaba Cloud Container team runs PyTorch on HDFS using Alluxio under Kubernetes environment. The original Chinese article was published on Alibaba Cloud's engineering blog, then translated and published on Alluxio's Engineering Blog
Data encryption can enhance your security strategy, simplify system architecture, and provide lasting protection against breaches.
Data contracts define ownership, quality, SLAs, and context—preventing silent failures in pipelines, analytics, and AI systems.
Learn how to efficiently manage user access in Oracle databases for seamless data sharing and collaboration among departments.
This post will get into details about how a retail bank builds their fraud risk management platform based on Apache Doris and how it performs.

Wondering when to switch from Python to Spark? This practical guide breaks down the real differences, warning signs, and best use cases—so you know exactly when
Apache Arrow eliminates PySpark serialization bottlenecks. Learn how columnar, zero copy memory boosts Pandas, Spark, and UDF performance at scale.
In this guide, we will explore data analytics using PyArrow, a powerful library designed for efficient in-memory data processing with columnar storage.
Delight is an open-source an cross-platform monitoring dashboard for Apache Spark with memory & CPU metrics complementing the Spark UI and Spark History Server.
A brief description of the difference between Data Science and Data Engineering.
CocoIndex's layered concurrency control help you optimize data processing performance, prevent system overload, and ensure stable, efficient pipelines at scale
GitHub Actions is widely recognized as a powerful tool for automating tasks in software development.
Apache Spark 4.1 introduces significant architectural efficiencies designed to simplify Change Data Capture (CDC) and lifecycle management.
This article aims to provide a reference for non-tech companies who are seeking to empower their business with data analytics.
Agentic AI is transforming data engineering, requiring real-time pipelines, vector systems, and reliable data infrastructure.
A valuable asset for anyone looking to break into the Data Engineering field is understanding the different types of data and the Data Pipeline.
This article describes a large-scale data warehousing use case to provide a reference for data engineers who are looking for log analytic solutions.
Governance is the Gordian Knot to all Your Business Problems.
Amazon AI/ML Stack
Production ML fails less from bad models and more from weak data platforms. Here’s how ingestion, storage, and observability determine reliability.

Apache Doris 2.1.0's built-in Job Scheduler simplifies task automation with high efficiency, flexibility, and easy integration for seamless data management.
The worst nightmare of analytics managers is accidentally blowing up the data warehouse cost. How can we avoid receiving unexpectedly expensive bills?
Discover the top three areas data engineers can learn to leverage generative AI in 2025.
In "Towards Open Options Chains", Chris Chow presents his solution for collecting options data: a data pipeline with Airflow, PostgreSQL, and Docker.
In "Towards Open Options Chains", Chris Chow presents his solution for collecting options data: a data pipeline with Airflow, PostgreSQL, and Docker.
Here's what every AI practitioner must internalize.
R Systems Blogbook Round 2 is open! Submit your article on microservices observability or zero trust security between April 29–May 30, 2025.
Apache SeaTunnel now supports Metalake integration!
How Tabby built a scalable DWH on GCP: BigQuery core, Debezium→Pub/Sub near-real-time sync, layered data architecture and practical lessons for analytics.
Learn what ELT is, how it differs from ETL, and why modern data platforms use ELT for scalable, real-time data processing and analytics.
Discover the powerful synergy of Apache Iceberg and Dremio, revolutionizing data management and analytics.
95% of AI startups fail because their data breaks first. Here’s how real winners build solid data infrastructure using Bright Data to stay alive.
Build reliable Spring Boot APIs with centralized exception handling using @ControllerAdvice. Learn how to create clean, consistent, and scalable error responses
Learn how to tackle challenges, implement solutions, and streamline your ETL workflow for enhanced scalability and maintainability.
Article explaining the importance of speedy data analytics and implementation of robust data infrastructure to achieve the same with live streaming data.
A 25-day production test comparing single-model anomaly detection vs a 3-model ensemble, reducing false positives by 35% on 332K orders.
Cloud costs aren’t fixed by infrastructure tweaks. Learn how JIT compilation and code optimization cut costs and boost performance.
An overview of challenges with working on web3 data projects vs web2 based on personal experience.
Big Data Analytics has evolved into the modern organization’s most powerful compass.
AI systems fail quietly when data arrives unverified. Learn how strong validation, lineage checks, and drift monitoring prevent hidden anomalies.
This blog post is a refresh of a talk that James and I gave at Strata back in 2017. Why recap a 3-year-old conference talk? Well, the core ideas have aged well, we’ve never actually put them into writing before, and we’ve learned some new things in the meantime. Enjoy!
Unlock ML speed with expert tips on data pipeline development, cloud integration, and infrastructure planning from Google’s senior customer engineer, Abhijeet R
Learn how data contracts prevent schema drift and silent pipeline failures using Kafka, Schema Registry, and Great Expectations in modern data architectures.
Enterprise GenAI strategy will fail without data modernization. Legacy data warehouses can't support AI. Learn why you must migrate both data and business logic
Data observability monitors nulls, drift, and freshness, catching pipeline issues before they corrupt dashboards, models, or business decisions.
While ETL pipelines are often the first preference, ELT pipelines could very well be more advantageous to your particular use case.
Modern BI workloads demand more than star schemas. Learn when dimensional models work and when purpose-driven analytical tables improve performance.
Explore effective methods for calculating binomial proportion metrics like conversion rates and click-through rates.
Learn why EMR fails in multi-job environments. Discover why concurrent pipelines exhaust shared subnets and how to build a DynamoDB ledger to fix it.
Discover how to boost Apache Spark's query efficiency using data sketches for fast counts and intersections in large datasets. Essential for data pros!
Explore Kafka Streams: a Java library for building scalable, fault-tolerant stream processing apps. Learn how to simplify real-time data processing.
Overview of the modern data stack after interview 200+ data leaders. Decision Matrix for Benchmark (DW, ETL, Governance, Visualisation, Documentation, etc)
How behavioral data, long-tail economics, and A/B testing transformed guesswork into the engine behind modern digital businesses.
An excellent data architecture doesn’t just function; it empowers, elevating an organization’s innovation ability.
In databases, data update is to add, delete, or modify data. Timely data update is an important part of high quality data services.
Discover how to bridge the knowledge gap between data scientists and MLOps engineers with these three essential concepts.
A data leader reveals the hidden cost of success: Sunday panic attacks, the "savior complex," and the struggle to find rhythm in a chaotic role.
"Governance is a process problem wearing a tool costume." I tested 5 data catalogs against real data incidents. Here is what actually broke.
Tiered Locality is a feature led by my colleague Andrew Audibert at Alluxio. This article dives into the details of how tiered locality helps provide optimized performance and lower costs. The original article was published on Alluxio’s engineering blog
Explore the rise of multimodal AI, a new frontier in artificial intelligence that integrates text, images, audio, and video for a more holistic approach.
A deep dive into building a production-ready LLM cost and risk optimization system with token analytics, prompt risk detection, and real-time monitoring.
NetEase has replaced Elasticsearch and InfluxDB with Apache Doris in its monitoring and time series data analysis platforms, respectively
After speaking to hundreds of teams, I discovered ~80% of data issues aren’t covered by testing alone. Here are 4 layers to building a data reliability stack.
A real 99M-row benchmark reveals why Import Mode still outperforms Direct Lake in Microsoft Fabric and what the engine truth means for your BI architecture.
The 5 things every data analyst should know and why it is not Python, nor SQL
CocoIndex now officially supports custom targets — giving you the power to export data to any destination, whether it's a local file, cloud storage, a REST API.
Discover Apache Iceberg with a free guide, crash course, and video playlist. Learn efficient data management and processing for big data environments.
Dremio Auto-Ingest is a game-changing feature that simplifies the process of loading data into Apache Iceberg tables.
In "Towards Open Options Chains", Chris Chow presents his solution for collecting options data: a data pipeline with Airflow, PostgreSQL, and Docker.
Incremental design results in a working system at the end of implementation. On the other hand, iterative design produces a functioning system
Goldman Will Dominate Consumer Banking
How I reverse-engineered the APIs of India's quick-commerce giants (Blinkit, Zepto, Swiggy) to map 4,000+ hidden dark stores.
Discover how Apache Iceberg revolutionizes data lakehouse architecture with efficient table management and powerful features like schema evolution.
This week, HackerNoon features DataOps.live, the automation platform powering Snowflake, Roche, and enterprises building AI-ready data at scale.
See mParticle data events and attributes displayed in an eCommerce UI, and experiment with implementing an mParticle data plan yourself.
Stop slow ingestion and high costs. Learn advanced patterns for high-throughput data ingestion using Spark, Delta Lake, and Zero-Trust security.
An architectural analysis of identity discontinuity in multi-bank FX systems and why reconciliation failures are structural rather than operational.
Both events and entities have unique roles in data modeling, and understanding when to use each is crucial for building effective data platforms.
2021 Noonies Nominee General Interview with Veronika. Read for more on cloud services, data engineering, and python.
Today, I am going to cover why I consider data science as a team sport?
Learn 3 simple, effective methods to detect and handle outliers in your data. Improve analysis accuracy and make smarter decisions with clean datasets.
Most ML failures aren’t outages; they’re silent drifts. Trusting green dashboards hides data distortion. Smart pipelines stay skeptical.
A data engineer breaks down why lakehouse architecture isn’t the revolution it’s marketed as—and why data modeling, quality, and ownership matter far more.
Move beyond static passwords.As we move toward more decentralized systems, cryptographically proven identity becomes the only reliable anchor for trust
As the third largest e-commerce site in China, Vipshop processes large amounts of data collected daily to generate targeted advertisements for its consumers. In this article, guest author Gang Deng from Vipshop describes how to meet SLAs by improving struggling Spark jobs on HDFS by up to 30x, and optimize hot data access with Alluxio to create a reliable and stable computation pipeline for e-commerce targeted advertising.
CocoIndex continuously watches source changes and keeps derived data in sync, with low latency and minimal performance overhead.
Discover the power of data visualization with Plotly in Python. Learn to transform raw data into interactive, insightful visuals and create dynamic dashboard
How to supercharge data analytics workflows and build trust with metric layers, self service and AI-assisted analytics.
This hands-on guide walks you through a real production upgrade with clear steps, SQL scripts & troubleshooting tips.
Why fans distrust live sports score apps—and the UX, performance, and design signals that make real-time score platforms feel reliable.
How to handle updates in indexing pipelines without breaking consistency or reprocessing everything. Practical strategies from real-world systems.
Pattern matching allows for more intuitive and readable conditional logic by enabling the matching of complex data structures with minimal code.
Explore how AI-driven DevOps will reshape data engineering in 2026, from automation to smarter pipelines and faster insights.
An Introduction to the art and science of dimensional modeling with relational databases
This article covers 7 data engineering gotchas in an ML project. The list is sorted in descending order based on the number of times I've encountered each one.
Extracting data from existing databases is the Data Engineering team's complex task. Here are insights and tips to navigate these challenges and save time.
This case study describes how we built a custom library that combines data housed in disparate sources to acquire the insights we needed.
While data tools today are more powerful than ever, most organizations still find data platforms complex and costly to maintain.
Notebooks used to be a personal workspace: run a query, poke at a dataset, export a CSV, and move on. Now they’re becoming the default data UX for teams.
See how a hybrid architecture marries the best of the SaaS world and on-prem world for modern data stack software.
A hands-on guide to architecting unified, governed and AI-ready data platforms using open table formats, semantic layers and multicloud governance.
The art of building a large catalog of connectors is thinking in onion layers.
In this article, we will look into the specifics of Gen AI’s role in data engineering and see where it flourishes and where it requires enhancement
Convert Spark dataframe output/Hive/Impala console output to CSV with PySpark. Simple script to clean tables, save data, and streamline workflows. Try it now!
Apache Iceberg simplifies data management, but lacks built-in governance. Catalog-level access controls via Nessie or Polaris offer secure, centralized table ma
Join the discussion about various techniques for ensuring data privacy in data engineering.
If you are using Databricks serving endpoint, and you wish to export metrics to Datadog, you can face with some challenges in Datadog documentation.
Define the metric once, in one place, and every tool (and every AI agent) that queries it gets the same answer.
From simplifying data collection to enabling data-driven feature development, Customer Data Platforms (CDPs) have far-reaching value for engineers.
If you’re an engineer curious about transitioning towards the business side, don’t underestimate how transferable your toolkit is.
Maximize speed and minimize cloud costs. Learn advanced SQL tuning for Snowflake and Databricks using Pruning, Broadcast Joins, and Z-Ordering.
Learn how to write SQL that the query optimizer understands—reduce costs, avoid slow queries, and improve performance in Snowflake and Databricks.
An introduction to auto-increment columns in Apache Doris, usage, applicable scenarios, and implementation details.
Find out how to set up and work locally with the most granular demographics dataset that is out there.
AI in manufacturing fails without strong data pipelines. Learn why real-time, clean, connected data matters more than models for real results.
Migrating Presto workloads from a fully on-premise environment to cloud infrastructure has numerous benefits, including alleviating resource contention and reducing costs by paying for computation resources on an on-demand basis. In the case of Presto running on data stored in HDFS, the separation of compute in the cloud and storage on-premises is apparent since Presto’s architecture enables the storage and compute components to operate independently. The critical issue in this hybrid environment of Presto in the cloud retrieving HDFS data from an on-premise environment is the network latency between the two clusters.
Synthetic data is transforming AI by solving privacy, bias, and scalability challenges. Learn methods, use cases, and key risks.
Sometimes, we might not be able to afford a paid subscription on Slack. Here's a tutorial on how you can save and search through your Slack history for free.
This blog covers real-world use cases of businesses embracing machine learning and data engineering revolution to optimize their marketing efforts.
What's Deep Data Observability and how it's different from Shallow.
Put your organization on the path to consistent data quality with by adopting these six habits of highly effective data.
12/10/2024: Top 5 stories on the HackerNoon homepage!
CocoIndex can build and maintain a knowledge graph from a set of documents, using LLMs (like GPT-4o) to extract structured relationships between concepts.
Automate safe database copies for devs. MaskDump anonymizes emails & phones in huge SQL dumps via pipelines. Compare tools, see configs.
We have been working on CocoIndex - a real-time data framework for AI for a while, with lots of excitement from the community. We officially crossed 1k stars!
DPaaS solves the enterprise data scalability paradox with declarative policies, multi-plane architecture, and continuous reconciliation.
Self-serve systems are a big priority for data leaders, but what exactly does it mean? And is it more trouble than it's worth?
Despite AI/ML research focusing on unstructured data, tabular data remains the primary area of time and financial investment in the Data Integration world.
Stop duplicate records and broken data. Learn how a Digital Architect uses Atomicity and Idempotency to ensure financial integrity in the Lakehouse.
This article introduces Structured Data Management (Developer Preview) available in the latest Alluxio 2.1.0 release, a new effort to provide further benefits to SQL and structured data workloads using Alluxio. The original concept was discussed on Alluxio’s engineering blog. This article is part one of the two articles on the Structured Data Management feature my team worked on.
7/17/2025: Top 5 stories on the HackerNoon homepage!
1/29/2025: Top 5 stories on the HackerNoon homepage!
Congratulations, you’ve successfully implemented data testing in your pipeline!
4/3/2026: Top 5 stories on the HackerNoon homepage!
A company's self-healing pipeline failed to detect and fix a data quality issue.
Learn three practical methods to integrate Databend with SeaTunnel for scalable, real-time ETL.
Most organizations struggle with data scattered across multiple systems, inconsistent definitions and no clear ownership.
CocoIndex + ColPali enable fine-grained, patch-level visual search that sees layout, text, and objects—just like you do.
5/21/2025: Top 5 stories on the HackerNoon homepage!
4/29/2025: Top 5 stories on the HackerNoon homepage!
AI dashboards can turn unstable metric definitions into trusted operating decisions before teams agree on what the numbers actually mean.
CocoIndex now provides robust and flexible support for typed vector data — from simple numeric arrays to deeply nest multi-dimensional vectors.
Proper use of output variables can significantly improve workflow flexibility and maintainability.
From "class isolation" to "governable ClassLoaders with verifiable reclamation"; a phased proposal for fixing SeaTunnel's runtime resource boundaries.
In this blog, guest writer Derek Tan, Executive Director of Infra & Simulation at WeRide, describes how engineers leverage Alluxio as a hybrid cloud data gateway for applications on-premises to access public cloud storage like AWS S3.
Implementing tracking code based on an outdated version of your organization's data plan can result in time-consuming debugging, dirty data pipelines, an
I tried processing 430 million AML transactions on my laptop, which kept crashing, but account-level sampling solved it and changed my data engineering approach
In the previous article, I described the concept and design of the Structured Data Service in the Alluxio 2.1.0 release. This article will go through an example to demonstrate how it helps SQL and structured data workloads.
Fix your chunks, freshen your index, rerank before you generate, and actually instrument retrieval separately from generation.
Visit the /Learn Repo to find the most read blog posts about any technology.
2026-05-01 03:59:29
According to JPMorganChase, among their account holders who have crypto exchange-traded funds (ETFs), the median allocation to crypto ETFs constitutes approximately 4% of their total portfolio value. As cryptocurrencies are becoming more accessible and markets are also improving, there's been increased interest in crypto investing, especially after Bitcoin prices reached all-time highs in March and November 2024.
Those who are new to cryptocurrencies may be wondering what the best ways to buy them are. In this article, we'll present you with two options: spot markets and perpetual futures. We'll also discuss each one's pros and cons, and which you should pick for your personal circumstances and preferences.
The most straightforward way to buy and sell cryptocurrencies is to use spot markets. When you do a spot trade, you purchase a cryptocurrency and get immediate ownership of it at the current market price.
After the transaction is completed, you can hold, transfer, or sell that asset at any time. Also, you won't have to worry about:
For example, you'd buy Bitcoin on a spot market, and then store it in a wallet. You can then use it for transactions or hold it as part of your long-term investment strategy. You're not betting on price movements at all.
Because the transactions are so simple, spot trading is very appealing to beginners, as well as long-term investors. There's no liquidation risk either, so spot markets are less risky compared to derivatives (financial contracts) like perpetual futures; even if prices drop, you still retain ownership of your crypto. However, your downside (the potential for an investment to lose value) is directly tied to the crypto's price movement, so you may still experience losses if the market declines.
In general, spot trading is optimal for investors who believe in the long-term value of a cryptocurrency. It's also great for those who want a simpler and more transparent trading experience.
For readers interested in diving deeper into spot trading, here are some useful resources:
https://www.axi.com/int/blog/education/cryptocurrencies/spot-trading-cryptocurrency?embedable=true
Perpetual futures are a type of derivative contract that allows traders to speculate on the price of a cryptocurrency, but they don't have to actually own it. Traditional futures contracts usually have an expiration date, but as the name implies, perpetual futures don't; you can hold positions indefinitely.
By using a mechanism called the funding rate, perpetual futures contracts can track the price of an underlying asset (such as Bitcoin). That way, the contract price stays close to the spot price. Traders can go "long" (betting the price will increase) or "short" (betting the price will decrease). This makes perpetual futures highly flexible in both bullish and bearish markets.
There's lots of leverage with perpetual futures. Traders can borrow funds to increase their position size; on some platforms, they can borrow up to 100 times or more. This can amplify profits, but it also significantly increases risk.
Since there are potential significant risks to be managed, perpetual futures are usually used by experienced traders, as they're more comfortable managing risk and monitoring positions actively. These contracts are also popular for:
There are two main differences between spot and perpetual futures: ownership and risk structure. You own the cryptocurrency outright with spot trading, while you're trading a contract that represents the asset's price (without holding the asset itself) with perpetual futures.
Another big difference is leverage; spot trading doesn't usually involve borrowing funds, so your exposure is limited only to the amount you invest. But with perpetual futures, traders can use leverage, and this can multiply both gains and losses.
Liquidity and trading strategies used also differ. Since perpetual futures markets often have higher liquidity and tighter spreads, they're attractive to active traders. They enable strategies like short selling, too (this isn't possible in traditional spot markets unless you use additional tools).
Lastly, perpetual futures have funding fees. These periodic payments are used between traders to keep the contract price aligned with the spot price.
| | Spot trading | Perpetual futures | |----|----|----| | Ownership | Own the crypto outright | Trading a contract that represents the crypto's price | | Leverage | No | Yes | | Liquidity/trading strategies | Higher liquidity and tighter spreads | Lower liquidity and wider spreads | | Funding fees | No | Yes |
If your goal is to accumulate and hold cryptocurrency over time, then spot trading is the better option. Those who believe in the long-term growth of an asset will benefit directly from price appreciation by buying it on the spot market. This eliminates worries about liquidation or margin calls.
We'd also recommend that beginning traders use spot trading, too. The lack of leverage and straightforward mechanics make it much easier to understand and manage. It's simple: buy low and sell high (or hold), and you won't have to monitor funding rates or manage collateral.
Spot trading is also particularly useful during uncertain or highly volatile market conditions. Because there's no risk of forced liquidation, you can ride out short-term price swings without losing your entire position. Those who prefer a more passive investment approach will find this a much safer option.
In addition, spot trading should be your pick if you plan to use your crypto for things like staking, payments, or transferring between wallets. This is because it's the only option that gives you actual ownership.
Perpetual futures are best suited for short-term traders who want to capitalize on price movements (either up or down). Are you actively monitoring the market, looking for opportunities to profit from volatility? Then use perpetual futures.
One of the biggest advantages here is the ability to short the market. This means that you can make a profit, even when prices are falling, which is especially useful during bear markets or corrections. Also, the leverage allows you to increase your exposure without using a large amount of capital.
If you want to do hedging, then perpetual futures are valuable, too. For example, if you hold a large amount of Bitcoin in spot, you can open a short position in perpetual futures. This offsets potential losses during a downturn.
You get a double-edged sword here, though; with these benefits come significant risks. High leverage can lead to rapid losses, and positions can be liquidated if the market moves against you.
Spot trading carries less structural risk than perpetual futures do. However, it still requires discipline, especially during market downturns. The key to long-term success with spot trading is setting realistic expectations and avoiding emotional decisions.
With perpetual futures, strict risk management is a must due to leverage and liquidation risk. To protect your capital, you should use tools like:
Another must-do is understanding funding rates. These are periodic payments, and they can either add to your profits or increase your costs, depending on market conditions. If you ignore them, this can lead to unexpected losses over time.
Smart traders will utilize diversification, as this helps them avoid putting all their capital into a single trade or asset. By spreading capital across different positions, this can reduce the overall risk traders take on.
Of course, different market conditions will call for different trading approaches. For instance, in a strong bull market, spot trading can be highly effective since holding assets may generate huge returns.
On the other hand, in sideways or choppy markets, perpetual futures trading can be the better choice since you have the ability to go both long and short. This lets you profit from smaller price movements, and you won't have to rely on a clear upward trend.
During bear markets, perpetual futures can give you chances to short the market. Spot traders might struggle, unless they're willing to hold through extended downturns. But this requires skill and discipline, so it's not recommended for beginners; volatility can lead to rapid reversals.
What's important is that you have a flexible strategy. For instance, they may combine the two types of trading.
This strategy can give you a balanced approach to cryptocurrency trading. Not only can you benefit from long-term growth, but you can also take advantage of short-term market movements.
A real-world example would be holding Bitcoin in a spot wallet as a long-term investment. In addition, you can use perpetual futures to trade price fluctuations. Together, this dual approach can help you maximize returns and reduce risk if used correctly.
Another strategy traders use is hedging. For example, if you're concerned about a potential price drop, then you can open a short position in perpetual futures. That way, you can offset losses in your spot holdings, which is especially useful during periods of uncertainty.
As expected, though, combining these strategies requires careful planning and a solid understanding of both markets. If you mismanage leverage or overtrade, then this can negate the benefits and increase risk.
However, if you manage to use this hybrid approach effectively, then it offers flexibility. You'll then be able to adapt to changing market conditions while maintaining a core investment position.
Q.1 What Is Liquidation, and Why Does It Matter in Perpetual Futures?
A- Liquidation happens when your position is automatically closed by the exchange because your losses have reached a level where your collateral can no longer support the trade. This is one of the biggest risks in perpetual futures trading.
If you're using high leverage, then even small price movements can trigger liquidation. So it's important to understand your liquidation price and maintain a sufficient margin. These things can help you avoid unexpected losses.
Q.2 Are Perpetual Futures Suitable for Beginners?
A- In general, perpetual futures aren't suitable for beginners, not unless they have proper education and practice. Yes, these contracts offer powerful tools (e.g., leverage and short selling), but there are also significant risks.
Beginners are better off starting off with spot trading. This can help them build a foundational understanding of the market before moving into derivatives.
Q.3 How Do Funding Rates Impact Profitability?
A- Funding rates are periodic payments exchanged between long and short traders. If you're the one paying, it can eat into your profits over time, especially if you hold positions for long periods. On the other hand, if you're receiving the payments, it can add to your returns.
It's important to monitor these rates to help manage costs effectively.
Q.4 Can You Lose More Than Your Initial Investment in Perpetual Futures?
A. The good news is that on most modern crypto exchanges, there are liquidation mechanisms that are designed to prevent losses beyond your initial margin. However, you can still experience significant losses if there's extreme volatility or poor risk management.
Do note that some platforms also offer cross-margin. This is where multiple positions share collateral, which increases the overall risk.
Q.5 Is It Possible To Earn Passive Income With Either Method?
A. Yes, although spot trading offers more opportunities for passive income through:
Even though perpetual futures aren't typically designed for passive income, some traders try to earn funding fees under specific conditions.
Q.6 Which Is Better for Long-Term Wealth Building?
A. For most people, spot trading is better suited for long-term wealth building. You can accumulate and hold assets without the stress of active management or liquidation risk.
As stated before, perpetual futures are more appropriate for short-term trading. Also, it requires ongoing attention and expertise, which is harder to keep up with in the long term.
Q.7 Choose the Right Crypto Investment Method
A. If you're new to crypto trading, then we'd suggest starting off with spot trading. But if you're experienced, have good risk management, and want short-term gains, then you might be better off with perpetual futures.
Once you've gotten familiar with the market, adopting a hybrid strategy can be a good idea. If done right, it can result in both short and long-term gains.
2026-05-01 03:26:52
If you spend enough time reading about prompt engineering on developer forums, you’ll inevitably run into "The Token Hack." It goes something like this: Stop feeding your Large Language Models JSON or Markdown. Switch everything to YAML. It’s denser, it drops the heavy syntax brackets, and it will instantly slash your API bills by 20%.
The theory has serious backing. It gained momentum following deep-dives into GPT tokenization, with developers pointing out how much of the LLM context window is wasted on JSON's endless curly braces and quotation marks. In the broader AI community, studies continue to emphasize how critical prompt engineering and the choice between structured formats like YAML and JSON are for overall model performance.
As a Principal Software Engineer working daily with multi-agent code generation pipelines, I wanted these savings. Our internal pipeline reads human-authored specification documents (SPECs) and orchestrates Claude agents to write our React and TypeScript code. The cost and latency of these invocations add up quickly.
The prevailing wisdom told me that converting our Markdown specs to YAML would optimize our cache-create tokens and speed up the agents.
I decided to test the theory. I took three production specifications of varying sizes, built an isolated evaluation harness using claude-sonnet-4-6, and spent exactly $23.18 across 90 automated trials to see how much money YAML would save us.
The result? I was dead wrong. And the reason why reveals a lot about how we actually need to optimize multi-agent prompts.
The most surprising finding was that for two out of our three specifications, YAML was actually larger than Markdown. My medium-sized spec (15.4 KB, 265 lines) bloated by 16.2% when converted to YAML.
Why? Because human-authored specs aren't just rigid data objects; they are heavily reliant on prose, paragraphs, and acceptance criteria. YAML requires strict structural overhead to remain mathematically lossless.
Look at what happens to a simple list of Markdown acceptance criteria when forced into valid, lossless YAML:
### Acceptance Criteria
* If the user clicks the "Save" button, the system should trigger a debounced API call to the `/users/update` endpoint.
* If the endpoint returns a 500 error, display the `NetworkErrorToast` component and retain the user's unsaved form data.
acceptanceCriteria:
- |
If the user clicks the "Save" button, the system should trigger a debounced API call to the `/users/update` endpoint.
- |
If the endpoint returns a 500 error, display the `NetworkErrorToast` component and retain the user's unsaved form data.
To maintain strict content equivalence, the YAML conversion had to introduce dictionary keys, sequence dashes, explicit block scalar markers (|), and multiple levels of indentation. When you apply this structural overhead to hundreds of lines of developer instructions, the "compactness" of YAML completely evaporates.
YAML only beat Markdown on my largest spec (44.2 KB) because that specific file happened to be dominated by massive Markdown tables with heavy whitespace padding, giving the indentation-based YAML an artificial edge.
To ensure the test was fair, I couldn't just run a single prompt and eyeball the results. I needed to isolate the format-driven effects from the natural noise of the LLM.
I built a two-path methodology to test this:
I used our actual code-generation prompt, but instructed the agent to stop after Phase 1 (reading the spec and planning the work breakdown). In a standard pipeline run, the actual code-generation phase introduces massive, unpredictable variance in output tokens. By halting the agent before it writes code, we measure the true reasoning and context-parsing costs of the file format without the volatile noise of multi-file coding loops.
To strip out the noise entirely, I bypassed the default multi-agent system prompt and disabled file-access tools. I fed the spec directly into the user prompt and forced a highly deterministic JSON output.
Here is a simplified version of the Node.js harness I used to trigger the Claude CLI for the isolation runs:
import { execSync } from 'child_process';
import fs from 'fs';
/**
* Isolated Evaluation Harness for YAML vs. Markdown
* Targets: claude-sonnet-4-6
*/
function runIsolationEval(specName, format) {
const specContent = fs.readFileSync(`./specs/${specName}/SPEC.${format}`, 'utf8');
// Stripped-down extraction prompt to eliminate reasoning variance
const prompt = `Read the spec below carefully. Output a single JSON object with exactly these keys: components, hooks, utils, featureFlags. Output ONLY the JSON object. No prose, no code fences.
---SPEC START---
${specContent}
---SPEC END---`;
// Invoking the Claude CLI with strict constraints
const cmd = `claude -p "${prompt}" \\
--model claude-sonnet-4-6 \\
--output-format json \\
--max-budget-usd 1 \\
--tools "" \\
--exclude-dynamic-system-prompt-sections \\
--no-session-persistence`;
const rawOutput = execSync(cmd, { encoding: 'utf8' });
const metrics = JSON.parse(rawOutput);
return metrics;
}
To balance out any "cache warm-up" advantages, the format was alternated trial-by-trial. In total, I ran 90 trials across the three specs.
If YAML was inherently more efficient, we would expect to see a consistent drop in cachecreationinputtokens and overall costusd across the board.
That is not what happened. Here are the median results from 10 trials per format on the isolated evaluation path:
| Spec Target | Format | Cache-Create Tokens | Output Tokens | Wall Time (s) | Cost per Run ($) | Cost Mean ± 1σ | |----|----|----|----|----|----|----| | Small (feature-onboarding) | MD | 55,212 | 766 | 15.8 | $0.2186 | $0.2186 ± 0.0001 | | Small (feature-onboarding) | YAML | 55,388 | 732 | 15.5 | $0.2187 | $0.2187 ± 0.0001 | | Medium (user-profile) | MD | 57,972 | 7,698 | 93.9 | $0.3508 | $0.3457 ± 0.0195 | | Medium (user-profile) | YAML | 63,082 | 7,491 | 97.1 | $0.3446 | $0.3546 ± 0.0249 | | Large (core-orchestrator) | MD | 63,815 | 1,736 | 30.6 | $0.2654 | $0.2669 ± 0.0088 | | Large (core-orchestrator) | YAML | 64,222 | 1,911 | 32.7 | $0.2702 | $0.2731 ± 0.0096 |
[Note: Data from 60 isolated runs on claude-sonnet-4-6. Delta format: YAML vs. MD]
The Delta Breakdown:
Not only is the direction of the effect completely inconsistent, but the median cost difference between formats fell entirely within ±1 standard deviation of the natural, within-format trial variance.
In statistics, this is called a null result. In engineering, it means: Stop wasting your time converting these files.
If YAML isn't saving us tokens, where is the money actually going? To understand this, we need to look at how a modern, multi-agent code generation pipeline actually handles an LLM request.
Our pipeline leverages Anthropic's prompt caching. When a developer updates a specification, the orchestrator agent builds a massive context window before making the API call. Here is a breakdown of that "Input Token Stack":
| Layer | Estimated Token Count | Role |
|----|----|----|
| Global System Prompt | ~30,000 tokens | Core agent instructions and logic |
| Project Memory & Skills | ~15,000 tokens | Repository rules, CLAUDE.md, and skills |
| The Spec Document | 1,500 – 11,000 tokens | The actual feature requirements |
| User Prompt Wrappers | ~200 tokens | Current task instructions |
When I broke down the usage logs from my 90 trials, the token distribution was eye-opening. The spec file was a drop in the bucket.
Token Breakdown Per Run:
The math doesn't lie. The specification content accounted for only 5% to 25% of the total input. Even if transitioning to YAML magically saved us 10% to 20% on the spec size, it would only yield a 1% to 2% cost swing on the total run.
Furthermore, the natural variance of LLM outputs completely swallowed any minor format savings. Across my isolation tests, the model's output length for the exact same prompt varied wildly. On my medium-sized spec, the JSON output ranged from 5,517 to 9,824 tokens across trials of the exact same format.
You simply cannot extract a meaningful cost reduction signal when the noise floor of claude-sonnet-4-6's own output variance is massive enough to cause ±$0.02 swings on its own.
Beyond the raw LLM token costs, we have to consider the engineering reality of making a systemic format change. Switching a repository from .md to .yaml carries massive friction:
The verdict is clear: migrating to YAML carries a non-trivial engineering cost with absolutely no measurable runtime upside.
My $23 experiment proved that migrating your entire development organization from Markdown to YAML is a waste of engineering bandwidth. The structural overhead inflates your prose, and the system prompt dilutes any actual savings.
If you are a frontend or platform engineer trying to optimize an agentic pipeline, ignore the file format hype. Instead, pull these three levers:
The "YAML Manifesto" might hold true if you are building an LLM classifier that only ingests raw data schemas. But in the messy, real world of software engineering—where humans write prose-heavy feature specs, and agents require tens of thousands of tokens of project context to operate safely—Markdown is just fine.
Don't restructure your CI/CD pipelines to chase a theoretical token hack. Keep your specs readable, write better global instructions, and let the agents do their jobs.
\
2026-05-01 03:21:35
\
Larger context windows, longer token limits and massive memory systems promise better agents by giving them more information, yet without the right structure they can make agents less reliable.
In practice, unstructured or weakly related context can increase ambiguity, domain confusion and stale-pattern interference. Context that does not improve the decision is not intelligence. It is noise, cost and latency.
The rise of the Model Context Protocol (MCP) makes this problem even more urgent by dramatically expanding the context surface that must be governed.
The result: higher token costs, higher operational risk and slower adaptation, exactly the opposite of what most organizations expect when they “invest in memory.”
Reliability in production depends far less on how much the model can remember and far more on how clearly the system defines:
Structure before memory. Boundaries before execution.
Spending more on memory is often treated as the all-purpose solution for making AI better. Infinite context is frequently regarded as the holy grail of agentic AI. Larger context windows, longer token limits and massive memory systems promise better agents by giving them more information, yet without the right structure they can make those agents less reliable.
Most discussions focus on latency, compute cost and token spend. The common debate is about how much context we can afford. The more important question is whether the use case really needs that context, whether the added context improves the decision and whether the system has enough structure to prevent that context from distorting the outcome.
The deeper issue is architectural. When context grows without clear state, chronology, domain boundaries or execution rules, it can actively make results worse by creating more ambiguity, interference and error, while driving up token costs, latency and operational risk.
Human cognition offers a useful analogy. We do not push every input directly into long-term memory. Instead, we filter, prioritize and structure information before it shapes complex action. Agentic systems need the same discipline: structure before memory.
Context in agentic systems also grows through architecture: connected tools, external data sources, sub-agents, memory layers, retrieved documents, workflow state and protocols such as the Model Context Protocol (MCP). These integrations make agents more capable, but they also expand the context surface, creating more paths for stale information, domain mismatch, weak provenance or conflicting assumptions to influence reasoning and execution.
For leaders, this means memory strategy is not only an AI performance decision. It is an architecture, risk and cost decision.
It becomes even more dangerous once we add the fourth dimension: time.
Agentic AI, like any actor operating in the real world, does not deal only with static facts. It has to deal with sequences, transitions, recency, and changing state. Status is not just a snapshot. It is a moving timeline.
And yet many people assume current Large Language Models (LLMs) will naturally track the passage of time and give recent events the right priority without explicit structure or prompting, almost as if time-awareness were built in as common sense.
A simple exercise gives a glimpse of the problem. Tell an LLM you have a conference session tomorrow. Leave the chat open. The next day, ask it to write a social post promoting your talk. It may still invite people to attend “tomorrow,” even though the event has already passed and the later conversation makes clear that the talk already happened.
In a chatbot, that is a minor mistake. In a monitoring or operational system, the same failure mode becomes much more serious. If the system does not represent time explicitly, it can misread what is current, what is recent, and what is already obsolete, then act on the wrong assumption.
There is an important paradox here. In monitoring, long-term history is extremely valuable for training, baselining, and pattern learning. It helps identify seasonality, recurring behaviors, likely incident classes, and normal operating ranges over time.
But that does not mean the same history should be injected directly into flat live context and allowed to compete with current state, recent changes, and fresh exceptions.
In these cases, the fallacy is often not retrieval failure, but reasoning bias. The model may see the exception but effectively outvote it through the weight of surrounding context.
What helps a system learn better over time can make it decide worse in the moment if training memory and runtime context are treated as the same thing.
For live monitoring, current state, recent changes, exceptions, and the timeline that connects them must take priority over accumulated historical familiarity. Long-term history should support runtime decisions through structured baselines, trained models, or controlled retrieval, not as undifferentiated context competing with what is true now.
In many monitoring cases, the critical insight is not a single event, but the cause-and-effect relationship between changes that recur over time. To recognize that pattern reliably, the timeline has to be represented explicitly.
History should support the system. It should not override the current truth.
An agent whose retrieved knowledge, examples and operational history are dominated by SUSE Linux Enterprise Server 15 may continue to suggest SLES 15-era assumptions when SLES 16 is deployed, even as kernel behavior, security defaults, lifecycle rules or package-management assumptions change. More historical data does not make the agent smarter about the new version. It makes adaptation slower.
This example illustrates both a failure of recency (time) and domain recognition (the two versions represent meaningfully different operating environments). History tells us when a pattern was valid; domain tells us whether it applies at all.
Time is not the only dimension that flat context fails to preserve. Domain matters too.
In agentic systems, a request may be syntactically clear while still being ambiguous in scope. A general-purpose agent connected to multiple tools may understand the words in the request, yet still choose the wrong domain in which to solve it.
Ambiguity is not only about language. It is inherent to reality.
Much of the discussion around LLM uncertainty focuses on natural language ambiguity and prompting techniques. But in agentic systems, uncertainty also comes from the structure of the systems they operate in. The world is divided into distinct domains, operational scopes and knowledge realms. Systems, tools and APIs are designed to operate under specific assumptions within their own domain. When an agent crosses those boundaries without resolving them explicitly, a request may be syntactically clear and well phrased, yet still wrong in meaning.
As standardized tool protocols such as the Model Context Protocol (MCP) make it dramatically easier to connect agents to large numbers of capabilities and expose them to external context beyond the original user prompt, this domain ambiguity becomes even more dangerous.
In agentic AI too much knowledge can become counterproductive. \n A physicist who has deeply studied relativity and quantum models is not more intelligent simply because they hold more knowledge in memory. They are effective because they know the domain boundaries: they do not apply subatomic rules to planetary motion or use the wrong model for the wrong scale.
When those domains are mixed without clear boundaries in a flat, unstructured context, the physicist does not become a genius — they become a source of noise, applying the right math to the wrong reality.
The same pattern appears in agentic systems. Ask an agent in a container-related conversation to “check vulnerabilities,” and the failure may not come from bad tooling. It may come from selecting a Linux host vulnerability API instead of the Kubernetes or container security tool. The tools may both be clear and well defined for their intended purpose. The problem is not the tool descriptions. The problem is how tool domains are scoped and exposed to the agent. The relevant tool domain was assumed, while the agent never properly resolved which domain the request belonged to, or resolved it too late.
In agentic AI, hyperspecialization can hurt, too.
This is the 'Law of the Instrument' in action. There is an old adage that “If all you have is a hammer, everything looks like a nail”. In an agentic system, if the domain isn't strictly defined, an agent with security tools may view a simple latency issue as a DDoS attack. An agent focused on cost-optimization may view a critical security patch as an unnecessary expense. Without clear boundaries, the agent doesn't admit it’s the wrong specialist—it simply tries to solve the problem with the only 'hammer' it has.
Like with human experts, you do not call a physicist when you need a mechanic. Both may be competent, but if the domain is wrong, the answer can still be useless or harmful. Agentic systems need the same discipline.
The same pattern appears outside infrastructure. A function like calculateRetention(amount) may be technically valid while remaining semantically incomplete. Payroll tax, contractor withholding, dividend taxation, country-specific rules, and legal entity all change the meaning. The call can be correct in syntax and wrong in meaning. 
This is the domain dimension of the problem: more context does not help if the system does not first establish which domain owns the request and which domain each tool is meant to serve.
Once that happens in a read-only workflow, the result is a misleading answer. Once it happens in an execution workflow, it becomes a wrong action.
Reasoning can be probabilistic. Execution must be bounded and controlled.
A plausible interpretation may be acceptable in analysis. It is not acceptable as unbounded operational authority.
Capability without structure is not intelligence. It is uncontrolled optionality.
Time and domain failures are not only reliability problems. In agentic systems, they can quickly become security and compliance problems.
A stale signal can lead to an outdated risk decision. A wrong-domain interpretation can invoke the wrong tool or expose the wrong data. An MCP source without clear provenance and authority can introduce context that should never influence execution.
This is why sensitive data and operational tools require structure: classification, provenance, authority, auditability, approval boundaries, and clear rules for when the agent must stop and ask. (See the Domain Handshake and MCP governance patterns in the solutions section below.)
Related reading: I explore the security side of this argument in “Gateway Security Won’t Be Enough for MCP-Powered AI,” which explains why MCP-powered systems need enforcement closer to the tools, endpoints and execution paths, not only perimetral security at the gateway/proxy.
Context is everything the model can see. State is the authoritative truth of the system at a given moment.
Context is necessary for intelligent behavior. But context alone is not enough. Information only becomes useful when it is structured in a way that preserves meaning, priority, and scope. Without that structure, extra knowledge becomes noise.
The same structural failure appears across both time and domain. In time, older patterns can overwhelm the latest exception. In domain, familiar knowledge from one scope can bleed into another where it no longer applies. The problem is the same in both cases: flat context forces the model to reconstruct relevance probabilistically instead of receiving it through explicit structure.
And agentic AI does not only interpret reality. It acts on it. It changes systems, moves workflows forward, and turns one system state into another. In that setting, intelligence without explicit state is not enough. Reliable action requires reliable state, and often a clear chronology of state transitions.
If state, chronology, domain boundaries, and execution rules are not represented explicitly, adding more memory can make decisions worse instead of better.
This creates a paradox, and explains why the bigger-memory fallacy is so easy to believe. In a small, well-scoped context, adding relevant information usually improves results and reduces errors. But there is an inflection point where more data no longer improves judgment. It starts adding noise.
Beyond that point, the system may know more, yet become less able to identify what matters most — all while consuming significantly higher token spend and compute resources. In small contexts, the relevant signals remain dominant. In large, unstructured contexts, those signals begin to compete with historical volume, irrelevant patterns, and lower-priority information. That is when decision quality starts to degrade.
Small, well-scoped context often produces clearer and more reliable behavior. Large, unstructured context can do the opposite. It can drown important exceptions in historical noise, blur the boundaries between domains, and make the system sound informed while acting on the wrong interpretation.
That is the bigger-memory fallacy in agentic AI: assuming that more memory automatically means more intelligence.
If bigger memory is not the answer, the response is not to remove context. It is to structure it.

Use the smallest context that preserves the right truth. Separate state from context. Do not let a large flat context become the control plane for monitoring or execution. And do not rely on common sense. In critical systems, unstated assumptions are design failures.
The rule should be: Keep the current state explicit. Preserve chronology separately. Do not let history override present truth.
A lightweight schema validator, rules engine, or fast routing LLM runs first. It parses the request, identifies candidate tools or domains, and flags overlaps or ambiguity.
Do not let the heavy, probabilistic core of the agent discover ambiguity by accident. Instead, enforce explicit ambiguity thresholds early. If multiple plausible actions exceed the acceptable threshold, the system must immediately trigger clarification or block progression.
This single guardrail dramatically reduces the volume of risky cases that ever reach the probabilistic core of the agent.
Making agents reliable before execution
Before any agent is allowed to act, verify the following:
Rule of thumb: If any item above is unclear, do not let the probabilistic core decide. Escalate or block.
Organize agents like a tree of specialized subagents. Let each agent resolve what it safely can within its own scope, and escalate upward when clarification, rerouting, or broader context is needed. When context is insufficient, the correct behavior is not to infer. It is to ask.
This structure ensures that each component remains focused, but the true power of this hierarchy lies in a negotiated delegation process:
This “Domain Handshake” turns delegation from a risky, probabilistic best-guess into a negotiated, self-correcting process that preserves the system's structural integrity.
The Domain Handshake begins at enrollment, not execution.
The Model Context Protocol (MCP) is changing how agentic AI connects to external systems by standardizing the way tools, resources, and prompts are exposed to LLM applications — creating a more consistent integration model for external capabilities.
Crucially, MCP also expands the context surface: external servers can expose resources, prompts, tool metadata, and workflow-specific information that agents may request or consume during a workflow. If domain boundaries are weak, an agent may pull context from the wrong source, apply it to the wrong operational domain, or treat externally supplied context as more authoritative than it should be. As MCP-based architectures evolve toward more bidirectional interactions, provenance, authority, and domain scoping become even more critical.
In MCP-based architectures, external capabilities and context sources must not simply be exposed and trusted at runtime. When registering an MCP server, tool, resource, prompt, or sub-agent, the system must validate its Domain Contract against the existing hierarchy — acting as a “pre-compilation” check for your agent architecture.
Each capability or context source must explicitly declare its domain, the actions it can perform, the data it can expose, the authority level it requires, its provenance, and where its boundaries end.
If two tools or context sources claim overlapping subdomains, or if their descriptions are too vague to guarantee clear separation, the system flags the collision at enrollment time — not after the agent has already selected a tool or consumed the context.
MCP standardizes how capabilities and context are discovered and invoked, but it does not eliminate the need for architectural governance. A tool or context source exposed through MCP is only safe if its domain, authority, provenance, and execution boundaries are explicit.
Requiring this clarification before runtime keeps delegation simpler, reduces both tool-selection and context-selection ambiguity, and makes the overall agentic system significantly more predictable. MCP tells the agent what capabilities and context exist. The Domain Handshake determines whether they belong in the decision path.
Accept that AI will make mistakes. Design for bounded error, not perfect intelligence. Not everything needs to be LLMized: use LLMs where interpretation and coordination add value, and rely on deterministic components where state, policy, and control matter most. Improving system-level error tolerance with guardrails is often a more achievable goal than chasing flawless model behavior. Systems with low error tolerance need tighter controls than systems that can absorb bounded mistakes.
Structure improves reliability, but it is not free. Add complexity only where the use case demands it. Not every expert agent needs timeline-aware reasoning, multi-domain routing or complex hierarchies. In many cases, a small specialized agent operating on a clean state snapshot is enough.
Structure requires architecture, classification, routing, validation, guardrails, domain mapping, state management and clear contracts. In more complex systems, it may also require multi-level agent hierarchies. When sensitive data or operational tools enter the agent workflow, the architecture must also account for security, audit and compliance requirements. All of this can significantly increase memory use, compute cost, latency and operational complexity.
There is usually an inflection point. In small, well-scoped tasks, flat context can be the optimal solution: more context can improve reliability at relatively low cost because the relevant signals and patterns remain easy to identify. Beyond that point, additional context starts to create ambiguity, interference and domain confusion. The system then needs more structure to remain reliable, but that structure introduces its own cost.
As a practical check, match the structure to the most likely failure mode:
The KISS principle still applies: the best architecture is rarely the most sophisticated one. It is the simplest one that safely fits the task.
LLMs are useful in ambiguous situations because they can interpret incomplete language, compare plausible meanings, and generate candidate paths. But that same probabilistic nature also makes them fallible, which is why they should not become the sole authority for low-error-tolerance execution. In many messy, real-world scenarios, this flexibility is a feature, not a bug. The goal is not to build perfect AI, but to build AI systems that are resilient when uncertainty remains.
If your use case has zero tolerance for error, treat LLM agents as analysis or supervised assistants only. If a small margin of error is acceptable, bounded autonomous use may be appropriate, provided the right structural guardrails are in place. Treat these recommendations as tools to be applied judiciously based on your specific risk and cost profile.
A wrong answer in a chatbot is annoying. A wrong action in an agentic system is an operational problem.
That is the real weakness of the bigger-memory fallacy. More context does not automatically create better judgment. Without explicit structure, it can create more ambiguity, more interference, and more ways to be wrong.
In monitoring and operations, context is nothing without structure. Agentic AI needs explicit state, clear timelines, and ordered transitions, not just larger snapshots of accumulated information.
LLMs will be everywhere, but they shouldn't do everything. They are often best used as coordinators and interpreters in roles that need to deal with ambiguity, but they may be the wrong tool for state and policy execution in low-error-tolerance systems. Stop trying to LLMize every workflow. Save your massive probabilistic models for reasoning, and rely on strict deterministic guardrails for control.
The goal is not perfect AI. The goal is to build systems that reduce mistakes, bound execution, and survive the ones that still happen.
In agentic systems, simpler architecture is often not only cheaper, but safer.
As a practical pattern, favoring smaller, specialized sub-agents with narrow context and clear domains is more than just a pragmatic choice. It may also be a more reliable approach.
In the age of rapidly proliferating MCP-based tool integrations, this principle matters more than ever: the easier it becomes to connect tools and expand context, the more rigorously we must define their domains, boundaries, provenance and authority.
And that means spending more on memory alone is not always the solution. Often, the better investment is in structure, analysis, and clear system requirements.
More context is not the answer. Better structure is.
\n \n
\
2026-05-01 03:05:32
:::info Auto-Invest feature lets customers automatically invest their paycheck in digital assets or a USD Interest Account
Las Vegas, Nevada, USA
:::
\ Uphold, the modern infrastructure provider for on-chain finance, announces the launch of Auto-Invest, a new feature for its popular Direct Deposit service. The new feature lets customers automatically invest their paycheck across multiple digital assets or a USD Interest Account.
With Direct Deposit, customers receive all or part of their paycheck automatically and securely in their Uphold account. Auto-Invest lets customers buy up to ten assets automatically in a single step the moment their paycheck arrives. Customers choose from digital assets, a USD Interest Account, or metals, and then set the percentage they wish to allocate to each asset. Anything not assigned stays in their USD balance. Auto-Invest users earn 3% back in XRP on crypto trades over $500, and 2% back on trades below $500.1
Customers can change their settings, pause, stop, or reactivate Auto-Invest at any time, with changes taking effect on future paychecks.
“Auto-Invest removes the friction of building a portfolio: customers set it up once, and it goes to work the moment their paycheck arrives,” said Nancy Beaton, President at Uphold HQ. “It embodies our goal of making people’s everyday finances work harder.”
Uphold Auto-Invest is unavailable in New York, American Samoa, and the U.S. Virgin Islands.
Uphold is a financial technology company that believes on-chain services are the future of finance. It provides modern infrastructure for on-chain payments, banking and investments. Offering Consumer Services, Business Services and Institutional Trading, Uphold makes financial services easy and trustworthy for millions of customers in more than 140 countries.
Uphold integrates with more than 30 trading venues, including centralized and decentralized exchanges, to deliver superior liquidity, resilience and optimal execution. Uphold never loans out customer assets and is always 100% reserved.
The company pioneered radical transparency and uniquely publishes its assets and liabilities every 30 seconds on a public website (https://uphold.com/en-us/transparency).
Uphold is regulated in the U.S. by FinCen and State regulators; and is registered in the UK with the FCA and in Europe with the Financial Crime Investigation Service under the Ministry of the Interior of the Republic of Lithuania. Securities products and services are offered by Uphold Securities, Inc., a broker-dealer registered with the SEC and a member of FINRA and SIPC.
To learn more about Uphold’s products and services, visit uphold.com.
1 Terms apply to the Auto-Invest XRP back promo
\
:::warning Disclaimer: The information provided in this press release is not a solicitation for investment, nor is it intended as investment advice, financial advice, or trading advice. Investing involves risk, including the potential loss of capital. It is strongly recommended you practice due diligence, including consultation with a professional financial advisor, before investing in or trading cryptocurrency and securities. You are solely responsible for your investment decisions and assume all associated risks. Neither the media platform nor the publisher shall be held responsible for any fraudulent activities, misrepresentations, or financial losses arising from the content of this press release.
:::
\
:::tip This story was distributed as a release by Blockchain Wire under HackerNoon Business Blogging Program.
:::
\