MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

Exploratory Data Analysis: How to Read a Dataset

2026-04-28 16:24:05

Loading data is not the start of understanding it.

You have done the loading. You have done the cleaning. You have merged tables and filtered rows and built charts one at a time.

But those were isolated skills. EDA is what happens when you use all of them together with a specific purpose. You are not just making charts. You are interrogating a dataset. Asking questions. Finding answers. Forming new questions from those answers.

Real data science looks like this: you load a dataset, you do not know what is in it, and forty minutes later you know its shape, its problems, its patterns, and you have three specific hypotheses to test with a model.

This post walks through that entire process on one real dataset, start to finish.

The Dataset

We will use the Housing dataset, available on Kaggle (search "House Prices Advanced Regression Kaggle"). It has 79 features and 1460 rows describing residential homes in Ames, Iowa. The target is the sale price.

If you cannot download it right now, use this simplified version to follow along:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

np.random.seed(42)
n = 500

df = pd.DataFrame({
    "SalePrice":    np.random.normal(180000, 50000, n).clip(50000, 500000),
    "GrLivArea":    np.random.normal(1500, 400, n).clip(500, 4000),
    "YearBuilt":    np.random.randint(1900, 2010, n),
    "OverallQual":  np.random.randint(1, 11, n),
    "GarageCars":   np.random.choice([0, 1, 2, 3], n, p=[0.1, 0.2, 0.5, 0.2]),
    "TotalBsmtSF":  np.random.normal(1000, 300, n).clip(0, 3000),
    "Neighborhood": np.random.choice(["A", "B", "C", "D", "E"], n),
    "HouseStyle":   np.random.choice(["1Story", "2Story", "1.5Fin"], n),
    "MasVnrArea":   np.random.exponential(100, n).clip(0, 1600),
})

df.loc[np.random.choice(n, 30, replace=False), "TotalBsmtSF"] = np.nan
df.loc[np.random.choice(n, 10, replace=False), "GrLivArea"]   = np.random.uniform(8000, 15000, 10)

Phase 1: First Contact

The very first thing. No charts yet. Just numbers and structure.

print("=" * 60)
print("DATASET OVERVIEW")
print("=" * 60)
print(f"Shape: {df.shape}")
print(f"Memory: {df.memory_usage(deep=True).sum() / 1024:.1f} KB\n")

print("Column Types:")
print(df.dtypes.value_counts())

print("\nFirst 5 rows:")
print(df.head())

print("\nBasic Statistics:")
print(df.describe().round(2))

What you are looking for at this stage:

How many rows and columns. A dataset with 500 rows and 79 features is very wide relative to its length. That matters for modeling.

Dtypes. Are numerical columns actually numerical? Are categorical columns stored as strings or codes?

The min and max values in describe(). An age of -5 or a salary of 10 billion tells you something went wrong. Spot it here.

The difference between mean and median (50%). Large differences signal skewness or outliers.

Phase 2: Missing Values Map

missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(1)
missing_df = pd.DataFrame({
    "missing_count": missing,
    "missing_pct":   missing_pct
}).query("missing_count > 0").sort_values("missing_pct", ascending=False)

print("Missing Values:")
print(missing_df)

fig, ax = plt.subplots(figsize=(8, 4))
ax.barh(missing_df.index, missing_df["missing_pct"], color="coral")
ax.set_xlabel("Missing %")
ax.set_title("Missing Values by Column")
ax.axvline(x=50, color="red", linestyle="--", label="50% threshold")
ax.legend()
plt.tight_layout()
plt.savefig("missing_values.png", dpi=150)
plt.show()

Columns with more than 50% missing are usually not worth imputing. They carry too little signal. Drop them or note them for investigation.

Columns with less than 5% missing are safe to impute with mean, median, or mode.

The pattern of missingness matters too. If MasVnrArea is missing, is it also always missing in the same rows as MasVnrArea type? Missing together suggests a structural relationship, not random noise.

Phase 3: Target Variable First

Before anything else, understand what you are trying to predict.

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].hist(df["SalePrice"], bins=40, color="steelblue", edgecolor="white")
axes[0].set_title("SalePrice Distribution (Raw)")
axes[0].set_xlabel("Price")

axes[1].hist(np.log1p(df["SalePrice"]), bins=40, color="coral", edgecolor="white")
axes[1].set_title("SalePrice Distribution (Log)")
axes[1].set_xlabel("Log Price")

stats_text = (
    f"Mean:   ${df['SalePrice'].mean():,.0f}\n"
    f"Median: ${df['SalePrice'].median():,.0f}\n"
    f"Std:    ${df['SalePrice'].std():,.0f}\n"
    f"Skew:   {df['SalePrice'].skew():.2f}"
)
axes[2].text(0.5, 0.5, stats_text, transform=axes[2].transAxes,
             fontsize=12, va="center", ha="center",
             bbox=dict(boxstyle="round", facecolor="lightblue", alpha=0.5))
axes[2].set_title("Target Statistics")
axes[2].axis("off")

plt.tight_layout()
plt.savefig("target_distribution.png", dpi=150)
plt.show()

Sale prices are almost always right-skewed. A few very expensive homes pull the mean above the median. Log transformation often makes them more normally distributed, which helps many ML algorithms.

The skew value confirms this. Skew above 1 or below -1 usually means you should consider transforming the target.

Phase 4: Numerical Features Deep Dive

num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
num_cols = [c for c in num_cols if c != "SalePrice"]

n_cols = 3
n_rows = (len(num_cols) + n_cols - 1) // n_cols
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 3.5))
axes = axes.flatten()

for i, col in enumerate(num_cols):
    axes[i].hist(df[col].dropna(), bins=30, color="steelblue", edgecolor="white", alpha=0.8)
    axes[i].set_title(col, fontsize=10)
    axes[i].set_xlabel("")

for j in range(i + 1, len(axes)):
    axes[j].set_visible(False)

plt.suptitle("Distribution of All Numerical Features", fontsize=14, y=1.01)
plt.tight_layout()
plt.savefig("feature_distributions.png", dpi=150, bbox_inches="tight")
plt.show()

You are looking for:

Skewed features that might need transformation.

Binary features disguised as continuous (only values 0 and 1).

Features with very low variance (almost all the same value). These add noise to models.

Bimodal distributions suggesting two subpopulations that might need to be modeled separately.

Phase 5: Correlation Analysis

corr = df[num_cols + ["SalePrice"]].corr()

top_corr = corr["SalePrice"].abs().sort_values(ascending=False).drop("SalePrice")
print("Top correlations with SalePrice:")
print(top_corr.head(8).round(3))

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

top_features = top_corr.head(6).index.tolist()
sns.heatmap(
    df[top_features + ["SalePrice"]].corr(),
    annot=True, fmt=".2f", cmap="coolwarm",
    center=0, square=True, ax=axes[0]
)
axes[0].set_title("Correlation: Top Features vs Target")

top_corr.head(8).sort_values().plot(kind="barh", ax=axes[1], color="steelblue")
axes[1].set_title("Feature Correlation with SalePrice")
axes[1].set_xlabel("Absolute Correlation")
axes[1].axvline(x=0.5, color="red", linestyle="--", alpha=0.7)

plt.tight_layout()
plt.savefig("correlations.png", dpi=150)
plt.show()

Features with correlation above 0.5 with the target are strong candidates for your model.

Features highly correlated with each other (above 0.8) are redundant. Keeping both adds noise without adding information. You will need to choose one.

Phase 6: Scatter Plots for Top Features

top_4 = top_corr.head(4).index.tolist()

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for i, col in enumerate(top_4):
    axes[i].scatter(df[col], df["SalePrice"], alpha=0.4, color="steelblue", s=20)

    m, b = np.polyfit(df[col].fillna(df[col].median()), df["SalePrice"], 1)
    x_line = np.linspace(df[col].min(), df[col].max(), 100)
    axes[i].plot(x_line, m * x_line + b, color="red", linewidth=2)

    axes[i].set_xlabel(col)
    axes[i].set_ylabel("SalePrice")
    corr_val = df[[col, "SalePrice"]].corr().iloc[0, 1]
    axes[i].set_title(f"{col} vs SalePrice (r={corr_val:.2f})")
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("scatter_top_features.png", dpi=150)
plt.show()

Scatter plots reveal what correlation numbers cannot.

A linear relationship looks like a clean diagonal cloud. A curved relationship means linear regression will underfit and you need polynomial features or a tree-based model. Heteroscedasticity (cone-shaped scatter) means variance increases with the feature value, common in house prices.

Outliers stand out visually here. One point far above the trend line is not noise, it is a story. Investigate it.

Phase 7: Categorical Features

cat_cols = df.select_dtypes(include=["object"]).columns.tolist()

fig, axes = plt.subplots(1, len(cat_cols), figsize=(14, 5))

for i, col in enumerate(cat_cols):
    order = df.groupby(col)["SalePrice"].median().sort_values(ascending=False).index
    sns.boxplot(
        data=df, x=col, y="SalePrice",
        order=order, ax=axes[i], palette="Set2"
    )
    axes[i].set_title(f"SalePrice by {col}")
    axes[i].set_xlabel("")
    axes[i].tick_params(axis="x", rotation=45)

plt.tight_layout()
plt.savefig("categorical_analysis.png", dpi=150)
plt.show()

Box plots by category show you the median and spread of the target for each category value. If neighborhoods have very different median prices, neighborhood is a powerful feature. If house styles have similar medians with overlapping boxes, the style might not matter much.

Phase 8: Outlier Detection and Investigation

from scipy import stats

z_scores = np.abs(stats.zscore(df[["SalePrice", "GrLivArea"]].dropna()))
outlier_mask = (z_scores > 3).any(axis=1)

print(f"Outliers detected: {outlier_mask.sum()}")

fig, ax = plt.subplots(figsize=(10, 6))

normal = df[~df.index.isin(df[["SalePrice", "GrLivArea"]].dropna().index[outlier_mask])]
outliers = df[df.index.isin(df[["SalePrice", "GrLivArea"]].dropna().index[outlier_mask])]

ax.scatter(normal["GrLivArea"], normal["SalePrice"],
           alpha=0.5, color="steelblue", s=20, label="Normal")
ax.scatter(outliers["GrLivArea"], outliers["SalePrice"],
           color="red", s=80, marker="X", label="Outlier", zorder=5)
ax.set_xlabel("Above Grade Living Area (sq ft)")
ax.set_ylabel("Sale Price")
ax.set_title("Outlier Detection: GrLivArea vs SalePrice")
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("outliers.png", dpi=150)
plt.show()

The classic housing data problem: some very large houses sell for surprisingly low prices. These are often partial sales, auction sales, or non-standard transactions. For modeling normal market behavior, you might remove them. Document the decision.

Phase 9: EDA Summary Report

After all that exploration, write down what you found.

summary = """
EDA FINDINGS SUMMARY
=====================

Dataset: Housing Prices
Shape: {} rows x {} columns

KEY FINDINGS:
1. Target (SalePrice) is right-skewed (skew={:.2f}). Log transform recommended.
2. {} columns have missing values. TotalBsmtSF has {:.1f}% missing.
3. Top predictors: OverallQual (r={:.2f}), GrLivArea (r={:.2f}), GarageCars (r={:.2f})
4. {} outliers detected in GrLivArea vs SalePrice. Investigate before modeling.
5. Neighborhoods show significant price variation. Include as categorical feature.

RECOMMENDED PREPROCESSING:
- Log transform SalePrice
- Impute TotalBsmtSF with median
- Encode Neighborhood and HouseStyle
- Remove or cap outliers in GrLivArea
- Consider polynomial features for GrLivArea (curved relationship)
""".format(
    df.shape[0], df.shape[1],
    df["SalePrice"].skew(),
    df.isnull().any(axis=1).sum(),
    df["TotalBsmtSF"].isnull().mean() * 100,
    df[["OverallQual", "SalePrice"]].corr().iloc[0,1],
    df[["GrLivArea", "SalePrice"]].corr().iloc[0,1],
    df[["GarageCars", "SalePrice"]].corr().iloc[0,1],
    outlier_mask.sum()
)

print(summary)

with open("eda_summary.txt", "w") as f:
    f.write(summary)

Always end EDA with a written summary. What you found, what it means, what you will do about it. This is the document that guides every preprocessing and modeling decision that follows.

The EDA Mindset

EDA is not a checklist. It is a conversation with your data.

Every chart raises a question. Why does this neighborhood have such a wide price range? Why are there so many houses with zero masonry veneer area? Why does the scatter for GrLivArea show two clusters?

Follow the questions. The answers tell you what your model needs to learn and what obstacles it will face.

The best data scientists are not the ones who build the fanciest models. They are the ones who understand their data so thoroughly that when the model behaves unexpectedly, they already have a hypothesis about why.

EDA is how you build that understanding.

A Blog That Shaped How People Think About EDA

Will Koehrsen wrote a piece on Towards Data Science called "A Gentle Introduction to Exploratory Data Analysis" using the exact Ames Housing dataset we referenced here. It became one of the most-read EDA tutorials in the data science community. His approach of treating EDA as hypothesis generation rather than just plotting everything influenced a generation of practitioners. Search "Will Koehrsen gentle introduction EDA Towards Data Science."

Try This

Create eda_practice.py.

Download the Ames Housing dataset from Kaggle (search "House Prices Advanced Regression Techniques Kaggle"). The full version has 79 features.

Run the complete EDA workflow from this post on the real dataset. All nine phases.

Then answer these specific questions using only code and charts:

Which five features have the strongest linear correlation with SalePrice?

Is the relationship between GrLivArea and SalePrice truly linear or does it curve at the high end?

Which neighborhoods have the highest median SalePrice? Which have the highest variance?

Are there any two features that are so correlated with each other that keeping both is redundant?

Write your own EDA summary report and save it to a text file.

What's Next

Phase 3 ends with one more post: a full data project pulling together everything from loading to EDA on a real dataset with a real question to answer. After that, Phase 4: SQL. Then Phase 5: dev tools. Then the real thing begins. Machine learning.

【2026-04-28】Claude Code CLI 的 5 个隐藏用法 🔥

2026-04-28 16:15:41

先说结论 — Claude Code 不只是一个会聊天的终端。它内置的 5 个隐藏功能——Codex Skills、可跨会话记忆、文件作用域控制、结构化输出模式、内联工具编排——此刻就赔在你的安装目录里,90% 的用户从未用过。

感谢 @gaborcselle@sethmlaird 在生产环境中挖掘出这些用法。

为什么你现在已经拥有这些工具

本周 GitHub Trending 数据讲了一个清晰的故事:开发者花真金白钱买"人工智能编程助手",却只用了 20% 的功能。本周最热的仓库 ——一份精忆整理的 Codex Skills 列表——只有几百星,因为大多数用户根本不知道还有 Skills 系统这回事。

今天我们来拆解 Claude Code 那些文档稀少的进阶能力。不装额外包,不烧 API 额度。只是一些配置文件和 CLI 参数,你本地已经有了。

隐藏用法 #1:Codex Skills — 可复用的提示词链

大多数开发者把 Claude Code 当单次对话用。但 Skills 系统让你把多步骤工作流打包成可复用模块。

# 查看当前可用的 skills
claude --skills list

# 为你最常见的任务模式创建一个 skill
mkdir -p ~/.claude/skills/my-team
cat > ~/.claude/skills/my-team/read-parse-prd.md << 'EOF'
你是一个 PRD 分析智能体。

给定位于 {path} 的 PRD 文档:
1. 将所有用户故事提取为列表
2. 用 [需要潓清] 标记任何模糊需求
3. 输出 JSON 摘要:{"stories": [], "open_questions": []}
EOF

# 调用方式:claude --skill my-team/read-parse-prd --path docs/prd.md

Skill 文件就是带 {} 占位符的 Markdown。模型在运行时自动挽值。可以把它想象成个人人微智能体模板库。

数据来源:Trending 仓库"A curated list of practical Codex skills"(GitHub)聚合了 200+ 社区真实用例的 Skills。

隐藏用法 #2:Beads — 跨会话持久记忆

本周 Trending 的仓库"Beads"自我描述为"给你的编程智能体的大脑加内存条"。核心观察:默认情况下,Claude Code 每次会话都是全新的。Beads 为它增加了一个轻量语义记忆层。

# beads_memory.py — 通过 --init-script 附加到 claude-code
import beads, os

MEMORY_PATH = os.path.expanduser("~/.claude/memory_index.json")

def on_context_window_near_limit(context):
    # 上下文达到 80% 容量时自动调用
    memory = beads.load(MEMORY_PATH)
    # 将当前会话的事实压缩为语义摘要
    summary = beads.summarize(context.recent_messages, max_tokens=512)
    memory.append({"type": "session_summary", "content": summary})
    beads.save(memory, MEMORY_PATH)
    return memory

# 在 .claude/settings.json 中加入:
# { "init_scripts": ["beads_memory.py"] }

这就是"Beads"本周 Trending 背后的模式。智能体记得 3 个会话前它对你的代码库学到的东西。

HN 讨论:本周 HN 上"GTFOBins"及相关安全贴子正在讨论 AI 编程工具的内存持久化风险——启用此功能前值得一读。

隐藏用法 #3:--focus-path 精确文件作用域

Claude Code 处理大仓库时容易"跑偏"。"--focus-path"参数将工具可见性限制在指定目录树下。

# Claude 只能看到 auth/ 目录及其子模块
claude --focus-path ./src/auth "重构 OAuth token 刷新逻辑"

# 配合 --no-auto-context 防止跨模块上下文污染
claude --focus-path ./src/auth --no-auto-context "修复 refresh token 的 bug"

这防止模型跑到 vendor/node_modules/ 或不相关的 src/ 子目录里,幻觉出不属于该 auth 模块的调用。

隐藏用法 #4:结构化输出模式(--output-format json

大多数开发者把 Claude Code 的输出 pipe 到 grepjq 就完了事。但 --output-format json 让你直接获得机器可读的结构化输出。

# 获取依赖审计报告(JSON 格式)
claude --output-format json << 'EOF'
分析当前目录的 package.json。返回:
{
  "outdated": [{"package": "", "current": "", "latest": ""}],
  "security_issues": [{"package": "", "severity": "high|medium|low", "description": ""}],
  "update_recommendation": "safe_to_update|needs_review|major_bump"
}
EOF

这让 Claude Code 变成了一个可编程 CLI 工具,而不只是聊天界面。

隐藏用法 #5:/tool 前缀内联工具编排

在交互式 Claude Code 会话中,/tool 命令让你在对话中递调用任何子智能体或外部脚本。

/tool run --name my-team/read-parse-prd --path docs/roadmap.md

这和 /attach 不同——/tool 把 skill 作为独立推理过程运行,收集其输出,然后恢复父会话。你可以连续链式调用 3-4 个 tool,每个都建立在前一个的结果上,最后再要一个汇总。

大局观

本周 HN 最热新问——微软与 OpenAI 结束独家合作与收入分成——意味着更多 AI 编程工具将在价格和功能上展开竞争。Claude Code 的 Skills 系统和记忆扩展是竞争拥埊——它们让你从每次 API 调用中榬取更多价值,而不只是一个对话包裉器。

GitHub Copilot 转向按量计费就是市场信号:算力即金钱。Skills 和记忆扩展直接降低每次任务的 token 消耗率,从而影响你的账单。

相关阅读

你最常用的 Claude Code 隐藏功能是什么?在评论区分享吧。

Boost Your Claude / AI Dev Workflow with These 4 Tools

2026-04-28 16:15:11

If you're working with AI-assisted development (especially Claude), these tools can significantly improve productivity, context management, and UI output quality.

1. Superpowers

GitHub: https://github.com/obra/superpowers

What it does:
Enhances Claude with structured prompts, reusable workflows, and better task orchestration.

Use cases:

  • Prompt engineering at scale
  • Reusable task templates
  • Structured dev workflows (debug, refactor, plan)

Pros:

  • Improves consistency of outputs
  • Saves time with reusable prompt packs
  • Works well with CLI-based Claude usage

Cons:

  • Requires initial setup and learning curve
  • Not fully plug-and-play for beginners

2. UI UX Pro Max Skill

GitHub: https://github.com/nextlevelbuilder/ui-ux-pro-max-skill

What it does:
A curated prompt/skill set to generate high-quality UI/UX designs from AI.

Use cases:

  • Landing page generation
  • Component design (React/Tailwind)
  • UX improvement suggestions

Pros:

  • Produces cleaner, modern UI output
  • Great for Tailwind + React workflows
  • Reduces design iteration time

Cons:

  • Output quality depends on prompt discipline
  • Not a design system replacement

3. Awesome Claude Code

GitHub: https://github.com/hesreallyhim/awesome-claude-code

What it does:
A curated list of tools, prompts, and resources for Claude-based coding.

Use cases:

  • Discovering new AI dev tools
  • Learning best practices
  • Expanding workflow stack

Pros:

  • Continuously updated ecosystem list
  • Saves research time
  • Good for both beginners and advanced users

Cons:

  • Not a tool itself (just a collection)
  • Quality varies across listed resources

4. Claude Mem

GitHub: https://github.com/thedotmack/claude-mem

What it does:
Adds persistent memory layer for Claude, enabling context retention across sessions.

Use cases:

  • Long-term project context
  • Remembering architecture decisions
  • Maintaining state between prompts

Pros:

  • Reduces repetition in prompts
  • Improves continuity in complex projects
  • Useful for large-scale applications

Cons:

  • Requires setup and storage management
  • Risk of stale or outdated memory if not maintained

Final Thoughts

If you're serious about AI-assisted development:

  • Use Superpowers → for structured workflows
  • Use UI UX Pro Max Skill → for frontend/UI generation
  • Use Awesome Claude Code → for discovering tools
  • Use Claude Mem → for persistent context

Combining all four creates a much more powerful and scalable AI development workflow.

Feel free to fork, customize, and integrate these into your own stack.

DeepSeek-V4 Changes the Context Game for Agents — And Your Memory Architecture Should Adapt

2026-04-28 16:14:47

A million-token context window built specifically for agentic workloads. That's the feature in DeepSeek-V4 that stopped me mid-scroll this week — not because big context windows are new, but because this one is engineered for the exact failure mode that plagues every serious agent builder right now.

The Duct Tape Era of Agent Memory

Let's be honest about the state of agent architectures in 2026. Most production agents are held together with aggressive summarization, chunked context windows, and RAG pipelines that were originally designed for search, not for multi-step reasoning.

These patterns exist because we've been building agents under a hard constraint: 128K tokens, sometimes 200K if you're lucky. When your agent needs to reason across an entire codebase, navigate a 400-page contract set, or execute a multi-step plan spanning hundreds of tool calls, you hit that ceiling fast. So you compress. You summarize. You retrieve fragments and hope the model can reconstruct enough coherence to make good decisions.

It works — until it doesn't. And when it fails, it fails silently. The agent confidently acts on incomplete context, makes decisions based on lossy summaries, or retrieves the wrong chunk because the embedding similarity didn't capture the actual semantic dependency. You don't get an error message. You get a subtly wrong output that takes hours to debug.

What DeepSeek-V4 Actually Offers

DeepSeek-V4 ships with a native million-token context window that, according to Hugging Face's technical breakdown, is specifically optimized for agentic workloads. This isn't just a bigger number on a spec sheet. The architecture is designed to maintain reasoning coherence across the full window — meaning the model doesn't degrade catastrophically at token 900K the way many extended-context models do.

For agent builders, this changes the design calculus in a concrete way:

  • Full codebase reasoning: Instead of chunking a repository into fragments and hoping RAG retrieves the right file, you can feed the agent the entire codebase. It can trace dependencies, understand architectural patterns, and reason about cross-file implications natively.
  • End-to-end plan execution: Multi-step agents that make hundreds of tool calls can maintain their full execution history in context. No more summarizing previous steps and losing the nuance of why a particular decision was made.
  • Document-heavy workflows: Legal contracts, technical specifications, regulatory filings — domains where missing a clause on page 312 because it wasn't in your top-k retrieval results can be catastrophic.

This Doesn't Kill RAG — But It Reframes It

I'm not arguing that retrieval-augmented generation is dead. RAG still wins when your corpus is genuinely massive — tens of millions of tokens, entire knowledge bases, continuously updated data streams. You can't fit Wikipedia into a context window, and you shouldn't try.

But here's the reframe: RAG should be a scaling strategy, not a coping mechanism. Too many agent architectures use retrieval because the context window is too small, not because retrieval is the right abstraction for the problem. When your entire relevant context fits within a million tokens — and for a surprising number of real-world agent tasks, it does — native context is simpler, more reliable, and produces better reasoning.

The engineering complexity you save is significant. No embedding pipeline to maintain. No chunk-size tuning. No re-ranking layer to debug. No retrieval failures to handle gracefully. You replace an entire subsystem with a longer prompt.

The Benchmark You Should Run

If you're building or refining an agent memory system right now, here's what I'd actually do: take your current RAG-augmented agent, take the same task, and run it with the full context stuffed into DeepSeek-V4's window. Compare output quality, reasoning coherence, and — critically — failure modes. You might find that the simpler architecture wins outright for your use case.

Sometimes the best engineering decision is removing a system, not adding one.

Key Takeaways

  • Million-token native context changes the design calculus for agents — many tasks that currently require RAG or aggressive summarization can now be handled with full-context reasoning, reducing architectural complexity and silent failure modes.
  • RAG should be a scaling strategy, not a default — if your relevant context fits within a million tokens, benchmark native context before adding retrieval layers. Simpler architectures are easier to debug and often produce better results.
  • Test your assumptions empirically — run your current agent pipeline against a full-context baseline on DeepSeek-V4. The results might justify ripping out infrastructure you assumed was necessary.

If you're designing agent memory systems today, benchmark against million-token native context before reflexively reaching for retrieval. What agent architecture decisions would you revisit with a reliable million-token window?

274 AI Tools, One Database: Why I Treat Competitors as Curriculum

2026-04-28 16:10:04

274 AI Tools, One Database: Why I Treat Competitors as Curriculum

This project has a feature called "AI University" — a database of 274 AI tools and services that users can learn from systematically. Here's the design philosophy behind it, and why I deliberately chose curriculum over competitive intelligence.

The Problem That Started It

AI tools are multiplying faster than anyone can track. I knew Claude, GPT-4, and Gemini. But what about MLflow, Ray, BentoML, and Feast? What's the difference between Hugging Face and Weights & Biases?

To go from "AI user" to "AI system designer," I needed structured knowledge of the tooling landscape. Building that knowledge into the product meant I could learn it myself while creating something useful for users.

The 12-Category Taxonomy

Category Examples ~Count
Foundation models / APIs Claude, GPT-4, Gemini ~40
LLM frameworks LangChain, LlamaIndex ~25
Fine-tuning TRL, PEFT, Unsloth ~20
MLOps / experiment tracking MLflow, wandb, Neptune ~30
Model serving vLLM, TorchServe, BentoML ~20
Observability Arize Phoenix, TruLens ~15
Vector databases Pinecone, Weaviate, pgvector ~20
AI agents AutoGPT, CrewAI, Dify ~25
Voice / video AI ElevenLabs, Sora, Runway ~30
Coding AI Claude Code, Copilot, Codex ~15
Multimodal GPT-4V, Gemini Vision ~20
Cloud ML platforms SageMaker, Vertex AI, Azure ML ~10

Data Schema

CREATE TABLE ai_university (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  company_name TEXT NOT NULL,
  category TEXT NOT NULL,
  description TEXT NOT NULL,
  key_features JSONB NOT NULL,
  github_stars TEXT,           -- "33k+" format
  difficulty_level INTEGER,    -- 1-10
  relevance_score INTEGER,     -- relevance to this project, 1-10
  official_url TEXT,
  content_md TEXT,
  created_at TIMESTAMPTZ DEFAULT now()
);

Two axes drive the learning paths:

  • difficulty_level: 1–3 = beginner (call the API), 7–10 = advanced (design distributed systems)
  • relevance_score: 9–10 = actively used in this project

Why Curriculum, Not Competitive Intelligence

The competitor framing creates a problem: If ElevenLabs is a "competitor," it becomes something to beat. The productive question isn't "how do we beat ElevenLabs?" — it's "why did ElevenLabs become the benchmark for Japanese voice quality, and what can we learn from that?"

Tools built by thousands of engineers and refined over years are free education. Treating them as enemies closes that channel.

What curriculum framing gives you:

  • Deep understanding of why wandb became the de facto ML experiment tracker → applies to product design decisions
  • Pattern extraction from successful products → better feature design
  • For users: a navigable map of the AI landscape, not just a product pitch

AI-Assisted Content Generation

Writing 274 company profiles manually is not realistic. The pipeline:

1. Company name + category → Claude generates description
2. GitHub stars → fetched via GitHub API
3. difficulty / relevance → scored against project context
4. Supabase migration → managed as SQL seed files

PS#3 instance handles this exclusively: 2–3 companies per session, accumulating daily.

Learning Paths by Axis

Beginner track (difficulty 1-3):
  Claude API → OpenAI API → Gemini API

Practical track (relevance 8-10):
  Supabase → Flutter → Firebase → GitHub Actions

Advanced track (difficulty 8-10):
  Ray/Anyscale → Kubeflow → Seldon Core

What 274 Tools Taught Me

LLM framework wars: LangChain vs LlamaIndex vs raw API. In 2026 production, raw API + thin wrapper is the most stable pattern. Heavy frameworks add abstraction cost without enough benefit at small scale.

MLOps convergence: Experiment tracking is wandb vs MLflow. OSS deployments → MLflow. Cloud integration → wandb.

Serving divergence: Real-time → vLLM. Batch → BentoML. Edge → Ollama. The use case determines the tool, and the tools have diverged accordingly.

Summary

Building "AI University" as curriculum rather than competitive analysis produced three outcomes:

  1. A structured AI tooling map for users — a genuinely useful learning resource
  2. Deep domain knowledge for me — better design decisions
  3. Content as an asset — SEO, search traffic, user acquisition

274 tools aren't enemies. They're teachers. Solo founders who treat the existing landscape as curriculum learn faster and build better than those who treat it as a threat.

Agile, After Agile

2026-04-28 16:08:50

First published at memex.ai

Extreme AI Programming #1 — Agile, After Agile

Kent Beck in 1999. A stuck team that hadn't shipped in months. The developer who is no longer a person. A manifesto without a successor. Vibe coding versus the instrument. About two years until the door closes.

Kent Beck published Extreme Programming Explained in 1999. I read it as a young engineer running a small team that kept getting lost in its own commitments and had not shipped anything in months. By the time I finished it, and later met Kent in person, it had shaped how I thought about the job for the next twenty-five years. Pair programming, TDD, short iterations. It helped seed the Agile movement that followed, which became the way most serious software was built and most serious engineering teams were run.

The world it was written for no longer exists.

Agile evolved through Scrum, Lean, Kanban and continuous delivery. Every adjustment refined the same underlying problem: how a group of humans working on the same software stays coordinated. None of us anticipated what would happen when the team members were not all human.

In any serious AI-native team today, the developer is increasingly not a person. Nor, often, is the tester. The developer is an agent (Claude Code, Codex, Copilot, Cursor, Windsurf, with Aider, Cline and OpenCode on the open-source side), and often it is several of them at once, each running its own conversation with a different member of the team, each producing work on its own cadence. The coordination problem has changed shape entirely. Agile assumed developers on one side of the table and customers on the other. That is no longer what the table looks like.

The Agile Manifesto, signed in 2001, needs a successor.

This series, Extreme AI Programming, is an unsubtle homage to Beck. Without the work he, Martin Fowler and the others did in the late nineties, there is nothing to build on here. It is an attempt to describe what a serious, professional discipline for building software with AI agents actually looks like, rooted in their work but reshaped for a world they could not yet see.

A short note on where this is coming from. I have spent more than thirty-five years running software development teams across the companies I have founded and exited. Today I am co-founder and CEO of Mindset AI, where for over a year we have been an AI-native company by design, and the practices I plan to write about here are the ones we have been evolving in real time. This is not observation from a distance.

Here is the thesis of the series, compressed.

Agile is, at heart, a coordination protocol between humans. When most of the implementation is being done by agents that never forget, never tire and never need to be motivated on a Monday morning, the shape of the coordination problem changes. The hard part is no longer keeping a team of engineers aligned. It is keeping humans, agents and the codebase itself aligned with the decisions the team has actually made, and making sure each agent acts on the current set of decisions rather than an old one, a half-remembered one, someone else's private one, or worse, no one's at all.

Instrument or slot machine?

There are two ways of working with AI. One treats it as a genuine instrument to be mastered. The other treats it as a slot machine.

The slot-machine version has a name. It is called vibe coding. Prompt, accept, ship. If the code runs, move on. If it doesn't, prompt again. It feels productive in the moment, and for a weekend project it is perfectly fine. As a professional practice it is quietly catastrophic.

You end up with codebases the team cannot evolve. Decisions the team does not remember making, or did not actually make. A continuous accumulation of incoherence nobody can see until it breaks something expensive.

The instrument version is harder. It requires that you be articulate.

Being technically articulate as a human has become, almost overnight, the most economically valuable skill in software. Not because humans have any mystical edge over machines, but because no agent does its best work without one telling it precisely what to build.

Agents are excellent at writing code when they know what they are meant to be writing. They are also quite capable of producing fluent, confident code that does precisely the wrong thing, and doing so very quickly. The same goes for the markdown documents they produce to describe the code. The difference between those outcomes is almost entirely the precision of the brief they were given. An agent is only as good as its specification, and only as good as the data it can see. The better the agent gets, the more unforgiving the gap between what you said and what you meant becomes.

The alternative, and the subject of most of what I plan to write, is a more disciplined way of working:

  • Intent set clearly up front.
  • Decisions captured as they are made.
  • Rules for how software gets built, written down somewhere every agent and every engineer on the team will actually read.
  • Plans reviewed before code is written, rather than code reviewed after the fact.

None of this is new in spirit. A reasonable reader could point out that it is roughly what good engineering has always looked like. The cast has changed, though, and the practices have to change with it.

Next week I want to look at the first response the industry has already reached for: the CLAUDE.md file, the cursor-rules file, and most recently Anthropic's Skills, which are the moment's hot topic. Every AI-native team has independently invented some version of these. They are useful, but they are band-aids, and the first implementations break in predictable ways.

Spoiler: Skills are not the answer. In their current form, they exacerbate the problem.

The harder question

There is a harder question underneath all of this, and the industry has been slow to talk about it openly.

The junior job market has, over the past year or so, all but disappeared. A senior engineer with a competent agent now ships what used to take a team of five. The obvious economic move is to hire the senior and skip the juniors. That move is being made everywhere, quietly.

The mechanism is depressing in its specifics: graduates submit hundreds, sometimes thousands, of applications to AI-driven ATS systems that reject them before any human reads their CVs.

The part that genuinely worries me is not, in itself, that our profession will shrink, even though it means good people will be out of work just as society needs more opportunities, not fewer. Professions reshape themselves all the time, and ours has done it before.

What makes the calculation work is articulation. The agent's output is only as good as the brief it is given, and writing that brief takes the kind of judgment a senior has spent years earning. A junior cannot yet bring that, not because they are incapable, but because the instinct takes time.

What worries me more is that the people currently being trained to enter the profession, in universities, on boot camps, in the first year of a graduate scheme, will arrive to find that the first rung of the ladder is no longer there. They will have done the work, paid the fees, learned the craft, and the jobs they were being prepared for will not exist in the form they were promised.

Unless those of us already inside the profession reshape it around what the next generation still has to bring, we will run on the experience we have now and stop restocking. The closing chapters of this series come back to what that reshape needs to look like in practice.

The timing is unforgiving: about two years, and after that the door does not reopen.

Why I'm writing this

That is the pessimistic version of the argument. The optimistic version is the reason I am writing this series at all.

If we find genuinely good ways of collaborating with machines, and if we build the new ceremonies and disciplines a team actually needs when several of its members do not sleep, then what we have been handed is not a diminishing of the profession but an enlargement of it.

The mechanical middle of the job, the typing, can be delegated. What is left is human creativity and human judgment, amplified by instruments that can turn a well-framed intention into working software in an afternoon. The point of this series is to work out, in public, how to build that discipline, so that we exploit what humans are actually good at while the machines do the work.

There is a great deal of talk at the moment about whether our profession is coming to an end. I don't believe a word of it.

For anyone who has been in software for a long time, this is the most interesting moment in the discipline since the arrival of the web. The barrier of programming language, the particular question of which one you happen to know, has quietly become less important than it has ever been. What matters now is clarity of intent and the ability to articulate what you want in enough detail that a competent instrument can produce it. That specific skill has always been scarce. For the first time, it is directly economically productive.

Does this need a new manifesto?

At the start I said the Agile Manifesto, signed in 2001, needs a successor. That leaves a question I have been turning over for a while: who should write it?

Over the last eighteen months, several people have already tried. None has yet had its Snowbird moment, but the conversation is well underway, and the candidates worth knowing are these.

  • Casey West, The Agentic Manifesto (November 2025). Modelled directly on the original ("while there is value in the items on the right, we value the items on the left more"), it pairs four new values with a five-phase Agentic Delivery Lifecycle. The central problem he names is the "determinism gap": the move from "did it do what I said" to "did it do what I wanted". The most-cited of the candidates so far, and the closest in shape to the original.

  • Shay Cohen at Wix Engineering, The AI Coding Agent Manifesto (April 2026). Five values written from inside an engineering team that has been living with agents for a year: contracts over conventions, verification over generation, vanilla over clever, types over tests, explicit over implicit. The most practitioner-shaped of the bunch.

  • Ry Walker and Jonathan Vanderford, The AIFSD Manifesto. Eleven principles, two of which carry most of the weight: "AI is your intern, not your boss", and "the human always has the last word". A hard line on responsibility, written for a moment when the temptation to delegate accountability is real.

There are others circulating, including Mircea Trofimciuc's earlier agenticmanifesto.org from May 2025, and a steady stream of essays calling for a successor without yet committing one to a fixed text. The list is open.

I take some comfort in not being alone on the deferral. Asked at Thoughtworks' twenty-fifth-anniversary retreat in February 2026 whether there should be a new manifesto for AI development, Martin Fowler said:

It's way too early. I don't have a lot of time for manifestos.

Of the original Snowbird signatories, he is the only one to have spoken publicly on the meta question.

My intention with this series is not to add another. The world does not need yet another manifesto written by someone who has not yet earned the seventeen co-signatories the original had. What I would rather do, in the spirit of open source, is back one of the existing candidates: read them carefully, write about them seriously, and put my weight behind the strongest of them. By the end of this series I will have made that choice explicit and said why.

Over the next few months, the series will cover what I think the new discipline actually contains. The roles and how they have recomposed, including which Agile ceremonies still make sense and which do not. The new first-class artefacts, which in my view are decisions, blueprints and execution plans. How to review work produced by an agent without becoming its babysitter. The economics of running a team where one engineer and three agents produce what five engineers used to, and the things that get harder rather than easier. And occasionally, because there is no point pretending otherwise, the commercial work I am doing with Mindset AI sits inside the argument.

The rhythm of the series will alternate between two registers. This first piece has been philosophical, sitting with the manifestos and the broader argument about the discipline. Next week is closer to the keyboard, practical: the artefacts every AI-native team has already reinvented half a dozen times in the last year, and why the first versions break in the same places. Both registers matter, and neither does the work alone.

If the argument resonates, I would be glad of the company. If it doesn't, I would be glad to hear why. I would rather be argued with now than wrong in print later.

— Barrie

I am co-founder and CEO of Mindset AI, where we are building Memex AI, a decision and knowledge layer for AI-native engineering teams. This series is the thinking that shapes our product. I will flag it explicitly when an article touches something we build. Most of it is simply where the industry is going, with or without us.