MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

83% Accuracy: How We Reverse Engineered Amazon's Dynamic Pricing Algorithm

2026-03-02 05:31:24

Hero Image - Reverse Engineering Amazon's Price Algorithm

Six months ago, we asked a simple question at Avluz.com: "Can we predict when Amazon will drop prices on products?" Today, our system forecasts price changes with 83% accuracy across 50,000 products, processing 7.3 price updates per product daily. But here's the thing—the journey to get here taught us more about e-commerce algorithms than any documentation ever could.

This isn't a theoretical post. This is the complete technical breakdown of how we built, tested, and deployed a system that reverse-engineered Amazon's dynamic pricing patterns to power our deal discovery and price tracking platform.

d*', price_text).group().replace(',', ''))
break

        if not price:
            return None

        return {
            'asin': asin,
            'price': price,
            'timestamp': datetime.utcnow().isoformat(),
            'url': product_url,
            'availability': self._check_availability(soup)
        }

    except Exception as e:
        print(f"Error scraping {asin}: {str(e)}")
        return None

def _check_availability(self, soup):
    """Check if product is in stock"""
    availability = soup.select_one('#availability span')
    if availability:
        return 'in_stock' if 'In Stock' in availability.get_text() else 'out_of_stock'
    return 'unknown'

**Key Implementation Details:**
- Rotating user agents to avoid detection
- Multiple fallback selectors (Amazon changes HTML structure frequently)
- Availability tracking (critical for inventory-based pricing)
- Proper error handling and timeouts

We run this scraper every 2 hours for 50,000 products. That's 600,000 requests per day, which brings us to...

### Component 2: MongoDB Time-Series Database

![MongoDB Query Code](https://www.genspark.ai/api/files/s/C8BIaNAu?cache_control=3600)

Storing 600,000 price points daily requires efficient time-series storage. MongoDB's time-series collections were perfect for this.


javascript
// MongoDB schema for price history
db.createCollection("price_history", {
timeseries: {
timeField: "timestamp",
metaField: "product",
granularity: "hours"
}
});

// Aggregation pipeline for pattern detection
const priceTrends = await db.price_history.aggregate([
{
$match: {
"product.asin": productAsin,
timestamp: {
$gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) // Last 30 days
}
}
},
{
$group: {
_id: {
hour: { $hour: "$timestamp" },
dayOfWeek: { $dayOfWeek: "$timestamp" }
},
avgPrice: { $avg: "$price" },
minPrice: { $min: "$price" },
maxPrice: { $max: "$price" },
priceChanges: { $sum: 1 },
stdDev: { $stdDevPop: "$price" }
}
},
{
$sort: { "_id.dayOfWeek": 1, "_id.hour": 1 }
}
]);

// Calculate price volatility
const volatility = priceTrends.map(trend => ({
timeSlot: ${trend._id.dayOfWeek}-${trend._id.hour},
volatilityScore: (trend.stdDev / trend.avgPrice) * 100,
changeFrequency: trend.priceChanges
}));


**Database Performance:**
- **Write throughput**: 8,000 inserts/second
- **Query latency**: 45ms average for 30-day aggregations
- **Storage efficiency**: 2.4GB per million records (compressed)
- **Index strategy**: Compound index on (asin, timestamp)

This powers the real-time analysis, but the magic happens in the ML model.

### Component 3: Machine Learning Model

![ML Model Code](https://www.genspark.ai/api/files/s/Ydli5Gd3?cache_control=3600)

The prediction model is where we encode everything we learned about Amazon's patterns.


python
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd

class PricePredictionModel:
def init(self):
self.model = RandomForestRegressor(
n_estimators=200,
max_depth=15,
min_samples_split=10,
random_state=42
)
self.scaler = StandardScaler()

def engineer_features(self, price_history, product_metadata):
    """
    Create features from raw price data
    Returns: feature matrix for ML model
    """
    df = pd.DataFrame(price_history)

    # Time-based features
    df['hour'] = pd.to_datetime(df['timestamp']).dt.hour
    df['day_of_week'] = pd.to_datetime(df['timestamp']).dt.dayofweek
    df['day_of_month'] = pd.to_datetime(df['timestamp']).dt.day
    df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
    df['is_month_end'] = (df['day_of_month'] >= 28).astype(int)

    # Price history features
    df['price_ma_24h'] = df['price'].rolling(window=12, min_periods=1).mean()
    df['price_ma_7d'] = df['price'].rolling(window=84, min_periods=1).mean()
    df['price_std_24h'] = df['price'].rolling(window=12, min_periods=1).std()
    df['price_change_rate'] = df['price'].pct_change()
    df['price_volatility'] = df['price'].rolling(window=24).std() / df['price'].rolling(window=24).mean()

    # Competitor pricing features
    df['competitor_min'] = product_metadata.get('competitor_prices', []).min() if product_metadata.get('competitor_prices') else df['price']
    df['price_vs_competitor'] = (df['price'] - df['competitor_min']) / df['competitor_min']

    # Inventory signals
    df['low_stock'] = (product_metadata.get('stock_level', 100) < 10).astype(int)

    # Demand indicators
    df['search_volume'] = product_metadata.get('search_trend', 0)
    df['sales_rank'] = product_metadata.get('sales_rank', 999999)

    return df.fillna(0)

def train(self, historical_data, future_prices):
    """Train the model on historical data"""
    X = self.engineer_features(historical_data)
    y = future_prices  # Price 24 hours ahead

    X_scaled = self.scaler.fit_transform(X)
    self.model.fit(X_scaled, y)

    return self.model.score(X_scaled, y)  # R² score

def predict_next_price(self, recent_data, product_metadata):
    """Predict price for next time window"""
    X = self.engineer_features(recent_data, product_metadata)
    X_scaled = self.scaler.transform(X)

    prediction = self.model.predict(X_scaled[-1:])
    confidence = self.model.score(X_scaled, recent_data['price'])

    return {
        'predicted_price': float(prediction[0]),
        'confidence_score': float(confidence),
        'current_price': float(recent_data['price'].iloc[-1]),
        'predicted_change': float(prediction[0] - recent_data['price'].iloc[-1])
    }

**Model Performance Evolution:**

![Performance Chart](https://www.genspark.ai/api/files/s/0jlfX9Yq?cache_control=3600)

- **Month 1**: 47% accuracy (baseline linear regression)
- **Month 2**: 62% accuracy (added time features)
- **Month 3**: 71% accuracy (competitor price features)
- **Month 4**: 78% accuracy (demand signals integrated)
- **Month 5**: 81% accuracy (ensemble methods)
- **Month 6**: 83% accuracy (hyperparameter tuning)

### System Flow

![Process Flowchart](https://www.genspark.ai/api/files/s/1tIRoj7i?cache_control=3600)

The complete workflow:

1. **Scraper** collects price every 2 hours
2. **Detection** identifies if price changed
3. **Analysis** compares to historical patterns
4. **ML Prediction** forecasts next change
5. **Alert Generation** notifies users of opportunities
6. **Feedback Loop** improves model with results

### Technology Stack

![Technology Stack](https://www.genspark.ai/api/files/s/MiqDXvGn?cache_control=3600)

**Frontend:**
- React with Material-UI for dashboard
- Chart.js for price visualization
- WebSocket for oe real-time updates

**Backend:**
- Node.js Express API
- Python Flask for ML serving
- Redis for caching hot predictions

**Data Layer:**
- MongoDB for time-series storage
- Redis for session state
- S3 for raw HTML archives

**Infrastructure:**
- AWS Lambda for scraping jobs
- CloudWatch for monitoring
- API Gateway for public API

## The Optimization Phase: From 62% to 83%

Getting from decent accuracy to production-grade prediction required obsessive optimization.

### Failed Optimizations

Let me be honest about what didn't work:

**Attempt 1: Deep Learning**
- Tried LSTM networks for time-series prediction
- Result: 58% accuracy (worse than Random Forest)
- Reason: Not enough data per product for deep learning

**Attempt 2: Real-Time Competitor Scraping**
- Added scraping of 10 competitor sites
- Result: Minimal accuracy improvement (2%)
- Cost: 4x infrastructure costs
- Decision: Removed from production

**Attempt 3: Review Sentiment Analysis**
- Hypothesis: Review sentiment predicts pricing
- Result: No correlation found
- Lesson: Focus on direct price signals

### Successful Optimizations

What actually moved the needle:

**1. Category-Specific Models** (+7% accuracy)


python

Instead of one model for all products

models = {
'electronics': RandomForestRegressor(max_depth=20),
'books': GradientBoostingRegressor(max_depth=10),
'home': RandomForestRegressor(max_depth=15)
}


Electronics prices follow different patterns than books. Category-specific models captured these nuances.

**2. Temporal Cross-Validation** (+4% accuracy)


python

Time-based splits instead of random splits

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
model.fit(X[train_idx], y[train_idx])
score = model.score(X[test_idx], y[test_idx])


Random cross-validation was leaking future information into training. Time-based splits fixed this.

**3. Feature Interaction Terms** (+3% accuracy)


python

Interaction between time and price volatility

df['evening_volatility'] = df['is_evening'] * df['price_volatility']
df['weekend_demand'] = df['is_weekend'] * df['search_volume']




The interaction between time of day and price volatility was more predictive than either feature alone.

### ROI Analysis

![ROI Calculation](https://www.genspark.ai/api/files/s/NEvBrARG?cache_control=3600)

**Investment:**
- Development: $12,000 (3 months, 2 engineers)
- Infrastructure: $300/month AWS costs
- Total Year 1: $15,600

**Returns:**
- Labor savings: $48,000/year (40 → 2 hours/week manual work)
- Better deal pricing: $18,000/year (improved conversion)
- Total Annual Return: $66,000

**ROI: 323% in first year**

This approach now powers Avluz.com's deal recommendation engine, helping millions of shoppers catch price drops at the perfect moment.

## The Lessons: What We'd Do Differently

Looking back, here's what we learned:

### Technical Lessons

**1. Start Simple, Add Complexity Incrementally**
Our initial attempt used complex deep learning. Random Forest with good features outperformed it. Start simple, add complexity only when needed.

**2. Data Quality > Data Quantity**
50,000 products with clean data beats 500,000 with noisy data. We spent 2 weeks cleaning outliers and anomalies. That investment paid off.

**3. Domain Knowledge > Algorithm Choice**
Understanding *why* Amazon prices change (inventory, competitors, demand) was more valuable than trying every ML algorithm. Talk to your domain experts.

### Business Lessons

**1. Predict, Don't Just React**
The shift from reactive ("price changed!") to predictive ("price will change in 6 hours") transformed our value proposition.

**2. User Trust Requires Transparency**
We show confidence scores with predictions. Users trust "83% confident" more than "will drop" without context.

**3. Continuous Learning Is Essential**
Amazon's algorithm evolves. Our model retrains weekly with new data. Static models decay rapidly in e-commerce.

## What's Next at Avluz.com

We're expanding this approach in three directions:

**1. Multi-Retailer Prediction**
Applying these techniques to Target, Walmart, and Best Buy. Early tests show 76% accuracy—close to Amazon's 83%.

**2. Bundle Optimization**
Predicting optimal times to buy multiple products together. "Wait 3 days for laptop, buy monitor today" type recommendations.

**3. Open Source Toolkit**
Planning to release our feature engineering library as open source. The ML community helped us; time to give back.

**4. Real-Time Alert Improvements**
Moving from email alerts to push notifications with WebSocket connections. Users get price drop alerts within 30 seconds.

## Recommendations for Engineering Teams

If you're building similar systems, here's my advice:

### For Individual Engineers

**Start with data collection:** You need 3-6 months of historical data before ML makes sense. Start scraping now.

**Use existing tools:** Don't build scrapers from scratch. Libraries like Scrapy and Selenium are battle-tested.

**Validate constantly:** Test predictions against holdout data weekly. Models drift faster than you expect.

### For Engineering Teams

**Invest in infrastructure early:** We initially underestimated storage and compute needs. MongoDB's time-series collections saved us.

**Build feedback loops:** Our model improves because we track actual vs. predicted prices and retrain.

**Consider ethics:** Price prediction can be used for price discrimination. We use it to help consumers, not exploit them.

### For Companies

**This is a marathon, not a sprint:** It took us 6 months to reach 83% accuracy. Budget for iterative improvement.

**Partner with domain experts:** Our deals team's insights were as valuable as our ML expertise.

**Prepare for maintenance:** Amazon changes their site structure monthly. Budget for ongoing scraper maintenance.

## Code Repository

Want to build your own price prediction system? Key resources:

- [MongoDB Time-Series Documentation](https://www.mongodb.com/docs/manual/core/timeseries-collections/)
- [scikit-learn RandomForest Guide](https://scikit-learn.org/stable/modules/ensemble.html#forest)
- [Scrapy Best Practices](https://docs.scrapy.org/en/latest/topics/practices.html)
- [AWS Lambda for Web Scraping](https://aws.amazon.com/blogs/compute/web-scraping-at-scale-with-aws-lambda/)

## Final Thoughts

Reverse engineering Amazon's pricing algorithm taught us that modern e-commerce is a real-time game. Prices adjust to demand, inventory, and competition within minutes. Static price tracking isn't enough—you need prediction.

The system we built processes 7.3 price changes per product per day across 50,000 products. That's 365,000 price updates daily. And we can predict the next change with 83% accuracy.

But here's the most important lesson: The algorithm is just the beginning. The real value comes from helping real people save money on products they actually want to buy. That's what drives us at [Avluz.com](https://avluz.com).

---

## 💬 Discussion Questions

1. Have you experimented with price prediction for e-commerce? What accuracy did you achieve?
2. What other factors do you think influence Amazon's pricing algorithm that we might have missed?
3. For large-scale web scraping, do you prefer Scrapy, Selenium, or cloud-based solutions?
4. How do you handle model drift when external algorithms (like Amazon's) are constantly evolving?

---

*Engineering insights from [Avluz.com](https://avluz.com) - Where millions discover deals, coupons, and price drops daily. Follow our engineering blog for more deep dives into e-commerce ML, real-time data processing, and scalable web scraping.*

🤑 A Price for Everyone 🤑

2026-03-02 05:25:33

This is a submission for the DEV Weekend Challenge: Community

The Community 👥

I'm an avid card player. You know, the kind that used to play once a week, and now maybe 5 times a year 🤭 I started playing Magic the Gathering back in High School when Invasion was dropping. We would go to the top of the gym, and use gymnastic stacked mats to play Magic all the time, crack new packs, and have the thrill of getting a foil. Cracking packs of Magic used to be so fun, knowing you might get a rare foil, or some interesting mythic. Recently Magic has kind have let my friends and I down. With them releasing IP's like TMNT, SpiderMan, and others, and with foils being in every pack, cracking a pack does not feel special anymore.

Come Sorcery: Contested Realm, a card game that has that exact same thrill of opening packs and getting a foil maybe 1 out of 6 packs. What makes cracking packs so fun, is getting that single rare card, or a curiosa, that could have HUGE value. This game is in it's early stages too, but it has cards valued at over $1,000~ 🤑 (Curiousas, Rare Foils) in the game already!

Most people who collect TCG cards of value, often know about TCGPlayer.com. This site holds the standard when it comes to priced TCG cards on the market. Sorcery: Contested Realm price data is already on this website. Sorcery also has it's own website for card information, deck building, and collection management. So what's the issue?

Let's go with S:CR (Sorcery: Contested Realm) for the rest of this entry. S:CR's website Curiosa.io is an incredible website. But some friends and I have always wanted to be able to price our collection and decks. I mean how cool would that be?! To be able to check our collection and see how "rich" 😆 we are. But in all honesty, I always thought it would be cool if curiousa.io built a way to check the price of each deck and keep up with your collection value.

What I Built 👾

Big brain thought. Why not just do exactly this ourselves?! The ability to have price information and total value information on the website would be absolutely incredible. This site has been out 2+ years and no one has created anything to allow this.

So, let's go through the "app" I built (😵‍💫,😤,😂). I've been trying to think of ways I could implement this into our community of Sorcery card players and wanted this tool for everyone. Since TCGPlayer and Curiousa.io both run on react / virtual DOM's, the way it loads information makes it almost impossible to scrape. So how can we scrape price data information on TCGPlayer?

In comes MutationObserver. I had never thought to use this. After an several attempts on my own trying to figure out how to scrape the data, I had CoPilot help me understand the process of how TCGPlayer loads it's data and how I could grab that information. With MutationObserver, it observes the DOM and waits for certain data to appear before starting itself.

Now that we got how we're going to scrape the data, how are we going to create this as something that will scrape the data, match data on a page, and inject data into that page? A Chrome Extension!! And my very first time creating an extension in Chrome.

In building the Chrome Extension, I learned how to create a manifest for Chrome Extension environments, and how a file structure works for Extensions. I think one of the coolest things about Extensions, is how you can just refresh the extension and it grab the newest bit of code with new versions. So let's break down the versions.

1.0.1 - Users have to scrape TCGPlayer for data, built inside the extension, link for scraping and all. Can see price data only for decks on Curiousa.io, card names are matching and price injection is happening. But at the moment, it's not accurate. We are just pulling the first card Name we see, not considering foils and Alpha/Beta same cards.

1.0.2-5 - Can see price data now accurate for versions "Alpha/Beta". Let me tell you, and excuse my french, but that was a bitch to handle. So the way Curiousa.io works is by tRPC Batch Requests. With the extension I had to intercept the fetch request, create the same exact request for card information, and match the data for the set it's under. The set is shown when you click on a card, so I also had to setup the code to automate the clicks, give a loading screen, then inject the price data correctly.

1.0.6 - Now we have foils pulling in, very nicely as well. They have this nice yellow badge on the card name along with the price and all cards are now injecting correct set pricing.

1.0.7 - Added Total cards and Total value of each deck to the top of the deck.

1.0.8 - Created a collection value and used the same name match and price injection to show correct information within the collection.

1.0.9 - Created a link to TCGPlayer.com on the price badge someone can click on to go to the card price information on TCGPlayer.com

1.0.10 - Finessed the wording for the plugin, tested to make sure all price data was matching correctly in both decks and collection, and optimized the code some.

1.1.0 - Created a .crx file extension for testing.

Note: So each user has to scrape the data manually each time to get updated prices and card information from TCGPlayer. I was afraid to hit TCG with tons of scrape data daily from each extension being loaded in Chrome, so I made the scraping happen manually.

Demo 🎥

This is a video demo of the Sorcery Extension being live used.

Code 🧬

🔮 Sorcery Price Tracker — Chrome Extension

Automatically displays TCGPlayer market prices for Sorcery: Contested Realm cards on Curiosa.io, including deck totals.

File Structure

sorcery-price-tracker/
├── manifest.json          # Extension config + permissions
├── background.js          # Service worker: storage, canonical names, messaging
├── content_tcgplayer.js   # Scraper: runs on TCGPlayer, collects card prices
├── content_curiosa.js     # Injector: runs on Curiosa.io, displays prices
├── popup.html             # Extension popup UI
├── popup.js               # Popup logic
└── icons/                 # Add icon16.png, icon48.png, icon128.png here

Setup

  1. Open Chrome and go to chrome://extensions
  2. Enable Developer Mode (top right toggle)
  3. Click Load unpacked and select this folder
  4. The extension icon will appear in your toolbar

How to Use

Step 1 — Scrape Prices from TCGPlayer

  1. Click the extension icon → "Open TCGPlayer to Scrape"
  2. The TCGPlayer Sorcery listings page will open
  3. Scroll through the results — prices are saved automatically as cards load
  4. Navigate to page 2…

Here is a sample of the code I used to scrape TCGPlayer.com website for card data.

content_tcgplayer.js

function observeDOM() {
    const observer = new MutationObserver(() => {
      clearTimeout(debounceTimer);
      debounceTimer = setTimeout(tryScrape, DEBOUNCE_MS);
    });
    observer.observe(document.body, { childList: true, subtree: true });
  }

  function tryScrape() {
    const results = scrapeCurrentPage();
    const count = Object.keys(results).length;
    if (count === 0) return;

    scrapedOnThisPage = true;
    const slug = getSetSlugFromUrl();
    updateBanner(`Scraped ${count} entries from ${getSetDisplayName(slug)} — saving...`);

    chrome.runtime.sendMessage(
      { type: 'SAVE_PRICES', payload: results },
      (response) => {
        if (response?.success) {
          updateBanner(`${getSetDisplayName(slug)}: ${count} entries saved (${response.count} total). Open another set's price guide to continue.`);
        }
      }
    );
  }

  // ─────────────────────────────────────────────
  // Scrape the price guide table.
  // Captures card name, price, AND the TCGPlayer
  // url from the card.
  // ─────────────────────────────────────────────
  function scrapeCurrentPage() {
    const prices = {};
    const slug = getSetSlugFromUrl();
    const setDisplayName = getSetDisplayName(slug);

    const rows = document.querySelectorAll('.tcg-table-body tr');
    if (rows.length === 0) return prices;

    rows.forEach((row) => {
      const nameEl = row.querySelector('.pdp-url');
      const cells  = row.querySelectorAll('.tcg-table-body__cell');

      if (!nameEl || cells.length <= PRICE_COLUMN_INDEX) return;

      const rawName  = nameEl.innerText?.trim();
      const rawPrice = cells[PRICE_COLUMN_INDEX]?.innerText?.trim();

      if (!rawName || !rawPrice) return;
      if (rawPrice === '-' || rawPrice === '' || rawPrice === 'N/A') return;

      const priceFloat = parseFloat(rawPrice.replace(/[^0-9.]/g, ''));
      if (isNaN(priceFloat) || priceFloat === 0) return;

      // Capture the TCGPlayer PDP URL from the anchor tag
      const href = nameEl.href || nameEl.closest('a')?.href || null;
      const pdpUrl = href && href.startsWith('http') ? href : null;

      const foil = isFoilName(rawName);
      const baseName = foil ? stripFoilSuffix(rawName) : rawName;

      const priceObj = {
        display:   rawPrice,
        value:     priceFloat,
        set:       setDisplayName,
        setSlug:   slug,
        foil:      foil,
        url:       pdpUrl,
        source:    'tcgplayer',
        scrapedAt: Date.now(),
      };
  1. observeDOM() - This will open a new MutationObserver to watch the body of the page.
  2. tryScrape() - This will create a small banner modal on the bottom of the screen, updating it's value, show the data being scraped, and showing the exact amount of entries stored.
  3. scrapeCurrentPage() - When on TCGPlayer.com it will attempt to scrape the data base on this method and if the classes begin to match it'll auto-scrape the page if the extension is on.

So there's a better understanding of the file structure here is how I created the file structure for this extension:

  • manifest.json = Loads the extension data for Chrome
  • background.js = Runs in background, using SorceryAPI, to pull card data.
  • content_curiosa.js = Runs when on Curiosa.io website
  • content_tcgplayer.js = Runs when on TCGPlayer.com

How I Built It 🛠️

The way I created the extension was simply with Javascript. It's the only thing it needed. I also used Claude Sonnet in browser to help me understand the way TCGPlayer loaded it's data and the way Curiousa.io was fetching it's card information through tRCP requests.

Python to Rust to Proof: Cross-Validating a ZK System

2026-03-02 05:12:48

Three implementations of the same physics. Python floating-point. Python scaled-integer. Rust BigInt. They must agree bit-for-bit, because the ZK proof is only as correct as the reference implementation it was validated against.

Here's how we cross-validate a ZK system across languages and number representations.

The Three Implementations

Our zk-physics project maintains three implementations of the Rayleigh-Plesset simulation:

1. Python Float (reference/simulate.py)

The ground truth for physics. Uses NumPy float64 — the same representation a physicist would use. Human-readable, easy to modify, trusted.

@dataclass
class SimulationParams:
R0: float # Equilibrium radius (m)
P0: float # Ambient pressure (Pa)
Pa: float # Acoustic amplitude (Pa)
freq: float # Frequency (Hz)
gamma: float # Polytropic exponent
sigma: float # Surface tension (N/m)
mu: float # Viscosity (Pa*s)
rho: float # Density (kg/m³)
T0: float # Initial temperature (K)
dt: float # Timestep (s)

This implementation answers the question: "given these physical parameters, what does the simulation produce?"

2. Python Scaled-Integer (reference/simulate_scaled.py)

The bridge between float physics and field arithmetic. Uses Python's arbitrary-precision integers with SCALE = 10**30:

SCALE = 10**30

def smul(a, b):
return (a * b) // SCALE

def sdiv(a, b):
return (a * SCALE) // b

Same physics, different arithmetic. This implementation answers: "does the scaled-integer representation produce the same results as floating-point?"

3. Rust BigInt (src/witness.rs)

The actual witness generator. Uses num-bigint for arbitrary-precision integers:

fn smul(a: &BigInt, b: &BigInt) -> BigInt {
(a * b) / scale()
}

fn sdiv(a: &BigInt, b: &BigInt) -> BigInt {
(a * scale()) / b
}

This implementation answers: "does the Rust witness match the Python reference, and do its values satisfy the halo2 circuit constraints?"

The Shared Loop: Deduplication as Validation

A key architectural decision: the Rust witness uses a single simulate_inner() function for both the collapse preset and the acoustic driving preset:

fn simulate_inner(
r0: &BigInt, p0: &BigInt, p_initial: &BigInt,
dt: &BigInt, dt2: &BigInt, two_sigma: &BigInt,
four_s: &BigInt, three_halves_s: &BigInt,
rho: &BigInt, mu: &BigInt,
r_start: BigInt, n_steps: usize,
acoustic_dp: impl Fn(usize) -> BigInt,
) -> Vec<BigInt> {
// Single Verlet integration loop
// acoustic_dp returns zero for collapse, Pa*sin(wt) for acoustic
}

Before our code review, collapse and acoustic had separate simulation loops. This is a correctness hazard: fix a bug in one loop, forget the other. Deduplication means every physics improvement applies everywhere.

Cross-Validation: The Test Suite

The Python test suite (reference/test_simulate.py) runs both Python implementations and compares:

def test_collapse_float_vs_scaled():
"""Float and scaled-integer must agree within tolerance."""
params_float = CollapsePreset.float_params()
params_scaled = CollapsePreset.scaled_params()

trace_float = simulate_float(params_float, n_steps=100)
trace_scaled = simulate_scaled(params_scaled, n_steps=100)

for i in range(len(trace_float)):
    r_float = trace_float[i]
    r_scaled = trace_scaled[i] / SCALE
    assert abs(r_float - r_scaled) / r_float &lt; 1e-10, \
        f&quot;Step {i}: float={r_float}, scaled={r_scaled}&quot;

The tolerance (10^-10 relative error) accounts for the precision difference between float64 (53-bit mantissa) and exact BigInt arithmetic. Over 100 steps, errors accumulate but stay well within this bound.

The Rust-to-Python comparison uses exported JSON traces:

def test_rust_vs_python():
"""Rust BigInt witness must match Python scaled-integer exactly."""
rust_trace = json.load(open("output/rust_trace.json"))
python_trace = simulate_scaled(params, n_steps=len(rust_trace)-1)

for i in range(len(rust_trace)):
    assert rust_trace[i] == python_trace[i], \
        f&quot;Step {i}: rust={rust_trace[i]}, python={python_trace[i]}&quot;

Note: this comparison is exact (not approximate). Rust BigInt and Python int both do arbitrary-precision integer arithmetic. The same operations on the same inputs must produce identical results.

The Field Replay: Closing the Gap

Even with exact Rust-Python agreement, there's one more gap: the circuit uses field arithmetic (modular inverse), not integer arithmetic (truncating division). These differ whenever a product isn't exactly divisible by SCALE.

The compute_public_inputs&lt;Fr&gt;() function closes this gap by replaying the entire simulation using field operations:

pub fn compute_public_inputs<F: PrimeField>(witness: &SimulationWitness) -> Vec<F> {
let s = F::from_u128(SCALE);
let s_inv = s.invert().unwrap();
// Load initial conditions as field elements
let mut r_curr = F::from_u128(witness.r0);
// Replay each step using field mul/div (not integer)
// Extract final temperature and emission count
// Return [r0, final_temp, total_emission]
}

This ensures the public inputs — the values the verifier checks — are computed the same way the circuit computes them. Integer arithmetic provides the witness; field arithmetic provides the verification target.

Validation Pyramid

The full validation chain looks like this:

Python float (ground truth physics)
↕ agree within 10^-10 relative error
Python scaled-integer (exact BigInt)
↕ agree exactly (same arithmetic)
Rust BigInt witness
↕ satisfies constraints
halo2 circuit (field arithmetic)
↕ matches
compute_public_inputs (field replay)
↕ verified by
PLONK verifier (pairing check)

Each layer validates the one below it. If any link breaks, we know exactly where the disagreement is:

  • Python float ≠ Python scaled: scaling error or precision loss

  • Python scaled ≠ Rust BigInt: implementation bug in witness generator

  • Rust witness ≠ circuit constraints: witness computation doesn't match gates

  • Public inputs ≠ circuit output: field vs. integer arithmetic mismatch

  • Verifier rejects: proof generation bug or transcript error

Lessons Learned

1. Three implementations is not redundant. Each catches different classes of bugs. The float version catches physics errors. The scaled-integer version catches precision errors. The Rust version catches implementation errors.

2. Exact comparison where possible, bounded comparison where necessary. Float-to-scaled comparisons need a tolerance. Scaled-to-BigInt comparisons should be exact. Mixing these up wastes debugging time.

3. Deduplication is a correctness strategy. The simulate_inner() refactor wasn't about code cleanliness — it was about ensuring both presets use the same physics. Every shared line of code is one fewer place for divergence to hide.

4. The field replay is non-optional. You cannot skip compute_public_inputs. Integer arithmetic and field arithmetic give different results. If your public inputs come from integer computation, your proofs will intermittently fail for no apparent reason.

This is Part 7 of our 8-part series. Part 6: Testing ZK Privacy covers the privacy testing framework. Final article: Part 8: Sonoluminescence — The Physics of Light from Nothing — the full science behind the simulation.

New here? Subscribe free for the full series.

Originally published at https://jacobian.ghost.io/python-to-rust-to-proof-cross-validating-a-zk-system/.

Subscribe to **Jacobian* for weekly ZK/privacy insights: jacobian.ghost.io*

New here? Grab the free **ZK Privacy Patterns* guide: Get it free*

Understanding Joins and Window Functions in SQL

2026-03-02 05:07:56

Introduction

In this article, we will explore two of the most powerful and widely used features in SQL: JOINs and Window Functions. We will begin by understanding what they are and how they work, and then walk through practical examples to see when, where, and why they are used in real-world scenarios.

Let's start with joins:

What are Joins?

In Structured Query Language (SQL), a join is a clause used to combine rows from two or more tables based on a related column between them. The purpose of a joins is to retrieve data that is spread across multiple tables into a single table, providing a unified view.
Example: In database with an orders and customers table, a join can be used to answer questions such as which customer placed an order.

Type of Joins

1. INNER JOIN

It combines two or more tables based on a specified common column with matching values and only returns the set of records that have a match in all the involved tables. The rows that don't have a match in the other table(s) are excluded from the result set.

Example:

 Above, are tables customers and orders

select  first_name, last_name
from article.customers
inner join article.orders on customers.customer_id = orders.customer_id;

 Using INNER JOIN based on the specified column customer_id on both the customers and orders table, we are able to join the tables, getting a result of only the records which have customer_ids on both tables.

2. LEFT JOIN

It retrieves all rows from the left (first) table and matching rows from the right (second) table. If a row in the left table has no corresponding match in the right table based on the join condition, the result will contain NULL values for the columns of the right table.

Example:
Using our previous tables

--LEFT JOIN
select first_name, last_name, order_date
from article.customers
left join article.orders on customers.customer_id = orders.customer_id;

 We perform a left join on the orders table with the customers table based on the customer_id condition. The query returns all the customers but some with the NULL value on the date column as they are not on the orders table(they have never made an order).

The left join is also known as the left outer join.

3. RIGHT JOIN

It retrieves all rows from the right (second) table and matching rows from the left (first) table. If a row in the right table has no corresponding match in the left table based on the join condition, the result will contain NULL values for the columns of the left table.

--RIGHT JOIN
select first_name, last_name, order_date
from article.customers
right join article.orders on customers.customer_id = orders.customer_id;

 Now, on performing the RIGHT JOIN on our tables, we get a result of all the rows from the right table (orders) that have a match on the left table (customers). Compared to the LEFT JOIN, we lose the NULL values as those rows don't have match in the left table (customers).

The right join is also known as the right outer join.

4. FULL OUTER JOIN

It returns all rows from both the left and right tables, combining matching records and using NULL values for columns where there is no match

Example:

-- FULL OUTER JOIN
select first_name, last_name, email, phone_number,order_id, order_date, book_id
from article.customers
full outer join article.orders on customers.customer_id = orders.order_id;

 When we perform a FULL OUTER JOIN on the customers and orders tables, the SQL query returns all the rows from both tables and filling NULL values for there is no match.

Now, let's look at window functions:

What are Window Functions?

In SQL, Window functions perform calculations across a set of table rows related to the selected row without merging these rows into a single output/value. Unlike traditional output functions such as SUM() and COUNT() which reduce multiple rows to one, window functions return a value for each row in the original result set.
They operate over a window (specific set of rows) defined by the OVER() clause.

Key Components of SQL Window Functions

Select: This defines the columns you want to select from the table_name(The columns you select, create your window).
Function: This is the window function you want to use.
Over Clause: This defines the partitioning and ordering of rows and can be applied with functions to compute aggregated values.
Partition by: This divides rows into partitions based on specified expressions. It is suitable for large datasets because it makes them simpler to manage.
Order by: This is a specified order expression to define the order in which rows will be processed within each partition.
Output column: This is the name you give to your output column.

Types of window functions in SQL

1. Aggregate window functions: They calculate aggregates over a window of rows while retaining individual rows.
2. Ranking window functions: They provide rankings of rows within a partition based on a specific criteria.
3. Value window functions: They are used to assign to rows, values from other rows. It is usually possible to replicate the values of these functions using two nested queries, hence they are not that common compared to aggregate and ranking window functions.

Example:

 We have a table employees containing information of different employees from different departments.

Let's use a ranking window function RANK() (which provides a unique rank to each row while skipping duplicates), an OVER() to define the partitioning and ordering of rows using PARTITION BY and ORDER BY then define an output column that will contain out results.

 In our table, we have used the RANK() function to rank employees by department. We have applied (PARTITION BY department) to partition our results in the different departments in our table, (ODER BY salary DESC) to order the results by descending salary value then we have output our RANK() using rank_by_salary column.

 After the ranking of the Finance department is over, we can see a new ranking starting for the HR department.

 Here we are now using the aggregate window function AVG() to get the average salary for each department and comparing it to each of the employees salary in different departments

To learn more on the other window functions, click here Window functions

Conclusion

From this article, we have gone through and understood what are joins and window functions are. Through clear explanations and practical, real-world examples, you’ve learned what they are, the different types available, how they work, and exactly how to write clean, efficient SQL queries to achieve your desired results. Mastering these powerful tools will dramatically improve your ability to analyze, transform and retrieve data with precision. Start applying them in your own projects today, the more you practice, the more natural they will feel.
Happy querying!

The Witness Problem: When BigInt Precision Breaks Your Proof

2026-03-02 05:07:31

The witness and the circuit computed different numbers. Both were "correct." Here's why that's terrifying — and the bugs we found that prove it.

What Is the Witness Problem?

In a ZK proof system, the witness is the prover's secret knowledge. The circuit is the set of constraints that verify the computation. They must agree on every intermediate value.

But here's the thing: the witness is computed by regular Rust code (BigInt arithmetic, floating-point conversions, integer division). The circuit operates in a 254-bit prime field (modular arithmetic, field inversion, no truncation). These are fundamentally different computational models.

When they disagree — even by a single bit in a single intermediate value — the proof is invalid. The prover can't generate a valid proof because the witness values don't satisfy the circuit's constraints.

We found three bugs in our witness generator that illustrate exactly how this happens.

Bug 1: The Precision Time Bomb

Our original to_scaled function was simple:

fn to_scaled(v: f64) -> BigInt {
BigInt::from((v * 1e30) as i128)
}

This looks fine. Multiply by 10^30, cast to integer, wrap in BigInt. And it works perfectly for small values.

But f64 has a 53-bit mantissa — roughly 16 decimal digits of precision. When you multiply a typical physical value by 10^30, the result needs 30+ decimal digits. The cast to i128 captures whatever f64 can represent, and silently drops the rest.

For the surface tension parameter (σ = 0.0728 N/m):

f64: 0.0728 * 1e30 = 72800000000000004194304 (only 16 significant digits)
BigInt: 0.0728 * 1e30 = 72800000000000000000000000000 (exact)

That's a relative error of ~10^-16 — negligible for most applications, catastrophic for a ZK proof. The circuit computes with the exact field element. The witness provides a different value. The constraint a * b - c * S = 0 fails by a tiny amount. The proof is invalid.

The fix splits the conversion into two stages, each within f64's precision:

fn float_to_scaled_big(v: f64) -> BigInt {
let int_part = v.trunc() as u64;
let frac_part = v - int_part as f64;
let int_scaled = BigInt::from(int_part) * BigInt::from(10u64).pow(30);
let frac_hi = (frac_part * 1e15).round() as i64;
let frac_scaled = BigInt::from(frac_hi) * BigInt::from(10u64).pow(15);
int_scaled + frac_scaled
}

The integer part gets exact scaling (multiplication by 10^30 in BigInt — no precision loss). The fractional part gets 15 digits (within f64's capacity) and is scaled by 10^15 in BigInt. Combined: 30 digits of precision, matching the circuit exactly.

Bug 2: The Silent Clamp

The witness generator had a safety clamp to prevent the bubble radius from going negative:

// BEFORE (BUGGY)
let r_min = BigInt::from(1u64) * scale() / BigInt::from(1_000_000u64);
if r_curr < r_min {
r_curr = r_min.clone();
}

The intent was reasonable: if numerical instability drives the radius below a minimum, clamp it instead of producing nonsensical physics.

The problem: the circuit has no such clamp. The circuit computes R_next = 2R - R_prev + R''×dt² faithfully, even if the result is smaller than r_min. When the witness clamps but the circuit doesn't, they diverge. The proof fails.

Worse: the clamp was silent. No error, no warning. The witness just quietly produced different numbers than the circuit expected.

The fix replaces the clamp with an assertion:

// AFTER (CORRECT)
assert!(r_curr > BigInt::zero(),
"Radius went non-positive at step {} — reduce timestep or check parameters",
step);

If the simulation is physically unstable, we want to know immediately — not discover it as a mysterious proof failure.

Bug 3: The Informational Trace

The witness struct returned a radius trace:

pub struct SimulationWitness {
pub r_trace: Vec<u128>,
// ...
}

This looked like the circuit would use r_trace[i] to set the radius at each step. But in reality, the circuit only uses r_trace[0] (the initial radius). Every subsequent radius is computed by the circuit itself through constrained field arithmetic.

The values r_trace[1..N] were only used for display purposes — showing peak temperature and minimum radius in the CLI output. But a developer reading the code would naturally assume the trace feeds into the circuit. If they modified the trace (say, smoothing it for numerical stability), the witness would be wrong and the proof would fail — for no obvious reason.

The fix was documentation:

/// r_trace[0] is the initial radius loaded into the circuit. All subsequent
/// values r_trace[1..N] are informational only (used for peak temperature
/// and min radius display) — the circuit computes its own trace via
/// constrained field arithmetic starting from r_trace[0].
pub struct SimulationWitness {
pub r_trace: Vec<u128>,

This isn't a code change. It's a correctness boundary change. Making explicit which values the circuit trusts and which are advisory prevents an entire class of future bugs.

Field vs. Integer: The Deepest Subtlety

Even after fixing precision and clamping, there's a fundamental gap between the witness and the circuit.

The witness computes division as integer truncation:

c = floor(a * b / S)

The circuit computes division as field inversion:

c = a * b * S⁻¹ mod p

These give different results whenever a * b is not exactly divisible by S. The integer version truncates. The field version gives the exact multiplicative inverse — a completely different number.

For public inputs, this matters: the values the verifier checks must be computed the same way the circuit computes them. Our compute_public_inputs&lt;Fr&gt;() function replays the entire simulation using field arithmetic:

pub fn compute_public_inputs<F: PrimeField>(witness: &SimulationWitness) -> Vec<F> {
let s = F::from_u128(SCALE);
let s_inv = s.invert().unwrap();
// Replay the entire circuit's computation using field ops
// ...
}

This guarantees the public inputs match what the circuit will produce, regardless of how the BigInt witness was computed.

The Meta-Lesson

Every ZK project has a witness problem. The specific bugs differ, but the pattern is always the same:

  • Two computational models (witness generator vs. circuit) must agree exactly

  • Silent divergence is the default — nothing tells you they disagree until proof generation fails

  • Precision mismatches compound across thousands of steps

  • Defensive programming (clamps, defaults, fallbacks) actively fights against correctness

The fix is always the same: make the witness generator compute the exact same operations as the circuit, using the same arithmetic. Where that's impossible (because the circuit uses field arithmetic and the witness uses integers), bridge the gap explicitly with field-replay functions.

And document the boundary. Make it impossible to confuse which values feed into the circuit and which are informational. The bugs that kill ZK projects aren't in the math — they're in the assumptions about which code produces which values.

This is Part 5 of our 8-part series. Part 4: 1,088 Bytes to Prove a Star covers the proof pipeline. Next: Part 6: Testing ZK Privacy — how to verify that proofs actually hide what they should.

New here? Subscribe free for the full series.

Originally published at https://jacobian.ghost.io/the-witness-problem-when-bigint-precision-breaks-your-proof/.

Subscribe to **Jacobian* for weekly ZK/privacy insights: jacobian.ghost.io*

New here? Grab the free **ZK Privacy Patterns* guide: Get it free*

[DEV Weekend Challenge: Community] Kanoon Mera Kawach

2026-03-02 05:03:42

This is a submission for the DEV Weekend Challenge: Community

The Community

This project is built for the general public in India — everyday citizens who often have a very limited or poor understanding of laws, especially how they are interpreted and applied in real life. Misunderstandings frequently lead to exploitation, unnecessary legal troubles, or harassment by bad actors (landlords, employers, police, etc.). Many people hesitate to seek help due to complexity, cost, or fear, so this tool aims to democratize access to constitutional rights and basic legal knowledge for the masses.

What I Built

Kanoon Mera Kawach ("Law is My Shield") is a mobile app that lets users quickly look up provisions from the Indian Constitution, with plans to expand to the Indian Penal Code (IPC) and other key laws. It includes:

  • A clean, searchable law database
  • A social discussion forum where users can ask questions, share experiences, and discuss legal issues
  • Automatic grouping of discussions (similar to subreddits or community categories) to keep conversations organized and focused

The goal is to give the common person an accessible way to understand their rights, avoid common pitfalls, and engage in informed community discussions — reducing misinformation and empowering people to stand up for themselves.

Demo

Here’s a quick video walkthrough of the app in action:

(If the embed doesn't load, direct link: https://www.youtube.com/watch?v=KIqJ4L53XSk)

Code

The full source code is open on GitHub:

GitHub logo Amit-Roy / kanoon-mera-kawach

Making legislation accessible to masses

Kanoon Mera Kavach 🛡️⚖️

"Law is the shield of the people"

A React Native + Expo mobile app that empowers Indian citizens to understand their legal rights, search Indian laws with plain-English explanations, report civic issues, and participate in community discussions — all through a premium glassmorphic interface.

⚠️ Disclaimer: This is NOT legal advice. This app provides general legal information only. Always consult a qualified lawyer or the relevant authority before taking any action based on information in this app.

Features

Core

  • Legal Search — Browse 2,300+ Indian Constitution articles with full text, plain-English summaries, and actionable guidance. Paginated virtual scrolling for smooth performance.
  • Discuss an Issue — Report incidents (police encounters, business fraud, government stonewalling/bribes) via a structured form. Posts auto-link to community topics.
  • Community Topics — Reddit-style discussion threads. Create topics, post replies, upvote, and comment.
  • Live News — Latest Indian legal/civic news via NewsAPI.org with automatic…

Feel free to clone, explore, or contribute!

How I Built It

  • Frontend/Mobile: Built with React Native using Expo for fast development and cross-platform support (iOS + Android).
  • Data Source: Currently pulls from the excellent open-source Constitution of India JSON by CivicTech India. Future plans include integrating the Indian Penal Code JSON and more statutes.
  • Visuals: All icons and graphics are custom SVGs generated with Claude AI to avoid any copyright concerns.
  • Backend & Auth: Firebase handles user authentication (email/social login), real-time database for discussions/comments, and storage for any future user-uploaded content.
  • Other: Focused on simplicity and offline-friendly design where possible (e.g., caching law texts).

This was a fun weekend sprint to prototype something meaningful for a real societal need. Would love feedback — especially ideas for expanding the law database or improving the discussion features!

Thanks for checking it out, and thanks to the DEV team for running this Community challenge!