2026-03-02 05:31:24
Six months ago, we asked a simple question at Avluz.com: "Can we predict when Amazon will drop prices on products?" Today, our system forecasts price changes with 83% accuracy across 50,000 products, processing 7.3 price updates per product daily. But here's the thing—the journey to get here taught us more about e-commerce algorithms than any documentation ever could.
This isn't a theoretical post. This is the complete technical breakdown of how we built, tested, and deployed a system that reverse-engineered Amazon's dynamic pricing patterns to power our deal discovery and price tracking platform.
d*', price_text).group().replace(',', ''))
break
if not price:
return None
return {
'asin': asin,
'price': price,
'timestamp': datetime.utcnow().isoformat(),
'url': product_url,
'availability': self._check_availability(soup)
}
except Exception as e:
print(f"Error scraping {asin}: {str(e)}")
return None
def _check_availability(self, soup):
"""Check if product is in stock"""
availability = soup.select_one('#availability span')
if availability:
return 'in_stock' if 'In Stock' in availability.get_text() else 'out_of_stock'
return 'unknown'
**Key Implementation Details:**
- Rotating user agents to avoid detection
- Multiple fallback selectors (Amazon changes HTML structure frequently)
- Availability tracking (critical for inventory-based pricing)
- Proper error handling and timeouts
We run this scraper every 2 hours for 50,000 products. That's 600,000 requests per day, which brings us to...
### Component 2: MongoDB Time-Series Database

Storing 600,000 price points daily requires efficient time-series storage. MongoDB's time-series collections were perfect for this.
javascript
// MongoDB schema for price history
db.createCollection("price_history", {
timeseries: {
timeField: "timestamp",
metaField: "product",
granularity: "hours"
}
});
// Aggregation pipeline for pattern detection
const priceTrends = await db.price_history.aggregate([
{
$match: {
"product.asin": productAsin,
timestamp: {
$gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) // Last 30 days
}
}
},
{
$group: {
_id: {
hour: { $hour: "$timestamp" },
dayOfWeek: { $dayOfWeek: "$timestamp" }
},
avgPrice: { $avg: "$price" },
minPrice: { $min: "$price" },
maxPrice: { $max: "$price" },
priceChanges: { $sum: 1 },
stdDev: { $stdDevPop: "$price" }
}
},
{
$sort: { "_id.dayOfWeek": 1, "_id.hour": 1 }
}
]);
// Calculate price volatility
const volatility = priceTrends.map(trend => ({
timeSlot: ${trend._id.dayOfWeek}-${trend._id.hour},
volatilityScore: (trend.stdDev / trend.avgPrice) * 100,
changeFrequency: trend.priceChanges
}));
**Database Performance:**
- **Write throughput**: 8,000 inserts/second
- **Query latency**: 45ms average for 30-day aggregations
- **Storage efficiency**: 2.4GB per million records (compressed)
- **Index strategy**: Compound index on (asin, timestamp)
This powers the real-time analysis, but the magic happens in the ML model.
### Component 3: Machine Learning Model

The prediction model is where we encode everything we learned about Amazon's patterns.
python
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
class PricePredictionModel:
def init(self):
self.model = RandomForestRegressor(
n_estimators=200,
max_depth=15,
min_samples_split=10,
random_state=42
)
self.scaler = StandardScaler()
def engineer_features(self, price_history, product_metadata):
"""
Create features from raw price data
Returns: feature matrix for ML model
"""
df = pd.DataFrame(price_history)
# Time-based features
df['hour'] = pd.to_datetime(df['timestamp']).dt.hour
df['day_of_week'] = pd.to_datetime(df['timestamp']).dt.dayofweek
df['day_of_month'] = pd.to_datetime(df['timestamp']).dt.day
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['is_month_end'] = (df['day_of_month'] >= 28).astype(int)
# Price history features
df['price_ma_24h'] = df['price'].rolling(window=12, min_periods=1).mean()
df['price_ma_7d'] = df['price'].rolling(window=84, min_periods=1).mean()
df['price_std_24h'] = df['price'].rolling(window=12, min_periods=1).std()
df['price_change_rate'] = df['price'].pct_change()
df['price_volatility'] = df['price'].rolling(window=24).std() / df['price'].rolling(window=24).mean()
# Competitor pricing features
df['competitor_min'] = product_metadata.get('competitor_prices', []).min() if product_metadata.get('competitor_prices') else df['price']
df['price_vs_competitor'] = (df['price'] - df['competitor_min']) / df['competitor_min']
# Inventory signals
df['low_stock'] = (product_metadata.get('stock_level', 100) < 10).astype(int)
# Demand indicators
df['search_volume'] = product_metadata.get('search_trend', 0)
df['sales_rank'] = product_metadata.get('sales_rank', 999999)
return df.fillna(0)
def train(self, historical_data, future_prices):
"""Train the model on historical data"""
X = self.engineer_features(historical_data)
y = future_prices # Price 24 hours ahead
X_scaled = self.scaler.fit_transform(X)
self.model.fit(X_scaled, y)
return self.model.score(X_scaled, y) # R² score
def predict_next_price(self, recent_data, product_metadata):
"""Predict price for next time window"""
X = self.engineer_features(recent_data, product_metadata)
X_scaled = self.scaler.transform(X)
prediction = self.model.predict(X_scaled[-1:])
confidence = self.model.score(X_scaled, recent_data['price'])
return {
'predicted_price': float(prediction[0]),
'confidence_score': float(confidence),
'current_price': float(recent_data['price'].iloc[-1]),
'predicted_change': float(prediction[0] - recent_data['price'].iloc[-1])
}
**Model Performance Evolution:**

- **Month 1**: 47% accuracy (baseline linear regression)
- **Month 2**: 62% accuracy (added time features)
- **Month 3**: 71% accuracy (competitor price features)
- **Month 4**: 78% accuracy (demand signals integrated)
- **Month 5**: 81% accuracy (ensemble methods)
- **Month 6**: 83% accuracy (hyperparameter tuning)
### System Flow

The complete workflow:
1. **Scraper** collects price every 2 hours
2. **Detection** identifies if price changed
3. **Analysis** compares to historical patterns
4. **ML Prediction** forecasts next change
5. **Alert Generation** notifies users of opportunities
6. **Feedback Loop** improves model with results
### Technology Stack

**Frontend:**
- React with Material-UI for dashboard
- Chart.js for price visualization
- WebSocket for oe real-time updates
**Backend:**
- Node.js Express API
- Python Flask for ML serving
- Redis for caching hot predictions
**Data Layer:**
- MongoDB for time-series storage
- Redis for session state
- S3 for raw HTML archives
**Infrastructure:**
- AWS Lambda for scraping jobs
- CloudWatch for monitoring
- API Gateway for public API
## The Optimization Phase: From 62% to 83%
Getting from decent accuracy to production-grade prediction required obsessive optimization.
### Failed Optimizations
Let me be honest about what didn't work:
**Attempt 1: Deep Learning**
- Tried LSTM networks for time-series prediction
- Result: 58% accuracy (worse than Random Forest)
- Reason: Not enough data per product for deep learning
**Attempt 2: Real-Time Competitor Scraping**
- Added scraping of 10 competitor sites
- Result: Minimal accuracy improvement (2%)
- Cost: 4x infrastructure costs
- Decision: Removed from production
**Attempt 3: Review Sentiment Analysis**
- Hypothesis: Review sentiment predicts pricing
- Result: No correlation found
- Lesson: Focus on direct price signals
### Successful Optimizations
What actually moved the needle:
**1. Category-Specific Models** (+7% accuracy)
python
models = {
'electronics': RandomForestRegressor(max_depth=20),
'books': GradientBoostingRegressor(max_depth=10),
'home': RandomForestRegressor(max_depth=15)
}
Electronics prices follow different patterns than books. Category-specific models captured these nuances.
**2. Temporal Cross-Validation** (+4% accuracy)
python
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
model.fit(X[train_idx], y[train_idx])
score = model.score(X[test_idx], y[test_idx])
Random cross-validation was leaking future information into training. Time-based splits fixed this.
**3. Feature Interaction Terms** (+3% accuracy)
python
df['evening_volatility'] = df['is_evening'] * df['price_volatility']
df['weekend_demand'] = df['is_weekend'] * df['search_volume']
The interaction between time of day and price volatility was more predictive than either feature alone.
### ROI Analysis

**Investment:**
- Development: $12,000 (3 months, 2 engineers)
- Infrastructure: $300/month AWS costs
- Total Year 1: $15,600
**Returns:**
- Labor savings: $48,000/year (40 → 2 hours/week manual work)
- Better deal pricing: $18,000/year (improved conversion)
- Total Annual Return: $66,000
**ROI: 323% in first year**
This approach now powers Avluz.com's deal recommendation engine, helping millions of shoppers catch price drops at the perfect moment.
## The Lessons: What We'd Do Differently
Looking back, here's what we learned:
### Technical Lessons
**1. Start Simple, Add Complexity Incrementally**
Our initial attempt used complex deep learning. Random Forest with good features outperformed it. Start simple, add complexity only when needed.
**2. Data Quality > Data Quantity**
50,000 products with clean data beats 500,000 with noisy data. We spent 2 weeks cleaning outliers and anomalies. That investment paid off.
**3. Domain Knowledge > Algorithm Choice**
Understanding *why* Amazon prices change (inventory, competitors, demand) was more valuable than trying every ML algorithm. Talk to your domain experts.
### Business Lessons
**1. Predict, Don't Just React**
The shift from reactive ("price changed!") to predictive ("price will change in 6 hours") transformed our value proposition.
**2. User Trust Requires Transparency**
We show confidence scores with predictions. Users trust "83% confident" more than "will drop" without context.
**3. Continuous Learning Is Essential**
Amazon's algorithm evolves. Our model retrains weekly with new data. Static models decay rapidly in e-commerce.
## What's Next at Avluz.com
We're expanding this approach in three directions:
**1. Multi-Retailer Prediction**
Applying these techniques to Target, Walmart, and Best Buy. Early tests show 76% accuracy—close to Amazon's 83%.
**2. Bundle Optimization**
Predicting optimal times to buy multiple products together. "Wait 3 days for laptop, buy monitor today" type recommendations.
**3. Open Source Toolkit**
Planning to release our feature engineering library as open source. The ML community helped us; time to give back.
**4. Real-Time Alert Improvements**
Moving from email alerts to push notifications with WebSocket connections. Users get price drop alerts within 30 seconds.
## Recommendations for Engineering Teams
If you're building similar systems, here's my advice:
### For Individual Engineers
**Start with data collection:** You need 3-6 months of historical data before ML makes sense. Start scraping now.
**Use existing tools:** Don't build scrapers from scratch. Libraries like Scrapy and Selenium are battle-tested.
**Validate constantly:** Test predictions against holdout data weekly. Models drift faster than you expect.
### For Engineering Teams
**Invest in infrastructure early:** We initially underestimated storage and compute needs. MongoDB's time-series collections saved us.
**Build feedback loops:** Our model improves because we track actual vs. predicted prices and retrain.
**Consider ethics:** Price prediction can be used for price discrimination. We use it to help consumers, not exploit them.
### For Companies
**This is a marathon, not a sprint:** It took us 6 months to reach 83% accuracy. Budget for iterative improvement.
**Partner with domain experts:** Our deals team's insights were as valuable as our ML expertise.
**Prepare for maintenance:** Amazon changes their site structure monthly. Budget for ongoing scraper maintenance.
## Code Repository
Want to build your own price prediction system? Key resources:
- [MongoDB Time-Series Documentation](https://www.mongodb.com/docs/manual/core/timeseries-collections/)
- [scikit-learn RandomForest Guide](https://scikit-learn.org/stable/modules/ensemble.html#forest)
- [Scrapy Best Practices](https://docs.scrapy.org/en/latest/topics/practices.html)
- [AWS Lambda for Web Scraping](https://aws.amazon.com/blogs/compute/web-scraping-at-scale-with-aws-lambda/)
## Final Thoughts
Reverse engineering Amazon's pricing algorithm taught us that modern e-commerce is a real-time game. Prices adjust to demand, inventory, and competition within minutes. Static price tracking isn't enough—you need prediction.
The system we built processes 7.3 price changes per product per day across 50,000 products. That's 365,000 price updates daily. And we can predict the next change with 83% accuracy.
But here's the most important lesson: The algorithm is just the beginning. The real value comes from helping real people save money on products they actually want to buy. That's what drives us at [Avluz.com](https://avluz.com).
---
## 💬 Discussion Questions
1. Have you experimented with price prediction for e-commerce? What accuracy did you achieve?
2. What other factors do you think influence Amazon's pricing algorithm that we might have missed?
3. For large-scale web scraping, do you prefer Scrapy, Selenium, or cloud-based solutions?
4. How do you handle model drift when external algorithms (like Amazon's) are constantly evolving?
---
*Engineering insights from [Avluz.com](https://avluz.com) - Where millions discover deals, coupons, and price drops daily. Follow our engineering blog for more deep dives into e-commerce ML, real-time data processing, and scalable web scraping.*
2026-03-02 05:25:33
This is a submission for the DEV Weekend Challenge: Community
I'm an avid card player. You know, the kind that used to play once a week, and now maybe 5 times a year 🤭 I started playing Magic the Gathering back in High School when Invasion was dropping. We would go to the top of the gym, and use gymnastic stacked mats to play Magic all the time, crack new packs, and have the thrill of getting a foil. Cracking packs of Magic used to be so fun, knowing you might get a rare foil, or some interesting mythic. Recently Magic has kind have let my friends and I down. With them releasing IP's like TMNT, SpiderMan, and others, and with foils being in every pack, cracking a pack does not feel special anymore.
Come Sorcery: Contested Realm, a card game that has that exact same thrill of opening packs and getting a foil maybe 1 out of 6 packs. What makes cracking packs so fun, is getting that single rare card, or a curiosa, that could have HUGE value. This game is in it's early stages too, but it has cards valued at over $1,000~ 🤑 (Curiousas, Rare Foils) in the game already!
Most people who collect TCG cards of value, often know about TCGPlayer.com. This site holds the standard when it comes to priced TCG cards on the market. Sorcery: Contested Realm price data is already on this website. Sorcery also has it's own website for card information, deck building, and collection management. So what's the issue?
Let's go with S:CR (Sorcery: Contested Realm) for the rest of this entry. S:CR's website Curiosa.io is an incredible website. But some friends and I have always wanted to be able to price our collection and decks. I mean how cool would that be?! To be able to check our collection and see how "rich" 😆 we are. But in all honesty, I always thought it would be cool if curiousa.io built a way to check the price of each deck and keep up with your collection value.
Big brain thought. Why not just do exactly this ourselves?! The ability to have price information and total value information on the website would be absolutely incredible. This site has been out 2+ years and no one has created anything to allow this.
So, let's go through the "app" I built (😵💫,😤,😂). I've been trying to think of ways I could implement this into our community of Sorcery card players and wanted this tool for everyone. Since TCGPlayer and Curiousa.io both run on react / virtual DOM's, the way it loads information makes it almost impossible to scrape. So how can we scrape price data information on TCGPlayer?
In comes MutationObserver. I had never thought to use this. After an several attempts on my own trying to figure out how to scrape the data, I had CoPilot help me understand the process of how TCGPlayer loads it's data and how I could grab that information. With MutationObserver, it observes the DOM and waits for certain data to appear before starting itself.
Now that we got how we're going to scrape the data, how are we going to create this as something that will scrape the data, match data on a page, and inject data into that page? A Chrome Extension!! And my very first time creating an extension in Chrome.
In building the Chrome Extension, I learned how to create a manifest for Chrome Extension environments, and how a file structure works for Extensions. I think one of the coolest things about Extensions, is how you can just refresh the extension and it grab the newest bit of code with new versions. So let's break down the versions.
1.0.1 - Users have to scrape TCGPlayer for data, built inside the extension, link for scraping and all. Can see price data only for decks on Curiousa.io, card names are matching and price injection is happening. But at the moment, it's not accurate. We are just pulling the first card Name we see, not considering foils and Alpha/Beta same cards.
1.0.2-5 - Can see price data now accurate for versions "Alpha/Beta". Let me tell you, and excuse my french, but that was a bitch to handle. So the way Curiousa.io works is by tRPC Batch Requests. With the extension I had to intercept the fetch request, create the same exact request for card information, and match the data for the set it's under. The set is shown when you click on a card, so I also had to setup the code to automate the clicks, give a loading screen, then inject the price data correctly.
1.0.6 - Now we have foils pulling in, very nicely as well. They have this nice yellow badge on the card name along with the price and all cards are now injecting correct set pricing.
1.0.7 - Added Total cards and Total value of each deck to the top of the deck.
1.0.8 - Created a collection value and used the same name match and price injection to show correct information within the collection.
1.0.9 - Created a link to TCGPlayer.com on the price badge someone can click on to go to the card price information on TCGPlayer.com
1.0.10 - Finessed the wording for the plugin, tested to make sure all price data was matching correctly in both decks and collection, and optimized the code some.
1.1.0 - Created a .crx file extension for testing.
Note: So each user has to scrape the data manually each time to get updated prices and card information from TCGPlayer. I was afraid to hit TCG with tons of scrape data daily from each extension being loaded in Chrome, so I made the scraping happen manually.
This is a video demo of the Sorcery Extension being live used.
Automatically displays TCGPlayer market prices for Sorcery: Contested Realm cards on Curiosa.io, including deck totals.
sorcery-price-tracker/
├── manifest.json # Extension config + permissions
├── background.js # Service worker: storage, canonical names, messaging
├── content_tcgplayer.js # Scraper: runs on TCGPlayer, collects card prices
├── content_curiosa.js # Injector: runs on Curiosa.io, displays prices
├── popup.html # Extension popup UI
├── popup.js # Popup logic
└── icons/ # Add icon16.png, icon48.png, icon128.png here
chrome://extensions
Here is a sample of the code I used to scrape TCGPlayer.com website for card data.
content_tcgplayer.js
function observeDOM() {
const observer = new MutationObserver(() => {
clearTimeout(debounceTimer);
debounceTimer = setTimeout(tryScrape, DEBOUNCE_MS);
});
observer.observe(document.body, { childList: true, subtree: true });
}
function tryScrape() {
const results = scrapeCurrentPage();
const count = Object.keys(results).length;
if (count === 0) return;
scrapedOnThisPage = true;
const slug = getSetSlugFromUrl();
updateBanner(`Scraped ${count} entries from ${getSetDisplayName(slug)} — saving...`);
chrome.runtime.sendMessage(
{ type: 'SAVE_PRICES', payload: results },
(response) => {
if (response?.success) {
updateBanner(`${getSetDisplayName(slug)}: ${count} entries saved (${response.count} total). Open another set's price guide to continue.`);
}
}
);
}
// ─────────────────────────────────────────────
// Scrape the price guide table.
// Captures card name, price, AND the TCGPlayer
// url from the card.
// ─────────────────────────────────────────────
function scrapeCurrentPage() {
const prices = {};
const slug = getSetSlugFromUrl();
const setDisplayName = getSetDisplayName(slug);
const rows = document.querySelectorAll('.tcg-table-body tr');
if (rows.length === 0) return prices;
rows.forEach((row) => {
const nameEl = row.querySelector('.pdp-url');
const cells = row.querySelectorAll('.tcg-table-body__cell');
if (!nameEl || cells.length <= PRICE_COLUMN_INDEX) return;
const rawName = nameEl.innerText?.trim();
const rawPrice = cells[PRICE_COLUMN_INDEX]?.innerText?.trim();
if (!rawName || !rawPrice) return;
if (rawPrice === '-' || rawPrice === '' || rawPrice === 'N/A') return;
const priceFloat = parseFloat(rawPrice.replace(/[^0-9.]/g, ''));
if (isNaN(priceFloat) || priceFloat === 0) return;
// Capture the TCGPlayer PDP URL from the anchor tag
const href = nameEl.href || nameEl.closest('a')?.href || null;
const pdpUrl = href && href.startsWith('http') ? href : null;
const foil = isFoilName(rawName);
const baseName = foil ? stripFoilSuffix(rawName) : rawName;
const priceObj = {
display: rawPrice,
value: priceFloat,
set: setDisplayName,
setSlug: slug,
foil: foil,
url: pdpUrl,
source: 'tcgplayer',
scrapedAt: Date.now(),
};
So there's a better understanding of the file structure here is how I created the file structure for this extension:
The way I created the extension was simply with Javascript. It's the only thing it needed. I also used Claude Sonnet in browser to help me understand the way TCGPlayer loaded it's data and the way Curiousa.io was fetching it's card information through tRCP requests.
2026-03-02 05:12:48
Three implementations of the same physics. Python floating-point. Python scaled-integer. Rust BigInt. They must agree bit-for-bit, because the ZK proof is only as correct as the reference implementation it was validated against.
Here's how we cross-validate a ZK system across languages and number representations.
Our zk-physics project maintains three implementations of the Rayleigh-Plesset simulation:
reference/simulate.py)
The ground truth for physics. Uses NumPy float64 — the same representation a physicist would use. Human-readable, easy to modify, trusted.
@dataclass
class SimulationParams:
R0: float # Equilibrium radius (m)
P0: float # Ambient pressure (Pa)
Pa: float # Acoustic amplitude (Pa)
freq: float # Frequency (Hz)
gamma: float # Polytropic exponent
sigma: float # Surface tension (N/m)
mu: float # Viscosity (Pa*s)
rho: float # Density (kg/m³)
T0: float # Initial temperature (K)
dt: float # Timestep (s)
This implementation answers the question: "given these physical parameters, what does the simulation produce?"
reference/simulate_scaled.py)
The bridge between float physics and field arithmetic. Uses Python's arbitrary-precision integers with SCALE = 10**30:
SCALE = 10**30
def smul(a, b):
return (a * b) // SCALE
def sdiv(a, b):
return (a * SCALE) // b
Same physics, different arithmetic. This implementation answers: "does the scaled-integer representation produce the same results as floating-point?"
src/witness.rs)
The actual witness generator. Uses num-bigint for arbitrary-precision integers:
fn smul(a: &BigInt, b: &BigInt) -> BigInt {
(a * b) / scale()
}
fn sdiv(a: &BigInt, b: &BigInt) -> BigInt {
(a * scale()) / b
}
This implementation answers: "does the Rust witness match the Python reference, and do its values satisfy the halo2 circuit constraints?"
A key architectural decision: the Rust witness uses a single simulate_inner() function for both the collapse preset and the acoustic driving preset:
fn simulate_inner(
r0: &BigInt, p0: &BigInt, p_initial: &BigInt,
dt: &BigInt, dt2: &BigInt, two_sigma: &BigInt,
four_s: &BigInt, three_halves_s: &BigInt,
rho: &BigInt, mu: &BigInt,
r_start: BigInt, n_steps: usize,
acoustic_dp: impl Fn(usize) -> BigInt,
) -> Vec<BigInt> {
// Single Verlet integration loop
// acoustic_dp returns zero for collapse, Pa*sin(wt) for acoustic
}
Before our code review, collapse and acoustic had separate simulation loops. This is a correctness hazard: fix a bug in one loop, forget the other. Deduplication means every physics improvement applies everywhere.
The Python test suite (reference/test_simulate.py) runs both Python implementations and compares:
def test_collapse_float_vs_scaled():
"""Float and scaled-integer must agree within tolerance."""
params_float = CollapsePreset.float_params()
params_scaled = CollapsePreset.scaled_params()
trace_float = simulate_float(params_float, n_steps=100)
trace_scaled = simulate_scaled(params_scaled, n_steps=100)
for i in range(len(trace_float)):
r_float = trace_float[i]
r_scaled = trace_scaled[i] / SCALE
assert abs(r_float - r_scaled) / r_float < 1e-10, \
f"Step {i}: float={r_float}, scaled={r_scaled}"
The tolerance (10^-10 relative error) accounts for the precision difference between float64 (53-bit mantissa) and exact BigInt arithmetic. Over 100 steps, errors accumulate but stay well within this bound.
The Rust-to-Python comparison uses exported JSON traces:
def test_rust_vs_python():
"""Rust BigInt witness must match Python scaled-integer exactly."""
rust_trace = json.load(open("output/rust_trace.json"))
python_trace = simulate_scaled(params, n_steps=len(rust_trace)-1)
for i in range(len(rust_trace)):
assert rust_trace[i] == python_trace[i], \
f"Step {i}: rust={rust_trace[i]}, python={python_trace[i]}"
Note: this comparison is exact (not approximate). Rust BigInt and Python int both do arbitrary-precision integer arithmetic. The same operations on the same inputs must produce identical results.
Even with exact Rust-Python agreement, there's one more gap: the circuit uses field arithmetic (modular inverse), not integer arithmetic (truncating division). These differ whenever a product isn't exactly divisible by SCALE.
The compute_public_inputs<Fr>() function closes this gap by replaying the entire simulation using field operations:
pub fn compute_public_inputs<F: PrimeField>(witness: &SimulationWitness) -> Vec<F> {
let s = F::from_u128(SCALE);
let s_inv = s.invert().unwrap();
// Load initial conditions as field elements
let mut r_curr = F::from_u128(witness.r0);
// Replay each step using field mul/div (not integer)
// Extract final temperature and emission count
// Return [r0, final_temp, total_emission]
}
This ensures the public inputs — the values the verifier checks — are computed the same way the circuit computes them. Integer arithmetic provides the witness; field arithmetic provides the verification target.
The full validation chain looks like this:
Python float (ground truth physics)
↕ agree within 10^-10 relative error
Python scaled-integer (exact BigInt)
↕ agree exactly (same arithmetic)
Rust BigInt witness
↕ satisfies constraints
halo2 circuit (field arithmetic)
↕ matches
compute_public_inputs (field replay)
↕ verified by
PLONK verifier (pairing check)
Each layer validates the one below it. If any link breaks, we know exactly where the disagreement is:
Python float ≠ Python scaled: scaling error or precision loss
Python scaled ≠ Rust BigInt: implementation bug in witness generator
Rust witness ≠ circuit constraints: witness computation doesn't match gates
Public inputs ≠ circuit output: field vs. integer arithmetic mismatch
Verifier rejects: proof generation bug or transcript error
1. Three implementations is not redundant. Each catches different classes of bugs. The float version catches physics errors. The scaled-integer version catches precision errors. The Rust version catches implementation errors.
2. Exact comparison where possible, bounded comparison where necessary. Float-to-scaled comparisons need a tolerance. Scaled-to-BigInt comparisons should be exact. Mixing these up wastes debugging time.
3. Deduplication is a correctness strategy. The simulate_inner() refactor wasn't about code cleanliness — it was about ensuring both presets use the same physics. Every shared line of code is one fewer place for divergence to hide.
4. The field replay is non-optional. You cannot skip compute_public_inputs. Integer arithmetic and field arithmetic give different results. If your public inputs come from integer computation, your proofs will intermittently fail for no apparent reason.
This is Part 7 of our 8-part series. Part 6: Testing ZK Privacy covers the privacy testing framework. Final article: Part 8: Sonoluminescence — The Physics of Light from Nothing — the full science behind the simulation.
New here? Subscribe free for the full series.
Originally published at https://jacobian.ghost.io/python-to-rust-to-proof-cross-validating-a-zk-system/.
Subscribe to **Jacobian* for weekly ZK/privacy insights: jacobian.ghost.io*
New here? Grab the free **ZK Privacy Patterns* guide: Get it free*
2026-03-02 05:07:56
In this article, we will explore two of the most powerful and widely used features in SQL: JOINs and Window Functions. We will begin by understanding what they are and how they work, and then walk through practical examples to see when, where, and why they are used in real-world scenarios.
Let's start with joins:
In Structured Query Language (SQL), a join is a clause used to combine rows from two or more tables based on a related column between them. The purpose of a joins is to retrieve data that is spread across multiple tables into a single table, providing a unified view.
Example: In database with an orders and customers table, a join can be used to answer questions such as which customer placed an order.
It combines two or more tables based on a specified common column with matching values and only returns the set of records that have a match in all the involved tables. The rows that don't have a match in the other table(s) are excluded from the result set.
Example:
Above, are tables customers and orders
select first_name, last_name
from article.customers
inner join article.orders on customers.customer_id = orders.customer_id;
Using INNER JOIN based on the specified column customer_id on both the customers and orders table, we are able to join the tables, getting a result of only the records which have customer_ids on both tables.
It retrieves all rows from the left (first) table and matching rows from the right (second) table. If a row in the left table has no corresponding match in the right table based on the join condition, the result will contain NULL values for the columns of the right table.
Example:
Using our previous tables
--LEFT JOIN
select first_name, last_name, order_date
from article.customers
left join article.orders on customers.customer_id = orders.customer_id;
We perform a left join on the orders table with the customers table based on the customer_id condition. The query returns all the customers but some with the NULL value on the date column as they are not on the orders table(they have never made an order).
The left join is also known as the left outer join.
It retrieves all rows from the right (second) table and matching rows from the left (first) table. If a row in the right table has no corresponding match in the left table based on the join condition, the result will contain NULL values for the columns of the left table.
--RIGHT JOIN
select first_name, last_name, order_date
from article.customers
right join article.orders on customers.customer_id = orders.customer_id;
Now, on performing the RIGHT JOIN on our tables, we get a result of all the rows from the right table (orders) that have a match on the left table (customers). Compared to the LEFT JOIN, we lose the NULL values as those rows don't have match in the left table (customers).
The right join is also known as the right outer join.
It returns all rows from both the left and right tables, combining matching records and using NULL values for columns where there is no match
Example:
-- FULL OUTER JOIN
select first_name, last_name, email, phone_number,order_id, order_date, book_id
from article.customers
full outer join article.orders on customers.customer_id = orders.order_id;
When we perform a FULL OUTER JOIN on the customers and orders tables, the SQL query returns all the rows from both tables and filling NULL values for there is no match.
Now, let's look at window functions:
In SQL, Window functions perform calculations across a set of table rows related to the selected row without merging these rows into a single output/value. Unlike traditional output functions such as SUM() and COUNT() which reduce multiple rows to one, window functions return a value for each row in the original result set.
They operate over a window (specific set of rows) defined by the OVER() clause.
Select: This defines the columns you want to select from the table_name(The columns you select, create your window).
Function: This is the window function you want to use.
Over Clause: This defines the partitioning and ordering of rows and can be applied with functions to compute aggregated values.
Partition by: This divides rows into partitions based on specified expressions. It is suitable for large datasets because it makes them simpler to manage.
Order by: This is a specified order expression to define the order in which rows will be processed within each partition.
Output column: This is the name you give to your output column.
1. Aggregate window functions: They calculate aggregates over a window of rows while retaining individual rows.
2. Ranking window functions: They provide rankings of rows within a partition based on a specific criteria.
3. Value window functions: They are used to assign to rows, values from other rows. It is usually possible to replicate the values of these functions using two nested queries, hence they are not that common compared to aggregate and ranking window functions.
Example:
We have a table employees containing information of different employees from different departments.
Let's use a ranking window function RANK() (which provides a unique rank to each row while skipping duplicates), an OVER() to define the partitioning and ordering of rows using PARTITION BY and ORDER BY then define an output column that will contain out results.
In our table, we have used the RANK() function to rank employees by department. We have applied (PARTITION BY department) to partition our results in the different departments in our table, (ODER BY salary DESC) to order the results by descending salary value then we have output our RANK() using rank_by_salary column.
After the ranking of the Finance department is over, we can see a new ranking starting for the HR department.
Here we are now using the aggregate window function AVG() to get the average salary for each department and comparing it to each of the employees salary in different departments
To learn more on the other window functions, click here Window functions
From this article, we have gone through and understood what are joins and window functions are. Through clear explanations and practical, real-world examples, you’ve learned what they are, the different types available, how they work, and exactly how to write clean, efficient SQL queries to achieve your desired results. Mastering these powerful tools will dramatically improve your ability to analyze, transform and retrieve data with precision. Start applying them in your own projects today, the more you practice, the more natural they will feel.
Happy querying!
2026-03-02 05:07:31
The witness and the circuit computed different numbers. Both were "correct." Here's why that's terrifying — and the bugs we found that prove it.
In a ZK proof system, the witness is the prover's secret knowledge. The circuit is the set of constraints that verify the computation. They must agree on every intermediate value.
But here's the thing: the witness is computed by regular Rust code (BigInt arithmetic, floating-point conversions, integer division). The circuit operates in a 254-bit prime field (modular arithmetic, field inversion, no truncation). These are fundamentally different computational models.
When they disagree — even by a single bit in a single intermediate value — the proof is invalid. The prover can't generate a valid proof because the witness values don't satisfy the circuit's constraints.
We found three bugs in our witness generator that illustrate exactly how this happens.
Our original to_scaled function was simple:
fn to_scaled(v: f64) -> BigInt {
BigInt::from((v * 1e30) as i128)
}
This looks fine. Multiply by 10^30, cast to integer, wrap in BigInt. And it works perfectly for small values.
But f64 has a 53-bit mantissa — roughly 16 decimal digits of precision. When you multiply a typical physical value by 10^30, the result needs 30+ decimal digits. The cast to i128 captures whatever f64 can represent, and silently drops the rest.
For the surface tension parameter (σ = 0.0728 N/m):
f64: 0.0728 * 1e30 = 72800000000000004194304 (only 16 significant digits)
BigInt: 0.0728 * 1e30 = 72800000000000000000000000000 (exact)
That's a relative error of ~10^-16 — negligible for most applications, catastrophic for a ZK proof. The circuit computes with the exact field element. The witness provides a different value. The constraint a * b - c * S = 0 fails by a tiny amount. The proof is invalid.
The fix splits the conversion into two stages, each within f64's precision:
fn float_to_scaled_big(v: f64) -> BigInt {
let int_part = v.trunc() as u64;
let frac_part = v - int_part as f64;
let int_scaled = BigInt::from(int_part) * BigInt::from(10u64).pow(30);
let frac_hi = (frac_part * 1e15).round() as i64;
let frac_scaled = BigInt::from(frac_hi) * BigInt::from(10u64).pow(15);
int_scaled + frac_scaled
}
The integer part gets exact scaling (multiplication by 10^30 in BigInt — no precision loss). The fractional part gets 15 digits (within f64's capacity) and is scaled by 10^15 in BigInt. Combined: 30 digits of precision, matching the circuit exactly.
The witness generator had a safety clamp to prevent the bubble radius from going negative:
// BEFORE (BUGGY)
let r_min = BigInt::from(1u64) * scale() / BigInt::from(1_000_000u64);
if r_curr < r_min {
r_curr = r_min.clone();
}
The intent was reasonable: if numerical instability drives the radius below a minimum, clamp it instead of producing nonsensical physics.
The problem: the circuit has no such clamp. The circuit computes R_next = 2R - R_prev + R''×dt² faithfully, even if the result is smaller than r_min. When the witness clamps but the circuit doesn't, they diverge. The proof fails.
Worse: the clamp was silent. No error, no warning. The witness just quietly produced different numbers than the circuit expected.
The fix replaces the clamp with an assertion:
// AFTER (CORRECT)
assert!(r_curr > BigInt::zero(),
"Radius went non-positive at step {} — reduce timestep or check parameters",
step);
If the simulation is physically unstable, we want to know immediately — not discover it as a mysterious proof failure.
The witness struct returned a radius trace:
pub struct SimulationWitness {
pub r_trace: Vec<u128>,
// ...
}
This looked like the circuit would use r_trace[i] to set the radius at each step. But in reality, the circuit only uses r_trace[0] (the initial radius). Every subsequent radius is computed by the circuit itself through constrained field arithmetic.
The values r_trace[1..N] were only used for display purposes — showing peak temperature and minimum radius in the CLI output. But a developer reading the code would naturally assume the trace feeds into the circuit. If they modified the trace (say, smoothing it for numerical stability), the witness would be wrong and the proof would fail — for no obvious reason.
The fix was documentation:
/// r_trace[0] is the initial radius loaded into the circuit. All subsequent
/// values r_trace[1..N] are informational only (used for peak temperature
/// and min radius display) — the circuit computes its own trace via
/// constrained field arithmetic starting from r_trace[0].
pub struct SimulationWitness {
pub r_trace: Vec<u128>,
This isn't a code change. It's a correctness boundary change. Making explicit which values the circuit trusts and which are advisory prevents an entire class of future bugs.
Even after fixing precision and clamping, there's a fundamental gap between the witness and the circuit.
The witness computes division as integer truncation:
c = floor(a * b / S)
The circuit computes division as field inversion:
c = a * b * S⁻¹ mod p
These give different results whenever a * b is not exactly divisible by S. The integer version truncates. The field version gives the exact multiplicative inverse — a completely different number.
For public inputs, this matters: the values the verifier checks must be computed the same way the circuit computes them. Our compute_public_inputs<Fr>() function replays the entire simulation using field arithmetic:
pub fn compute_public_inputs<F: PrimeField>(witness: &SimulationWitness) -> Vec<F> {
let s = F::from_u128(SCALE);
let s_inv = s.invert().unwrap();
// Replay the entire circuit's computation using field ops
// ...
}
This guarantees the public inputs match what the circuit will produce, regardless of how the BigInt witness was computed.
Every ZK project has a witness problem. The specific bugs differ, but the pattern is always the same:
Two computational models (witness generator vs. circuit) must agree exactly
Silent divergence is the default — nothing tells you they disagree until proof generation fails
Precision mismatches compound across thousands of steps
Defensive programming (clamps, defaults, fallbacks) actively fights against correctness
The fix is always the same: make the witness generator compute the exact same operations as the circuit, using the same arithmetic. Where that's impossible (because the circuit uses field arithmetic and the witness uses integers), bridge the gap explicitly with field-replay functions.
And document the boundary. Make it impossible to confuse which values feed into the circuit and which are informational. The bugs that kill ZK projects aren't in the math — they're in the assumptions about which code produces which values.
This is Part 5 of our 8-part series. Part 4: 1,088 Bytes to Prove a Star covers the proof pipeline. Next: Part 6: Testing ZK Privacy — how to verify that proofs actually hide what they should.
New here? Subscribe free for the full series.
Originally published at https://jacobian.ghost.io/the-witness-problem-when-bigint-precision-breaks-your-proof/.
Subscribe to **Jacobian* for weekly ZK/privacy insights: jacobian.ghost.io*
New here? Grab the free **ZK Privacy Patterns* guide: Get it free*
2026-03-02 05:03:42
This is a submission for the DEV Weekend Challenge: Community
This project is built for the general public in India — everyday citizens who often have a very limited or poor understanding of laws, especially how they are interpreted and applied in real life. Misunderstandings frequently lead to exploitation, unnecessary legal troubles, or harassment by bad actors (landlords, employers, police, etc.). Many people hesitate to seek help due to complexity, cost, or fear, so this tool aims to democratize access to constitutional rights and basic legal knowledge for the masses.
Kanoon Mera Kawach ("Law is My Shield") is a mobile app that lets users quickly look up provisions from the Indian Constitution, with plans to expand to the Indian Penal Code (IPC) and other key laws. It includes:
The goal is to give the common person an accessible way to understand their rights, avoid common pitfalls, and engage in informed community discussions — reducing misinformation and empowering people to stand up for themselves.
Here’s a quick video walkthrough of the app in action:
(If the embed doesn't load, direct link: https://www.youtube.com/watch?v=KIqJ4L53XSk)
The full source code is open on GitHub:
"Law is the shield of the people"
A React Native + Expo mobile app that empowers Indian citizens to understand their legal rights, search Indian laws with plain-English explanations, report civic issues, and participate in community discussions — all through a premium glassmorphic interface.
⚠️ Disclaimer: This is NOT legal advice. This app provides general legal information only. Always consult a qualified lawyer or the relevant authority before taking any action based on information in this app.
Feel free to clone, explore, or contribute!
This was a fun weekend sprint to prototype something meaningful for a real societal need. Would love feedback — especially ideas for expanding the law database or improving the discussion features!
Thanks for checking it out, and thanks to the DEV team for running this Community challenge!