MoreRSS

site iconHackerNoonModify

We are an open and international community of 45,000+ contributing writers publishing stories and expertise for 4+ million curious and insightful monthly readers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of HackerNoon

Bcon Global Launches Non-Custodial Crypto Payment Gateway

2026-01-28 02:04:49

Introduction

Bcon Global has announced the launch of its non-custodial crypto payment gateway designed to help businesses accept crypto payments directly to their own wallets. The platform enables cryptocurrency payments without custody, KYC, or access to private keys, providing full control over funds.

Non-Custodial Crypto Payment Infrastructure

Bcon Global operates as a non-custodial crypto payment system, meaning the service does not store user funds or private keys. Merchants connect only a public wallet address, and all cryptocurrency payments are sent directly on-chain to their wallets.

This architecture removes intermediaries from the payment flow and allows businesses to accept crypto payments without relying on banks, custodial processors, or centralized exchanges.

Accept Crypto Payments on Websites and Applications

The crypto payment gateway allows online stores, SaaS platforms, and digital services to accept cryptocurrency payments on websites and applications. Payments can be received in hot wallets, cold storage wallets, or hardware wallets.

Bcon Global supports common wallet solutions such as MetaMask, Trust Wallet, Ledger, Trezor, Exodus, Electrum, and other wallets that provide public blockchain addresses.

API, Invoices, and Payment Tracking

Bcon Global provides a crypto payment API that includes functions for invoice generation, payment status tracking, and balance verification. Businesses can create invoices in fiat currency, while the system automatically calculates the crypto amount using real-time exchange rates.

Webhooks are available to notify merchants about payment confirmations and status changes, enabling automation and integration into existing systems.

E-commerce Integrations

The platform offers ready-to-use crypto payment plugins for e-commerce platforms, including WordPress and OpenCart. These integrations allow merchants to accept crypto payments without complex development or custom infrastructure.

For advanced use cases, developers can integrate directly through the API to build custom cryptocurrency payment flows.

Multi-Blockchain and Stablecoin Support

Bcon Global supports cryptocurrency payments across multiple blockchains, including Bitcoin, Ethereum, BNB Smart Chain, Solana, and TRON. Stablecoin payments are supported on major networks, including USDT, USDC, USDS, and TUSD.

All transactions are transparent and can be verified using public blockchain explorers.

Pricing Model and Fees

The service applies a flat 1% fee only to payments processed through the Bcon Global API. Transactions received directly on the wallet outside the service are not affected. Fees are accounted for separately and do not interfere with wallet balances.

There are no monthly fees, setup costs, or minimum volumes.

Use Cases and Availability

Bcon Global is designed for businesses looking for a cryptocurrency payment processor without custody or KYC. Typical use cases include e-commerce stores, subscription services, freelancers, and developers accepting peer-to-peer crypto payments.

The platform is available globally and supports payments from users worldwide.

About Bcon Global

Bcon Global is a non-custodial crypto payment gateway focused on enabling direct cryptocurrency payments to user-controlled wallets. The service provides infrastructure for accepting crypto payments without intermediaries, custody, or identity verification.

Social Media Links

X|Facebook|Telegram

:::tip This story was published as a press release by Btcwire under HackerNoon’s Business Blogging Program.

:::

\n

\ \

Researchers Use Diffusion Models to Simulate DOOM Gameplay in Real Time

2026-01-28 01:43:06

Table of Links

ABSTRACT

1 INTRODUCTION

2 INTERACTIVE WORLD SIMULATION

3 GAMENGEN

3.1 DATA COLLECTION VIA AGENT PLAY

3.2 TRAINING THE GENERATIVE DIFFUSION MODEL

4 EXPERIMENTAL SETUP

4.1 AGENT TRAINING

4.2 GENERATIVE MODEL TRAINING

5 RESULTS

5.1 SIMULATION QUALITY

5.2 ABLATIONS

6 RELATED WORK

7 DISCUSSION, ACKNOWLEDGEMENTS AND REFERENCES

\ \

2 INTERACTIVE WORLD SIMULATION

An Interactive Environment E consists of a space of latent states S, a space of partial projections of the latent space O, a partial projection function V : S → O, a set of actions A, and a transition probability function p(s|a, s′ ) such that s, s′ ∈ S, a ∈ A. For example, in the case of the game DOOM, S is the program’s dynamic memory contents, O is the rendered screen pixels, V is the game’s rendering logic, A is the set of key presses and mouse movements, and p is the program’s logic given the player’s input (including any potential non-determinism). Given an input interactive environment E, and an initial state s0 ∈ S, an Interactive World Simulation is a simulation distribution function q(on|o ment and the simulation when enacting the agent’s policy π. Importantly, the conditioning actions for these samples are always obtained by the agent interacting with the environment E, while the conditioning observations can either be obtained from E (the teacher forcing objective) or from the simulation (the auto-regressive objective). We always train our generative model with the teacher forcing objective. Given a simulation distribution function q, the environment E can be simulated by auto-regressively sampling observations.

\ \  Figure 3:GameNGen method overview. v-prediction details are omitted for brevity.

\ \

3 GAMENGEN

GameNGen (pronounced “game engine”) is a generative diffusion model that learns to simulate the game under the settings of Section 2. In order to collect training data for this model, with the teacher forcing objective, we first train a separate model to interact with the environment. The two models (agent and generative) are trained in sequence. The entirety of the agent’s actions and observations corpus Tagent during training is maintained and becomes the training dataset for the generative model in a second stage. See Figure

\ 3. 3.1 DATA COLLECTION VIA AGENT PLAY

Our end goal is to have human players interact with our simulation. To that end, the policy π as in Section 2 is that of human gameplay. Since we cannot sample from that directly at scale, we start by approximating it via teaching an automatic agent to play. Unlike a typical RL setup which attempts to maximize game score, our goal is to generate training data which resembles human play, or at least contains enough diverse examples, in a variety of scenarios, to maximize training data efficiency. To that end, we design a simple reward function, which is the only part of our method that is environment-specific (see Appendix A.3). We record the agent’s training trajectories throughout the entire training process, which includes different skill levels of play. This set of recorded trajectories is our Tagent dataset, used for training the generative model (see Section 3.2).

\ 3.2 TRAINING THE GENERATIVE DIFFUSION MODEL

We now train a generative diffusion model conditioned on the agent’s trajectories Tagent (actions and observations) collected during the previous stage. We re-purpose a pre-trained text-to-image diffusion model, Stable Diffusion v1.4 (Rombach et al., 2022). We condition the model fθ on trajectories T ∼ Tagent, i.e. on a sequence of previous actions a<n and remove all text conditioning. Specifically, to condition on actions, we simply learn an embedding Aemb from each action (e.g. a specific key press) into a single token and replace the cross attention from the text into this encoded actions sequence. In

\  Figure 4:Auto-regressive drift. Top: we present every 10th frame of a simple trajectory with 50 frames in which the player is not moving. Quality degrades fast after 20-30 steps. Bottom: the same trajectory with noise augmentation does not suffer from quality degradation.

order to condition on observations (i.e. previous frames) we encode them into latent space using the auto-encoder ϕ and concatenate them in the latent channels dimension to the noised latents (see Figure 3). We also experimented conditioning on these past observations via cross-attention but observed no meaningful improvements. We train the model to minimize the diffusion loss with velocity parameterization (Salimans & Ho, 2022b):

\

3.2.1 MITIGATING AUTO-REGRESSIVE DRIFT USING NOISE AUGMENTATION

The domain shift between training with teacher-forcing and auto-regressive sampling leads to error accumulation and fast degradation in sample quality, as demonstrated in Figure 4. To avoid this divergence due to auto-regressive application of the model, we corrupt context frames by adding a varying amount of Gaussian noise to encoded frames in training time, while providing the noise level as input to the model, following Ho et al. (2021). To that effect, we sample a noise level α uniformly up to a maximal value, discretize it and learn an embedding for each bucket (see Figure 3). This allows the network to correct information sampled in previous frames, and is critical for preserving frame quality over time. During inference, the added noise level can be controlled to maximize quality, although we find that even with no added noise the results are significantly improved. We ablate the impact of this method in section 5.2.2.

\ 3.2.2 LATENT DECODER FINE-TUNING

The pre-trained auto-encoder of Stable Diffusion v1.4, which compresses 8x8 pixel patches into 4 latent channels, results in meaningful artifacts when predicting game frames, which affect small details and particularly the bottom bar HUD (“heads up display”). To leverage the pre-trained knowledge while improving image quality, we train just the decoder of the latent auto-encoder using an MSE loss computed against the target frame pixels. It might be possible to improve quality even further using a perceptual loss such as LPIPS (Zhang et al. (2018)), which we leave to future work. Importantly, note that this fine tuning process happens completely separately from the U-Net finetuning, and that notably the auto-regressive generation isn’t affected by it (we only condition autoregressively on the latents, not the pixels). Appendix A.2 shows examples of generations with and without fine-tuning the auto-encoder.

\ 3.3 INFERENCE 3.3.1 SETUP

We use DDIM sampling (Song et al., 2022). We employ Classifier-Free Guidance (Ho & Salimans, 2022) only for the past observations condition o<n. We didn’t find guidance for the past actions condition a<n to improve quality. The weight we use is relatively small (1.5) as larger weights create artifacts which increase due to our auto-regressive sampling.

We also experimented with generating 4 samples in parallel and combining the results, with the hope of preventing rare extreme predictions from being accepted and to reduce error accumulation. We experimented both with averaging the samples and with choosing the sample closest to the median. Averaging performed slightly worse than single frame, and choosing the closest to the median performed only negligibly better. Since both increase the hardware requirements to 4 TPUs, we opt to not use them, but note that this might be an interesting area for future work.

\ 3.3.2 DENOISER SAMPLING STEPS

During inference, we need to run both the U-Net denoiser (for a number of steps) and the autoencoder. On our hardware configuration (a TPU-v5), a single denoiser step and an evaluation of the auto-encoder both takes 10ms. If we ran our model with a single denoiser step, the minimum total latency possible in our setup would be 20ms per frame, or 50 frames per second. Usually, generative diffusion models, such as Stable Diffusion, don’t produce high quality results with a single denoising step, and instead require dozens of sampling steps to generate a high quality image. Surprisingly, we found that we can robustly simulate DOOM, with only 4 DDIM sampling steps (Song et al., 2020). In fact, we observe no degradation in simulation quality when using 4 sampling steps vs 20 steps or more (see Appendix A.4). Using just 4 denoising steps leads to a total U-Net cost of 40ms (and total inference cost of 50ms, including the auto encoder) or 20 frames per second. We hypothesize that the negligible impact to quality with few steps in our case stems from a combination of: (1) a constrained images space, and (2) strong conditioning by the previous frames.

\ Since we do observe degradation when using just a single sampling step, we also experimented with model distillation similarly to (Yin et al., 2024; Wang et al., 2023) in the single-step setting. Distillation does help substantially there (allowing us to reach 50 FPS as above), but still comes at a some cost to simulation quality, so we opt to use the 4-step version without distillation for our method (see Appendix A.4). This is an interesting area for further research. We note that it is trivial to further increase the image generation rate substantially by parallelizing the generation of several frames on additional hardware, similarly to NVidia’s classic SLI Alternate Frame Rendering (AFR) technique. Similarly to AFR, the actual simulation rate would not increase and input lag would not reduce.

:::info Authors:

  1. Dani Valevski
  2. Yaniv Leviathan
  3. Moab Arar
  4. Shlomi Fruchter

:::

:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Diffusion Models Are Real-Time Game Engines

2026-01-28 01:43:00

:::info Authors:

  1. Dani Valevski
  2. Yaniv Leviathan
  3. Moab Arar
  4. Shlomi Fruchter

:::

Table of Links

ABSTRACT

1 INTRODUCTION

2 INTERACTIVE WORLD SIMULATION

3 GAMENGEN

3.1 DATA COLLECTION VIA AGENT PLAY

3.2 TRAINING THE GENERATIVE DIFFUSION MODEL

4 EXPERIMENTAL SETUP

4.1 AGENT TRAINING

4.2 GENERATIVE MODEL TRAINING

5 RESULTS

5.1 SIMULATION QUALITY

5.2 ABLATIONS

6 RELATED WORK

7 DISCUSSION, ACKNOWLEDGEMENTS AND REFERENCES

ABSTRACT

We present GameNGen, the first game engine powered entirely by a neural model that also enables real-time interaction with a complex environment over long trajectories at high quality. When trained on the classic game DOOM, GameNGen extracts gameplay and uses it to generate a playable environment that can interactively simulate new trajectories. GameNGen runs at 20 frames per second on a single TPU and remains stable over extended multi-minute play sessions. Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation, even after 5 minutes of autoregressive generation. GameNGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions. Conditioning augmentations help ensure stable auto-regressive generation over long trajectories, and decoder fine-tuning improves the fidelity of visual details and text.

Figure 1: A human player is playing DOOM on GameNGen at 20 FPS. See https://gamengen.github.io for multi-minute real-time videos of people playing with GameNGen.

1 INTRODUCTION

Computer games are manually crafted software systems centered around the following game loop: (1) gather user inputs, (2) update the game state, and (3) render it to screen pixels. This game loop, running at high frame rates, creates the illusion of an interactive virtual world for the player. Such game loops are classically run on standard computers, and while there have been many amazing attempts at running games on bespoke hardware (e.g. the iconic game DOOM has been run on kitchen appliances such as a toaster and a microwave, a treadmill, a camera, an iPod, and within the game of Minecraft, to name just a few examples1 ), in all of these cases the hardware is still emulating the manually written game software as-is. Furthermore, while vastly different game engines exist, the game state updates and rendering logic in all are composed of a set of manual rules, programmed or configured by hand.

\ In recent years, generative models made significant progress in producing images and videos conditioned on multi-modal inputs, such as text or images. At the forefront of this wave, diffusion models became the de-facto standard in media (i.e. non-language) generation, with works like DallE (Ramesh et al., 2022), Stable Diffusion (Rombach et al., 2022) and Sora (Brooks et al., 2024). At a glance, simulating the interactive worlds of video games may seem similar to video generation. However, interactive world simulation is more than just very fast video generation. The requirement to condition on a stream of input actions that is only available throughout the generation breaks some assumptions of existing diffusion model architectures. Notably, it requires generating frames autoregressively which tends to be unstable and leads to sampling divergence (see section 3.2.1). Several important works (Ha & Schmidhuber, 2018; Kim et al., 2020; Bruce et al., 2024) (see Section 6) simulate interactive video games with neural models. Nevertheless, most of these approaches are limited in respect to the complexity of the simulated games, simulation speed, stability over long time periods, or visual quality (see Figure 2). It is therefore natural to ask:

\ Can a neural model running in real-time simulate a complex game at high quality?

In this work we demonstrate that the answer is yes. Specifically, we show that a complex video game, the iconic game DOOM, can be run on a neural network (an augmented version of the open Stable Diffusion v1.4 (Rombach et al., 2022)), in real-time, while achieving a visual quality comparable to that of the original game. While not an exact simulation, the neural model is able to perform complex game state updates, such as tallying health and ammo, attacking enemies, damaging objects, opening doors, and persist the game state over long trajectories. GameNGen answers one of the important questions on the road towards a new paradigm for game engines, one where games are automatically generated, similarly to how images and videos are generated by neural models in recent years. Key questions remain, such as how these neural game engines would be trained and how games would be effectively created in the first place, including how to best leverage human inputs. We are nevertheless extremely excited for the possibilities of this new paradigm.

\ \ Figure 2: GameNGen compared to prior state-of-the-art simulations of DOOM

\

:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\ \

Alessio Vinassa on The Hidden Skill Behind Every Successful Entrepreneur

2026-01-28 01:20:02

Every entrepreneur talks about vision, resilience, and execution. But beneath all of these traits lies a quieter, less visible skill that ultimately determines success or failure: decision-making under uncertainty.

Markets shift, data is incomplete, and outcomes are rarely guaranteed. The leaders who endure are not those who wait for certainty—but those who learn how to think clearly without it.

According to Alessio Vinassa, serial entrepreneur and business advisor, uncertainty is not an obstacle to leadership—it is the environment in which leadership exists.

“If you wait for perfect information, you’re already behind,” Vinassa says.

\

“Entrepreneurship is the practice of making meaningful decisions with incomplete data.”

Why Uncertainty Is the Entrepreneur’s Default State

Unlike corporate environments with established processes and historical benchmarks, entrepreneurial decision-making often happens in real time. Founders must decide when to hire, when to pivot, when to say no, and when to move faster—all without knowing how markets, customers, or competitors will respond.

Vinassa emphasizes that uncertainty is not a phase entrepreneurs outgrow.

“The idea that uncertainty disappears as companies grow is a myth,” he explains. “It simply changes shape.”

At early stages, uncertainty revolves around product fit and survival. Later, it shifts toward leadership, culture, expansion, and reputation. The cognitive challenge remains the same: choosing a direction without guarantees.

Decision-Making Is a Cognitive Skill, Not Just Instinct

Many founders describe their decisions as “gut-driven,” but Vinassa argues that effective intuition is built—not innate.

“Strong instincts are trained through exposure, reflection, and pattern recognition,” he says. “They are not emotional impulses.”

Successful entrepreneurs develop internal frameworks that help them evaluate risk, weigh trade-offs, and act decisively without becoming reckless. These frameworks allow leaders to remain calm while others freeze or overreact.

Separating Signal from Noise

One of the greatest challenges in uncertain environments is information overload. Data, opinions, forecasts, and external pressure can cloud judgment.

Vinassa highlights the importance of filtering.

“Not all information deserves equal attention,” he notes. “Good decision-makers know what to ignore.”

Experienced leaders learn to prioritize first-order effects over speculation, focusing on what directly influences outcomes rather than chasing every possible scenario. This discipline reduces cognitive fatigue and improves consistency.

Reversibility vs. Irreversibility

One mental model Vinassa frequently references is the distinction between reversible and irreversible decisions.

“Most decisions are not permanent,” he explains. “Treating them as such creates unnecessary paralysis.”

Reversible decisions—such as testing a new strategy or experimenting with a process—should be made quickly and adjusted as needed. Irreversible decisions—those affecting reputation, ethics, or long-term trust—require deeper consideration.

Understanding this distinction allows entrepreneurs to move faster without becoming careless.

Managing Emotional Bias Under Pressure

Uncertainty often triggers fear, ego, or urgency. Vinassa believes emotional regulation is a critical but underestimated leadership skill.

“You’re not just managing a business—you’re managing your own psychology,” he says.

Effective decision-makers create distance between emotion and action. They pause, reflect, and seek perspective before committing. This does not mean avoiding risk, but approaching it with clarity rather than anxiety.

Building Confidence Without Certainty

Confidence in leadership does not come from knowing outcomes—it comes from trusting one’s process.

“Confidence is the belief that you can respond well, even if the decision doesn’t work out,” Vinassa explains.

Entrepreneurs who view decisions as experiments rather than verdicts are better equipped to adapt. Failure becomes feedback, not identity.

Decision-Making as a Team Sport

As organizations grow, decision-making must scale beyond the founder. Vinassa stresses the importance of building cultures that support distributed judgment.

“Strong leaders don’t make every decision,” he says. “They build systems that produce good decisions.”

Clear principles, aligned incentives, and psychological safety allow teams to navigate uncertainty together rather than bottlenecking leadership.

The Long-Term Advantage

Over time, entrepreneurs who master decision-making under uncertainty gain a powerful advantage. They move faster, recover quicker, and inspire confidence in others.

“People follow leaders who can stay grounded when outcomes are unclear,” Vinassa notes. “Calm is contagious.”

In volatile markets, this steadiness becomes a competitive differentiator.

A Skill Worth Developing

Decision-making under uncertainty is not glamorous, but it is foundational. It shapes strategy, culture, and outcomes more than any single idea.

As Vinassa puts it: \n “Ideas matter, but decisions determine destiny.”

\n Alessio Vinassa is a serial entrepreneur, business strategist, and thought leader focused on leadership, adaptability, and building resilient businesses in fast-changing global markets. His work centers on mentorship, innovation, and helping entrepreneurs navigate complexity with clarity and purpose.

For more information on Alessio and his work, visit his website or follow him across social media, including Facebook, Instagram, LinkedIn, X, Youtube, and Medium.

:::tip This story was published as a press release by Btcwire under HackerNoon’s Business Blogging Program.

:::

\

Your AI Model Isn’t Broken. Your Data Is

2026-01-28 01:10:34

The Hidden Tax of Dirty Data on AI Performance — and a Developer’s Shortcut to Fixing It

You’ve trained a customer segmentation model, but it’s clustering addresses from different states together. Your recommendation engine is suggesting products to customers in countries you don’t ship to. Your fraud detection system keeps flagging legitimate international transactions.

Sound familiar? Here’s the uncomfortable truth: Your model isn’t failing. Your data is.

As we discussed in my previous piece on data quality’s role in AI accuracy, garbage in = garbage out. But what we didn’t cover was the practical reality: cleaning data is notoriously painful. It’s the unglamorous, time-consuming work that derails AI projects before they even see production.

What if I told you there’s a shortcut? Not a theoretical framework, but actual APIs that solve 80% of your data quality problems before they ever reach your model.

The Real Cost of "Just Fixing It Ourselves"

I’ve been in those sprint planning meetings. The team agrees: “We need clean address data.” Then comes the estimate: 3-4 sprints to build validation logic, source international reference data, handle edge cases, maintain updates…

Let’s break down what “building it yourself” actually entails for common data points:

For address validation alone:

  • Building parsers for different international formats
  • Maintaining postal code databases across 240+ countries
  • Geocoding and standardization logic
  • Handling address changes and updates

For identity/contact data:

  • Phone number formatting and validation per country
  • Email syntax and deliverability checking
  • Name parsing and normalization

For demographic data:

  • Date/age validation
  • Gender categorization pitfalls
  • Cultural naming conventions

That’s months of development time — time spent rebuilding what already exists as robust, maintained services.

The Developer’s Dilemma: Build vs. Buy vs. Burnout

Most engineering teams face the same crossroads:

  1. Build from scratch (and become a data quality team instead of an AI team)
  2. Patch with regex (and watch edge cases pile up in production)
  3. Ignore it (and wonder why model accuracy degrades)

There’s a fourth option: Consume quality as a service.

This is where Melissa’s APIs changed my team’s workflow. What started as a “let’s try it for addresses” experiment turned into a comprehensive data quality strategy.

Real-World Integration: How Teams Are Doing This Today

Case 1: The E-Commerce Recommendation Engine Fix

The Problem: A mid-sized retailer’s product recommendation model was underperforming. Analysis showed 23% of customer addresses had formatting issues, causing incorrect regional clustering.

The Melissa Solution: They piped customer data through the Global Address Verification API during signup and before batch training runs.

python

# Simplified integration example
import requests
import pandas as pd

def clean_address_for_training(df):
    # Pre-process with Melissa before training
    for index, row in df.iterrows():
        response = requests.post(
            'https://api.melissa.com/v2/address/verify',
            json={
                'address': row['raw_address'],
                'country': row['country'],
                'apikey': YOUR_API_KEY
            }
        )
        if response.json()['valid']:
            df.at[index, 'clean_address'] = response.json()['standardized']
            df.at[index, 'latitude'] = response.json()['coordinates']['lat']
            df.at[index, 'longitude'] = response.json()['coordinates']['lon']
    return df

# Use clean data for geospatial features
df_clean = clean_address_for_training(raw_customer_data)

The Result: Regional clustering accuracy improved by 31%. Shipping cost predictions became significantly more accurate because distances were calculated from verified coordinates.

Case 2: The FinTech Fraud Detection Boost

The Problem: A payment processor’s fraud model had high false positives on international transactions due to inconsistent phone and identity data.

The Melissa Solution: They implemented a pre-processing pipeline using:

  1. Phone Verification API to validate and format numbers
  2. Global Name Verification to normalize customer names
  3. Email Verification to check deliverability

javascript

// Webhook-based verification for real-time applications
app.post('/api/transaction', async (req, res) => {
  const { customerPhone, customerEmail, billingAddress } = req.body;

  // Parallel verification calls
  const [phoneValid, emailValid, addressValid] = await Promise.all([
    melissa.verifyPhone(customerPhone),
    melissa.verifyEmail(customerEmail),
    melissa.verifyAddress(billingAddress)
  ]);

  // Create verified features for fraud model
  const verificationFeatures = {
    phone_score: phoneValid.confidence,
    email_score: emailValid.deliverability,
    address_match: addressValid.valid
  };

  // Pass enhanced data to ML model
  const fraudProbability = await fraudModel.predict({
    ...req.body,
    ...verificationFeatures
  });

  res.json({ risk_score: fraudProbability });
});

The Result: False positives decreased by 44% while catching 15% more actual fraud through better identity linking.

The Practical Integration Guide: Where Data Quality APIs Fit in Your ML Pipeline

Option 1: Pre-Training Batch Processing (Easiest Start)

python

# Your existing data preparation script, enhanced
import melissa_sdk

def prepare_training_data(csv_path):
    df = pd.read_csv(csv_path)

    # Initialize client
    client = melissa_sdk.Client(api_key=API_KEY)

    # Batch process critical fields
    df['clean_address'] = client.addresses.batch_verify(df['address'].tolist())
    df['valid_email'] = client.emails.batch_verify(df['email'].tolist())

    # Now train with clean data
    model.fit(df[['clean_address', 'valid_email', 'other_features']], df['target'])

Option 2: Real-Time Feature Engineering (Production-Ready)

For models making real-time predictions (credit scoring, recommendations, fraud detection), bake verification into your feature engineering pipeline:

python

# Feature engineering service
class VerifiedFeatures:
    def __init__(self):
        self.melissa = MelissaClient(API_KEY)

    def enrich(self, raw_data):
        features = {}

        # Address-based features
        addr = self.melissa.verify_address(raw_data['address'])
        features['address_valid'] = addr.valid
        features['address_type'] = addr.type  # Commercial, Residential, etc.
        features['geohash'] = self._to_geohash(addr.coordinates)

        # Phone-based features
        phone = self.melissa.verify_phone(raw_data['phone'])
        features['phone_carrier'] = phone.carrier
        features['phone_country'] = phone.country

        return {**raw_data, **features}

# Use in your prediction endpoint
@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    enriched_data = feature_service.enrich(data)
    prediction = model.predict(enriched_data)
    return jsonify({'prediction': prediction})

Option 3: The Hybrid Approach (Most Teams)

Most teams we work with use a combination:

  1. Batch clean historical training data
  2. Real-time verify at inference time
  3. Periodic re-cleaning of training datasets

Why This Isn't Just Another Vendor Pitch

I was skeptical too. The market is full of "data quality" tools that add complexity instead of reducing it. What changed my mind:

  1. The API-First Design: No enterprise sales calls needed. You can literally sign up at developer.melissa.com, get a key, and make your first call in 5 minutes.
  2. The Coverage: We needed to handle addresses in 14 countries initially. Melissa covered all of them plus 230 more we might expand into.
  3. The Accuracy Rates: For North American addresses, we consistently see 99%+ validation accuracy. International varies by country but stays above 95% for most developed nations.
  4. The Cost Math: When I calculated engineer-hours to build vs. API costs, it wasn't even close. At our scale, we'd need half a dedicated engineer to maintain what $X/month in API calls provides.

Your Actionable Checklist for Next Sprint

  1. Audit Your Training Data: Pick one model and check a 1000-row sample for address/phone/email validity rates.
  2. Run a Cost-Benefit: Estimate engineering time to build vs. API costs. Use Melissa's pricing page for numbers.
  3. Prototype in an Hour: Pick one endpoint (start with Global Address Verification) and clean a sample dataset.
  4. Measure Impact: A/B test model performance with cleaned vs. raw data for a single feature.
  5. Decide Scope: Batch only? Real-time? Hybrid?

The Bottom Line for AI Teams

Data quality isn't a "nice-to-have" — it's your model's foundation. But foundation work shouldn't mean reinventing the wheel for every project.

The strategic shift isn't from "ignoring data quality" to "building everything in-house." It's from "building" to "orchestrating" — leveraging specialized tools so you can focus on what makes your AI unique.

Your next step: Pick one training dataset this week. Run it through verification for just one field (addresses or emails). Compare the "before" and "after" distributions. You'll see the noise removed from your signal immediately.

Then ask yourself: is data cleaning really where you want your team's innovation energy going?

Have you implemented data quality APIs in your ML pipeline? What was your experience? Share your stories (or horror stories) in the comments below.

:::tip Ready to experiment? Start with their developer portal: developer.melissa.com

:::


\n

\

What Really Determines the Speed of Your PyTorch Code?

2026-01-28 01:02:27

Anyone who works with PyTorch model code starts asking the same questions:

Why is this taking so long? How do I make my training loop faster?

Whether you’re an ML engineer, a researcher or just decided to play around with a random ML repository over the weekend, you will eventually try to understand how to speed your code up.

However, before we can do that, we need to learn how to measure performance correctly. And then draw the right conclusions from these measurements. This article is about exactly that, about properly benchmarking CUDA or PyTorch code.

Execution flow

Let’s use matrix multiplication on an H100 as the running example throughout.

However, before we think about measurements, let’s get the terminology straight. By “kernel” we mean a function (or really any operation) that runs on a GPU. Your code might have a lot of these, from tens to thousands of operations during each iteration of the training loop, for example:

  • torch.zeros((16, 16), device=”cuda)
  • a + b, where a and b are tensors in GPU memory
  • a @ b
  • flash_attention(q, k, v)
  • and so on

Despite the kernel being executed on a GPU, its launch is controlled by the CPU. If we imagine a Python file with the training code, it alternates between “preparatory” work, such as for-loops or if-statements, and actual kernel launches. Because of this, the CPU runs ahead and whenever it hits a kernel, it schedules that kernel for execution on the GPU by adding it to a queue. When the next kernel gets added to the queue, there is no guarantee that the previous one has already finished.

CPU-to-GPU scheduling. The CPU triggers kernel launches and runs some additional processing work in between.

This allows the GPU to avoid sitting idle: it can always find work in that kernel queue (except when synchronization is explicitly called or when execution is CPU-bound).

The naïve approach and why it’s wrong

When we first think about measuring time in Python, the natural instinct is to reach for the time module and run something like this:

import time

import torch


def matmul(a, b):
    return a @ b


def get_data(m, n, k):
    a = torch.randn(m, n, device="cuda", dtype=torch.bfloat16)
    b = torch.randn(n, k, device="cuda", dtype=torch.bfloat16)
    return a, b


def benchmark_naive(m, n, k, n_iters: int = 100):
    a, b = get_data(m, n, k)
    times = []
    for _ in range(n_iters):
        start = time.perf_counter_ns()
        matmul(a, b)
        end = time.perf_counter_ns()
        times.append(end - start)        
    # perf_counter_ns returns nanoseconds, convert to ms
    return sum(times) / n_iters / 1e6


small_shapes_time = benchmark_naive(16, 32, 16)
large_shapes_time = benchmark_naive(4096, 8192, 4096)

print(f"(16, 32) by (32, 16) matmul time: {small_shapes_time:.5f} ms")
print(f"(4096, 8192) by (8192, 4096) matmul time: {large_shapes_time:.5f} ms")

# (16, 32) by (32, 16) matmul time: 0.01415 ms
# (4096, 8192) by (8192, 4096) matmul time: 0.01350 ms

If we run this code on two different sets of shapes: 16x32 and 4096x8192, we’d see that there’s basically no difference in matmul time. In fact, in my run, it is actually faster to multiply matrices which are 256 times larger. This result is very suspicious and should make us think about what this code actually measures.

Actually, we never wait for the GPU matmul to finish. Since all the “preparatory” code runs on the CPU, time.perf_counter_ns() is actually measuring the time it takes for us to schedule the matmul kernel into the GPU queue and move on with our lives.

time.perf_counter measures the CPU overhead instead of the actual GPU time

What we can do is to add torch.cuda.synchronize() after the matmul to force the CPU to wait for the GPU. But that’s still not ideal, because we’re measuring elapsed wall-clock time on the CPU, which includes scheduling overhead. We do not isolate the actual GPU execution this way.

time.perf_counter-based measurement for matrix multiplication

\

Measuring with CUDA events

The correct way to measure the GPU time instead of the CPU time is with CUDA events. These are markers we can insert directly into the GPU’s execution stream. The GPU records timestamps whenever it reaches each CUDA event, giving us the actual execution time on the device.

import torch


def matmul(a, b):
    return a @ b


def get_data(m, n, k):
    a = torch.randn(m, n, device="cuda", dtype=torch.bfloat16)
    b = torch.randn(n, k, device="cuda", dtype=torch.bfloat16)
    return a, b


def benchmark_cuda_events(m, n, k, n_iters: int = 100):
    a, b = get_data(m, n, k)
    start_events = [torch.cuda.Event(enable_timing=True) for _ in range(n_iters)]
    end_events = [torch.cuda.Event(enable_timing=True) for _ in range(n_iters)]
    for it in range(n_iters):
        start_events[it].record()
        matmul(a, b)
        end_events[it].record()
    torch.cuda.synchronize()
    times = [start.elapsed_time(end) for start, end in zip(start_events, end_events)]
    return sum(times) / n_iters


small_shapes_time = benchmark_cuda_events(16, 32, 16)
large_shapes_time = benchmark_cuda_events(4096, 8192, 4096)

print(f"(16, 32) by (32, 16) matmul time: {small_shapes_time:.5f} ms")
print(f"(4096, 8192) by (8192, 4096) matmul time: {large_shapes_time:.5f} ms")

# (16, 32) by (32, 16) matmul time: 0.02093 ms
# (4096, 8192) by (8192, 4096) matmul time: 0.34025 ms

The events get inserted into the same stream as our matmul kernel. Setting enable_timing=True ensures that timestamps are recorded. The .elapsed_time method gives us the difference between the start and end timestamps.

Now we get real numbers, not absurdly fast results from the naive approach. From this benchmark, the matmul with large shapes takes at least 15x longer.

CUDA-events-based benchmarking measures the actual GPU time

It is also worth noting that it’s better to include warmup iterations at the beginning, because there are often one-time costs included in the first launch, such as the JIT compilation. We will include warmup in the next measurements.

\

The L2 cache

Another issue with our current measurement is the lack of an L2 cache flush. When we run the same kernel repeatedly with the same data (and this is exactly what we’re doing), that data stays in the GPU’s L2 cache.

In NVIDIA GPUs, there’s a memory hierarchy: HBM is the largest and slowest, followed by progressively smaller and faster memory units (L2 cache → L1 cache or Shared Memory → register memory). The L2 cache is a relatively large chunk of memory that reduces the main memory accesses.

| GPU | Generation | L2 per GPU | HBM per GPU | |----|----|----|----| | V100 | Volta | 6MB | 32GB | | A100 | Ampere | 40MB | 80GB | | H100 | Hopper | 50MB | 80GB | | B200 | Blackwell | 126MB | 192GB |

Because of this cache, our measurements might show better results than what we’d see in production, where data changes between iterations. To get more realistic measurements, we’d need to flush the cache. One way to do that is to allocate a large buffer and update it at the beginning of each iteration. This is, of course, a heuristic rather than a guaranteed cache flush:

import torch


def flush_l2_cache():
    # On H100, the L2 cache is 50MB, so we allocate something a bit bigger
    cache_size = 60 * 1024 * 1024
    buffer = torch.zeros(cache_size // 4, dtype=torch.float32, device="cuda")
    buffer += 1
    del buffer


def matmul(a, b):
    return a @ b


def get_data(m, n, k):
    a = torch.randn(m, n, device="cuda", dtype=torch.bfloat16)
    b = torch.randn(n, k, device="cuda", dtype=torch.bfloat16)
    return a, b


def benchmark_with_cache_flush(m, n, k, n_iters: int = 100):
    a, b = get_data(m, n, k)

    for _ in range(10):
        matmul(a, b)  # warmup
    torch.cuda.synchronize()

    start_events = [torch.cuda.Event(enable_timing=True) for _ in range(n_iters)]
    end_events = [torch.cuda.Event(enable_timing=True) for _ in range(n_iters)]
    for it in range(n_iters):
        flush_l2_cache()  # flush the L2 between iterations
        start_events[it].record()
        matmul(a, b)
        end_events[it].record()
    torch.cuda.synchronize()
    times = [start.elapsed_time(end) for start, end in zip(start_events, end_events)]
    return sum(times) / n_iters


small_shapes_time = benchmark_with_cache_flush(16, 32, 16)
large_shapes_time = benchmark_with_cache_flush(4096, 8192, 4096)

print(f"(16, 32) by (32, 16) matmul time: {small_shapes_time:.5f} ms")
print(f"(4096, 8192) by (8192, 4096) matmul time: {large_shapes_time:.5f} ms")

# (16, 32) by (32, 16) matmul time: 0.00634 ms
# (4096, 8192) by (8192, 4096) matmul time: 0.40575 ms

Actually, there was a significant difference in the result between this run and the previous one, but in reality re-running the code several times alters the results a bit, and our difference is well within the expected variance - so for this particular problem the cache did not have a significant impact.

\

Built-in solutions

Do we have to remember all of this just to benchmark something? The answer is no. In fact, there’s a library called Triton, which was originally developed by OpenAI back in 2021 for writing GPU kernels in a Pythonic way. Nowadays it has a built-in testing module with proper benchmarking utils, which are capable of doing all of the above on their own. Let’s take a look:

# Source: https://github.com/triton-lang/triton/blob/main/python/triton/testing.py#L127C1-L190C64

def do_bench(fn, warmup=25, rep=100, grad_to_none=None, quantiles=None, return_mode="mean"):
    """
    Benchmark the runtime of the provided function. By default, return the median runtime of :code:`fn` along with
    the 20-th and 80-th performance percentile.

    :param fn: Function to benchmark
    :type fn: Callable
    :param warmup: Warmup time (in ms)
    :type warmup: int
    :param rep: Repetition time (in ms)
    :type rep: int
    :param grad_to_none: Reset the gradient of the provided tensor to None
    :type grad_to_none: torch.tensor, optional
    :param quantiles: Performance percentile to return in addition to the median.
    :type quantiles: list[float], optional
    :param return_mode: The statistical measure to return. Options are "min", "max", "mean", "median", or "all". Default is "mean".
    :type return_mode: str
    """
    assert return_mode in ["min", "max", "mean", "median", "all"]

    di = runtime.driver.active.get_device_interface()

    fn()
    di.synchronize()

    cache = runtime.driver.active.get_empty_cache_for_benchmark()

    # Estimate the runtime of the function
    start_event = di.Event(enable_timing=True)
    end_event = di.Event(enable_timing=True)
    start_event.record()
    for _ in range(5):
        runtime.driver.active.clear_cache(cache)
        fn()
    end_event.record()
    di.synchronize()
    estimate_ms = start_event.elapsed_time(end_event) / 5

    # compute number of warmup and repeat
    n_warmup = max(1, int(warmup / estimate_ms))
    n_repeat = max(1, int(rep / estimate_ms))
    start_event = [di.Event(enable_timing=True) for i in range(n_repeat)]
    end_event = [di.Event(enable_timing=True) for i in range(n_repeat)]
    # Warm-up
    for _ in range(n_warmup):
        fn()
    # Benchmark
    for i in range(n_repeat):
        # we don't want `fn` to accumulate gradient values
        # if it contains a backward pass. So we clear the
        # provided gradients
        if grad_to_none is not None:
            for x in grad_to_none:
                x.grad = None
        # we clear the L2 cache before each run
        runtime.driver.active.clear_cache(cache)
        # record time of `fn`
        start_event[i].record()
        fn()
        end_event[i].record()
    # Record clocks
    di.synchronize()
    times = [s.elapsed_time(e) for s, e in zip(start_event, end_event)]
    return _summarize_statistics(times, quantiles, return_mode)

In addition to what we’ve covered, do_bench allows to get several statistics out of the benchmark, not just the mean. For instance, one might request the median or the 99th percentile. There are other useful arguments, which are covered in the docstring.

To use this function, we can simply call:

import triton

bench_time = triton.testing.do_bench(lambda: matmul(a, b))

\

CPU-bound execution and CUDA graphs

For kernels with a very short execution time, the launch overhead might become so large that it exceeds the kernel time itself. The same happens when there’s significant CPU work (usually unoptimized) between kernel launches. This would severely affect our measurements.

GPU stays idle because of the CPU overhead

\n Let’s add a looong for-loop into our matmul kernel, since we know that for-loops are really slow on CPU in Python:

def cpu_heavy_matmul(a, b):
    cnt = 0
    for _ in range(100_000):
        cnt += 1
    return a @ b

We’ll see that the (4096, 8192) by (8192, 4096) matmul time increased from 0.4ms to 2.2ms - more than 5x!

Now, depending on what your actual training run looks like, you might want to either eliminate the CPU-overhead from the benchmarking results or not. For example, if your training is already CPU-bound, then the same problems would arise during the production run as well. Then, most likely, it does not make sense to get rid of the CPU time for our measurements. However, in normal training runs, the CPU should not stand in the way of the optimal training performance - usually the CPU runs ahead of the GPU. In that case, even if that one kernel has a significant overhead, it won’t alter the kernel’s runtime on the GPU. Then we’d want to eliminate the overhead from our measurements.

To do this, we can use CUDA graphs. CUDA Graphs let us record a series of kernels once and replay them with minimal CPU involvement. In Triton, there’s triton.testing.do_bench_cudagraph, which provides exactly the same functionality. Using this for our new matmul, we get the initial kernel time (before the CPU overhead was introduced).

a, b = get_data(4096, 8192, 4096)
triton.testing.do_bench_cudagraph(lambda: cpu_heavy_matmul(a, b))  # back to ~0.4ms!

Of course, CUDA graphs come with their own constraints (static shapes, static control flow and so on). They are best suited for isolating some chunks of the GPU performance rather than e2e training loops.

\

Not only systems fail - looking at data

Sometimes even perfect benchmarking methodology doesn’t tell the whole story, because the production conditions are different from the isolated environment where we run the benchmark.

For example, let’s say we want to benchmark the grouped gemm kernel inside of a Mixture-of-Experts layer. In a typical MoE layer, tokens are dynamically routed to different experts based on router probabilities. As a result, the actual workload seen by each expert (and by the grouped gemm kernel) depends on how balanced the routing is.

Naively, we’d generate random router probabilities and artificially route tokens based on them. However, throughout training, our router might be imbalanced, utilizing several experts much more heavily than the others.

Let’s see how the underlying distribution of tokens routed to each expert affects our measurements.

import numpy as np
import torch
import triton


def sample_expert_assignments(
    seq_len: int,
    num_experts: int,
    top_k: int,
    use_beta: bool = True,
    alpha: float = 1.0,
    beta: float = 5.0,
) -> tuple[torch.Tensor, torch.Tensor]:
    if use_beta:
        expert_weights = np.random.beta(alpha, beta, num_experts)
        expert_weights = expert_weights / expert_weights.sum()
    else:
        expert_weights = np.ones(num_experts) / num_experts

    gumbel_noise = -np.log(-np.log(np.random.uniform(0, 1, (seq_len, num_experts))))

    log_weights = np.log(expert_weights + 1e-10)
    scores = log_weights[None, :] + gumbel_noise

    expert_indices = torch.from_numpy(np.argsort(scores, axis=1)[:, -top_k:])

    tokens_per_expert = torch.bincount(
        expert_indices.flatten(), minlength=num_experts
    ).to(torch.int32)

    return expert_indices, tokens_per_expert


def get_tokens(
    seq_len: int,
    hidden: int,
    num_experts: int,
    top_k: int,
    expert_indices: torch.Tensor,
    tokens_per_expert: torch.Tensor,
    device: str = "cuda",
) -> torch.Tensor:
    x_original = torch.randn((seq_len, hidden), dtype=torch.bfloat16, device=device)

    token_indices = torch.arange(seq_len, device=device)[:, None].expand(-1, top_k)
    expert_flat = expert_indices.to(device).flatten()
    token_flat = token_indices.flatten()

    token_sorted = token_flat[torch.argsort(expert_flat)]

    return x_original[token_sorted]


def get_tensors(
    seq_len: int = 1024,
    hidden: int = 4096,
    intermediate: int = 1536,
    num_experts: int = 128,
    top_k: int = 8,
    use_beta: bool = True,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    expert_indices, tokens_per_expert = sample_expert_assignments(
        seq_len,
        num_experts,
        top_k,
        use_beta=use_beta,
    )

    x = get_tokens(
        seq_len,
        hidden,
        num_experts,
        top_k,
        expert_indices,
        tokens_per_expert,
        device="cuda",
    )

    w = torch.randn(
        (num_experts, hidden, intermediate), dtype=torch.bfloat16, device="cuda"
    )

    offsets = torch.cumsum(tokens_per_expert, dim=0).to(
        dtype=torch.int32, device="cuda"
    )

    return x, w, offsets


def benchmark(
    seq_len: int = 1024,
    hidden: int = 4096,
    intermediate: int = 1536,
    num_experts: int = 128,
    top_k: int = 8,
    use_beta: bool = True,
    num_iters: int = 250,
):
    x, w, offsets = get_tensors(
        seq_len,
        hidden,
        intermediate,
        num_experts,
        top_k,
        use_beta,
    )
    return triton.testing.do_bench_cudagraph(
        lambda: torch._grouped_mm(x, w, offs=offsets), rep=num_iters
    )


params = {
    "seq_len": 4096,
    "hidden": 4096,
    "intermediate": 1536,
    "num_experts": 128,
    "top_k": 8,
}
uniform = benchmark(**params, use_beta=False)
beta = benchmark(**params, use_beta=True)

print(f"Uniform: {uniform:.5f} ms")
print(f"Beta: {beta:.5f} ms")

# Uniform: 0.96811 ms
# Beta: 1.02422 ms

With the uniform token assignment, the GroupedGEMM kernel is 6% faster compared to when we sample from the beta distribution (which we use to model imbalanced load)! In practice, this could lead us to seeing discrepancies between the performance during an actual training run vs. what we observed during benchmarking.

\

Putting everything together

To summarize:

  1. When writing the benchmarking code by hand, do not forget to:

  2. Flush the L2 cache

  3. Use CUDA events

  4. Use cuda.synchronize to wait for the completion of all GPU work

  5. Alternatively, use triton.testing.do_bench or triton.testing.do_bench_cudagraph for a built-in solution.

  6. Regardless of approach, do not forget about the underlying data you’re using.

All in all, your benchmark is only as useful as its relevance to what we’ll actually see in real production runs. If we care about CPU overhead, do not get rid of it. If you expect a specific data distribution during training, try to sample from it while benchmarking.