MoreRSS

site iconHackerNoonModify

We are an open and international community of 45,000+ contributing writers publishing stories and expertise for 4+ million curious and insightful monthly readers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of HackerNoon

How to Build Secure & Compressed Microservices in Symfony

2026-03-31 05:53:58

In the rapidly evolving landscape of modern web development, microservices have become the gold standard for building scalable, decoupled applications. But as your system grows, so does the complexity of how these isolated services communicate. Enter asynchronous messaging.

\ When dealing with high-throughput systems, two massive challenges inevitably emerge:

  1. Performance & Scale (how to handle millions of messages without burning through your infrastructure budget)
  2. Resiliency & Reliability (how to survive network hiccups, database locks, and API rate limits without dropping data).

\ With the release of the Symfony 7.4 ecosystem, the symfony/messenger component continues to be a developer’s best friend. And thanks to the CompressStamp, we now have a native, incredibly elegant way to crush bandwidth costs and supercharge queue performance.

\ In this deep dive, we are going to explore how to build a highly resilient, lightning-fast microservice architecture using Symfony Messenger, Redis, and advanced message stamping.

The Bottleneck: The “Fat Payload” Problem

Message queues are designed to be fast and lightweight. A classic architectural rule is to “send references, not data” (e.g., sending a user_id instead of the entire User object). However, in real-world microservices, this isn’t always possible.

\ Imagine you are building a reporting microservice, an invoice generator, or a system that bulk-syncs data to a third-party CRM. You are forced to pass massive JSON payloads, Base64-encoded file strings, or deeply nested arrays across the wire.

\ When these “fat payloads” hit your transport (Redis, Amazon SQS, etc.), three things happen:

  1. Memory Bloat: Transport stores everything. Giant messages will trigger eviction policies or crash your instance entirely.
  2. Network Latency: Moving megabytes of data between your web nodes and your queue slows down your producers.
  3. Security Risks: Storing unencrypted PII or financial data in a queue violates compliance standards like GDPR or HIPAA.

A Custom Serialization Pipeline

Out of the box, Symfony Messenger serializes your message objects into plain JSON strings. To solve our performance and security bottlenecks, we are going to intercept this process.

\ By creating custom Stamps (metadata markers) and decorating the default Serializer, we can instruct Symfony to natively compress and encrypt specific messages right before they hit the transport and reverse the process the moment a worker picks them up.

Designing for Resiliency & Reliability

Speed means nothing if your system is fragile. Microservices fail. Third-party APIs go down. Databases lock. If your consumer throws an exception, you cannot afford to lose the message.

\ A resilient Symfony Messenger architecture relies on three pillars:

  1. Asynchronous Transports: Never make the user wait for a background task.
  2. Retry Strategies: Automatically re-queue failed messages with an exponential backoff (e.g., retry after 10 seconds, then 20 seconds, then 40 seconds).
  3. Failure Transports (Dead Letter Queues): If a message fails all retries, route it to a secure database queue where a developer can inspect it, fix the bug, and manually replay it.

The Tech Stack

We are utilizing the current Symfony 7.4 LTS ecosystem alongside PHP 8.2+. Ensure you have the necessary PHP extensions installed on your server.

  1. symfony/messenger: Core message bus and worker tooling.
  2. symfony/redis-messenger: The official Redis transport for Messenger.
  3. ext-zlib: Native PHP extension required for gzcompress.
  4. ext-openssl: Native PHP extension required for AES-256 encryption.

Step-by-Step Implementation

Let’s build our secure, highly compressed “Invoice Generation” service.

The Message Class & Handlers

In modern PHP, we use strongly typed, read-only classes for our messages.

namespace App\Message;

/**
 * Represents a bulk invoice generation request.
 */
readonly class GenerateBulkInvoiceMessage
{
    public function __construct(
        public string $batchId,
        public array $invoiceData // Imagine this array contains thousands of nested rows
    ) {
    }
}

\ Using the #[AsMessageHandler] attribute, our worker expects to receive the fully hydrated, decompressed, and decrypted object. Our worker doesn’t need to know how the message was transported; it just handles the business logic.

namespace App\MessageHandler;

use App\Message\GenerateBulkInvoiceMessage;
use Symfony\Component\Messenger\Attribute\AsMessageHandler;
use Psr\Log\LoggerInterface;

#[AsMessageHandler]
readonly class GenerateBulkInvoiceMessageHandler
{
    public function __construct(
        private LoggerInterface $logger
    ) {
    }

    public function __invoke(GenerateBulkInvoiceMessage $message): void
    {
        $this->logger->info('Starting bulk invoice generation for batch.', [
            'batchId' => $message->batchId,
            'recordsCount' => count($message->invoiceData)
        ]);

        // Simulate heavy processing...
        // If this throws an exception, Symfony Messenger automatically catches it,
        // checks the retry_strategy in messenger.yaml and re-queues it in Redis!

        $this->logger->info('Bulk invoice generation completed successfully.');
    }
}

Creating the Custom Stamps

In Symfony Messenger, stamps are simply DTOs that act as metadata. We will create two stamps: one for compression and one for security.

namespace App\Messenger\Stamp;

use Symfony\Component\Messenger\Stamp\StampInterface;

/**
 * A stamp indicating that the serialized message payload should be compressed.
 */
readonly class CompressStamp implements StampInterface
{
}
namespace App\Messenger\Stamp;

use Symfony\Component\Messenger\Stamp\StampInterface;

/**
 * A stamp indicating that the serialized message payload should be encrypted.
 */
readonly class SecureStamp implements StampInterface
{
}

The Custom Serializer

Instead of writing a serializer from scratch, we use the Decorator pattern to wrap Symfony’s default serializer. If the CompressStamp is present, we compress the JSON body using PHP’s native zlib extension. If it detects the SecureStamp, it applies AES-256-CBC encryption via OpenSSL.

namespace App\Messenger\Serialization;

use App\Messenger\Stamp\CompressStamp;
use App\Messenger\Stamp\SecureStamp;
use Symfony\Component\Messenger\Envelope;
use Symfony\Component\Messenger\Exception\MessageDecodingFailedException;
use Symfony\Component\Messenger\Transport\Serialization\SerializerInterface;

/**
 * Serializer that independently handles compression and encryption.
 */
readonly class CompressSerializer implements SerializerInterface
{
    private const string COMPRESSED_HEADER = 'X-Compressed';
    private const string SECURED_HEADER = 'X-Secured';
    private const string CIPHER_ALGO = 'aes-256-cbc';

    public function __construct(
        private SerializerInterface $innerSerializer,
        private string $encryptionKey
    ) {
    }

    public function decode(array $encodedEnvelope): Envelope
    {
        $body = $encodedEnvelope['body'] ?? throw new MessageDecodingFailedException('Encoded envelope has no body.');

        print_r($encodedEnvelope);


        // 1. Handle Decryption (must happen before decompression if both were used)
        if (isset($encodedEnvelope['headers'][self::SECURED_HEADER]) && 'true' === $encodedEnvelope['headers'][self::SECURED_HEADER]) {
            $decodedBody = base64_decode($body, true);
            if ($decodedBody === false) {
                throw new MessageDecodingFailedException('Failed to base64 decode the secured message body.');
            }

            $ivLength = openssl_cipher_iv_length(self::CIPHER_ALGO);
            $iv = substr($decodedBody, 0, $ivLength);
            $encryptedData = substr($decodedBody, $ivLength);

            $key = hash('sha256', $this->encryptionKey, true);
            $decryptedBody = openssl_decrypt($encryptedData, self::CIPHER_ALGO, $key, OPENSSL_RAW_DATA, $iv);

            if (false === $decryptedBody) {
                throw new MessageDecodingFailedException('Failed to decrypt the message body. Check your encryption key.');
            }

            $body = $decryptedBody;
        }

        // 2. Handle Decompression
        if (isset($encodedEnvelope['headers'][self::COMPRESSED_HEADER]) && 'true' === $encodedEnvelope['headers'][self::COMPRESSED_HEADER]) {
            $decompressedBody = gzinflate($body);

            if (false === $decompressedBody) {
                throw new MessageDecodingFailedException('Failed to decompress the message body.');
            }

            $body = $decompressedBody;
        }

        $encodedEnvelope['body'] = $body;

        return $this->innerSerializer->decode($encodedEnvelope);
    }

    public function encode(Envelope $envelope): array
    {
        $encodedEnvelope = $this->innerSerializer->encode($envelope);
        $body = $encodedEnvelope['body'];

        // 1. Handle Compression (Compress first for maximum efficiency before encryption)
        if (null !== $envelope->last(CompressStamp::class)) {
            $compressedBody = gzdeflate($body);
            if (false === $compressedBody) {
                throw new \RuntimeException('Failed to compress the message body.');
            }
            $body = $compressedBody;
            $encodedEnvelope['headers'][self::COMPRESSED_HEADER] = 'true';
        }

        // 2. Handle Encryption
        if (null !== $envelope->last(SecureStamp::class)) {
            if (empty($this->encryptionKey)) {
                throw new \LogicException('Cannot encrypt message: MESSENGER_ENCRYPTION_KEY is not set.');
            }

            $ivLength = openssl_cipher_iv_length(self::CIPHER_ALGO);
            $iv = random_bytes($ivLength);
            $key = hash('sha256', $this->encryptionKey, true);

            $encryptedBody = openssl_encrypt($body, self::CIPHER_ALGO, $key, OPENSSL_RAW_DATA, $iv);

            if (false === $encryptedBody) {
                throw new \RuntimeException('Failed to encrypt the message body.');
            }

            $body = base64_encode($iv . $encryptedBody);
            $encodedEnvelope['headers'][self::SECURED_HEADER] = 'true';
        }

        $encodedEnvelope['body'] = $body;

        return $encodedEnvelope;
    }
}

Wiring It Up

To make this pipeline active, we register our decorator in config/services.yaml.

services:    
    ...
    App\Messenger\Serialization\CompressSerializer:
        arguments:
            $innerSerializer: '@messenger.default_serializer'
            $encryptionKey: '%env(MESSENGER_ENCRYPTION_KEY)%'
    ...

\ Now, dispatching a secure, lightweight message is as simple as:

        // Dispatch the message, attaching both stamps for maximum security and efficiency
        $this->messageBus->dispatch($message, [
            new CompressStamp(),
            new SecureStamp()
        ]);

Benchmarking the Ultimate Pipeline

To truly understand the value and the trade-offs of this architecture, let’s look at a real-world benchmark. We simulated a high-throughput environment, dispatching 10,000 GenerateBulkInvoiceMessage objects to our Redis transport. Each message contained a fat array payload that, when serialized natively, equated to approximately 500KB per message.

\ Here are the results across a standard cloud environment (2 vCPUs, 4GB RAM):

+-------------------------+----------------+------------------+---------------------------+
| Metric                  | Baseline (Raw) | Compress + Stamp | Compress + Secure         |
+-------------------------+----------------+------------------+---------------------------+
| Total Redis Memory      | ~4.88 GB       | ~410 MB          | ~550 MB                   |
+-------------------------+----------------+------------------+---------------------------+
| Worker CPU Utilization  | ~15%           | ~22%             | ~38%                      |
+-------------------------+----------------+------------------+---------------------------+
| Avg. Time to Dispatch   | 42 seconds     | 35 seconds       | 48 seconds                |
+-------------------------+----------------+------------------+---------------------------+
| Avg. Time to Consume    | 58 seconds     | 61 seconds       | 74 seconds                |
+-------------------------+----------------+------------------+---------------------------+

Analyzing the Trade-offs

  1. The Memory Sweet Spot: Raw payloads consume a massive 4.88 GB of Redis RAM. Compression crushes this down to 410 MB. However, when we add the SecureStamp, memory creeps up slightly to 550 MB because the output of OpenSSL is binary, and storing it safely requires base64_encode(). Even with this overhead, you are saving 88% of your memory footprint!
  2. The CPU Tax: Security is never free. Adding AES-256 encryption pushes the worker’s CPU utilization up to 38%. The worker has to perform cryptographic math on every single message before unpacking it.
  3. Time to Process: The baseline takes 42 seconds to dispatch because pushing 4.88 GB over a network connection is incredibly slow. Compression speeds this up (35 seconds) by shifting the bottleneck from the network to the CPU. Adding encryption slows it back down slightly (48 seconds) due to the heavy OpenSSL processing.

\ Is the CPU tax worth it? If your payloads contain PII or financial records, sacrificing a bit of CPU time to ensure military-grade encryption while still saving 88% on your infrastructure bill is an architectural slam dunk.

Conclusion

Building modern microservices requires more than just pushing data into a queue. By extending Symfony’s Messenger component with custom Serializers and Stamps, you can take complete control over your message payloads.

\ You no longer have to choose between performance and security. By implementing this custom pipeline, you ensure that your message broker remains highly performant, remarkably cost-effective, and fully compliant with strict data security laws.

\ Source Code: You can find the full implementation and follow the project’s progress on GitHub: [https://github.com/mattleads/CompressStamp]

Let’s Connect!

If you found this helpful or have questions about the implementation, I’d love to hear from you. Let’s stay in touch and keep the conversation going across these platforms:

\

What 55 Bluesky Power Users Think of Attie: Custom Feeds or Algorithmic Overlords?

2026-03-31 05:49:28

TL;DR Bluesky’s launch of Attie, a natural-language AI agent designed to help users build custom feeds, has ignited a massive cultural divide across the "Atmosphere." While the tool was intended to democratize feed creation, it was met with a digital uprising, quickly becoming the second most-blocked account on the protocol—surpassed only by J.D. Vance. This curation of 55 power-user reactions captures the tension between "protocol purists" who view any AI integration as an invasion of their human-centric sanctuary and tech-optimists who believe the tool is a necessary step toward true user sovereignty and data discovery.

Pretext Does What CSS Can't — Measuring Text Before the DOM Even Exists

2026-03-31 04:01:47

Pretext is a new JS/TS library from Cheng Lou (React core, react-motion, Midjourney) that calculates multiline text height without DOM reflow. It splits work into a one-time prepare() pass via off-screen canvas, then a pure-arithmetic layout() hot path that runs in ~0.09ms. The result: accurate pre-render height prediction for virtualized lists, chat UIs, and canvas renderers — plus layout primitives CSS literally doesn't have, like finding the minimum width for a block of text at a fixed line count. Solid i18n, rigorous accuracy testing, 7k+ GitHub stars in days. Still early: no SSR, system-ui is broken on macOS, and the 500x benchmark is the author's own caveat.

HatchIt Earns a 56 Proof of Usefulness Score by Building an AI Website Builder with Real Code Export

2026-03-31 03:46:29

HatchIt is an AI-powered website builder that generates production-ready React and Tailwind CSS code while allowing developers to export directly to GitHub. By eliminating vendor lock-in, it offers a developer-first alternative to traditional AI builders that trap users within proprietary platforms. Built with modern web infrastructure and designed for real-world use, HatchIt enables faster development without sacrificing code ownership or flexibility.

Summa Earns a 62 Proof of Usefulness Score by Building an AI-Powered Investor Matchmaking Platform

2026-03-31 03:36:09

Summa is an AI-powered platform that helps early-stage founders identify and connect with relevant investors using semantic search and NLP-based matching. By analyzing founder pitches against investor theses, it reduces the time and guesswork involved in fundraising. With early traction among founders and a growing dataset, Summa positions itself as a more efficient, data-driven alternative to traditional network-based fundraising.

I Gave 5 Frontier Models the Same Email Thread. Here's What They Missed.

2026-03-31 03:23:40

We gave five frontier models the same 31-message email thread, the same prompt, and the same job: tell us what was decided, who owns what, and what changed. None of them got all of it right.

One pulled a pricing decision from a forwarded internal discussion that had been reversed six messages later. One flattened two reply branches into a single story and quietly invented consensus where there was none. And one attributed a task to someone who never said "I'll handle it" because the sentence only appeared in quoted history from an earlier reply. By message #21, that same sentence had been duplicated 12 times across the thread by email clients quoting the full history on every reply.

Across all five models, 3 out of 5 listed a dropped integration as an agreed item. 4 out of 5 misidentified decision-makers. Every model confused "person who talks a lot" with "person who has authority."

GPT-5.4, Claude Sonnet 4.6, Gemini 3 Pro, Grok 4.20 Beta, and Mistral Large 3 are all capable of answering these questions correctly given the right input. Raw email is not the right input, and the structural reasons why are specific enough to document, which is what this post does.

The Test Setup

The thread is a real B2B SaaS deal negotiation (anonymized), spanning 3 weeks with 8 participants across both organizations. It includes a subject line change mid-thread, one participant CC'd halfway through who immediately starts giving opinions without the earlier context, a forwarded internal pricing discussion that accidentally went to the prospect, and three sub-threads that branched off when people replied to different messages in the chain.

When pulled from the Gmail API, this thread produced approximately 47,000 tokens of raw content. After deduplication and quoted text stripping, the actual unique content was about 11,000 tokens. That's a 4.3x bloat factor (the ratio of raw tokens to unique tokens, driven almost entirely by email clients quoting the full history on every reply). Typical for threads this length.

We fed each model the full raw content. Every model can handle the context length.

The question is what they do with it.

Prompt. Identical across all five models, no tools, no web access, temperature 0 where the API allows:

\

Read this email thread and return: (1) current decisions, (2) open action items with owners, (3) deadlines, (4) what changed during the thread, (5) any risks or contradictions. Use the JSON schema provided.

\ Models. GPT-5.4 (OpenAI API), Claude Sonnet 4.6 (Anthropic API), Gemini 3 Pro (Google Vertex AI), Grok 4.20 Beta (xAI API), Mistral Large 3 (La Plateforme). All tested March 2026. One model per lab.

Disclosure. I work at iGPT, which sells an email preprocessing API. The structured context in the second half of this test was generated by our product. The raw-input failures documented in the first half stand on their own regardless of how you choose to fix them.

The Most Revealing Failure from Each Model

The full scoring tables are at the end. What follows is the single most instructive miss from each model, chosen because it maps to a specific structural property of email that gets destroyed when you dump raw content into a context window.

GPT-5.4: Stale Forwarded Decision

In message #7, an internal pricing discussion was accidentally forwarded to the prospect. That forwarded chain contained an approved 15% discount from three weeks earlier. By message #13, the vendor's finance team had revised the discount to 12%. By message #19, the prospect had explicitly asked for "current" pricing.

GPT-5.4 reported the 15% discount as the agreed pricing. In the flattened text, the forwarded content sits inline with no structural marker distinguishing it from active conversation, and the older figure is stated with more confidence ("approved at 15%" versus "we're revising to 12%") which the model interprets as higher certainty. GPT-5.4 also performed worst on stale-history resistance overall, pulling from forwarded content on two other questions as well.

Failure class: Forwarded-chain staleness. Forwarded content appears inline in raw email with no structural boundary. The model treats old history as current conversation.

Claude Sonnet 4.6: Wrong Owner on Turn 18

In message #18, Priya (the vendor's Solution Architect) wrote "I'll send the POC scope document by Friday." Claude Sonnet 4.6 attributed this commitment to James (the Account Executive) who had written the most messages in the thread.

In flattened email text, the pronoun "I" appears dozens of times and refers to different people. Once the From: headers are buried in threading noise, the model uses name frequency as a proxy for speaker identity, and the person who talks the most gets credited with commitments they never made. That said, Sonnet 4.6 was the only model that produced zero hallucinated commitments from quoted text, and it was one of only two that flagged the CFO's silence as a risk signal.

Failure class: Multi-party pronoun ambiguity. "I'll handle it" gets attached to the wrong speaker because the model can't reliably map "I" across flattened turns without per-message participant metadata.

Gemini 3 Pro: Branch Merge Error

Between messages #14 and #20, the thread forked. David Kim (prospect VP Eng) replied to message #14 agreeing to a 30-day POC. Meanwhile, Lisa Park (Procurement, CC'd at message #12) replied to message #11 raising concerns about security certifications and suggesting the POC should wait until a compliance review was complete.

Gemini 3 Pro collapsed both branches into one narrative: "The team agreed to a 30-day POC pending compliance review." David agreed without conditions. Lisa wanted to delay. These are contradictory positions from different branches, and the model merged them into invented consensus. This was Gemini 3 Pro's worst dimension overall: it scored lowest on branch awareness across all five models, and it was the only one to also hallucinate a fifth action item from the dropped integration discussion.

Failure class: Thread fork blindness. The model collapses parallel branches into one linear story because flattened text can't represent non-linear conversation topology.

Grok 4.20 Beta: Overconfident Contradiction Summary

Grok 4.20 detected more risk signals than any other model. It flagged the CFO's silence, the competing vendor mention, the timeline pressure, and the accidental pricing leak. It was the only model to catch all four. But it described the situation as "increasingly adversarial" and rated the deal as "high risk, likely to stall," which over-indexes on the negatives without weighing the positive momentum.

More interesting was its handling of a cross-thread reference. The prospect's VP of Engineering mentioned in message #22 that "we're also looking at [Competitor]'s approach." Grok treated this as a direct comparison and began contrasting the vendor's capabilities against the competitor, hallucinating specifics about the competitor's offering that weren't in the thread. It filled the gap with plausible-sounding details because the thread referenced but didn't describe the competitor's product.

Failure class: Cross-thread relationship loss. A reference to external context triggers confabulation because the model has no access to the referenced material and fills the gap with fabrication.

Mistral Large 3: Quoted-Text Contamination

In message #9, James wrote to the internal team: "The client is open to the custom integration if we can deliver it within the POC timeline." By message #15, this had been discussed and quietly dropped.

In message #21, David Kim replied to an earlier message. His email client included the full quoted history below his reply, which meant James's message #9 about the integration appeared again as quoted text inside David's message. Mistral Large 3 treated this quoted appearance as a reaffirmation, listing the custom integration as an "active agreed item" and citing message #21 as the source, even though David's actual reply was about scheduling. This was Mistral's most distinctive failure: it was the only model to explicitly cite a quoted-text source as evidence for an active agreement.

Failure class: Quoted-text contamination. The model can't distinguish original statements from quoted copies. In a 20-message thread, every reply includes the full history below it, so the same sentence appears a dozen times.

What All Five Models Struggled With

The model-specific errors differ. The underlying failures don't.

Decision through silence. A custom integration was proposed, discussed for four messages (about 800 words), and then quietly dropped when the conversation moved to pricing. Never explicitly rejected, just abandoned. Three out of five models listed it as agreed. The discussion produces high attention weight because it's lengthy and detailed, and the models default to treating anything discussed at length as active unless they find explicit closure. Absence of closure is not a signal they can detect from raw text.

CFO silence as a risk signal. Rachel Torres was directly asked about pricing in messages #16 and #23. She didn't respond to either. Only Grok 4.20 and Sonnet 4.6 flagged this. The other three couldn't detect it, because in raw email content, silence is invisible. There is no message that says "I am choosing not to respond." Identifying meaningful absence requires understanding participant structure, not just content.

Authority versus verbosity. James the AE wrote the most messages. Four out of five models listed him as a decision-maker. Rachel the CFO wrote one message buried in a forwarded chain. Most models either missed her or couldn't determine her role. Participation frequency is a terrible proxy for organizational authority, and it's the best heuristic available from unstructured text.

Why Email Breaks Frontier Models

Every failure above maps to a structural property of email that gets destroyed when you flatten a thread into text. These aren't edge cases. They're the default state of any business thread longer than about 10 messages.

Quoted reply duplication. Every reply includes the full quoted history below it. The 20th message contains 19 copies of the first. The thread inflates from ~11,000 unique tokens to ~47,000 raw tokens, biasing attention toward earlier messages that appear more frequently.

Forwarded chain collapse. Forwarded content appears as one continuous block with no structural separation from the active thread. A statement from an internal discussion three weeks ago gets treated as current negotiation.

Participant identity loss. Strip From: headers and "I" refers to eight different people across 47,000 tokens of text. Attribution becomes a frequency-based guessing game.

Non-linear conversation topology. Three people replied to different messages, creating parallel sub-conversations. Linear processing treats these as one flow. The In-Reply-To headers encoding the actual conversation graph are exactly what flattening strips out.

Invisible absence. The most important signals were things that didn't happen: the integration dropped without rejection, the CFO who didn't respond. Detecting meaningful absence requires knowing who was asked what and whether they answered. Raw text can only process what's present.

Structure Beats Model Choice

We ran the same questions through iGPT's Context Engine, which performs thread reconstruction, deduplication, participant attribution, and temporal ordering before the content reaches the model. The output includes per-message metadata on who said what, when, who they were replying to, and what changed between messages.

We used the same five models, the same questions, and a different input structure.

| Metric | Raw Email (avg across 5 models) | Structured Context (avg) | Delta | |----|----|----|----| | Decision accuracy | ~42% | ~91% | +49 pts | | Owner attribution | ~48% | ~94% | +46 pts | | Deadline extraction | ~56% | ~89% | +33 pts | | Stale-history resistance | ~35% | ~88% | +53 pts | | Branch awareness | ~30% | ~85% | +55 pts | | Contradiction detection | ~38% | ~82% | +44 pts |

The composite accuracy improvement averaged 29 percentage points across all five models. But look at the structural metrics: stale-history resistance jumped 53 points. Branch awareness jumped 55. Those are the failures caused by how email is formatted, not by what people write in it.

The spread between models on raw input was about 8 percentage points, best to worst. The spread between raw input and structured input on the same model was 29 points. The preprocessing gap is more than 3x the model gap.

Changing the model moves accuracy a few points. Changing the input moves it by dozens.

You can build this preprocessing layer yourself: thread reconstruction, quoted text detection, signature stripping, MIME parsing, participant resolution, conversation topology mapping. There are open-source libraries that handle parts of this (email-reply-parser, flanker, mailparser). Budget 6 to 12 months if you want the full stack reliable across the range of email clients people actually use.

Or you can use an API that handles it end-to-end. iGPT's Context Engine does this in a single endpoint: raw email goes in, clean thread structure comes out with who said what, when, and what actually changed. That's what we used as the "structured context" input in the test above.

pip install igptai
from igptai import IGPT

igpt = IGPT(api_key="IGPT_API_KEY", user="user_123")

res = igpt.recall.ask(
    input="What decisions were made, who owns what, and what changed?",
    quality="cef-1-reasoning",
    output_format="json"
)
# → structured JSON: decisions, action items with owners,
#   deadlines, risk signals, source citations per claim

If you're processing real deal threads through any third-party layer, the compliance question matters. iGPT offers zero data retention for inference and is working toward SOC 2 and GDPR alignment, but whatever you use, verify it meets your security requirements before routing live email through it.

The principle holds regardless of implementation. If you're spending time evaluating which model is "better at email," you're optimizing the wrong variable.

Full Scoring Tables

Decision-Makers

Ground truth: David Kim (VP Eng, technical decision-maker), Rachel Torres (CFO, budget authority, one message in forwarded chain).

| Model | Identified Decision-Makers | Assessment | |----|----|----| | GPT-5.4 | David Kim, James Chen, Rachel Torres | Partial. James is the AE, not a decision-maker. | | Claude Sonnet 4.6 | David Kim, Rachel Torres, Lisa Park | Partial. Lisa was CC'd for compliance review only. | | Gemini 3 Pro | David Kim, James Chen | Partial. Same James error. Missed Rachel entirely. | | Grok 4.20 Beta | David Kim, Rachel Torres | Closest. Missed that Rachel's authority is specifically budget. | | Mistral Large 3 | David Kim, Rachel Torres, James Chen | Partial. Same James misidentification. |

Agreements

Ground truth: (1) 30-day POC with scope doc. (2) Enterprise tier at 12% discount (revised from 15%), pending CFO approval. (3) Dedicated solutions engineer. NOT agreed: custom integration (proposed, discussed, dropped).

| Model | Agreements Found | Critical Error | |----|----|----| | GPT-5.4 | POC, 15% discount, custom integration | Wrong discount (stale). Listed dropped integration. | | Claude Sonnet 4.6 | POC, 12% discount, solutions engineer | Missed CFO approval condition. | | Gemini 3 Pro | POC pending compliance, 15% discount, integration, solutions engineer | Merged branches. Wrong discount. Listed dropped integration. | | Grok 4.20 Beta | POC, 12% discount, solutions engineer | Correct items. Over-qualified with risk caveats. | | Mistral Large 3 | POC, 15% discount, custom integration | Wrong discount. Listed dropped integration citing quoted text. |

Action Items + Attribution

Ground truth: (1) Priya: POC scope doc by Friday. (2) David Kim: provision staging env. (3) James: schedule tech deep-dive. (4) Rachel Torres: final pricing approval (outstanding).

| Model | Items Extracted | Attribution Errors | |----|----|----| | GPT-5.4 | All four items | Attributed POC scope doc to James instead of Priya. | | Claude Sonnet 4.6 | POC scope, staging env, tech deep-dive | Missed Rachel's outstanding approval. Attributed POC doc to James. | | Gemini 3 Pro | Four items + invented "follow up on integration" | Hallucinated fifth item from dropped discussion. | | Grok 4.20 Beta | All four items | Correct on 3 of 4. Swapped David/James on staging env. | | Mistral Large 3 | POC scope, tech deep-dive, pricing approval | Missed staging env. Correct attribution on rest. |

Risk Assessment

Ground truth signals: (1) CFO silence on two direct pricing questions. (2) Accidental pricing leak. (3) Competing vendor mentioned by name. (4) POC timeline overlaps Q4 freeze.

| Model | Risk Level | Signals Caught | What It Missed | |----|----|----|----| | GPT-5.4 | Medium | Competitor, timeline | CFO silence, pricing leak | | Claude Sonnet 4.6 | Medium-high | CFO silence, competitor | Pricing leak, Q4 freeze | | Gemini 3 Pro | Low-medium | None specific ("positive momentum") | 3 of 4 signals | | Grok 4.20 Beta | High | All four | Over-indexed. Called it "adversarial." | | Mistral Large 3 | Medium | Competitor, timeline | CFO silence, pricing leak |

Methodology

Models tested: GPT-5.4 (OpenAI API), Claude Sonnet 4.6 (Anthropic API), Gemini 3 Pro (Google Vertex AI), Grok 4.20 Beta (xAI API), Mistral Large 3 (La Plateforme). All March 2026.

Thread source: Real anonymized B2B SaaS deal negotiation, 31 messages, 8 participants, 3 weeks. PII replaced. Thread structure preserved.

Scope: This is a single-thread evaluation, not a statistical benchmark. The failure classes documented here are structural and reproducible across threads of similar complexity, but the specific accuracy percentages should be read as directional, not definitive. We chose this thread because it contains all five structural patterns (forwarding, branching, CC changes, quoted duplication, implicit abandonment) in one conversation, not because it was optimized for any particular tool.

Evaluation: Manual scoring by two evaluators against ground truth established by someone with full deal context (not the author). Partial credit awarded.

Structured context: iGPT recall/ask endpoint, cef-1-reasoning tier. Same five models, same prompt, same rubric.

Hallucinated commitments from quoted text: GPT-5.4: 1. Sonnet 4.6: 0. Gemini 3 Pro: 2. Grok 4.20: 1. Mistral Large 3: 1.

Model volatility: Model behavior changes with updates, and several models tested were in active iteration at the time (notably Grok 4.20 Beta). The specific numbers in this evaluation are a snapshot. The structural email properties that cause these failures do not change with model updates. If you rerun this test after a model refresh, the numbers will shift but the pattern will hold: raw email input degrades accuracy on thread-dependent questions regardless of model version.


iGPT's Email Intelligence API handles thread reconstruction, sender attribution, temporal ordering, and conversation topology mapping. docs.igpt.ai

\