2025-11-12 20:27:54

What. A. Week. It wasn't just the preceding weeks of technical pain as we tried to work out how to get this data loaded, it was all the subsequent queries we had to deal with too. Some of them are totally understandable, whilst others just resulted in endless facepalms 🤦♂️ But we got there in the end with the worst of it just being a 24-hour period where we ended up on a SpamCop block list, for reasons I still don't understand. We are still on the very tail end of sending individual notifications, so there may be more to update in the next vid, but at least that one will be from home with sunshine, good coffee and a slower pace 😊
2025-11-05 14:41:43

I hate hyperbolic news headlines about data breaches, but for the "2 Billion Email Addresses" headline to be hyperbolic, it'd need to be exaggerated or overstated - and it isn't. It's rounded up from the more precise number of 1,957,476,021 unique email addresses, but other than that, it's exactly what it sounds like. Oh - and 1.3 billion unique passwords, 625 million of which we'd never seen before either. It's the most extensive corpus of data we've ever processed, by a significant margin.
Edit: Just to be crystal clear about the origin of the data and the role of Synthient (who you’ll read about in the next paragraph): this data came from numerous locations where cybercriminals had published it. Synthient (run by Ben during his final year of college) indexed that data and provided it to Have I Been Pwned solely for the purpose of notifying victims. He’s the good guy shining a light on the bad guys, so keep that in mind as you read on. (Some of the feedback Ben has received is exactly what I foreshadowed in the final paragraph of this post.)
A couple of weeks ago, I wrote about the 183M unique email addresses that Synthient had indexed in their threat intelligence platform and then shared with us. I explained that this was only part of the corpus of data they'd indexed, and that it didn't include the credential stuffing records. Stealer log data is obtained by malware running on infected machines. In contrast, credential stuffing lists usually originate from other data breaches where email addresses and passwords are exposed. They're then bundled up, sold, redistributed, and ultimately used to log in to victims' accounts. Not just the accounts they were initially breached from, either, because people reuse the same password over and over again, the data from one breach is frequently usable on completely unrelated sites. A breach of a forum to comment on cats often exposes data that can then be used to log in to the victim's shopping, social media and even email accounts. In that regard, credential stuffing data becomes "the keys to the castle".
Let me run through how we verified the data, what you can do about it and for the tech folks, some of the hoops we had to jump through to make processing this volume of data possible.
The first person whose data I verified was easy - me 😔 An old email address I've had since the 90s has been in credential stuffing lists before, so it wasn't too much of a surprise. Furthermore, I found a password associated with my address, which I'd definitely used many eons ago, and it was about as terrible as you'd expect from that era. However, none of the other passwords associated with my address were familiar. They certainly looked like passwords that other people might have feasibly used, but I'm pretty sure they weren't mine. One was even just an IP address from Perth on the other side of the country, which is both infeasible as a password I would have used, yet eerily close to home. I mean, of all the places in the world an IP address could have appeared from, it had to be somewhere in my own country I've been many times before...
Moving on to HIBP subscribers, I reached out to a handful and asked for support verifying the data. I chose a mix of subscribers with many who'd never been involved in any data breach we'd ever seen before; my experience above suggested that there's recycled data in there, and we had previously verified that when investigating those other incidents. However, is the all-new stuff legitimate? The very first response I received was exactly what I was looking for:
#1 is an old password that I don't use anymore. #2 is a more recent password. Thanks for the heads up, I've gone and changed the password for every critical account that used either one.
Perfectly illustrating most people's behaviour with passwords, #2 referred to above was just #1 with two exclamation marks at the end!! (Incidentally, these were simple six and eight-character passwords, and neither of them was in Pwned Passwords either.) He had three passwords in total, which also means one of them, like with my data, was not familiar. However, the most important thing here is that this example perfectly illustrates why we put the effort into processing data like this: #2 was a real, live password that this guy was actively using, and it was sitting right next to his email address, being passed around among criminals. However, through this effort, that credential pair has now become useless, which is precisely what we're aiming for with this exercise, just a couple of billion times over.
The second respondent only had one password against their address:
Yes that was a password I used for many years for what I would call throw away or unimportant accounts between 20 and 10 years ago
That was also only eight characters, but this time, we'd seen it in Pwned Passwords many times before. And the observation about the password's age was consistent with my own records, so there's definitely some pretty old data in there.
The following response was not at all surprising:
I am familiar with that password... I used it almost 10 years ago... and cannot recall the last time I used it.
That was on a corporate account, too, and the owner of the address duly forwarded my email to the cybersecurity team for further investigation. The single password associated with this lady's email address had a massive nine characters, and also hadn't previously appeared in Pwned Passwords.
Next up was a respondent who replied inline to my questions, so I'll list them below with the corresponding answers:
Is this familiar? Yes
Have you ever used it in the past? Yes and is still on some accounts I do not use any longer.
And if so, how long ago? Unfortunately, it is still on some active accounts that I have just made a list of to change or close immediately.
This individual's eight-character password with uppercase, lowercase, numbers and a "special" character also wasn't in Pwned Passwords. Similarly, as with the earlier response, that password was still in active use, posing a real risk to the owner. It would pass most password complexity criteria and slip through any service using Pwned Passwords to block bad ones, so again, this highlights why it was so important for us to process the data.
The next person had three different passwords against rows with their email address, and they came back with a now common response:
Yes, these are familiar, last used 10 years ago
We'd actually seen all three of them in Pwned Passwords before, many times each. Another respondent with precisely the kind of gamer-like passwords you'd expect a kid to use (one of which we hadn't seen before), also confirmed (I think?) their use:
maybe when i was a kid lol
Responses that weren't an emphatic "yes, that's my data" were scarce. The two passwords against one person's name were both in Pwned Passwords (albeit only once each), yet it's entirely possible that neither of them had been used by this specific individual before. It's also possible they'd forgotten a password they'd used more than a decade ago, or it may have even been automatically assigned to them by the service that was subsequently breached. Put it down as a statistical anomaly, but I thought it was worth mentioning to highlight that being in this data set isn't a guarantee of a genuine password of yours being exposed. If your email address is found in this corpus then that's real, of course, so there must be some truth in the data, but it's a reminder that when data is aggregated from so many different sources over such a long period of time, there's going to be some inconsistencies.
As a brief recap, we load passwords into the service we call Pwned Passwords. When we do so, there is absolutely no association between the password and the email address it appeared next to. This is for both your protection and ours; can you imagine if HIBP was pwned? It's not beyond the realm of possibility, and the impact of exposing billions of credential pairs that can immediately unlock an untold number of accounts would be catastrophic. It's highly risky, and completely unnecessary when you can search for standalone passwords anyway without creating the risk of it being linked back to someone.
Think about it: if you have a password of "Fido123!" and you find it's been previously exposed (which it has), it doesn't matter if it was exposed against your email address or someone else's; it's still a bad password because it's named after your dog followed by a very predictable pattern. If you have a genuinely strong password and it's in Pwned Passwords, then you can walk away with some confidence that it really was yours. Either way, you shouldn't ever use that password again anywhere, and Pwned Passwords has done its job.
Checking the service is easy, anonymous and depending on your level of technical comfort, can be done in several different ways. Here's a copy and paste from the last Synthient blog post:

My vested interest in 1Password aside, Watchtower is the easiest, fastest way to understand your potential exposure in this incident. And in case you're wondering why I have so many vulnerable and reused passwords, it's a combination of the test accounts I've saved over the years and the 4-digit PINs some services force you to use. Would you believe that every single 4-digit number ever has been pwned?! (If you're interested, the ABC has a fantastic infographic using a heatmap based on HIBP data that shows some very predictable patterns for 4-digit PINs.)
It pains me to say it, but I have to, given the way the stealer logs made ridiculous, completely false headlines a couple of weeks ago:
This story has suddenly gained *way* more traction in recent hours, and something I thought was obvious needs clarifying: this *is not* a Gmail leak, it simply has the credentials of victims infected with malware, and Gmail is the dominant email provider: https://t.co/S75hF4T1es
— Troy Hunt (@troyhunt) October 27, 2025
There are 32 million different email domains in this latest corpus, of which gmail.com is one. It is, of course, the largest and has 394 million unique email addresses on it. In other words, 80% of the data in this corpus has absolutely nothing to do with Gmail, and the 20% of Gmail addresses have absolutely nothing to do with any sort of security vulnerability on Google's behalf. There - now let reporting sanity prevail!
I wanted to add this just to highlight how painful it has been to deal with this data. This corpus is nearly 3 times the size of the previous largest breach we'd loaded, and HIBP is many times larger than it was in 2019 when we loaded the Collection #1 data. Taking 2 billion records and adding the ones we hadn't already seen in the existing 15 billion corpus, whilst not adversely impacting the live system serving millions of visitors a day, was very non-trivial. Managing the nuances of SQL Server indexes such that we could optimise both inserts and queries is not my idea of fun, and it's been a pretty hard couple of weeks if I'm honest. It's also been a very expensive period as we turned the cloud up to 11 (we run on Azure SQL Hyperscale, which we maxed out at 80 cores for almost two weeks).
A simple example of the challenge is that after loading all the email addresses up into a staging table, we needed to create SHA1 hashes of each. Normally, that would involve something to the effect of "update table set column = sha1(email)" and you're done. That crashed completely, so we ended up doing "insert into new table select email, sha1(email)". But on other occasions the breach load required us to do updates on other columns (with no hash creation), which, on mulitple occasions, we had to kill after a day or more of execution with no end in sight. So, we ended up batching in loops (usually 1M records at a time), reporting on progress along the way so we had some idea of when it would actually finish. It was a painful process of trial, waiting ages, error then taking a completely different approach.
Notifying our subscribers is another problem. We have 5.9 million of them, and 2.9 million are in this data 🫨 Simply sending that many emails at once is hard. It's not so much hard in terms of firing them off, rather it's hard in terms of not ending up on a reputation naughty list or having mail throttled by the receiving server. That's happened many times in the past when loading large, albeit much smaller corpuses; Gmail, for example, suddenly sees a massive spike and slows down the delivery to inboxes. Not such a biggy for sending breach notices, but a major problem for people trying to sign into their dashboard who can no longer receive the email with the "magic" link.
What we've done to address that for this incident is to slow down the delivery of emails for the individual breach notification. Whilst I'd originally intended to send the emails at a constant rate over the period of a week, someone listening to me on my Friday live stream had a much better suggestion:
the strategy I've found to best work with large email delivery is to look at the average number of emails you've sent over the last 30 days each time you want to ramp up, and then increase that volume by around 50% per day until you've worked your way through the queue
Which makes a lot of sense, and stacked up as I did more research (thanks Joe!). So, here's what our planned delivery schedule now looks like:

That's broken down by hour, increasing in volume by 1.015 times per hour, such that the emails are spread out in a similar, gradually increasing cadence. On a daily basis, that works out at a 45% increase in each 24-hour period, within Joe's suggested 50% threshold. Plus, we obviously have all the other mechanisms such as a dedicated IP, properly configured DKIM, DMARC and SPF, only emailing double-opted-in subscribers and spam-friendly message body construction. So, it could be days before you receive a notification, or just run a haveibeenpwned.com search on demand if you're impatient.
We've sent all the domain notification emails instantly because, by definition, they're going to a very wide range of different mail servers; it's just the individual ones we're drop-feeding.
Lastly, if you've integrated Pwned Passwords into your service, you'll now see noticeably larger response sizes. The numbers I mentioned in the opening paragraph increase the size of each hash range by an average of about 50%, which will push responses from about 26kb to 40kb. That's when brotli compressed, so obviously, make sure you're making requests that make the most of the compression.
This data is now searchable in HIBP as the Synthient Credential Stuffing Threat Data. It's an entirely separate corpus from that previous Synthient data I mentioned earlier; they're discrete datasets with some crossover, but obviously, this one is significantly larger. And, of course, all the passwords are now searchable per the Pwned Passwords guidance above.
If I could close with one request: this was an extremely laborious, time-consuming and expensive exercise for us to complete. We've done our best to verify the integrity of the data and make it searchable in a practical way while remaining as privacy-centric as possible. Sending as many notifications as we have will inevitably lead to a barrage of responses from people wanting access to complete rows of data, grilling us on precisely where it was obtained from or, believe it or not, outright abusing us. Not doing those things would be awesome, and I suggest instead putting the energy into getting a password manager, making passwords strong and unique (or even better, using passkeys where available), and turning on multi-factor auth. That would be an awesome outcome for all 😊
Edit: I've closed off comments on this blog post. As you'll see below, there was a constant stream of questions that have already been answered in the post itself, plus some comments that were starting to verge on precisely what I predicted in the last para above. Reading, responding and engaging is time-consuming and at this point, all the answers are already here both above and below this edit in the comments.
2025-11-03 16:23:37

The 2 billion email address stealer log breach I talk about this week is almost ready to go at the time of writing. It's been massively time-consuming, massively expensive (we turned the cloud up to 11) and enormously frustrating. I've written about why in the draft blog post, but once you get to the point of inserting billions of records into a system with lots of billions of records, most of the usual way we'd do things totally goes out the window. I think we're looking at Wednesday morning Europe time to get that out (oh, also, hi from Oslo!) so definitely stay for that one. Next week, I'll talk about how much subsequent support pain this breach is going to cause us 😭
2025-10-27 17:09:21

Tracking down bugs in software is a pain that all of us who write code must bear. When we're talking about outright errors in a web page, you typically have something to get you started (such as output in the console), but that wasn't the case here:
Sure! Reboots don't help :) Here are the two error screens which show up. pic.twitter.com/w2dmZcVyHk
— Peter Vogel (@PeterVogel) July 11, 2025
That's on a Chromebook, and it's the first user report we had about the issue back in early July. The initial problem this presented is that there are not a lot of people running around with devices we could test on. But there are enough people using them that we had multiple similar reports, so we were well beyond just giving people like Peter a bit of "works on my machine", and moving on. But the "SIGILL" error means that something pretty low-level has happened and, as you can see from the screen grab, you can't exactly just pop open the dev tools and peak at what's broken in the site when it can't even load in the first place.
However, after months of making no progress whilst the occasional Chromium user popped their head up and reported exactly the same problem, the answer finally emerged:
Reading MDN docs I don't find a directive 'report-sha256', so tried only removing that, and no crash.
— Mark : 1x Software Artisan (@virullius) October 24, 2025
Uh... shouldn't a browser just ignore a directive it doesn't recognise? (And incidentally, report-sha256 is documented in CSP level 3.) But the timing was awful coincidental with when we added that exact directive, only just before people started reporting problems:
Wow, good sleuthing! The timing of when this first began aligns with this commit from @stebets. I've just dropped it, does it look ok now? Also CC'ing @Scott_Helme - you seen this before mate? Bug was logged here: https://t.co/wiKpdmSxhU pic.twitter.com/m2nsDtAMjB
— Troy Hunt (@troyhunt) October 24, 2025
Getting to the title of this post, we almost worked this out ourselves, we just didn't look at data that was right in front of our eyes. Here it is:

This is Report URI's crash report graph, and until June, we'd had a good run! Crash reports are super cool because your customers' browsers automatically generate them, and with just a little tweaking of your response headers, you can easily turn your customers into automatic crash reporting bots! Report URI's value proposition (disclosure: I have a working relationship with them) is that it can receive those reports and create graphs like you see above. We just weren't watching the reports closely enough, hence the "almost" in the title.
I wanted to write this short post because sometimes, the answer is right in front of your eyes, and if we'd looked at what in hindsight is a really obvious place to check, we would have nailed this months ago. So, turn on crash reporting, and pay attention to it!
2025-10-26 05:32:46

It was the Synthient threat data that ate most of my time this week, and it continues to do so now, the weekend after recording this video. Data like this is equal parts enormously damaging to victims and frustratingly noisy to process. I have to be confident enough that it's new enough, legit enough and impactful enough to justify loading and that the value presented to breach victims sufficiently offsets the inevitable chorus of "what am I meant to do with this, tell me exactly what password was exposed for my record". It's an expensive exercise too; we're currently running an Azure SQL Hyperscale database at 80 cores to analyse the ~2 billion credential stuffing email addresses in this corpus. That's 2 billion unique email addresses too 😮 More on that in the next video, let's just work out if it's going to go live in the system first.
2025-10-22 03:20:01

Where is your data on the internet? I mean, outside the places you've consciously provided it, where has it now flowed to and is being used and abused in ways you've never expected? The truth is that once the bad guys have your data, it often replicates over and over again via numerous channels and platforms. If you're able to aggregate enough of it en masse, you end up with huge volumes of "threat intelligence data", to use the industry buzzword. And that's precisely what Ben from Synthient has done, and then sent it to Have I Been Pwned (HIBP).
Ben is in his final year of college in the US and is carving out a niche in threat intelligence. He's written up a deeper dive in The Stealer Log Ecosystem: Processing Millions of Credentials a Day, but the headline gives you a sense of the volumes. Have a read of that post and you'll see Ben is pulling data from various sources, including social media, forums, Tor and, of course, Telegram. He's managed to aggregate so much of it that by the time he sent it to us, it was rather sizeable:

That's 3.5 terrabytes of data, with the largest file alone being 2.6TB and, combined, they contain 23 billion rows. It's a vast corpus, and if we were attempting to compete with recent hyperbolic headlines about breach sizes, this would be one of the largest. But I'm not going to play the "mine is bigger than yours" game because it makes no sense once you start analysing the data. Part of what makes the data so large is that we're actually looking at both stealer logs and credential stuffing lists, so let's assess them separately, starting with those stealer logs.
Stealer logs are the product of infostealers, that is, malware running on infected machines and capturing credentials entered into websites on input. The output of those stealer logs is primarily three things:
Someone logging into Gmail, for example, ends up with their email address and password captured against gmail.com, hence the three parts. Due to the fact that stealer logs are so heavily recycled (they're posted over and over again to the sorts of channels Ben monitors), the first thing we always do is try to get a sense of how much is genuinely new:

This is the output of a little PowerShell script we use to guage where the email addresses in a new breach corpus have been seen before. Especially when there's a suspicion that data might have been repurposed from elsewhere, it's really useful to run them against the HIBP API and see what comes back. What the output above tells us is that after checking a sample of 94k of them, 92% had been previously seen, mostly in stealer log corpuses we'd loaded in the past. This is an empirical demonstration of what I wrote in the opening paragraph - "it often replicates over and over again" - and as you can see, most of what has been seen before was in the ALIEN TXTBASE stealer logs.
Back to the console output again, and having previously seen 92% of addresses also means we haven't seen 8% of the addresses. That's 8% of a considerable number, too: we found 183M unique email addresses across Ben's stealer log data, so we're talking about 14M+ addresses that have never surfaced in HIBP. (The final number once the entire data set was loaded into HIBP was 91% pre-existing, with 16.4M previously unseen addresses in any data breach, not just stealer logs.) But as with everything we load, the question has to be asked: Is it legit? Can you trust the shady criminals who publish this data not to fill it with junk? The only way to know for sure is to ask the legitimate owners of the data, so I reached out to a bunch of our subscribers and sought their support in verifying.
One of the respondants was already concerned there could be something wrong with his Gmail account and sure enough, he had one stealer log entry for "https://accounts.google.com/signin/challenge/pwd/1" with a, uh, "suboptimal" password:
Yes I can confirm that was an accurate password on my gmail account a few months ago
Another respondant who offered support had somewhat of a recognisable pattern in the sites he'd been visiting:

To his credit, he responded and confirmed that the list did indeed contain sites he'd visited, which also included online casinos, crypto websites and VPN services:
They all look like websites I have used and some still do use
As it turns out, he also had two other email addresses in the corpus of data, both with the same collection of passwords used on the first address he replied from. They also both aligned to services based on the same TLD as the other email address which suggested which country he's located in. (Incidentally, the online privacy offered by VPNs kinda falls apart when there's malware on your machine watching every site you visit and recording your credentials.)
Even without a response from a subscriber, it's still easy to get a sense of the legitimacy of the data in a privacy-preserving fashion (i.e. not logging in with their credentials!) just by testing enumeration vectors. For example, one subscriber had an account at ShopBack in the Philippines which offers what I'll refer to as "account enumeration as a service":

I simply added some character's in front of the email address and ShopBack happily confirmed that address didn't exist. However, remove the invalid characters and there's a very different response:

All of these little "tells" add up; another subscriber had a high prevalence of Greek websites they used, showing exactly the sort of pattern you'd expect to see for someone from that corner of the world. Another had various online survey sites they'd used, and like our "assandfurious" friend from earlier, a clear pattern emerged consistent with the apparent interests of the address's owner. Time and time again, the data checked out, so we loaded it. Those 183M email addresses are now searchable in HIBP, and the passwords are also searchable in Pwned Passwords, which has become rather popular:
Pwned Passwords just served 17.45 billion requests in 30 days 🤯 That's an *average* of 6,733 requests per second, but at our peak, we're hitting 42k per second in a 1-minute block. Crazy numbers! Made possible by @Cloudflare 😎 pic.twitter.com/Io6u1PiqJf
— Troy Hunt (@troyhunt) October 17, 2025
The website addresses are also now searchable, either in the stealer log section of your personal dashboard or by verified domain owners using the API. You'll find this data named "Synthient Stealer Log Threat Data" in HIBP, but stealer logs are only part of the Synthient story - the small part!
Ben's data also contained credential stuffing lists. Unlike stealer logs, which are the product of malware on the victim's machine, credential stuffing lists are typically aggregated from other places where email address and password pairs are obtained. For example, from data breaches where the passwords are either stored in plain text or protected with easily crackable hashing algorithms. Those lists are then used to access the other accounts of victims where they've reused their passwords.
Quick sidenote: Credential stuffing lists can be enormously damaging because they contain the keys to so many different services. Not only are they the gateway to so many takeovers of social media accounts, email addresses and other valuable personal resources, they're also responsible for many subsequent very serious data breaches. The 2017 Uber breach was attributed to previously breached employee credentials. Five years later, and the same approach provided the initial access to Uber again, after which MFA-bombing sealed the deal. Then there was the 23andMe breach in 2023, which was also traced back to credential stuffing. Similar but different was when Dunkin' Donuts had 20k customer details exposed in a show of how multifaceted this style of attack is: they were subsequently sued for not having sufficient controls to stop hackers from simply logging in with victims' legitimate credentials. It's wild; it's the attack that just keeps on giving.
Ever since loading Collection #1 in 2019, I have been extra cautious about dealing with credential stuffing lists. The 400+ comments on that blog post will give you just a little taste of how much attention that exercise garnered. Frankly, it was a significant contributor to the feeling that it was all getting a bit too much, leading to the decision that HIBP needed to find another home (which fortunately, never eventuated). The primary issue with credential stuffing lists is that we can't attribute a given row to a specific source website or data breach, and we don't offer a service to look up credential pairs. As you'll see from many of the comments on that post, I had angry people upset that, without knowing specifically which password was exposed in the list, the knowledge that they were in there was not actionable. I disagree, because by loading those passwords into Pwned Passwords, there are now three easy ways to check if you're using a vulnerable one:

My vested interest in 1Password aside, Watchtower is the easiest, fastest way to understand your potential exposure in this incident. And in case you're wondering why I have so many vulnerable and reused passwords, it's a combination of the test accounts I've saved over the years and the 4-digit PINs some services force you to use. Would you believe that every single 4-digit number ever has been pwned?! (If you're interested, the ABC has a fantastic infographic using a heatmap based on HIBP data that shows some very predictable patterns for 4-digit PINs.)
As of the time of publishing this blog post, only the stealer logs have been loaded, and as mentioned earlier, the data in HIBP has been called "Synthient Stealer Log Threat Data". We intend to load the credential stuffing data as a separate corpus next week and call it "Synthient Credential Stuffing Threat Data", assuming it's sufficiently new and the accuracy is confirmed with our subscribers! We're doing this in two parts simply because of the scale of the data and the fact that we want to break it into two discrete corpuses given the data originates via different means. I'll revise this blog post accordingly after we finish our analysis.
Something that is becoming more evident as we load more stealer logs is that treating them as a discrete "breach" is not an accurate representation of how these things work. The truth is that, unlike a single data breach such as Ashley Madison, Dropbox, or the many other hundreds already in HIBP, stealer logs are more of a firehose of data that's just constantly spewing personal info all over the place. That, combined with the duplication of previously seen data, means that we need a rethink on this model. The data itself is still on point, but I'd like to see HIBP better reflect that firehose analogy and provide a constant stream of new data. Until then, Synthient's Threat Data will still sit in HIBP and be searchable in all the usual ways.