MoreRSS

site iconTroy Hunt

Create courses for Pluralsight and am a Microsoft Regional Director and MVP who travels the world speaking at events and training technology professionals.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Troy Hunt

Closer to the Edge: Hyperscaling Have I Been Pwned with Cloudflare Workers and Caching

2024-11-21 15:35:59

Closer to the Edge: Hyperscaling Have I Been Pwned with Cloudflare Workers and Caching

I've spent more than a decade now writing about how to make Have I Been Pwned (HIBP) fast. Really fast. Fast to the extent that sometimes, it was even too fast:

The response from each search was coming back so quickly that the user wasn’t sure if it was legitimately checking subsequent addresses they entered or if there was a glitch.

Over the years, the service has evolved to use emerging new techniques to not just make things fast, but make them scale more under load, increase availability and sometimes, even drive down cost. For example, 8 years ago now I started rolling the most important services to Azure Functions, "serverless" code that was no longer bound by logical machines and would just scale out to whatever volume of requests was thrown at it. And just last year, I turned on Cloudflare cache reserve to ensure that all cachable objects remained cached, even under conditions where they previously would have been evicted.

And now, the pièce de résistance, the coolest performance thing we've done to date (and it is now "we", thank you Stefán): just caching the whole lot at Cloudflare. Everything. Every search you do... almost. Let me explain, firstly by way of some background:

When you hit any of the services on HIBP, the first place the traffic goes from your browser is to one of Cloudflare's 330 "edge nodes":

Closer to the Edge: Hyperscaling Have I Been Pwned with Cloudflare Workers and Caching

As I sit here writing this on the Gold Coast on Australia's most eastern seaboard, any request I make to HIBP hits that edge node on the far right of the Aussie continent which is just up the road in Brisbane. The capital city of our great state of Queensland is just a short jet ski away, about 80km as the crow flies. Before now, every single time I searched HIBP from home, my request bytes would travel up the wire to Brisbane and then take a giant 12,000km trip to Seattle where the Azure Function in the West US Azure data would query the database before sending the response 12,000km back west to Cloudflare's edge node, then the final 80km down to my Surfers Paradise home. But what if it didn't have to be that way? What if that data was already sitting on the Cloudflare edge node in Brisbane? And the one in Paris, and the one in well, I'm not even sure where all those blue dots are, but what if it was everywhere? Several awesome things would happen:

  1. You'd get your response much faster as we've just shaved off more than 99% of the distance the bytes need to travel.
  2. The availability would massively improve as there are far fewer nodes for the traffic to traverse through, plus when a response is cached, we're no longer dependent on the Azure Function or underlying storage mechanism.
  3. We'd save on Azure Function execution costs, storage account hits and especially egress bandwidth (which is very expensive).

In short, pushing data and processing "closer to the edge" benefits both our customers and ourselves. But how do you do that for 5 billion unique email addresses? (Note: As of today, HIBP reports over 14 billion breached accounts, the number of unique email addresses is lower as on average, each breached address has appeared in multiple breaches.) To answer this question, let's recap on how the data is queried:

  1. Via the front page of the website. This hits a "unified search" API which accepts an email address and uses Cloudflare's Turnstile to prohibit automated requests not originating from the browser.
  2. Via the public API. This endpoint also takes an email address as input and then returns all breaches it appears in.
  3. Via the k-anonyity enterprise API. This endpoint is used by a handful of large subscribers such as Mozilla and 1Password. Instead of searching by email address, it implements k-anonymity and searches by hash prefix.

Let's delve into that last point further because it's the secret sauce to how this whole caching model works. In order to provide subscribers of this service with complete anonymity over the email addresses being searched for, the only data passed to the API is the first six characters of the SHA-1 hash of the full email address. If this sounds odd, read the blog post linked to in that last bullet point for full details. The important thing for now, though, is that it means there are a total of 16^6 different possible requests that can be made to the API, which is just over 16 million. Further, we can transform the first two use cases above into k-anonymity searches on the server side as it simply involved hashing the email address and taking those first six characters.

In summary, this means we can boil the entire searchable database of email addresses down to the following:

  1. AAAAAA
  2. AAAAAB
  3. AAAAAC
  4. ...about 16 million other values...
  5. FFFFFD
  6. FFFFFE
  7. FFFFFF

That's a large albeit finite list, and that's what we're now caching. So, here's what a search via email address looks like:

  1. Address to search: [email protected]
  2. Full SHA-1 hash: 567159D622FFBB50B11B0EFD307BE358624A26EE
  3. Six char prefix: 567159
  4. API endpoint: https://[host]/[path]/567159
  5. If hash prefix is cached, retrieve result from there
  6. If hash prefix is not cached, query origin and save to cache
  7. Return result to client

K-anonymity searches obviously go straight to step four, skipping the first few steps as we already know the hash prefix. All of this happens in a Cloudflare worker, so it's "code on the edge" creating hashes, checking cache then retrieving from the origin where necessary. That code also takes care of handling parameters that transform queries, for example, filtering by domain or truncating the response. It's a beautiful, simple model that's all self-contained within a worker and a very simple origin API. But there's a catch - what happens when the data changes?

There are two events that can change cached data, one is simple and one is major:

  1. Someone opts out of public searchability and their email address needs to be removed. That's easy, we just call an API at Cloudflare and flush a single hash prefix.
  2. A new data breach is loaded and there are changes to a large number of hash prefixes. In this scenario, we flush the entire cache and start populating it again from scratch.

The second point is kind of frustrating as we've built up this beautiful collection of data all sitting close to the consumer where it's super fast to query, and then we nuke it all and go from scratch. The problem is it's either that or we selectively purge what could be many millions of individual hash prefixes, which you can't do:

For Zones on Enterprise plan, you may purge up to 500 URLs in one API call.

And:

Cache-Tag, host, and prefix purging each have a rate limit of 30,000 purge API calls in every 24 hour period.

We're giving all this further thought, but it's a non-trivial problem and a full cache flush is both easy and (near) instantaneous.

Enough words, let's get to some pictures! Here's a typical week of queries to the enterprise k-anonymity API:

Closer to the Edge: Hyperscaling Have I Been Pwned with Cloudflare Workers and Caching

This is a very predictable pattern, largely due to one particular subscriber regularly querying their entire customer base each day. (Sidenote: most of our enterprise level subscribers use callbacks such that we push updates to them via webhook when a new breach impacts their customers.) That's the total volume of inbound requests, but the really interesting bit is the requests that hit the origin (blue) versus those served directly by Cloudflare (orange):

Closer to the Edge: Hyperscaling Have I Been Pwned with Cloudflare Workers and Caching

Let's take the lowest blue data point towards the end of the graph as an example:

Closer to the Edge: Hyperscaling Have I Been Pwned with Cloudflare Workers and Caching

At that time, 96% of requests were served from Cloudflare's edge. Awesome! But look at it only a little bit later:

Closer to the Edge: Hyperscaling Have I Been Pwned with Cloudflare Workers and Caching

That's when I flushed cache for the Finsure breach, and 100% of traffic started being directed to the origin. (We're still seeing 14.24k hits via Cloudflare as, inevitably, some requests in that 1-hour block were to the same hash range and were served from cache.) It then took a whole 20 hours for the cache to repopulate to the extent that the hit:miss ratio returned to about 50:50:

Closer to the Edge: Hyperscaling Have I Been Pwned with Cloudflare Workers and Caching

Look back towards the start of the graph and you can see the same pattern from when I loaded the DemandScience breach. This all does pretty funky things to our origin API:

Closer to the Edge: Hyperscaling Have I Been Pwned with Cloudflare Workers and Caching

That last sudden increase is more than a 30x traffic increase in an instant! If we hadn't been careful about how we managed the origin infrastructure, we would have built a literal DDoS machine. Stefán will write later about how we manage the underlying database to ensure this doesn't happen, but even still, whilst we're dealing with the cyclical support patterns seen in that first graph above, I know that the best time to load a breach is later in the Aussie afternoon when the traffic is a third of what it is first thing in the morning. This helps smooth out the rate of requests to the origin such that by the time the traffic is ramping up, more of the content can be returned directly from Cloudflare. You can see that in the graphs above; that big peaky block towards the end of the last graph is pretty steady, even though the inbound traffic the first graph over the same period of time increases quite significantly. It's like we're trying to race the increasing inbound traffic by building ourselves up a bugger in cache.

Here's another angle to this whole thing: now more than ever, loading a data breach costs us money. For example, by the end of the graphs above, we were cruising along at a 50% cache hit ratio, which meant we were only paying for half as many of the Azure Function executions, egress bandwidth, and underlying SQL database as we would have been otherwise. Flushing cache and suddenly sending all the traffic to the origin doubles our cost. Waiting until we're back at 90% cache it ratio literally increases those costs 10x when we flush. If I were to be completely financially ruthless about it, I would need to either load fewer breaches or bulk them together such that a cache flush is only ejecting a small amount of data anyway, but clearly, that's not what I've been doing 😄

There's just one remaining fly in the ointment...

Of those three methods of querying email addresses, the first is a no-brainer: searches from the front page of the website hit a Cloudflare Worker where it validates the Turnstile token and returns a result. Easy. However, the second two models (the public and enterprise APIs) have the added burden of validating the API key against Azure API Management (APIM), and the only place that exists is in the West US origin service. What this means for those endpoints is that before we can return search results from a location that may be just a short jet ski ride away, we need to go all the way to the other side of the world to validate the key and ensure the request is within the rate limit. We do this in the lightest possible way with barely any data transiting the request to check the key, plus we do it in async with pulling the data back from the origin service if it isn't already in cache. In other words, we're as efficient as humanly possible, but we still cop a massive latency burden.

Doing API management at the origin is super frustrating, but there are really only two alternatives. The first is to distribute our APIM instance to other Azure data centres, and the problem with that is we need a Premium instance of the product. We presently run on a Basic instance, which means we're talking about a 19x increase in price just to unlock that ability. But that's just to go Premium; we then need at least one more instance somewhere else for this to make sense, which means we're talking about a 28x increase. And every region we add amplifies that even further. It's a financial non-starter.

The second option is for Cloudflare to build an API management product. This is the killer piece of this puzzle, as it would put all the checks and balances within the one edge node. It's a suggestion I've put forward on many occasions now, and who knows, maybe it's already in the works, but it's a suggestion I make out of a love of what the company does and a desire to go all-in on having them control the flow of our traffic. I did get a suggestion this week about rolling what is effectively a "poor man's API management" within workers, and it's a really cool suggestion, but it gets hard when people change plans or when we want to apply quotas to APIs rather than rate limits. So c'mon Cloudflare, let's make this happen!

Finally, just one more stat on how powerful serving content directly from the edge is: I shared this stat last month for Pwned Passwords which serves well over 99% of requests from Cloudflare's cache reserve:

That's about 3,900 requests per second, on average, non-stop for 30 days. It's obviously way more than that at peak; just a quick glance through the last month and it looks like about 17k requests per second in a one-minute period a few weeks ago:

Closer to the Edge: Hyperscaling Have I Been Pwned with Cloudflare Workers and Caching

But it doesn't matter how high it is, because I never even think about it. I set up the worker, I turned on cache reserve, and that's it 😎

I hope you've enjoyed this post, Stefán and I will be doing a live stream on this topic at 06:00 AEST Friday morning for this week's regular video update, and it'll be available for replay immediately after. It's also embedded here for convenience:

Weekly Update 426

2024-11-17 10:39:54

Weekly Update 426

I have absolutely no problem at all talking about the code I've screwed up. Perhaps that's partly because after 3 decades of writing software (and doing some meaningful stuff along the way), I'm not particularly concerned about showing my weaknesses. And this week, I screwed up a bunch of stuff; database queries that weren't resilient to SQL database scale changes, partially completed breach notifications I didn't notice until it was too late to easily fix, and some queries that performed so badly they crashed the entire breach notification process after loading the massive DemandScience incident. Fortunately, none of them had any impact of note, we fixed them all and re-ran processes, and now we're more resilient than ever 😄

Oh - and if you like this style of content, this coming Friday, Stefan and I will do a joint live stream on all sorts of other bits about how now HIBP runs.

Weekly Update 426
Weekly Update 426
Weekly Update 426
Weekly Update 426

References

  1. Sponsored by: 1Password Extended Access Management: Secure every sign-in for every app on every device.
  2. Elon Musk is right (I hate cookie warnings, but I'm entertained by people losing their minds "because Elon")
  3. The Hot Topic breach went into HIBP (that's another 57M email addresses right there)
  4. There are also now 122M more records in HIBP courtesy of the DemandScience breach (it's publicly aggregated data, but it's still a breach)

Inside the DemandScience by Pure Incubation Data Breach

2024-11-13 17:59:27

Inside the DemandScience by Pure Incubation Data Breach

Apparently, before a child reaches the age of 13, advertisers will have gathered more 72 million data points on them. I knew I'd seen a metric about this sometime recently, so I went looking for "7,000", which perfectly illustrates how unaware we are of the extent of data collection on all of us. I started Have I Been Pwned (HIBP) in the first place because I was surprised at where my data had turned up in breaches. 11 years and 14 billion breached records later, I'm still surprised!

Jason (not his real name) was also recently surprised at where his data had appeared. He found it in a breach of a service called "Pure Incubation", a company whose records had appeared on a popular hacking forum earlier this year:

When Jason found his email address and other info in this corpus, he had the same question so many others do when their data turns up in a place they've never heard of before - how? Why?! So, he asked them:

I seem to have found my email in your data breach. I am interested in finding how my information ended up in your database.

To their credit, he got a very comprehensive answer, which I've included below:

Inside the DemandScience by Pure Incubation Data Breach

Well, that answers the "how" part of the equation; they've aggregated data from public sources. And the "why" part? It's the old "data is the new oil" analogy that recognises how valuable our info is, and as such, there's a market for it. There are lots of terms used to describe what DemandScience does, including "B2B demand generation", "buyer intelligence solutions provider", "empowering technology companies to accelerate ROI", "supercharging pipelines" and "account intelligence". Or, to put it in a more lay-person-friendly fashion, they sell data on people.

DemandScience is what we refer to as a "data aggregator" in that they combine identity data from multiple locations, bundle it up, and then sell it. Occasionally, data aggregators end up having sizeable data breaches; before today, HIBP already contained Adapt (9M records), Data & Leads (44M records), Exactis (132M records), Factual (2M records), and You've Been Scraped (66M records). According to DemandScience, "none of our current operational systems were exploited", yet simultaneously, "the leaked data originated from a system that has been decommissioned". So, it's a breach of an old system.

Does it matter? I mean, if it's just public data, should people care? Jason cared, at least enough to make the original enquiry and for DemandScience to look him up and realise he's not in their current database. Still, he existed in the breached one (I later sent Jason his record from the breach, and he confirmed the accuracy). As I often do in these cases, I reached out to a bunch of recent HIBP subscribers in the breach and asked them three simple questions:

  1. Is the data about you accurate and if not, which bits are wrong?
  2. Is this data you would consider to be in the public domain already?
  3. Would you expect to be notified about your data being used in this fashion, and consequently appearing a breach?

The answers were all the same: the data is accurate, it's already in the public domain, and people aren't too concerned about it appearing in this breach. Well that was easy 🙂 However...

There are two nuances that aren't captured here, and the first one is that this is valuable data, that's why DemandScience sells it! It comes back to that "new oil" analogy and if you have enough of it, you can charge good money for it. Companies typically use data such as this to do precisely the sort of catchphrasey stuff the company talks about, primarily around maximising revenue from their customers by understanding them better.

The second nuance is that whilst this data may already be in the public domain, did the owners of it expect it to be used in this fashion? For example, if you publish your details in a business directory, is your expectation that this info may then be sold to other companies to help them upsell you on their products? Probably not. And if, like many of the records in the data, someone's row is accompanied by their LinkedIn profile, would they expect that data to matched and sold? I suggest the responses would likely be split here, and that in itself is an important observation: how we view the sensitivity of our data and the impact of it being exposed (whether personal or business) is extremely personal. Some people take the view of "I have nothing to hide", whilst others become irate if even just their email address is exposed.

Whilst considering how to add more insights to this blog post, I thought I'd do a quick check on just one more email address:

"54543060",,"0","TROY","HUNT","PO BOX 57",,"WEST RYDE",,,"AU","61298503333",,,,"[email protected]","pfizer.com","PFIZER INC",,"250-499","$50 - 99 Million","Healthcare, Pharmaceuticals and Biotech","VICE PRESIDENT OF INFORMATION TECHNOLOGY","VP Level","2834",,"Senior Management (SVP/GM/Director)","IT",,"1","GemsTarget INTL","GEMSTARGET_INTL_648K_10.17.18",,,,,,,,,"18/10/2018 05:12:39","5/10/2021 16:47:56","PFIZER.COM",,,,,"IT Management General","Information Technology"

I'll be entirely transparent and honest here - my exact words after finding this were "motherfucker!" True story, told uncensored here because I want to impress on the audience how I feel when my data turns up somewhere publicly. And I do feel like it's "my" data; it's certainly my name and even though it's my old Pfizer email address I've not used for almost a decade now, that also has my name in it. My job title is also there... and it's completely wrong! I never had a VP-level role, even though the other data around my tech role is at least in the vicinity of being correct. But other than the initial shock of finding myself in yet another data breach, personally, I'm in the same boat as the HIBP subscribers I contacted, and this doesn't bother me too much. But I also agree with the following responses I received to my third question:

I think it is useful to be notified of such breaches, even if it is just to confirm no sensitive data has been compromised. As I said, our IT department recently notified me that some of my data was leaked and a pre-emptive password reset was enforced as they didn't know what was leaked. 
It would be good to see it as an informational notification in case there's an increase in attack attempts against my email address.
I would like to opt-out of here to reduce the SPAM and Phishing emails.

That last one seems perfectly reasonable, and fortunately, DemandScience does have a link on their website to Do Not Sell My Information:

Inside the DemandScience by Pure Incubation Data Breach

Dammit! If, like me, you're part of the 99.5% of the world that doesn't live in California, then apparently this form isn't for you. However, they do list [email protected] on that page, which is the same address Jason was communicating with above. Chances are, if you want to remove your data then that's where to start.

There were almost 122M unique email addresses in this corpus and those have now been added to HIBP. Treat this as informational; I suspect that for most people, it won't bother them, whilst others will ask for their data not to be sold (regardless of where they live in the world). But in all likelihood, there will be more than a handful of domain subscribers who take issue with that volume of people data sitting there in one corpus easily downloadable via a clear web hacking forum. For example, mine was just one of many tens of thousands of Pfizer email addresses, and that sort of thing is going to raise the ire of some folks in corporate infosec capacities.

One last comment: there was a story published earlier this year titled Our Investigation of the Pure Incubation Ventures Leak and in there they refer to "encrypted passwords" being present in the data. Many of the files do contain a column with bcrypt hashes (which is definitely not encryption), but given the way in which this data was collated, I can see no evidence whatsoever that these are password hashes. As such, I haven't listed "Passwords" as one of the compromised data classes in HIBP and you find yourself in this breach, I wouldn't be at all worried about this.

Weekly Update 425

2024-11-09 15:15:01

Weekly Update 425

This was a much longer than usual update, largely due to the amount of time spent discussing the Earth 2 incident. As I said in the video (many times!), the amount of attention this has garnered from both Earth 2 users and the company itself is incommensurate with the impact of the incident itself. It's a nothing-burger. Email addresses and usernames, that's it, and of course, their association with the service, which may lead to some very targeted spam or phishing attempts. It's still a breach by any reasonable definition of the term, but it should have been succinctly summarised and disclosed to impacted parties with everyone moving on with more important things in life a few moments later. And that's exactly what I'm going to do right now 😊

Weekly Update 425
Weekly Update 425
Weekly Update 425
Weekly Update 425

References

  1. Sponsored by: Report URI: Guarding you from rogue JavaScript! Don’t get pwned; get real-time alerts & prevent breaches #SecureYourSite
  2. Speaking of giving a nothing-burger incident more attention than it deserves, the Earth 2 Twitter screed hasn't done them any favours (something something Streisand effect)
  3. Data breach disclosure 101: How to succeed after you've failed (7 years on, this is still the guidance I give breached orgs)

Weekly Update 424

2024-11-03 15:33:01

Weekly Update 424

I have really clear memories of listening to the Stack Overflow podcast in the late 2000's and hearing Jeff and Joel talk about the various challenges they were facing and the things they did to overcome them. I just suddenly thought of that when realising how long this week's video went for with no real plan other than to talk about our HIBP backlog. People seem to love this in the same way I loved listening to the guys a decade and a half ago. I'll do one of these with Stefan as well over the course of this month, let us know what you'd like to hear about 😊

Weekly Update 424
Weekly Update 424
Weekly Update 424
Weekly Update 424

References

  1. Sponsored by: 1Password Extended Access Management: Secure every sign-in for every app on every device.