MoreRSS

site iconBrandon SkerrittModify

A tech expert who invents open source projects, writes, makes videos, and worked as a Monzo security engineer.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Brandon Skerritt

Speeding up Game Sentence Miner (GSM) Statistics by 200%

2025-10-19 10:35:33

Speeding up Game Sentence Miner (GSM) Statistics by 200%

Over the course of a weekend in-between job interviews I decided to speed up the loading of statistics in one of my favourite apps, GSM.

GSM is an application designed to make it easy to turn games into flashcards. It records the screen with OBS and uses OCR / Whisper to get text from it. You then hover over a word with a dictionary, click "add to Anki" and GSM sends the full spoken line from the game + a gif of the game to your Anki card.

Speeding up Game Sentence Miner (GSM) Statistics by 200%
GSM does a lot more, but this is as succinct as I can make it
GitHub - bpwhelan/GameSentenceMiner: An All-in-One immersion toolkit for learning Languages through games and other visual media.
An All-in-One immersion toolkit for learning Languages through games and other visual media. - bpwhelan/GameSentenceMiner
Speeding up Game Sentence Miner (GSM) Statistics by 200%

GSM has a statistics page contributed by me, every time you read something in-game it adds it to a database which I then generate statistics from.

These stats take a while to load.

  • /stats takes 6 seconds
  • /anki takes around 40 seconds
  • /overview takes around 4 seconds

And I added /overview because /stats was too slow!

👾
Note: this whole app is a Windows .exe, that serves Flask entirely locally. These times are absurd for a local app!

This blog post talks about how I spent my weekend improving the loading speed of the website by around 200%

Speeding up Game Sentence Miner (GSM) Statistics by 200%
Left == new, right == old

Why does statistics take so long to load?

The entire database is one very long table called game_lines.

Every single time a game produces a line of text, that is recorded in game_lines with some statistics.

Each line looks like this:

e726c5f5-7d59-11f0-b39e-645d86fdbc49 NEKOPARA vol.3 「もう1回、同じように……」 C:\Users\XXX\AppData\Roaming\Anki2\User 1\collection.media\GSM 2025-08-20 10-35-15.avif C:\Users\XXX\AppData\Roaming\Anki2\User 1\collection.media\NEKOPARAvol.3_2025-08-20-10-35-28-515.opus 1755648553.21247 ebd4b051-27aa-4957-9b50-3495d1586ec1

Or in a more readable version:

🗂 Entry ID: e726c5f5-7d59-11f0-b39e-645d86fdbc49
🕒 Timestamp: 2025-08-20 10:35:15
🔊 Audio: NEKOPARAvol.3_2025-08-20-10-35-28-515.opus
🖼 Screenshot: GSM 2025-08-20 10-35-15.avif

🦜 Game Line: "「もう1回、同じように……」"

📁 File Paths:
C:\XXX\AppData\Roaming\Anki2\User 1\collection.media\GSM 2025-08-20 10-35-15.avif
C:\XXX
\AppData\Roaming\Anki2\User 1\collection.media\NEKOPARAvol.3_2025-08-20-10-35-28-515.opus

🧩 Original Game Name: NEKOPARA vol.3
🧠 Translation Source: NULL
🪶 Internal Ref ID: ebd4b051-27aa-4957-9b50-3495d1586ec1
📆 Epoch Timestamp: 1755648553.21247

Then to calculate statistics, we query every gameline.

For me this takes around 10 seconds.

If you play a lot of games it can take around 1 minute....

All the statistics you have seen so far are calculated from this data alone, there's some easy things like:

  • How many characters of this game have I read?
  • How long have I spent playing it?
  • What's the most I've read in a day?
Speeding up Game Sentence Miner (GSM) Statistics by 200%

But in the Japanese learning community there is 1 important bit of data everyone wants.

How many characters do I read per hour on average? What is my reading speed?

This is important because we know how many characters is in a game, if we know our reading speed we can work out how much of a slog it will be.

On a site like Jiten.moe we can insert our reading speed into the settings and see how long it'll take to read something.

At my very nooby reading speed of 2900 characters / hour, it'll take me 550 hours of non-stop reading to play Fate/Stay Night.

Speeding up Game Sentence Miner (GSM) Statistics by 200%

Although this is one of the most famous visual novels of all time and has been made into numerous anime, spending 550 hours slogging through it does not seem good.

Knowing my reading speed allows me to pick games / visual novels that I can do in a few weeks rather than a year or more.

🤔
Most people choose to read smaller visual novels / games to keep their interest high, higher interest means you will read more, which means you will improve more and your reading speed will go up 📈

Now looking at our data there is no easy way to calculate this, right? Games do not tell you "Oh yeah in this Call of Duty dialogue you read at this pace".

Other similar sites like ExStatic calculate this:

Speeding up Game Sentence Miner (GSM) Statistics by 200%
ExStatic is like GSM, except you need to use another program to hook into a games memory to get the game lines sent to ExStatic. This is called a "texthooker". You can also do this with GSM, but explaining how texthookers and OCR work is a blog post for another day.

But interestingly they have sorta the same data as us.

  • Game: Reverb
  • Line: 「ホタカさーん!ホタカさーん!」
  • Timestamp: 1755612879.053

But let's say we get 4 game lines come in. Each one is of length 10.

They come in every 15 minutes.

So our average reading speed is 40 characters per hour.

But then the next day, 24 hours later, we read another line of 10 characters.

Now our averaging reading speed is skewed to be much lower because in our code it looks like it took us 24 hours to read 10 characters.

The absence of data is data itself here, but how is everyone in the Japanese learning community handling this?

Speeding up Game Sentence Miner (GSM) Statistics by 200%

Everyone sets an AFK Timer.

If you do not get a new line within the timer, it assumes you are AFK and stops counting towards your stats.

This may seem uninteresting now, but this powers many of our design choices later on.

What should we do?

We have a couple of things we can do to speed up the loading of the stats site.

  • Batch all API calls into one

Currently we get all game lines multiple times calculate the stats that way. It's not as clear cut as 1 bar graph == 1 DB call. It's more like one section grabs all game_line and alters it to work for that section.

This makes a lot of sense, but sadly it doesn't work so well.

I've already tried this:

unify api calls to one by bee-san · Pull Request #192 · bpwhelan/GameSentenceMiner
Speeding up Game Sentence Miner (GSM) Statistics by 200%

Ignore my bad PR etiquette. We talked more about this in the Discord. I don't want to write conventional commits + nice PRs for a very niche tool 😅

Firstly, it only saves around a second of time. We still have to pull all the game lines no matter what.

Secondly, this makes it much harder and more rigid to calculate statistics. We had one API call, and then we calculated every possible statistic out of that one call and put it into a dictionary.

It's a bit... hardcore...

We basically had one 1200 line function which calculated every stat and then fed it to each statistic.

We could have broken it up, but to save 1 second of time only? For all that work? Surely there's a faster way.

  • Move statistics out of the page

We've already done this as a little hack. We moved many important statistics from the /statistics page to the /overview page.

This improves loading because instead of loading every stat, we now only load important ones.

Speeding up Game Sentence Miner (GSM) Statistics by 200%

Obviously a hack... but it worked.... Load speed went from 7 seconds to 4... Still bad... 🤢

  • Pre-calculate stats

Do we really need to calculate stats on the fly?

What if we were to pre-calculate all of our statistics and then present them to the user?

The final option, pre-calculating stats, is what we will be doing.

🥞 Rolling up stats

Every time GSM runs, let's pre-calculate all previous days stats for the user and then calculate just todays.

This will save us a lot of time.

Specifically our algorithm will now look like:

  • When GSM runs, roll up all previous days stats to a table
  • When we query statistics, use the pre-rolled up table for all previous data
  • And then we calculate today's stats and add it to the rolled up stats
"Why calculate today's stats on the fly at all? Why not turn each game_line into a rolled up stats and add it to today's rollup?"

By Jove, a great question!

When GSM receives a line of text from a game it does a lot of processing to make it appear on screen etc, so why not precalculate stats there and then?

This makes a lot of sense!

BUTTTT.....

The absence of data is data!

Each game line looks exactly like this:

🗂 Entry ID: e726c5f5-7d59-11f0-b39e-645d86fdbc49
🕒 Timestamp: 2025-08-20 10:35:15
🔊 Audio: NEKOPARAvol.3_2025-08-20-10-35-28-515.opus
🖼 Screenshot: GSM 2025-08-20 10-35-15.avif
🦜 Game Line: "「もう1回、同じように……」"
📁 File Paths:
C:\XXX\AppData\Roaming\Anki2\User 1\collection.media\GSM 2025-08-20 10-35-15.avif
C:\XXX\AppData\Roaming\Anki2\User 1\collection.media\NEKOPARAvol.3_2025-08-20-10-35-28-515.opus
🧩 Original Game Name: NEKOPARA vol.3
🧠 Translation Source: NULL
🪶 Internal Ref ID: ebd4b051-27aa-4957-9b50-3495d1586ec1
📆 Epoch Timestamp: 1755648553.21247

In the moment this imaginary rollup function only has this data.

When we calculate stats, we are looking at the past. We can see where the absences are to calculate the AFK time.

But in the moment, we don't know if the next game line will be 120 seconds or more later.

So therefore we cannot roll up today's stats because we cannot tell when a user takes an extended break away from the text or not.

What stats do we pre-calculate?

The next big question is "okay, what do we actually calculate?"

There's 2 types of stats:

  • Raw stats like characters read
  • Calculated stats that require more than just a single bit of data, like average characters per hour per a specific game.

I made an original list, booted up Claude and asked it to confirm my list and see if it thinks anything else is important.

Together we made this list:

_fields = [
    'date',                           # str — date
    'total_lines',                    # int — total number of lines read
    'total_characters',               # int — total number of characters read
    'total_sessions',                 # int — number of reading sessions
    'unique_games_played',            # int — distinct games played
    'total_reading_time_seconds',     # float — total reading time (seconds)
    'total_active_time_seconds',      # float — total active reading time (seconds)
    'longest_session_seconds',        # float — longest session duration
    'shortest_session_seconds',       # float — shortest session duration
    'average_session_seconds',        # float — average session duration
    'average_reading_speed_chars_per_hour',  # float — average reading speed (chars/hour)
    'peak_reading_speed_chars_per_hour',     # float — fastest reading speed (chars/hour)
    'games_completed',                # int — number of games completed
    'games_started',                  # int — number of games started
    'anki_cards_created',             # int — Anki cards generated
    'lines_with_screenshots',         # int — lines that include screenshots
    'lines_with_audio',               # int — lines that include audio
    'lines_with_translations',        # int — lines that include translations
    'unique_kanji_seen',              # int — unique kanji encountered
    'kanji_frequency_data',           # str — kanji frequency JSON
    'hourly_activity_data',           # str — hourly activity (JSON)
    'hourly_reading_speed_data',      # str — hourly reading speed (JSON)
    'game_activity_data',             # str — per-game activity (JSON)
    'games_played_ids',               # str — list of game IDs (JSON)
    'max_chars_in_session',           # int — most characters read in one session
    'max_time_in_session_seconds',    # float — longest single session (seconds)
    'created_at',                     # float — record creation timestamp
    'updated_at'                      # float — last update timestamp
]

Then using this list we can calculate stats like:

  • Average session length
  • Reading time per game
  • etc etc...

We don't need to calculate every single thing, just have enough data to calculate it all in the moment.

If we calculate things like total_active_time_seconds / total_sessions the abstraction becomes kinda too much.

Like come on, we don't need a whole database column just to divide two numbers 😂

In GSM you can also see your stats data in a date range:

Speeding up Game Sentence Miner (GSM) Statistics by 200%

So we have all these columns, and each row is 1 day of stats. That way we can easily calculate stats for any date range.

And we just need a special case for today to calculate today's stats.

How do we run this?

GSM is a Windows executable. Not a fully fledged server.

It could be ran every couple minutes, or ran once every couple months.

We need this code to successfully roll up stats regardless of when it runs, and we need it to be conservative in when it runs.

What we need is some kind of Cron system...

I added a new Database table called cron.

This table just stores information about tasks that GSM wants to run regularly.

Speeding up Game Sentence Miner (GSM) Statistics by 200%

We store some simple data:

  • ID
  • Name
  • Description
  • The last time it ran
  • The next time it runs
  • If it's enabled or not
  • When the cron was created
  • And the schedule it runs on

Then when we start GSM, it:

  • Runs a query to get all cron jobs that needs to run now
SELECT * FROM {cls._table} WHERE enabled=1 AND next_run <= ? ORDER BY next_run ASC

Loop through our list and run a basic if statement to see if one of our crons needs to run:

   for cron in due_crons:
        detail = {
            'name': cron.name,
            'description': cron.description,
            'success': False,
            'error': None
        }
        try:
            if cron.name == 'jiten_sync':
                from GameSentenceMiner.util.cron.jiten_update import update_all_jiten_games
                result = update_all_jiten_games()
                
                # Mark as successfully run
                CronTable.just_ran(cron.id)
                executed_count += 1
                detail['success'] = True
                detail['result'] = result
                
                logger.info(f"✅ Successfully executed {cron.name}")
                logger.info(f"   Updated: {result['updated_games']}/{result['linked_games']} games")

If it needs to run, we import that file (cron files are just python files we import and run. It's really simple)

We then run the command just_ran.

This command:

  1. sets last_run to current time
  2. calculates next_run based on the schedule type (weekly, monthly etc)
if cron.schedule == 'once':
            # For one-time jobs, disable after running
            cron.enabled = False
            cron.next_run = now  # Set to now since it won't run again
            logger.debug(f"Cron job '{cron.name}' completed (one-time job) and has been disabled")
        elif cron.schedule == 'daily':
            next_run_dt = now_dt + timedelta(days=1)
            cron.next_run = next_run_dt.timestamp()
            logger.debug(f"Cron job '{cron.name}' completed, next run scheduled for {next_run_dt}")
        elif cron.schedule == 'weekly':
            next_run_dt = now_dt + timedelta(weeks=1)
            cron.next_run = next_run_dt.timestamp()
            logger.debug(f"Cron job '{cron.name}' completed, next run scheduled for {next_run_dt}")
  1. Updates the Cron entry

This is just a super simple way to make GSM run tasks on a schedule without running every single time the app starts.

With all of these changes, our API speed is now....

  • 6 seconds -> 0.5 seconds!

But the webpage itself still loads in 3.5 seconds.

Google Lighthouse

Google Lighthouse rates our website as a 37.

It complains about some simple things like:

  • Preloading CSS / HTML
  • No compression
  • No caching

So what I did was:

  • Set rel=preload for important css
  • Added flask-compress dependency to compress the Flask payload, using Brotli. I read this HN comment thread on Brotli vs zstd and I believe Brotli makes the most sense for now.
😅
Brotli in a local open source program? Isn't that overkill?

🤓 achtkually no! The /api/stats endpoint returns a massive JSON payload containing all the stats (rolled up and todays) that's parsed by the frontend into pretty charts. Compressing it makes total sense.

Also, GSM works on a network level too. You may wish to host it on a beefy computer and use something like Moonlight to play the game on your phone, and then look up stats on your phone too.
  • Cached the CSS, since that changes very infrequently. Since GSM is not a server, users have to manually click "update" to update the app. At most this happens once every 3 days, so we use a 3 day cache here.

This led to our lighthouse score becoming 89, with the speed going from 3.5 seconds to 1.4 seconds.

Speeding up Game Sentence Miner (GSM) Statistics by 200%
Speeding up Game Sentence Miner (GSM) Statistics by 200%

Very speedy!

Conclusion

We successfully doubled the loading speed of the statistics sites, but more importantly here are some key takeaways.

  • The use of data is so vast, one persons "meh" data is another persons core product. We need to look at our data flow and our application to decide the best approach. For example, not rolling up today's stats to keep AFK metrics.
  • There's a lot of arguments on Brotli vs zstd. For a local open source program either works.
  • Lighthouse has become a lot more useful since I last used it back in 2018.

JJ Method for Japanese

2025-10-15 14:11:33

JJ Method for Japanese

TL;DR

Grade in Anki on readings not meanings

For N3+ learners who have already built up a good foundation in the language: Anki/Mining

  • Anki is a tool to learn readings, not meanings.
    • Anki is fantastic at drilling binary (yes/no, true/false) situations where a card has a one to a few correct answers and the rest are incorrect. Meanings/usage, however, are much more nuanced (not binary) should be learned through immersion (addressed more below)
    • use pass/fail extension to speed up your reps (only pass/fail based on whether or not you got it correctly, not meaning. Exceptions include: rare words you like and/or if you mine <5 cards and do <5mins anki/day.)
    • add images to back of card to boost retention, may be time consuming so optional but recommended
  • Anki doesn’t make you better at the language, it makes you better at learning the language.
  • Vocab cards → sentence cards
    • Removing all potential hints from the front of the card that your brain might use to “shortcut/cheat” its way to the reading is ideal. Vocab cards ONLY have the target word on the front so as you rep anki cards, your brain is forced to recall the answer purely based on the word. With sentence cards, your brain can/probably will subconsciously start to use the length of the sentence, the position of the target word within the sentence, the beginning few words of the sentence, and other pieces of unique info to that card to help you get to the answer.
  • Learn readings/kanji through words not individual kanji study/rtk
    • Since this is aimed at N3+ it might be too late for this warning, but while rtk/individual kanji study seems like a shortcut to learning how to read, it proves time and time again to just be a waste of time or less efficient than studying through vocabulary at best.
    • individual kanji study will not help you learn which readings are the most common for each kanji, nor will it help you know when to read a kanji like 木 as も / き, and at higher levels, words such as 大海 as たいかい vs 大地 will stump you (this isn’t a huge deal and you can learn Japanese just fine through immersion, but its something to keep in mind if you intend to read more difficult things)
    • If you encounter a word in immersion that you can’t read and its not a name, then mine if there’s: new kanji or new reading of a kanji you know. If you had trouble reading the word but you knew the kanji then you can choose whether or not to mine it
  • By using anki to learn readings, you’re able to replace the 50+% of your cards that you would’ve learned through immersion anyways (wasted time) with cards that actually build reading skill and make immersion a smoother experience (smoother gains)
  • Overall philosophy on actually acquiring the language:
    • the more you can rely on immersion as the main source of learning for any aspect of Japanese (besides kanji readings) the better. Immersion is the only requirement in language learning, even anki is optional (but highly recommended).
    • Enjoyment is the best tool you have for learning as fast as possible, and a little difficulty is optimal. Aim for 85-90% enjoyment and 10% challenge (adjust based on your own tolerances)
    • Acquisition of words (fully learning nuanced meaning) happens through immersion and seeing said word 100s of times in different contexts.
      • If you over-rely on anki to learn words (like try and make your anki database your entire vocabulary) you can run into traps like not being able to learn words without anki, drilling the same unnatural/rare use case of a word, etc.
      • If you encounter a word you can read but dont immediately know the meaning of in your immersion, make sure to give yourself a second or two to guess what the meaning might be based on the context then look it up. This second or two of struggle facilitates learning and you should be weary of falling into the habit of immediately shift hovering over words u dont know before allowing that struggle to happen.
      • If you can’t read the word either, the same “struggle a bit before you look up” philosophy applies but try guessing the reading as well, and make sure to mine the card afterwards based on the criteria mentioned earlier.
    • If you do enough immersion, what you do in anki doesn’t really matter, but it still means that youre potentially wasting many hours every month/year and possibly even creating friction in your learning since you might over-rely on anki to actually learn the language rather than just using it for your reading abilities
  • JLPT, quizzes, and quiz grinding: If you are aiming to work or study abroad in Japan, then do what you need to do to pass JLPT and keep up your immersion. If your goal is to acquire the language, then there are a few important points to keep in mind:
    • JLPT tests, rank quizzes, and other things like it should be checkpoints in your language learning journey, not goals.
      • Try them once in a while if you’d like to get a feel for your improvement. If you find them fun, then feel free to do them a little more often, but don’t treat it as immersion or a source of learning.
      • Anyone can grind practice material for tests or quiz grind to get discord roles, but achieving these milestones through immersion will result in a far more balanced understanding of the language.
    • Aiming mainly for a specific test or rank can lead to an unhealthy relationship with the language and the language learning community. As mentioned earlier, learning Japanese should be fun, and the reward/fulfillment should come from enjoying content in Japanese.
    • Tests/ranks exist to compare you against other people, which isn’t necessarily a bad thing unless you start to crave the test/rank more than the knowledge and the experience someone else has. Ranks don’t mean anything other than the fact that we’re all on the same language learning journey and some people happen to be further up the trail for one reason or another. The only person you should compare yourself to is yourself from the past.
    • Quiz grinding is a very common thing in japanese learning communities (less so in town thanks to how the quiz system is designed/enforced) and generally people who do grind don’t end up achieving the credibility and status they hoped for because they lack knowledge outside of the quizzes they grinded (e.g. dijter who grinded to a rank natives would struggle to achieve, and ended up failing jlpt n1/r1).
  • Outputting: outputting is a topic that I can’t have a fully formed opinion on because of my hafu upbringing, so take the following with more salt than the other sections, but I still believe what I’m saying here:
    • Input will always help grow your outputting abilities, even if you’re not explicitly practicing output (with diminishing returns).
      • If you can’t even realize when what you’re saying is gibberish then you’re not at a high enough comprehension level to progress in output.
      • Having a high comprehension level means improving your output will be much easier and more efficient.
      • There is a point where you start to become better at output through adding in output practice along with your immersion, but I’m not really sure when that is.
    • A popular strategy for developing output is to not do any output practice (silent period) for ~1 year and then start to incorporate it as you feel necessary. I think for most people that don’t have irl Japanese friends this is a good strategy, but many people will find that early output works well for them so experiment.
    • The main reason outputting is effective at improving output is because its the most concentrated and intensive form of input. Unlike immersion, there are real social consequences if you misunderstand what the other person is saying so you’re engaged in a very different way compared to reading or watching something. Additionally you have the opportunity to train recall while you’re outputting, and receive real time feedback.
  • COMMON IMMERSION LEARNING PITFALLS: immersion learners often reach a point where their comprehension is very good and their output is relatively bad. As mentioned earlier, this is not a bad place to be but unlike immersion, improving output requires experiencing failure in front of a native and possibly other learners. Humans have an innate fear of failure and that combined with the disconnect they can feel between their speaking and comprehension levels, it’s easy to be scared and not touch output at all.
  • One again, have fun / outputting doesn’t interest you and you don’t need it, you can always re-visit it later
JJ Method for Japanese

Designing Bit.ly

2025-10-03 08:07:01

Designing Bit.ly

Problem: Design a Tiny URL / Bit.ly competitor

Given a long URL, create a short URL.

Given a short URL, redirect to a long URL.

Requirements

  • Very high read, likely 80% of all links are the same 20% of links
  • Not so much write
  • Needs low latency
  • High availabilitiy

Flow

  • When a user goes to bit.ly/ajdjasjd they should be redirected to the original URL
  • A user should be able to create URLs

Let's start with creating a URL

Making shorter URLs

What we need to do:

  1. Given a URL
  2. Encode the URL so its shorter
  3. Store in a key:value database of short URL -> Original URL
  4. Return short URL

Database

We should probably use a key value store for this like DynamoDB, because nosql is very fast to read, and our data structure is obvious.

Since we do not want to store stats, we can just store the 2 values.

Our database will look like:

create table urls;
create column original_url;
create column short_url;

Then we can index on short_url, as users will be looking it up the most

TODO: can nosql be indexed...?

We should create replicas of the database across multiple regions to allow better readability later on.

We can use a master slave config as this is easier to implement.

The user writes to a single DB (master), and that master writes to the other replicas (or the replicas pull it. TODO: which one is it?)

Because the DB is read heavy and not write heavy, we do not have to worry so much about writes failing.

Encoding scheme

For our encoding scheme lets hash the URL.

This is because:

  • Hashing means users cannot work out the original URL without clicking it
  • Hashing is very fast
  • Hashing will restrict the URL to a specific length, if we used base58 the length could be longer if the URL is very long

TODO: should we use base58 or hashing?

We can then return the original URL to the user.

Reading a URL

Now we go onto the other part, taking the shortened URL and turning it back into the original URL.

HTTP response

We should always use a 301 HTTP response (redirect perm). This is because many internet providers, DNS services, CDNs etc will see this and cache it on their side potentially making it faster for the user.

Cache

When the user clicks the URL, the first thing it should look to do is search for the result in the cache.

TODO: how do we make sure cache is local to user physically? Do we need the load balancer first?

Maybe we should put the load balancer first.

  1. So the user clicks the URL
  2. it goes to our load balancer
  3. Our load balancer randomly selects a cache
  4. If its in the cache, return it
  5. Otherwise the cache fetches it from the DB and returns that

Our caching solution could be as simple as redis, which is a very fast key value store.

TODO: how do we make sure load balancer is local to the user? Do we want them to connect to LB from america?

With Redis you can save the cache too, to stop rebuilding it each time

Youtube

Rest API

Designing Bit.ly

How long is the URL?

How many URLs per second? 1000 per second = 31.5 billion URLs a year

10 to 1 read writes = 300 billion reads per year

How many characters can we use? Base62

62 chars

If we want to have 7 characters in our URL, thats 762 which is 3.5 trillion possible URL combos

TODO: why dont we just hash the original URL?

Using zookeeper as a count cache, so our URL will just be a counter

Zookeeper is good for the metadata here

Statistics for reading Japanese

2025-09-09 15:01:22

Statistics for reading Japanese

Hi friends!

I just released my first ever web-dev project!

There's a tool called Game Sentence Miner which uses OCR to grab text from your screen, and it lets you look that text up in a dictionary.

GitHub - bpwhelan/GameSentenceMiner: An All-in-One immersion toolkit for learning Languages through games and other visual media.
An All-in-One immersion toolkit for learning Languages through games and other visual media. - bpwhelan/GameSentenceMiner
Statistics for reading Japanese

It writes this to a database file:

  • The line of text
  • Where you read it
  • At what time did you read it

I thought....

"What statistics can I learn from this data?"

So I went out to make my first ever web-dev project, a statistics page!

PLEASE remember I am not a web dev! I do SecDevOps! Pls dont judge me too hard for my terrible design 😂 💛s

Statistics for reading Japanese

Notably this has a few cool features:

  • Daily overview of your reading
  • An overview of the current game you're playing
  • An overview of all games you've played
Statistics for reading Japanese

Because I have both the line of text and when it came in, I can guess a lot of things about your reading.

If you read a bit of text and then go away for 2 minutes I assume you are AFK, so if you press "next line of text" within 2 minutes I assume you are actively reading.

This lets me calculate a bunch of stuff.

Statistics for reading Japanese

I made a GitHub style heatmap of your reading along with streaks and average time spent per day reading.

I have a bunch more stats like:

  • How fast do you read, over time?
  • How many hours per day do you spend reading?
  • How many characters of text do you read over time?

As well as a Kanji heatmap:

Statistics for reading Japanese

Everytime you read a sentence that contains Kanji, it does +1 to these Kanji. The more times you read it, the closer to cyan it gets.

If you read a kanji 500 times, I assume you know it really well so it becomes cyan.

I also added a screenshot feature using html2canvas so you can take nice screenshots like the one I took above :)

Anki Integration

I integrated with Anki too

Statistics for reading Japanese

So if you see kanji a lot while reading but that kanji is not in any Anki cards, you can see that here!

And if you wonder "hmmm. When have I ever read 松?"

You can click on the kanji and go to a live search of every sentence you have ever read:

Statistics for reading Japanese

So you can see exactly where you have read this kanji before.

Data Cleanup

Statistics for reading Japanese

You can also deduplicate text, so if the same sentence exists within 5 minutes of each other within the same game I assume it's a duplicate and you can delete that.

You can also clean up text using Regex:

Statistics for reading Japanese

Or delete entire games you don't care about.

Web Dev Stuff

This is made using Flask, Jinja2 for templates, HTML, CSS and JS.

I do use Chart.js:

Chart.js
Simple yet flexible JavaScript charting library for the modern web
Statistics for reading Japanese

but tbh its so simple its hardly an advanced framework like Svelte or whatever.

The most advanced thing I use I use is Flexbox, but even that's not perfect.

If you look here:

Statistics for reading Japanese

The Kanji in these boxes are aligned horizontally but not vertically.

This is because Flexbox does not support vertical alignment.

I Googled it and tbh every answer was a wall of archaic wizard poetry, and the answers that were not were something like "bro flexbox is so out of date bro use css grid if your capacitor is within the sync of the flux device, otherwise you need to dynamically load the static content via the HTML 5.0 spec bro"

like i dont care that much someone else can fix this its open source

20 Popular System Design Interview Questions

2025-09-01 09:59:31

20 Popular System Design Interview Questions

1. Distributed Metrics Logging & Aggregation System
Companies: Google, Facebook, Amazon, eBay, Datadog, Atlassian

2. Distributed Stream Processing System (like Kafka)
Companies: Amazon, Microsoft, Wise, Confluent

3. Key-Value Store
Companies: Apple, Google, Canva, Avalara, Rubrik, OpenDoor

4. K Most Shared Articles in Time Windows (24h, 1h, 5m)
Companies: LinkedIn, Facebook, Twitter

5. API Rate Limiter
Companies: Amazon, Atlassian, Uber, Patreon, Microsoft, Stripe, Headway, Reputation.com, Pinterest

6. Performance Metrics Collection from Thousands of Servers
Companies: Google, Datadog, Amazon, eBay, LinkedIn

7. Google Calendar (or similar scheduling system)
Companies: Google, LinkedIn

8. Distributed Queue (like RabbitMQ)
Companies: Amazon, Apple, Instacart

9. Google Analytics – User Dashboard & Data Pipeline
Companies: Microsoft, Facebook, Qualtrics, Google

10. Sorting Large Data Sets at Scale
Companies: Google, Microsoft

11. Top-K Elements (App Store Rankings, Bestsellers, etc.)
Companies: Amazon, Bloomberg, Facebook, Pinterest

12. Dropbox / Google Drive (Cloud Storage & Sync)
Companies: Dropbox, Facebook, Google, Amazon, Microsoft, Oracle Cloud (OCI)

13. Job Scheduler
Companies: Google, Amazon, Microsoft, DoorDash, Netflix, Atlassian

14. Notification Service at Scale
Companies: Google, Pinterest, OCI, StubHub, Amazon, Airbnb, Instacart

15. Surge Pricing System (e.g., Uber, Lyft)
Companies: Uber, Lyft

16. Netflix – Limit Concurrent Screens per User
Companies: FAANG

17. ETA & Location Sharing Between Driver & Rider
Companies: Uber, FAANG

18. Hotel Booking System (Availability, Reservation, Booking)
Companies: Amazon, Square, Booking.com

19. A/B Testing Platform (like Optimizely)
Companies: Affirm, FAANG

20. Price Alert System (Amazon products, stock/crypto prices)
Companies: Facebook, Bloomberg, Coinbase, Swyftx, Trade Republic