MoreRSS

site iconHackerNoonModify

We are an open and international community of 45,000+ contributing writers publishing stories and expertise for 4+ million curious and insightful monthly readers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of HackerNoon

The HackerNoon Newsletter: Could the Soaring FTSE 100 Mean Its Time to Open a SIPP? (1/1/2026)

2026-01-02 00:02:07

How are you, hacker?


🪐 What’s happening in tech today, January 1, 2026?


The HackerNoon Newsletter brings the HackerNoon homepage straight to your inbox. On this day, Ireland became a part of Great Britain in 1801, The Euro became the official currency of 12 European Countries in 2002, The First Transcontinental phone call was made in 1915, and we present you with these top quality stories. From We Asked 14 Tech Bloggers Why They Write. Heres What They Said to The 10 Most Interesting C# Bugs We Found in Open Source in 2025, let’s dive right in.

We Asked 14 Tech Bloggers Why They Write. Heres What They Said


By @scynthiadunlop [ 14 Min read ] 14 expert tech bloggers share why they started writing and why they continue. Read More.

The 10 Most Interesting C# Bugs We Found in Open Source in 2025


By @akiradoko [ 15 Min read ] If youd like to check whether your project has similar issues, nows the time to use a static analyzer. Read More.

The Vibe Coding Hangover: What Happens When AI Writes 95% of your code?


By @tigranbs [ 8 Min read ] Y Combinator reports that 25% of its W25 project has codebases that are 95% AI-generated. Read More.

Could the Soaring FTSE 100 Mean Its Time to Open a SIPP?


By @dmytrospilka [ 4 Min read ] Could it be time for pension savers to embrace the boom by making the switch to a SIPP?  Read More.


🧑‍💻 What happened in your world this week?

It's been said that writing can help consolidate technical knowledge, establish credibility, and contribute to emerging community standards. Feeling stuck? We got you covered ⬇️⬇️⬇️


ANSWER THESE GREATEST INTERVIEW QUESTIONS OF ALL TIME


We hope you enjoy this worth of free reading material. Feel free to forward this email to a nerdy friend who'll love you for it.See you on Planet Internet! With love, The HackerNoon Team ✌️


How to Become Real Good in Prompt Engineering

2026-01-01 15:50:54

Last month, I spent 3 hours trying to write a decent cold email template.

Three. Whole. Hours.

The AI kept spitting out generic garbage that sounded like every other “Hey [FIRST_NAME], hope this email finds you well”…

Then I changed one thing in my prompt.

One thing.

Suddenly, the AI was writing emails that actually sounded human, referenced specific connection points, and had personality.

My replies rates jumped tremendously!

That moment?

That’s when prompt engineering stopped feeling like a skill and started feeling like almost cheating.

Wait, This Actually Works?

Here’s the thing about prompt engineering that is obvious: it’s about getting really, really good at asking for exactly what you want.

Most of us suck at it. Because is not that easy.

It clicked when I started to build this site using Cursor.

My first attempts were disasters:

My terrible prompt:

"Create my homepage and style with stunning and aesthetic visuals"

The result:

Generic, ugly, messy code nobody would ever be able to customize. 🤮

My improved prompt:

"You're a senior web designer developer with deep knowledge in UI/UX. You are building my personal blog with me, a good fellow unfamiliar with our codebase (Astro Framework). Based in Astro conventions and best practices, create practical assets, components like UI and sections based in astro files. The final result should be a template that experienced developers could use and customize easily..." 

The result:

Actually useful and clean astro files, at least better and more organized than before. (CSS files are still meeh though) 😅

The difference? I stopped asking the AI to write generic codes and started asking it to be an experienced developer helping a colleague building his humble project.

The Five Rules That Changed Everything

1. Stop Being Polite to Robots

I used to write prompts like I was asking a favor: “Could you please maybe help me write a blog post about SEO?”.

Now I’m direct: “Write a 1,200-word blog post for marketing developers who want to understand technical SEO. Include code examples and explain why site speed actually matters for conversion rates, not just rankings.”

The AI doesn’t have feelings. It has algorithms. Feed those algorithms exactly what they need.

2. Context is Everything

Bad prompt:

"Write a LinkedIn post about growth marketing."

Good prompt:

"I'm a Marketing Engineer at a YC startup. Write a LinkedIn post sharing one specific growth hack I discovered while scaling our user base from 1K to 10K. Make it tactical, not theoretical. My audience is other growth marketers and technical founders."

The second prompt works because the AI knows:

  • Who I am (Marketing Engineer, YC startup)
  • What I want (specific growth hack)
  • The context (1K to 10K scale)
  • My audience (growth marketers, technical founders)
  • The tone (tactical, not theoretical)

3. Show, Don’t Just Tell

Instead of saying “write in a conversational tone,” I give examples:

"Write like this: Here's the thing nobody talks about with A/B testing: most marketers get so excited about statistical significance that they forget to check if the difference actually matters. I've seen teams celebrate a 2% lift on a metric that generates $50/month. Congrats, you just spent three weeks optimizing for an extra dollar a month"

The AI learns from the example and matches that specific style.

4. Constrain to Liberate

Counterintuitive but true: the more constraints you give, the more creative the output.

Instead of:

"Help me with marketing automation."

Try:

"I need a 7-email drip sequence for SaaS trial users who haven't logged in after day 3. Each email should be under 100 words, focus on one specific feature, include a clear and valuable CTA, sounding like it's coming from a helpful teammate, not a sales robot."

Constraints force creativity within boundaries.

5. Iterate Like Your Conversion Rate Depends on It

My best prompts are never first drafts. I treat prompt engineering like optimizing ad copy (test, measure, refine, repeat).

First attempt usually gets me 60% of what I want. Then I say:

  • “Make it more specific”
  • “Add a real example from the industry”
  • “Remove the corporate buzzwords”
  • “Include the technical details a developer would care about”

Each iteration gets closer to perfect.

The “Cheating” Factor

Here’s why prompt engineering feels like cheating: I’m getting expert-level outputs on topics I’m still learning about.

I needed to ship a free Astro template. Instead of spending hours reading documentation, I just:

  • Watched some videos to learn best conventions (playlist)
  • Downloaded some important teachings transcripts
  • Feeded AI with Astro Docs + useful tutorials
  • Was really specific I wanted to make a clean code and replicable template

The Marketing Engineer Advantage

Here’s what I’ve learned being caught between marketing and engineering teams: both sides are already using AI, but they’re using it differently.

Marketers use AI for content: social posts, email copy, blog outlines.

Engineers use AI for code: debugging, documentation, optimization.

As a Marketing Engineer, I am trying to use AI to translate between worlds:

  • Converting technical features into benefit-driven copy
  • Turning marketing requirements into technical specifications
  • Building automation workflows to help both teams

The prompt engineering skills transfer directly. Asking AI to debug a Python script or write an email sequence, it’s the same core skill: being incredibly specific about what I want.

The Real Superpower

Prompt engineering isn’t actually about AI. It’s about getting incredibly good at articulating exactly what you want.

That’s why I believe to be better we need to be learning, reading, and discovering something, always. And writing thoughts in somewhere.

This is exactly how I built this blog, by applying prompt engineering to create content that ranks well and helps readers.

And the specificity skill will transfer everywhere:

  • Better briefs for your design team
  • Clearer requirements for developers
  • More effective communication with stakeholders
  • Stronger positioning for your products

So yes, prompting well feels like cheating.

This is just the latest one.

What’s your best prompting win?


Want to see prompt engineering in action? Check out how I used these techniques to build this blog with perfect SEO scores and create content that ranks.

How to Structure Dagger Components So Your Build System Doesn’t Hate You

2026-01-01 15:23:20

This document was adapted from Dagger Directives for a monorepo that uses Bazel Build, and is provided for ease of use in other organizations.

Contents

This document is extensive, and while each directive is simple, the broader architecture they promote may be unclear; therefore, an end-to-end example is provided to aid comprehension, and the underlying architectural rationale is provided to link the individual directives to broader engineering principles. The sections are presented for ease of reference, with directives first; however, readers are encouraged to begin with whichever section they find most helpful.

Terminology

The following definitions apply throughout this document:

  • Component: An interface annotated with @Component
  • Module: A class or interface annotated with @Module
  • Scope: A custom annotation meta-annotated with @Scope

Component Structure and Scoping

Directives for how components are defined, scoped, and related to one another.

Practice: Library-Provided Components

Libraries and generic utilities should provide components that expose their functionality and declare their component dependencies instead of only providing raw classes/interfaces.

Positive Example: A Networking library provides a NetworkingComponent that exposes anOkHttpClient binding and depends on a CoroutinesComponent.

Negative Example: A Networking library that provides various interfaces and classes, but no component, and requires downstream consumers to define modules and components to wire them together.

This approach transforms Dagger components from details of the downstream application into details of upstream libraries. Instead of forcing consumers to understand a library's internal structure(and figure out how to instantiate objects), library authors provide complete, ready-to-use components that can be composed together and used to instantiate objects. This approach is analogous to plugging in a finished appliance instead of assembling a kit of parts: consumers just declare a dependency on the component (e.g. a fridge), supply the upstream components (e.g. electricity), and get the fully configured objects they need without ever seeing the wiring (e.g. cold drinks). This approach scales well, at the cost of more boilerplate.

Practice: Narrow Scoped Components

Components should export a minimal set of bindings, accept only the dependencies they require to operate (i.e. with @BindsInstance), and depend only on the components they require to operate.

Positive Example: A Feature component that depends only on Network and Database components, exposes only its public API (e.g. FeatureUi), and keeps its internal bindings hidden.

Negative Example: A Feature component that depends on a monolithic App component (which itself goes against the practice), exposes various bindings that could exist in isolation (e.g.FeatureUi, Clock, NetworkPorts and RpcBridge, IntegerUtil), and exposes its internal bindings.

This allows consumers to compose functionality with granular precision, reduces unnecessary configuration (i.e. passing instances/dependencies that are not used at runtime), and optimizes build times. This approach is consistent with the core tenets of the Interface Segregation Principle in that it ensures that downstream components can depend on the components they need, without being forced to depend on unnecessary components.

Practice: Naked Component Interfaces

Components should be defined as plain interfaces ("naked interfaces") without Dagger annotations, and then extended by annotated interfaces for production, testing, and other purposes. Downstream components should target the naked interfaces in their component dependencies instead of the annotated interfaces.

Example:

// Definition
interface FooComponent {
  fun foo(): Foo
}

// Production Implementation
@Component(modules = [FooModule::class]) interface ProdFooComponent : FooComponent

// Testing Implementation
@Component(modules = [FakeFooModule::class]) interface TestFooComponent : FooComponent {
  fun fakeFoo(): FakeFoo
}

@Component(dependencies = [FooComponent::class])
interface BarComponent {
  @Component.Builder
  interface Builder {
    fun consuming(fooComponent: FooComponent): Builder
    fun build(): BarComponent
  }
}

This ensures Dagger code follows general engineering principles (separation of interface and implementation). While Dagger components are interfaces, the presence of a `@Component` annotation implicitly creates an associated implementation (the generated Dagger code); therefore, depending on an annotated component forces a dependency on its implementation (at the build system level), and implicitly forces test code to depend on production code. By separating them, consumers can depend on a pure interface without needing to include the Dagger implementation in their class path, thereby preventing leaky abstractions, optimising build times, and directly separating production and test code into discrete branches.

Standard: Custom Scope Required

Components must be bound to a custom Dagger scope.

Example:

@FooScope
@Component
interface ProdFooComponent : FooComponent {
  fun foo(): Foo
}

Unscoped bindings can lead to subtle bugs where expensive objects are recreated or shared state is lost. Explicit lifecycle management ensures objects are retained only as long as needed, thereby preventing these issues.

Standard: Module Inclusion Restrictions

Components must only include modules defined within their own package or its subpackages; however, they must never include modules from a subpackage if another component is defined in an intervening package.

Example:

Given the following package structure:

src
├── a
│   ├── AComponent
│   ├── AModule
│   ├── sub1
│   │   └── Sub1Module
│   └── sub2
│       ├── Sub2Component
│       └── sub3
│           └── Sub3Module
└── b 
    └── BModule

AComponent may include AModule (same package) and Sub1Module (subpackage with no intervening component), but not Sub3Module (intervening Sub2Component in a.sub2) or BModule (not a subpackage of a).

This enforces strict architectural layering and prevents dependency cycles (spaghetti code), thereby ensuring proper component boundaries and maintainability.

Practice: Dependencies Over Subcomponents

Component dependencies should be used instead of subcomponents.

Example: Foo depends on Bar via @Component(dependencies = [Bar::class]) rather than using@Subcomponent.

While subcomponents are a standard feature of Dagger, prohibiting them favors a flat composition-based component graph, thereby reducing cognitive load, allowing components to be tested in isolation, and creating a more scalable architecture.

Practice: Cross-Package Dependencies

Components may depend on components from any package.

Example: Foo in a.b can depend on Bar in x.y.z.

Allowing components to depend on each other regardless of location promotes reuse, thereby fostering high cohesion within packages.

Standard: Component Suffix

Components must include the suffix Component in their name.

Positive example: ConcurrencyComponent

Negative example: Concurrency

This clearly distinguishes the component interface from the functionality it provides and prevents naming collisions.

Standard: Scope Naming Convention

The name of the custom scope associated with a component must inherit the name of the component (minus "Component") with "Scope" appended.

Example: FooComponent is associated with FooScope.

Consistent naming allows contributors to immediately associate a scope with its component, thereby preventing conflicts and reducing split-attention effects.

Standard: Builder Naming

Component builders must be called `Builder`.

Example:

@Component
interface FooComponent {
  @Component.Builder
  interface Builder {
    @BindsInstance fun binding(bar: Bar): Builder
    fun build(): FooComponent
  }
}

Standardizing builder names allows engineers to predict the API surface of any component, thereby reducing the mental overhead when switching between components.

Standard: Binding Function Naming

Component builder functions that bind instances must be called `binding`; however, when bindings use qualifiers, the qualifier must be appended.

Example:

@Component
interface ConcurrencyComponent {
  @Component.Builder
  interface Builder {
    // Unqualified
    @BindsInstance fun binding(bar: Bar): Builder

    // Qualified
    @BindsInstance fun bindingIo(@Io scope: CoroutineScope): Builder
    @BindsInstance fun bindingMain(@Main scope: CoroutineScope): Builder  

    fun build(): ConcurrencyComponent
  }
}

Explicit naming immediately clarifies the mechanism of injection (instance binding vs component dependency), thereby preventing collisions when binding multiple instances of the same type.

Standard: Dependency Function Naming

Component builder functions that set component dependencies must be called `consuming`.

Example:

@Component(dependencies = [Bar::class])
interface FooComponent {
  @Component.Builder interface Builder {
    fun consuming(bar: Bar): Builder
    fun build(): FooComponent
  }
}

Distinct naming clearly separates structural dependencies (consuming) from runtime data (binding), thereby making the component's initialization logic self-documenting.

Standard: Provision Function Naming

Component provision functions must be named after the type they provide (in camelCase). However, when bindings use qualifiers, the qualifier must be appended to the function name.

Example:

@Component
interface FooComponent {
  // Unqualified
  fun bar(): Bar

  // Qualified
  @Io fun genericIo(): Generic
  @Main fun genericMain(): Generic
}

This ensures consistency and predictability in the component's public API.

Factory Functions

Requirements for the factory functions that instantiate components for ease of use.

Standard: Factory Function Required

Components must have an associated factory function that instantiates the component.

Example:

@Component(dependencies = [Quux::class])
interface FooComponent { // ... }

fun fooComponent(quux: Quux = DaggerQuux.create(), qux: Qux): FooComponent = DaggerFooComponent.builder() 
    .consuming(quux)
    .binding(qux)
    .build()

This integrates cleanly with Kotlin, thereby significantly reducing the amount of manual typing required to instantiate components.

Exception: Components that are file private may exclude the factory function (e.g. components defined in tests for consumption in the test only).

Standard: Default Component Dependencies

Factory functions must supply default arguments for parameters that represent component dependencies.

Example: fun fooComponent(quux: Quux = DaggerQuux.create(), ...)

Providing defaults for dependencies allows consumers to focus on the parameters that actually vary, thereby improving developer experience and reducing boilerplate.

Practice: Production Defaults

The default arguments for component dependency parameters in factory functions should be production components, even when the component being assembled is a test component.

Example: fun testFooComponent(quux: Quux = DaggerQuux.create(), ...)

This ensures tests exercise real production components and behaviours as much as possible, thereby reducing the risk of configuration drift between test and production environments.

Practice: Factory Function Location

Factory functions should be defined as top-level functions in the same file as the component.

Example: fooComponent() function in same file as FooComponent interface.

Co-locating the factory with the component improves discoverability.

Practice: Factory Function Naming

Factory function names should match the component, but in lower camel case.

Example: FooComponent component has fun fooComponent(...) factory function.

This ensures factory functions can be matched to components easily.

Practice: Default Non-Component Parameters

Factory functions should supply default arguments for parameters that do not represent component dependencies (where possible).

Example: fun fooComponent(config: Config = Config.DEFAULT, ...)

Sensible defaults allow consumers to only specify non-standard configuration when necessary, thereby reducing cognitive load.

Modules and Build Targets

Directives regarding Dagger modules and their placement in build targets.

Standard: Separate Module Targets

Modules must be defined in separate build targets to the objects they provide/bind.

Example: BarModule in separate build target from Baz implementation.

Separating implementation from interface/binding prevents changing an implementation from invalidating the cache of every consumer of the interface, thereby improving build performance.Additionally, it ensures consumers can depend on individual elements independently (crucial forHilt) and allows granular binding overrides in tests.

Standard: Dependency Interfaces

Modules must depend on interfaces rather than implementations.

Example: BarModule depends on Baz interface, not BazImpl.

This enforces consistency with the dependency inversion principle, thereby decoupling the module and its bindings from concrete implementations.

Testing Patterns

Patterns for defining components used in testing to ensure testability.

Standard: Test Component Extension

Test components must extend production components.

Example: interface TestFooComponent : FooComponent

Tests should operate on the same interface as production code (Liskov Substitution), thereby ensuring that the test environment accurately reflects production behavior.

Practice: Additional Test Bindings

Test components should export additional bindings.

Example: TestFooComponent component extends FooComponent and additionally exposes fun testHelper(): TestHelper.

Exposing test-specific bindings allows tests to inspect internal state or inject test doubles without compromising the public production API, thereby facilitating white-box testing where appropriate.

Rationale

The directives in this document work together to promote an architectural pattern for Dagger that follows foundational engineering best practices and principles, which in turn supports sustainable development and improves the contributor experience. The core principles are:

  • Interface Segregation Principle (ISP): Downstream consumers should be able to depend on the minimal API required for their task without being forced to consume/configure irrelevant objects. This reduces cognitive overhead for both maintainers and consumers, and lowers computational costs at build time and runtime. It's supported by directives such as the "Narrow Scoped Components" practice, which calls for small granular components instead of large God Objects, and the "Dependencies Over Subcomponents" practice, which encourages composition over inheritance.
  • Dependency Inversion Principle: High-level elements should not depend on low-level elements; instead, both should depend on abstractions. This reduces the complexity and scope of changes by allowing components to evolve independently and preventing unnecessary recompilation (in a complex build system such as Bazel). It's supported by the "Naked Component Interfaces" directive, which requires the use of interfaces rather than implementations, and the "Module Inclusion Restrictions" standard, which enforces strict architectural layering.
  • Abstraction and Encapsulation: Complex subsystems should expose simple, predictable interfaces that hide their internal complexity and configuration requirements. This allows maintainers and consumers to use and integrate components without deep understanding of the implementation. It's supported by the "Factory Function Required" standard, which encourages simple entry points, and "Default Component Dependencies", which provides sensible defaults to eliminate "Builder Hell".
  • Liskov Substitution Principle (LSP): Objects of a superclass must be replaceable with objects of its subclasses without breaking the application. This ensures test doubles can be seamlessly swapped in during tests, thereby improving testability without requiring changes to production code, and ensuring as much production code is tested as possible. It's supported by the "Test Component Extension" standard, which mandates that test components inherit from production component interfaces.
  • Compile-Time Safety (Poka-Yoke): The system is designed to prevent misconfiguration errors (i.e. "error-proofing"). By explicitly declaring component dependencies in the component interface, Dagger enforces their presence at compile time, and fails if they are missing. This gives consumers a clear, immediate signal of exactly what is missing or misconfigured, rather than failing obscurely at runtime. It's supported by the "Library-Provided Components" practice, which mandates fully declared dependencies, and the "Factory Function Required" standard, which mechanically ensures all requirements are satisfied effectively.

Overall, this architecture encourages and supports granular, maintainable components that can be evolved independently and composed together into complex structures. Components serve as both the public API for utilities, the integration system that ties elements together within utilities, and the composition system that combines utilities together. For upstream utility maintainers, this reduces boilerplate and reduces the risk of errors; for downstream utility consumers, this creates an unambiguous and self-documenting API that can be integrated without knowledge of implementation details; and for everyone, it distributes complexity across the codebase and promotes high cohesion(i.e. components defined nearest to the objects they expose). All together, this fosters sustainable development by reducing cognitive and computational load. \n The disadvantages of this approach and a strategy for mitigation are discussed in the[future work](#future-work) appendix.

End to End Example

The following example demonstrates a complete Dagger setup and usage that adheres to all the directives in this document. It features upstream (User) and downstream (Profile) components, separate modules for production and testing (including fake implementations), and strict separation of interface and implementation via naked component interfaces.

User Feature

Common elements:

/** Custom Scope */
@Scope @Retention(AnnotationRetention.RUNTIME)
annotation class UserScope

/** Domain Interface */
interface User

/** Naked Component */
interface UserComponent {
  fun user(): User
}

Production elements:

/** Real Implementation */
@UserScope class RealUser @Inject constructor() : User

/** Production Module */
@Module
interface UserModule {

  @Binds
  fun bind(impl: RealUser): User

  companion object {
    @Provides
    fun provideTimeout() = 5000L
  }
}

/** Production Component */
@UserScope
@Component(modules = [UserModule::class])
interface ProdUserComponent : UserComponent {
  @Component.Builder
  interface Builder {
    fun build(): ProdUserComponent
  }
}

/** Production Factory Function */
fun userComponent(): UserComponent = DaggerProdUserComponent.builder().build()

Test elements:

/** Fake Implementation */
@UserScope class FakeUser @Inject constructor() : User

/** Fake Module */
@Module
interface FakeUserModule {
  @Binds
  fun bind(impl: FakeUser): User
}

/** Test Component */
@UserScope
@Component(modules = [FakeUserModule::class])
interface TestUserComponent : UserComponent {
  fun fakeUser(): FakeUser

  @Component.Builder
  interface Builder {
    fun build(): TestUserComponent
  }
}

/** Test Factory Function */
fun testUserComponent(): TestUserComponent = DaggerTestUserComponent.builder().build()

Profile Feature

Common elements:

/** Custom Scope */
@Scope @Retention(AnnotationRetention.RUNTIME)
annotation class ProfileScope

/** Domain Interface */
interface Profile

/** Naked Component */
interface ProfileComponent {
  fun profile(): Profile
}

Production elements:

** Real Implementation */
@ProfileScope class RealProfile @Inject constructor(
  val user: User,
  private val id: ProfileId
) : Profile {
  data class ProfileId(val id: String)
}

/** Production Module */
@Module
interface ProfileModule {
  @Binds
  fun bind(impl: RealProfile): Profile
}

/** Production Component */
@ProfileScope
@Component(dependencies = [UserComponent::class], modules = [ProfileModule::class])
interface ProdProfileComponent : ProfileComponent {
  @Component.Builder
  interface Builder {
    fun consuming(user: UserComponent): Builder
    @BindsInstance fun binding(id: ProfileId): Builder
    fun build(): ProdProfileComponent
  }
}

/** Production Factory Function */
fun profileComponent(
  user: UserComponent = userComponent(),
  id: ProfileId = ProfileId("prod-id")
): ProfileComponent = DaggerProdProfileComponent.builder().consuming(user).binding(id).build()

Test elements:

/** Test Component */
@ProfileScope
@Component(dependencies = [UserComponent::class], modules = [ProfileModule::class])
interface TestProfileComponent : ProfileComponent {
  @Component.Builder
  interface Builder {
    fun consuming(user: UserComponent): Builder
    @BindsInstance fun binding(id: ProfileId): Builder
    fun build(): TestProfileComponent
  }
}

/** Test Factory Function */
fun testProfileComponent(
  user: UserComponent = userComponent(),
  id: ProfileId = ProfileId("test-id")
): TestProfileComponent = DaggerTestProfileComponent.builder().consuming(user).binding(id).build()

Usage

Example of production component used in production application:

class Application {
  fun main() {
    // Automatically uses production implementations (RealUser, RealProfile)
    val profile = profileComponent().profile()
    // ...
  }
}

Example of production profile component used with test user component in a test:

@Test
fun testProfileWithFakeUser() {
  // 1. Setup: Create the upstream test component (provides FakeUser)
  val fakeUserComponent = testUserComponent()
  val fakeUser = fakeUserComponent.fakeUser()

  // 2. Act: Inject it into the downstream test component
  val prodProfileComponent = profileComponent(user = fakeUserComponent)
  val profile = prodProfileComponent.profile()

  // 3. Assert: Verify integration
  assertThat(profile.user).isEqualTo(fakeUser)
}

Future Work

The main disadvantage of the pattern this document encodes is the need for a final downstreamassembly of components, which can become boilerplate heavy in deep graphs. For example:

fun main() {
  // Level 1: Base component
  val core = coreComponent()

  // Level 2: Depends on Core
  val auth = authComponent(core = core)
  val data = dataComponent(core = core)

  // Level 3: Depends on Auth, Data, AND Core
  val feature = featureComponent(auth = auth, data = data, core = core)

  // Level 4: Depends on Feature, Auth, AND Core
  val app = appComponent(feature = feature, auth = auth, core = core)
}

A tool to reduce this boilerplate has been designed, and implementation is tracked by this issue.

Transformers, Finally Explained

2026-01-01 15:20:01

After spending months studying transformer architectures and building LLM applications, I realized something: most explanations are overwhelming or missing out some details. This article is my attempt to bridge that gap — explaining transformers the way I wish someone had explained them to me.

For an intro into what Large language model (LLM) means, refer this article I published previously.

By the end of this lesson, you will be able to look at any LLM architecture diagram and understand what is happening.

This is not just academic knowledge — understanding the Transformer architecture will help you make better decisions about model selection, optimize your prompts, and debug issues when your LLM applications behave unexpectedly.

How to Read This Lesson: You don't need to absorb everything in one read. Skim first, revisit later—this lesson is designed to compound over time. The concepts build on each other, so come back as you need deeper understanding.

What You Will Learn

  • The complete Transformer architecture from input to output
  • How positional encodings let models understand word order
  • The difference between encoder-only, decoder-only, and encoder-decoder models
  • Why layer normalization and residual connections matter
  • How to read and interpret architecture diagrams
  • Practical implications for choosing the right model type

Don't worry if some of these terms sound unfamiliar—we'll explain each concept step by step, starting with the basics. By the end of this lesson, these technical terms will make perfect sense, even if you're new to machine learning architecture.


The Big Picture

Let's start with a simple analogy. Imagine you're reading a book and trying to understand a sentence:

"The animal didn't cross the street because it was too tired."

To understand this, your brain does several things:

  1. Recognizes the words - You know what "animal", "street", and "tired" mean
  2. Understands word order - "The animal was tired" means something different from "Tired was the animal"
  3. Connects related words - You figure out that "it" refers to "animal", not "street"
  4. Grasps the overall meaning - The animal's tiredness caused it to not cross

A Transformer does something remarkably similar, but using math. Let me give you a simple explanation of how it works:

What goes in: Text broken into pieces (called tokens)

What's a token? Think of tokens as the basic building blocks that language models understand:

  • Sometimes a token is a full word (like "cat" or "the")
  • Sometimes it's part of a word (like "under" and "stand" for "understand")
  • Even punctuation marks and spaces can be their own tokens
  • For example, "I love AI!" might be split into tokens: ["I", " love", " AI", "!"]

What happens inside: The model processes this text through several stages (we'll explore each in detail):

  1. Converts words to numbers (because computers only understand math)
  2. Adds information about word positions (1st word, 2nd word, etc.)
  3. Figures out which words are related to each other
  4. Builds deeper understanding by repeating this process many times

What comes out: Depends on what you need:

  • Understanding text (aka: encoding): A mathematical representation that captures meaning (useful for: "Is this email spam?" or "Find similar articles")
  • Generating text (aka: decoding): Prediction of what word should come next (useful for: ChatGPT, code completion, translation)

Think of a Transformer like an assembly line where each station refines the product. Raw materials (words) enter, each station adds something (position info, relationships, meaning), and the final product emerges more polished at each step.

A Quick Visual Journey

Here's how text flows through a Transformer:

Transformer Architecture: Text Processing Flow

The diagram shows how a simple sentence like "The cat sat on the mat" gets processed through the transformer architecture - from tokenization to final output. The key steps include embedding the tokens into vectors, adding positional information, applying self-attention to understand relationships between words, and repeating the attention and processing steps multiple times to refine understanding.

Modern LLMs repeat the attention and processing steps many times:

  • Small models: 12 repetitions (like BERT)
  • Large models: 120+ repetitions (like GPT-4)
  • Each repetition = one "layer" that deepens understanding

Now let's walk through each step in detail, starting from the very beginning.


Step 1: Tokenization and Embeddings

Before the model can process text, it needs to solve two problems: breaking text into pieces (tokenization) and converting those pieces into numbers (embeddings).

Part A: Tokenization - Breaking Text Into Pieces

The Problem: How do you break text into manageable chunks? You might think "just split by spaces into words," but that's too simple.

Why not just use words?

Consider these challenges:

  • "running" and "runs" are related, but treating them as completely separate words wastes the model's capacity
  • New words like "ChatGPT" appear constantly - you can't have infinite vocabulary
  • Different languages don't use spaces (Chinese, Japanese)

The solution: Subword Tokenization

Modern models break text into subwords - pieces smaller than words but larger than individual characters. Think of it like Lego blocks: instead of needing a unique piece for every possible structure, you reuse common blocks.

Simple example:

Text: "I am playing happily"

Split by spaces (naive approach):
["I", "am", "playing", "happily"]
Problem: Need separate entries for "play", "playing", "played", "player", "plays"...

Subword tokenization (smart approach):
["I", "am", "play", "##ing", "happy", "##ly"]
Better: Reuse "play" and "##ing" for "playing", "running", "jumping"
        Reuse "happy" and "##ly" for "happily", "sadly", "quickly"

Why this matters - concrete examples:

  1. Handling related words:
  • "unhappiness" → ["un", "##happy", "##ness"]
  • Now the model knows: "un" = negative, "happy" = emotion, "ness" = state
  • When it sees "uncomfortable", it recognizes "un" means negative!
  1. Handling rare/new words:
  • Imagine the word "unsubscribe" wasn't in training
  • Model breaks it down: ["un", "##subscribe"]
  • It can guess meaning from pieces it knows: "un" (undo) + "subscribe" (join)
  1. Vocabulary efficiency:
  • 50,000 tokens can represent millions of word combinations
  • Like having 1,000 Lego pieces that make infinite structures

Real example of tokenization impact:

Input: "The animal didn't cross the street because it was tired"

Tokens (what the model actually sees):
["The", "animal", "didn", "'", "t", "cross", "the", "street", "because", "it", "was", "tired"]

Notice:
- "didn't" → ["didn", "'", "t"] (split to handle contractions)
- Each token gets converted to numbers (embeddings) next

Part B: Embeddings - Converting Tokens to Numbers

The Problem: Computers don't understand tokens. They only work with numbers. So how do we convert "cat" into something a computer can process?

Understanding Dimensions with a Simple Analogy

Before we dive in, let's understand what "dimensions" mean with a familiar example:

Describing a person in 3 dimensions:

  • Height: 5.8 feet
  • Weight: 150 lbs
  • Age: 30 years

These 3 numbers (dimensions) give us a mathematical way to represent a person. Now, what if we want to represent a word mathematically?

Describing a word needs way more dimensions:

To capture everything about the word "cat", we need hundreds of numbers:

  • Dimension 1: How "animal-like" is this word? (0.9 - very animal-like)
  • Dimension 2: How "small" is this? (0.7 - fairly small)
  • Dimension 3: How "domestic" is it? (0.8 - very domestic)
  • Dimension 4: How "fluffy" is this? (0.6 - somewhat fluffy)
  • … (and hundreds more capturing different aspects)

Modern models use 768 to 4096 dimensions because words are complex! But here's the key: you don't need to understand what each dimension represents. The model figures this out during training.

How Words Get Converted to Numbers

Let's walk through a concrete example:

# This is a simplified embedding table (real ones have thousands of words)
# Each word maps to a list of numbers (a "vector")
embedding_table = {
    "cat": [0.2, -0.5, 0.8, ..., 0.1],    # 768 numbers total
    "dog": [0.3, -0.4, 0.7, ..., 0.2],    # Notice: similar to "cat"!
    "bank": [0.9, 0.1, -0.3, ..., 0.5],   # Very different from "cat"
}

# When we input a sentence:
sentence = "The cat sat"

# Step 1: Break into tokens
tokens = ["The", "cat", "sat"]

# Step 2: Look up each token's vector
embedded = [
    embedding_table["The"],   # Gets: [0.1, 0.3, ..., 0.2]  (768 numbers)
    embedding_table["cat"],   # Gets: [0.2, -0.5, ..., 0.1] (768 numbers)  
    embedding_table["sat"],   # Gets: [0.4, 0.2, ..., 0.3]  (768 numbers)
]

# Result: We now have 3 vectors, each with 768 dimensions
# The model can now do math with these!

Where Does This Table Come From?

Great question! The embedding table isn't written by hand. Here's how it's created:

  1. Start with random numbers: Initially, every word gets random numbers
  • "cat" → [0.43, 0.12, 0.88, …] (random)
  • "dog" → [0.71, 0.05, 0.33, …] (random)
  1. Training adjusts these numbers: As the model trains on billions of text examples, it learns:
  • "cat" and "dog" appear in similar contexts → Their numbers become similar
  • "cat" and "bank" appear in different contexts → Their numbers stay different
  1. After training: Words with similar meanings have similar number patterns
  • "cat" → [0.2, -0.5, 0.8, …]
  • "dog" → [0.3, -0.4, 0.7, …] ← Very similar to "cat"!
  • "happy" → [0.5, 0.8, 0.3, …]
  • "joyful" → [0.6, 0.7, 0.4, …] ← Similar to "happy"!

Why This Matters

These embeddings capture word relationships mathematically:

  • "king" - "man" + "woman" ≈ "queen" (this actually works with the vectors!)
  • Similar words cluster together in this high-dimensional space
  • The model can now reason about word meanings using math

Key Insight: Embeddings as Parameters

When we say GPT-3 has 175 billion parameters, where are they? A significant chunk lives in the embedding table.

What happens in the embedding layer:

  1. Each token in your vocabulary (like "cat" or "the") gets its own vector of numbers
  2. These numbers ARE the parameters - they're what the model learns during training
  3. For a model with 50,000 tokens and 1,024 dimensions per token, that's 51.2 million parameters just for embeddings

Example: If "cat" = token #847, the model looks up row #847 in its embedding table and retrieves a vector like [0.2, -0.5, 0.7, …] with hundreds or thousands of numbers. Each of these numbers is a parameter that was optimized during training.

This is why embeddings contain so much "knowledge" - they encode the meaning and relationships between words that the model learned from massive amounts of text.


Step 2: Adding Position Information

The Problem: After converting words to numbers, we have another issue. Look at these two sentences:

  1. "The cat sat"
  2. "sat cat The"

They have the same words, just in different order. But right now, the model sees them as identical because it just has three vectors with no order information!

Real-world example:

  • "The dog bit the man" vs "The man bit the dog"
  • Same words, completely different meanings!

Transformers process all words at the same time (unlike reading left-to-right), so we need to explicitly tell the model: "This is word #1, this is word #2, this is word #3."

How We Add Position Information

Think of it like adding page numbers to a book. Each word gets a "position tag" added to its embedding.

Simple Example:

# We have our word embeddings from Step 1:
word_embeddings = [
    [0.1, 0.3, 0.2, ...],  # "The" (768 numbers)
    [0.2, -0.5, 0.1, ...], # "cat" (768 numbers)
    [0.4, 0.2, 0.3, ...],  # "sat" (768 numbers)
]

# Now add position information:
position_tags = [
    [0.0, 0.5, 0.8, ...],  # Position 1 tag (768 numbers)
    [0.2, 0.7, 0.4, ...],  # Position 2 tag (768 numbers)  
    [0.4, 0.9, 0.1, ...],  # Position 3 tag (768 numbers)
]

# Combine them (add the numbers together):
final_embeddings = [
    [0.1+0.0, 0.3+0.5, 0.2+0.8, ...],  # "The" at position 1
    [0.2+0.2, -0.5+0.7, 0.1+0.4, ...], # "cat" at position 2
    [0.4+0.4, 0.2+0.9, 0.3+0.1, ...],  # "sat" at position 3
]

# Now each word carries both:
# - What the word means (from embeddings)
# - Where the word is located (from position tags)

How Are Position Tags Created?

The original Transformer paper used a mathematical pattern based on sine and cosine waves. You don't need to understand the math — just know that:

  1. Each position gets a unique pattern - Position 1 gets one pattern, position 2 gets another, etc.
  2. The pattern encodes relative distance - The model can figure out "word 5 is 2 steps after word 3"
  3. It works for any length - The mathematical pattern can extend beyond what the model saw during training, so a model trained on 100-word sentences can still understand the position of words in much longer documents like 1000-word documents

Modern Improvement: Rotary Position Embeddings (RoPE)

Newer models like Llama and Mistral use an improved approach called RoPE (Rotary Position Embeddings).

Simple analogy: Think of a clock face with moving hands:

Word at position 1: Clock hand at 12 o'clock (0°)
Word at position 2: Clock hand at 1 o'clock (30°)
Word at position 3: Clock hand at 2 o'clock (60°)
Word at position 4: Clock hand at 3 o'clock (90°)
...

RoPE: Position Encoding as Clock Rotation

How this connects to RoPE: Just like the clock hands rotate to show different times, RoPE literally rotates each word's embedding vector based on its position. Word 1 gets rotated 0°, word 2 gets rotated 30°, word 3 gets rotated 60°, and so on. This rotation encodes position information directly into the word vectors themselves.

Why this works:

  • Words next to each other have clock hands that are close (12 o'clock vs 1 o'clock)
  • Words far apart have very different clock positions (12 o'clock vs 6 o'clock)
  • Just by looking at the clock hands, the model can tell:
  • Where each word is: "This word is at the 5 o'clock position"
  • How far apart words are: "These two words are 3 hours apart"

Why this matters in practice:

  • Better performance on long documents
  • Enables "context extension" tricks (train on 4K words, use with 32K words)
  • More natural understanding of word distances

Key takeaway: Position encoding ensures the model knows "The cat sat" is different from "sat cat The". Without this, word order would be lost!


Step 3: Understanding Which Words Are Related (Attention)

This is the magic that makes Transformers work! Let's understand it with a story.

The Dinner Party Analogy

Imagine you're at a dinner party with 10 people. Someone mentions "Paris" and you want to understand what they mean:

  1. You scan the room (looking at all other conversations)
  2. You notice someone just said "France" and another said "Eiffel Tower"
  3. You connect the dots - "Ah! They're talking about Paris the city, not Paris Hilton"
  4. You gather information from those relevant conversations

Attention does exactly this for words in a sentence!

Example

Let's process this sentence:

"The animal didn't cross the street because it was too tired."

When the model processes the word "it", it needs to figure out: What does "it" refer to?

Step 1: The word "it" asks questions

  • "I'm a pronoun. Who do I refer to? I'm looking for nouns that came before me."

Step 2: All other words offer information

  • "The" says: "I'm just an article, not important"
  • "animal" says: "I'm a noun! I'm a subject! Pay attention to me!"
  • "didn't" says: "I'm a verb helper, not what you're looking for"
  • "street" says: "I'm a noun too, but I'm the location, not the subject"
  • "tired" says: "I describe a state, might be relevant"

Step 3: "it" calculates relevance scores

  • "animal": 0.45 (45% relevant - very high!)
  • "street": 0.08 (8% relevant - somewhat relevant)
  • "tired": 0.15 (15% relevant - moderately relevant)
  • All others: ~0.02 (2% each - barely relevant)

Step 4: "it" gathers information The model now knows: "it" = mostly "animal" + a bit of "tired" + tiny bit of others

How This Works Mathematically ?

The model creates three versions of each word:

  1. Query (Q): "What am I looking for?"
  • For "it": Looking for nouns, subjects, things that can be tired
  1. Key (K): "What do I contain?"
  • For "animal": I'm a noun, I'm the subject, I can get tired
  • For "street": I'm a noun, but I'm an object/location
  1. Value (V): "What information do I carry?"
  • For "animal": Carries the actual meaning/features of "animal"

The matching process:

# Simplified example (real numbers would be 768-dimensional)

# Word "it" creates its Query:
query_it = [0.8, 0.3, 0.9]  # Looking for: subject, noun, living thing

# Word "animal" has this Key:
key_animal = [0.9, 0.4, 0.8]  # Offers: subject, noun, living thing

# How well do they match? Multiply and sum:
relevance = (0.8×0.9) + (0.3×0.4) + (0.9×0.8)
          = 0.72 + 0.12 + 0.72  
          = 1.56  # High match!

# Compare with "street":
key_street = [0.1, 0.4, 0.2]  # Offers: not-subject, noun, non-living thing
relevance = (0.8×0.1) + (0.3×0.4) + (0.9×0.2)
          = 0.08 + 0.12 + 0.18
          = 0.38  # Lower match

# Convert to percentages (this is what "softmax" does):
# "animal" gets 45%, "street" gets 8%, etc.

Where Does The Formula Come From?

You might see this formula in papers:

Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V

What it means in plain English:

  1. Q × K^T: Match each word's Query against all other words' Keys (like our multiplication above)
  2. / √d_k: Scale down the numbers (prevents them from getting too big)
  3. softmax: Convert to percentages that add up to 100%
  4. × V: Gather information from relevant words based on those percentages

Where it comes from: Researchers from Google Brain discovered in 2017 that this mathematical formula effectively models how words should pay attention to each other. It's inspired by information retrieval (like how search engines find relevant documents).

You don't need to memorize this! Just remember: attention = figuring out which words are related and gathering information from them.

Complete Example Walkthrough

Let's see attention in action with actual numbers:

Sentence: "The animal didn't cross the street because it was tired"

When processing "it", the attention mechanism calculates:

Word         Relevance Score    What This Means
─────────────────────────────────────────────────────────
"The"        →  2%              Article, not important
"animal"     → 45%              Main subject! Likely referent
"didn't"     →  3%              Verb helper, not the focus
"cross"      →  5%              Action, minor relevance
"the"        →  2%              Article again
"street"     →  8%              Object/location, somewhat relevant
"because"    →  2%              Connector word
"it"         → 10%              Self-reference (checking own meaning)
"was"        →  8%              Linking verb, somewhat relevant  
"tired"      → 15%              State description, quite relevant
                ─────
Total        = 100%              (Scores sum to 100%)

Result: The model now knows "it" primarily refers to "animal" (45%), with some connection to being "tired" (15%). This understanding gets encoded into the updated representation of "it".

How does this actually update "it"? The model takes a weighted average of all words' Value vectors using these percentages:

# Each word has a Value vector (what information it contains)
value_animal = [0.9, 0.2, 0.8]  # Contains: mammal, four-legged, animate
value_tired = [0.1, 0.3, 0.9]   # Contains: state, adjective, fatigue
value_street = [0.2, 0.8, 0.1]  # Contains: place, concrete, inanimate
# ... (other words)

# Updated representation of "it" = weighted combination
new_it = (45% × value_animal) + (15% × value_tired) + (8% × value_street) + ...
       = (0.45 × [0.9, 0.2, 0.8]) + (0.15 × [0.1, 0.3, 0.9]) + ...
       = [0.52, 0.19, 0.61]  # Now "it" carries meaning from "animal" + "tired"

The word "it" now has a richer representation that includes information from "animal" (heavily weighted) and "tired" (moderately weighted), helping the model understand the sentence better.

Why "Multi-Head" Attention?

Simple analogy: When you read a sentence, you notice multiple things simultaneously:

  • Grammar relationships (subject → verb)
  • Meaning relationships (dog → animal)
  • Reference relationships (it → what does "it" mean?)
  • Position relationships (which words are nearby?)

Multi-head attention lets the model do the same thing! Instead of one attention mechanism, models use 8 to 128 different attention "heads" running in parallel.

Example with the sentence "The fluffy dog chased the cat":

  • Head 1 might focus on: "dog" ↔ "chased" (subject-verb)
  • Head 2 might focus on: "fluffy" ↔ "dog" (adjective-noun)
  • Head 3 might focus on: "chased" ↔ "cat" (verb-object)
  • Head 4 might focus on: nearby words (local context)
  • Head 5 might focus on: animate things (dog, cat)

Important: These specializations aren't programmed! During training, different heads naturally learn to focus on different relationships. Researchers discovered this by analyzing trained models—it emerges automatically.

How they combine:

# Each head produces its own understanding:
head_1_output = attention_head_1(text)  # Finds subject-verb
head_2_output = attention_head_2(text)  # Finds adjective-noun
head_8_output = attention_head_8(text)  # Finds other patterns

# Combine all heads into a rich understanding:
final_output = combine([head_1_output, head_2_output, ..., head_8_output])

# Now each word has information from all types of relationships!

Why this matters: Having multiple attention heads is like having multiple experts analyze the same text from different angles. The final result is much richer than any single perspective.

Step 4: Processing the Information (Feed-Forward Network)

After attention gathers information, each word needs to process what it learned. This is where the Feed-Forward Network (FFN) comes in.

Simple analogy:

  • Attention = Gathering ingredients from your kitchen
  • FFN = Actually cooking with those ingredients

What happens:

After "it" gathered information that it refers to "animal" and relates to "tired", the FFN processes this:

# Simplified version
def process_word(word_vector):
    # Step 1: Expand to more dimensions (gives more room to think)
    bigger = expand(word_vector)     # 768 numbers → 3072 numbers

    # Step 2: Apply complex transformations (the "thinking")
    processed = activate(bigger)      # Non-linear processing

    # Step 3: Compress back to original size
    result = compress(processed)      # 3072 numbers → 768 numbers

    return result

What's it doing? Let's trace through a concrete example using our sentence:

Example: Processing "it" in "The animal didn't cross the street because it was tired"

After attention, "it" has gathered information showing it refers to "animal" (45%) and relates to "tired" (15%). Now the FFN enriches this understanding:

Step 1 - What comes in:

Vector for "it" after attention: [0.52, 0.19, 0.61, ...]
This already knows: "it" refers to "animal" and connects to "tired"

Step 2 - FFN adds learned knowledge:

Think of the FFN as having millions of pattern detectors (neurons) that learned from billions of text examples. When "it" enters with its current meaning, specific patterns activate:

Input pattern: word "it" + animal reference + tired state

FFN recognizes patterns:
- Pattern A activates: "Pronoun referring to living creature" → Strengthens living thing understanding
- Pattern B activates: "Subject experiencing fatigue" → Adds physical/emotional state concept  
- Pattern C activates: "Reason for inaction" → Links tiredness to not crossing
- Pattern D stays quiet: "Object being acted upon" → Not relevant here

What the FFN is really doing: It's checking "it" against thousands of patterns it learned during training, like:

  • "When a pronoun refers to an animal + there's a state like 'tired', the pronoun is the one experiencing that state"
  • "Tiredness causes inaction" (learned from millions of examples)
  • "Animals get tired, streets don't" (learned semantic knowledge)

Step 3 - What comes out:

Enriched vector: [0.61, 0.23, 0.71, ...]
Now contains: pronoun role + animal reference + tired state + causal link (tired → didn't cross)

The result: The model now has a richer understanding: "it" isn't just referring to "animal"—it understands the animal is tired, and this tiredness is causally linked to why it didn't cross the street.

Here's another example showing how FFN removes uncertainty of word meanings:

Example - "bank":

  • Input sentence: "I sat on the river bank"
  • After attention: "bank" knows it's near "river" and "sat"
  • FFN adds: bank → shoreline → natural feature → place to sit
  • Output: Model understands it's a river bank (not a financial institution!)

Think of FFN as the model's "knowledge base" where millions of facts and patterns are stored in billions of network weights (the connections between neurons). Unlike attention (which gathers context from other words), FFN applies learned knowledge to that context.

It's the difference between:

  • Attention: "What words are nearby?" → Finds "river" and "sat"
  • FFN: "What does 'bank' mean here?" → Applies knowledge: must be shoreline, not finance

Key insight:

  • Attention = figures out which words are related
  • FFN = applies knowledge and reasoning to those relationships

Modern improvement: Newer models use something called "SwiGLU" instead of older activation functions. It provides better performance, but the core idea remains: process the gathered information to extract deeper meaning.


Step 5: Two Important Tricks (Residual Connections & Normalization)

These might sound technical, but they solve simple problems. Let me explain with everyday analogies.

Residual Connections: The "Don't Forget Where You Started" Trick

The Problem: Imagine you're editing a document. You make 96 rounds of edits. By round 96, you've completely forgotten what the original said! Sometimes the original information was important.

The Solution: Keep a copy of the original and mix it back in after each edit.

In the Transformer:

# Start with a word's representation
original = [0.2, 0.5, 0.8, ...]  # "cat" representation

# After attention + processing, we get changes
changes = [0.1, -0.2, 0.3, ...]  # What we learned

# Residual connection: Keep the original + add changes
final = original + changes
      = [0.2+0.1, 0.5-0.2, 0.8+0.3, ...]
      = [0.3, 0.3, 1.1, ...]  # Original info preserved!

Better analogy: Think of editing a photo:

  • Without residual: Each filter completely replaces the image (after 50 filters, original is lost)
  • With residual: Each filter adds to the image (original always visible + 50 layers of enhancements)

Why this matters: Deep networks (96-120 layers) need this. Otherwise, information from early layers disappears by the time you reach the end.

Layer Normalization: The "Keep Numbers Reasonable" Trick

The Problem: Imagine you're calculating daily expenses:

  • Day 1: ₹500
  • Day 2: ₹450
  • Day 3: ₹520
  • Then suddenly Day 4: ₹50,00,00,000 (a bug in your calculator!)

The huge number breaks everything.

The Solution: After each step, check if numbers are getting too big or too small, and adjust them to a reasonable range.

What normalization does:

Before normalization:

Word vectors might be:
"the":  [0.1, 0.2, 0.3, ...]
"cat":  [5.2, 8.9, 12.3, ...]      ← Too big!
"sat":  [0.001, 0.002, 0.001, ...] ← Too small!

After normalization:

"the":  [0.1, 0.2, 0.3, ...]
"cat":  [0.4, 0.6, 0.8, ...]      ← Scaled down to reasonable range
"sat":  [0.2, 0.4, 0.1, ...]      ← Scaled up to reasonable range

How it works (simplified):

# For each word's vector:
# 1. Calculate average and spread of numbers
average = 5.0
spread = 3.0

# 2. Adjust so average=0, spread=1
normalized = (original - average) / spread

# Now all numbers are in a similar range!

Why this matters:

  • Prevents numbers from exploding or vanishing
  • Makes training faster and more stable
  • Like cruise control for your model's internal numbers

Key takeaway: These two tricks (residual connections + normalization) are like safety features in a car—they keep everything running smoothly even when the model gets very deep (many layers).


Three Types of Transformer Models

Transformers come in three varieties, like three different tools in a toolbox. Each is designed for specific jobs.

Type 1: Encoder-Only (BERT-style) - The "Understanding" Expert

Think of it like: A reading comprehension expert who thoroughly understands text but can't write new text.

How it works: Sees the entire text at once, looks at relationships in all directions (words can look both forward and backward).

Training example:

Show it: "The [MASK] sat on the mat"
It learns: "The cat sat on the mat"

By filling in blanks, it learns deep understanding!

Real-world uses:

  • Email spam detection: "Is this email spam or legitimate?"
  • Needs: Deep understanding of the entire email
  • Example: Gmail's spam filter
  • Search engines: "Find documents similar to this query"
  • Needs: Understanding what documents mean
  • Example: Google Search understanding your query
  • Sentiment analysis: "Is this review positive or negative?"
  • Needs: Understanding the overall tone
  • Example: Analyzing customer feedback

Popular models: BERT, RoBERTa (used by many search engines)

Key limitation: Can understand and classify text, but cannot generate new text. It's like a reading expert who can't write.

Type 2: Decoder-Only (GPT-style) - The "Writing" Expert

Think of it like: A creative writer who generates text one word at a time, always building on what came before.

How it works: Processes text from left to right. Each word can only "see" previous words, not future ones (because future words don't exist yet during generation!).

Training example:

Show it: "The cat sat on the"
It learns: Next word should be "mat" (or "floor", "chair", etc.)

By predicting next words billions of times, it learns to write!

Why only look backward? Because when generating text, future words don't exist yet—you can only use what you've written so far. It's like writing a story one word at a time: after "The cat sat on the", you can only look back at those 5 words to decide what comes next.

When predicting "sat":
  Can see: "The", "cat"  ← Use these to predict
  Cannot see: "on", "the", "mat"  ← Don't exist yet during generation

Real-world uses:

  • ChatGPT / Claude: Conversational AI assistants
  • Task: Generate helpful responses to questions
  • Example: "Explain quantum physics simply" → generates explanation
  • Code completion: GitHub Copilot
  • Task: Complete your code as you type
  • Example: You type def calculate_ → it suggests the rest
  • Content creation: Blog posts, emails, stories
  • Task: Generate coherent, creative text
  • Example: "Write a product description for…" → generates description

Popular models: GPT-4, Claude, Llama, Mistral (basically all modern chatbots)

Why this is dominant: These models can both understand AND generate, making them incredibly versatile. This is what you use when you chat with AI.

Type 3: Encoder-Decoder (T5-style) - The "Translator" Expert

Think of it like: A two-person team: one person reads and understands (encoder), another person writes the output (decoder).

How it works:

  1. Encoder (the reader): Thoroughly understands the input, looking in all directions
  2. Decoder (the writer): Generates output one word at a time, consulting the encoder's understanding

Training example:

Input (to encoder):  "translate English to French: Hello world"
Output (from decoder): "Bonjour le monde"

Encoder understands English, Decoder writes French!

Real-world uses:

  • Translation: Google Translate
  • Task: Convert text from one language to another
  • Example: English → Spanish, preserving meaning
  • Summarization: News article summaries
  • Task: Read long document (encoder), write short summary (decoder)
  • Example: 10-page report → 3-sentence summary
  • Question answering:
  • Task: Read document (encoder), generate answer (decoder)
  • Example: "Based on this article, what caused…?" → generates answer

Popular models: T5, BART (less common nowadays)

Why less popular now: Decoder-only models (like GPT) turned out to be more versatile—they can do translation AND chatting AND coding, all in one architecture. Encoder-decoder models are more specialized.

Quick Decision Guide: Which Type Should You Use?

Need to understand/classify text? → Encoder (BERT)

  • Spam detection
  • Sentiment analysis
  • Search/similarity
  • Document classification

Need to generate text? → Decoder (GPT)

  • Chatbots (ChatGPT, Claude)
  • Code completion
  • Creative writing
  • Question answering
  • Content generation

Need translation/summarization only? → Encoder-Decoder (T5)

  • Language translation
  • Document summarization
  • Specific input→output transformations

Not sure? → Use Decoder-only (GPT-style)

  • Most versatile
  • Can handle both understanding and generation
  • This is what most modern AI tools use

Bottom line: If you're building something today, you'll most likely use a decoder-only model (like GPT, Claude, Llama) because they're the most flexible and powerful.


Scaling the Architecture

Now that you understand the components, let us see how they scale:

What Gets Bigger?

As models grow from small to large, here's what changes:

| Component | Small (125M params) | Medium (7B params) | Large (70B params) | |----|----|----|----| | Layers (depth) | 12 | 32 | 80 | | Hidden size (vector width) | 768 | 4,096 | 8,192 | | Attention heads | 12 | 32 | 64 |

Key insights:

1. Layers (depth) - This is how many times you repeat Steps 3 & 4

  • Each layer = one pass of Attention (Step 3) + FFN (Step 4)
  • Small model with 12 layers = processes the sentence 12 times
  • Large model with 80 layers = processes the sentence 80 times
  • Think of it like editing a document: more passes = more refinement and deeper understanding

Example: Processing "it" in our sentence:

  • Layer 1: Figures out "it" refers to "animal"
  • Layer 5: Understands the tiredness connection
  • Layer 15: Grasps the causal relationship (tired → didn't cross)
  • Layer 30: Picks up subtle implications (the animal wanted to cross but couldn't)

2. Hidden size (vector width) - How many numbers represent each word

  • Bigger vectors = more "memory slots" to store information
  • 768 dimensions vs 8,192 dimensions = like having 768 notes vs 8,192 notes about each word
  • Larger hidden size lets the model capture more nuanced meanings and relationships

3. Attention heads - How many different perspectives each layer examines

  • 12 heads = looking at the sentence in 12 different ways simultaneously
  • 64 heads = 64 different ways (grammar, meaning, references, dependencies, etc.)
  • More heads = catching more types of word relationships in parallel

Where do the parameters live?

Surprising fact: The Feed-Forward Network (FFN) actually takes up most of the model's parameters, not the attention mechanism!

Why? In each layer:

  • Attention parameters: relatively small (mostly for Q, K, V transformations)
  • FFN parameters: huge (expands 4,096 dimensions to 16,384 then back, with millions of learned patterns)

In large models, FFN parameters outnumber attention parameters by 3-4x. That's where the "knowledge" is stored!

Why Self-Attention is Expensive: The O(N²) Problem

Simple explanation: Every word needs to look at every other word. If you have N words, that's N × N comparisons.

Concrete example:

3 words: "The cat sat"
- "The" looks at: The, cat, sat (3 comparisons)
- "cat" looks at: The, cat, sat (3 comparisons)
- "sat" looks at: The, cat, sat (3 comparisons)
Total: 3 × 3 = 9 comparisons

6 words: "The cat sat on the mat"
- Each of 6 words looks at all 6 words
Total: 6 × 6 = 36 comparisons (4x more for 2x words!)

12 words:
Total: 12 × 12 = 144 comparisons (16x more for 4x words!)

The scaling problem:

| Sentence Length | Attention Calculations | Growth Factor | |----|----|----| | 512 tokens | 262,144 | 1x | | 2,048 tokens | 4,194,304 | 16x more | | 8,192 tokens | 67,108,864 | 256x more |

Why this matters: Doubling the length doesn't double the work—it quadruples it! This is why:

  • Long documents are expensive to process
  • Context windows have hard limits (memory/compute)
  • New techniques are needed for longer contexts

Solutions being developed:

  • Flash Attention: Clever memory tricks to compute attention faster
  • Sliding window attention: Each word only looks at nearby words (not all words)
  • Sparse attention: Skip some comparisons that matter less

These tricks help models handle longer texts without the exponential cost!


Understanding the Complete Architecture Diagram

Important: This diagram represents the universal Transformer architecture. All Transformer models (BERT, GPT, T5) follow this basic structure, with variations in how they use certain components.

Let's walk through the complete flow step by step:

Complete transformer architecture flow

Detailed Walkthrough with Example

Let's trace "The cat sat" through this architecture:

Step 1: Input Tokens

Your text: "The cat sat"
Tokens: ["The", "cat", "sat"]

Step 2: Embeddings + Position

"The" → [0.1, 0.3, ...] + position_1_tag → [0.1, 0.8, ...]
"cat" → [0.2, -0.5, ...] + position_2_tag → [0.4, -0.2, ...]
"sat" → [0.4, 0.2, ...] + position_3_tag → [0.8, 0.5, ...]

Now each word is a 768-number vector with position info!

Step 3: Through N Transformer Layers (repeated 12-120 times)

Each layer does this:

Step 4a: Multi-Head Attention

- Each word looks at all other words
- "cat" realizes it's the subject
- "sat" realizes it's the action "cat" does
- Words gather information from related words

Step 4b: Add & Normalize

- Add original vector back (residual connection)
- Normalize numbers to reasonable range
- Keeps information stable

Step 4c: Feed-Forward Network

- Process the gathered information
- Apply learned knowledge
- Each word's vector gets richer

Step 4d: Add & Normalize (again)

- Add vector from before FFN (another residual)
- Normalize again
- Ready for next layer!

After going through all N layers, each word's representation is incredibly rich with understanding.

Step 5: Linear + Softmax

Take the final word's vector: [0.8, 0.3, 0.9, ...]

Convert to predictions for EVERY word in vocabulary (50,000 words):
"the"    → 5%
"a"      → 3%
"on"     → 15%  ← High probability!
"mat"    → 12%
"floor"  → 8%
...
(All probabilities sum to 100%)

Step 6: Output

Pick the most likely word: "on"

Complete sentence so far: "The cat sat on"

Then repeat the whole process to predict the next word!

How the Three Model Types Use This Architecture

Now that you've seen the complete flow, here's how each model type uses it differently:

1. Encoder-Only (BERT):

  • Uses: Steps 1-4 (everything except the final output prediction)
  • Attention: Bidirectional - each word sees ALL other words (past AND future)
  • Training: Fill-in-the-blank ("The [MASK] sat" → predict "cat")
  • Purpose: Rich understanding for classification, search, sentiment analysis

2. Decoder-Only (GPT, Claude, Llama):

  • Uses: All steps 1-6 (the complete flow we just walked through)
  • Attention: Causal/Unidirectional - each word only sees PAST words
  • Training: Next-word prediction ("The cat sat" → predict "on")
  • Purpose: Text generation, chatbots, code completion

3. Encoder-Decoder (T5):

  • Uses: TWO stacks - one encoder (steps 1-4), one decoder (full steps 1-6)

  • Encoder: Bidirectional attention to understand input

  • Decoder: Causal attention to generate output, also attends to encoder

  • Training: Input→output mapping ("translate: Hello" → "Bonjour")

  • Purpose: Translation, summarization, transformation tasks

    Three Model Types(encoder, decoder, encoder-decoder) Transformer Architecture

The key difference: Same architecture blocks, different attention patterns and how they're connected!

Additional Key Insights

It's a loop: For generation, this process repeats. After predicting "on", the model adds it to the input and predicts again.

The "N" matters:

  • Small models: N = 12 layers
  • GPT-3: N = 96 layers
  • GPT-4: N = 120+ layers
  • More layers = deeper understanding but slower/more expensive

This is universal: Whether you're reading a research paper about a new model or trying to understand GPT-4, this diagram applies. The core architecture is the same!


Practical Implications

Understanding the architecture helps you make better decisions:

1. Context Window Limitations

The context window is not just a number—it is a hard architectural limit. A model trained on 4K context cannot magically understand 100K tokens without modifications (RoPE interpolation, fine-tuning, etc.).

2. Why Position Matters

Tokens at the beginning and end of context often get more attention (primacy and recency effects). If you have critical information, consider its placement in your prompt.

3. Layer-wise Understanding

Early layers capture syntax and basic patterns. Later layers capture semantics and complex reasoning. This is why techniques like layer freezing during fine-tuning work—early layers transfer well across tasks.

4. Attention is Expensive

Every extra token in your prompt increases compute quadratically. Be concise when you can.


Key Takeaways

  • Transformers process all tokens in parallel, using positional encoding to preserve order
  • Self-attention lets each token gather information from all other tokens
  • Multi-head attention captures different types of relationships simultaneously
  • Residual connections and layer normalization enable training very deep networks
  • Encoder-only models (BERT) excel at understanding; decoder-only (GPT) at generation
  • Modern LLMs are decoder-only with causal masking
  • Context window limitations come from O(n²) attention complexity
  • Understanding architecture helps you write better prompts and choose appropriate models

\n

\

We Asked 14 Tech Bloggers Why They Write. Here's What They Said

2026-01-01 15:16:34

We interviewed a dozen(ish) expert tech bloggers over the past year to share perspectives and tips beyond Writing for Developers. The idea: ask everyone the same set of questions and hopefully see an interesting range of responses emerge. They did.

You can read all the interviews here. We’ll continue the interview series (and maybe publish some book spinoff posts too). But first, we want to pause and compare how the first cohort of interviewees responded to specific questions.

Here’s how everyone answered the question “Why did you start blogging – and why do you continue?

Aaron Francis

I started blogging as a way to get attention for a product that I was working on. And while that product never really worked out, I started getting interest from people that wanted me to come work for them, either as a freelancer or as a full-time employee. And since then, I have realized what a cheat code it is to have a public body of work that people can just passively stumble upon. It’s like having a bunch of people out there advocating for you at all times, even while you’re sleeping.

\

antirez

I don’t know exactly, but in general, I want to express my interest in things I like, in my passions. It was not some kind of calculation where I said: oh, well, blogging would benefit my career. I just needed to do it.

\

Charity Majors

I started writing at a big life inflection point -- the brief period after I left Facebook but before I started Honeycomb. I had started giving talks, and found it surprisingly rewarding, but I’m not an extrovert and I’ve always considered myself more of a writer-thinker than a talker-thinker, so I thought I might as well give it a try.

There are very few things in life that I am prouder of than the body of writing I have developed over the past 10 years. I have had a yearly goal of publishing about one longform piece of writing per month. I don’t think I’ve ever actually hit that goal, but some years I have come close! When I look back over things I have written, I feel like I can see myself growing up, my mental health improving, I’m getting better at taking the long view, being more empathetic, being less reactive… I’ve never graduated from anything in my life, so to me, my writing kind of externalizes the progress I’ve made as a human being. It’s meaningful to me.

\

Eric Lippert

I started my blog, which is now at ericlippert.com, more than 20 years ago. I worked for the Developer Division at Microsoft at the time. As developers working on tools for other developers just like us, we felt a lot of empathy for our customers and were always looking for feedback on what their wants and needs were. But the perception of Microsoft by those customers was that the company was impersonal, secretive, uncommunicative, and uncaring. When blogs started really taking off in the early 2000s, all employees were encouraged to reach out to customers and put a more human, open, and empathetic face on the company, and I really went for that opportunity.

I was on the scripting languages team at the time, and our public-facing documentation was sparse. Our documentation was well-written and accurate, but there was only so much work that our documentation “team” – one writer – could do. I decided to focus my blog on the stuff that was missing from the documentation: odd corners of the language, why we’d made certain design decisions over others, that sort of thing. My tongue-in-cheek name for it was “Fabulous Adventures In Coding.” At its peak, I think it was the second most popular blog on MSDN that was run by an individual rather than a team.

For most of the last decade I’ve been at Facebook, which discourages employees blogging about their work, so my rate of writing dropped off precipitously then. And since leaving Facebook a couple years ago, I haven’t blogged much at all. I do miss it, and I might pick it up again this winter. I really enjoy connecting with an audience.

Editor’s note: He’s now writing a book.

\

fasterthanlime

I genuinely cannot remember why I started, because I’ve been blogging for about 15 years! That’s just what the internet was like back then? It wasn’t weird for people to have their own website — it was part of maintaining your online identity. We’re starting to see that come back in that post-Twitter era, folks value having their own domain name more, and pick up blogging again.

I can say, though, that in 2019 I started a Patreon to motivate me to take writing more seriously — I’m reluctant to call it “blogging” at this point because some of my longer articles are almost mini-books! Some can take a solid hour to go through. At the time, I was sick of so many articles glossing over particulars: I made it my trademark to go deep into little details, and not to be afraid to ask questions.

\

Glauber Costa

I started blogging years ago at ScyllaDB. I was initially forced to do it, but I ended up really enjoying it. [This is strikingly similar to Sarna’s “Stockholm Syndrome” story in Chapter 1 of the book].

I’ve always liked teaching people and I saw that technical blogging was a way to do that…at scale. As I was learning new things, often working with previously unexplored technologies and challenges, blogging gave me this opportunity to teach a large audience of people about what I discovered.

I kept doing it because it actually works. It really does reach a lot of people. And it’s very rewarding when you find that your blog is getting people to think differently, maybe even do something differently.

\

Gunnar Morling

I started blogging for a couple of reasons really. First, it just helps me to take note of things I learned and which I might want to refer to again in the future, like how to prevent ever-growing replication slots in Postgres. I figured, instead of writing things like that down just for myself, I could make these notes available on a blog so others could benefit from them, too. Then, I like to explore and write about technologies such as JavaOpenJDK Flight Recorder, or Apache Kafka. Some posts also distill the experience from many years, for instance discussing how to do code reviews or writing better conference talk abstracts. Oftentimes, folks will add their own thoughts in a comment, from which I can then learn something new again–so it’s a win for everyone.

Another reason for blogging is to spread the word about things I’ve been working on, such as kcctl 🐻, a command-line client for Kafka Connect. Such posts will, for instance, introduce a new release or discuss a particular feature or use case and they help to increase awareness for a project and build a community around it. Or they might announce efforts such as the One Billion Row coding challenge I did last year. Finally, some posts are about making the case for specific ideas, say, continuous performance regression testing, or how Kafka Connect could be reinvisioned in a more Kubernetes-native way.

Overall — and this is why I keep doing it — blogging is a great way for me to express my thoughts, ideas, and learnings, and share them with others. It allows me to get feedback and input on new projects, and it’s an opportunity for helping as well as learning from others.

\

Jeff Atwood

I love blogging – it’s how I got to where I am. Steve McConnell‘s book Code Complete is what inspired me to start blogging. His voice was just so human. Instead of the traditional chest-thumping about “My algorithm is better than your algorithm,” it was about “Hey, we’re all fallible humans writing software for other fallible humans.” I thought, “Oh my God, this is humanistic computing.” I loved it! I knew I had to write like that too. That’s what launched me on my journey.

Now more than ever, I think it’s important to realize that we’ve given everyone a Gutenberg printing press that reaches every other human on the planet. At first blush, that sounds amazing. Wow, everybody can talk to everybody! But then the terror sets in: Oh, my God. Everybody can talk to everybody – this is a nightmare.

I think blogs are important because it’s a structured form of writing. Sadly, chat tends to dominate now. I want people to articulate their thoughts, to really think about what they’re saying – structure it, have a story with a beginning, a middle, and an end. It doesn’t have to be super long. However, chat breaks everything up into a million pieces. You have these interleaved conversations where people are just typing whatever pops into their brain, sometimes with 10 people doing that at the same time. How do you create a narrative out of this? How do you create a coherent story out of chat?

I think blogging is a better mental exercise. Tell the story of what happened to you, and maybe we can learn from it. Maybe you can learn from your own story, perhaps from the whole rubber ducking aspect of it. As you’re explaining it to yourself, you’re also creating a public artifact that can benefit others who might have the same problem or a related story. And it’s your story – what’s unique about you. I want to hear about you as a person – your unique experience and what you’ve done and what you’ve seen. That’s what makes humanity great. And I think blogs are an excellent medium for that.

There’s certainly a place for video, there’s a place for chat. These tools all have their uses, but use the appropriate tool for the appropriate job. I think blogs are a very, very versatile tool in terms of median length, telling a story, and sharing it with the public.

If you look at the history of humanity, the things that have really changed the world have been in writing – books, novels, opinion pieces, even blogs. The invention of language was important, but the invention of writing was so much bigger. With writing, you didn’t have to depend on one person being alive long enough to tell the story over and over. You could write it down and then it could live on forever.

I encourage everyone to write, even if you write only for yourself. I think it’s better if you write in public because you can get feedback that way. You can learn so much from the feedback – learn that others feel the same way, learn about aspects you didn’t think about, etc. But it’s scary. I get it – people are afraid of putting themselves out there. Write for just you if you want, but write… just write.

\

Matt Butcher

I first started blogging in earnest back in the early 2000s because I was at university and wanted to share that experience with the rest of my family. Somewhere along the line, I started a second blog to record (mainly for myself) the technologies and tools I was learning.

These early tech blog posts were pretty basic. I’d learn a new sed trick, and write a couple hundred words about it. I’d try a new code editor, take a screenshot, and write my basic impressions. Embarrassingly, sometimes my blog posts were terribly inaccurate. I once wrote one on optimizing tree walking algorithms that was totally wrong. But I just updated it later with a note that said I’d learned more and now realized there were better ways of doing things.

In those early days, I never used any analytics or anything. I had no idea if anyone ever read what I wrote. Then one day, a friend of mine got really into SEO and asked if I would set up Google Analytics and share with him so he could learn a bit. I was utterly shocked by what we learned: My blog had a ton of traffic, and some of the most basic posts (like the one about sed) were perennially popular.

I’ve blogged on and off since then. These days, I mainly post on the Fermyon blog. And those posts are more theoretical than my early how-to focused posts.

\

Phil Eaton

My shameless goal when I started blogging in 2015 was to become a regular on the front of Hacker News (HN) because I got the sense that it would be good for my career. And I enjoyed the type of posts that made a great fit for HN; posts that were a little crazy yet taught you something useful.

After becoming a manager in 2017, I realized the importance and value of writing for the sake of communicating and my focus on writing shifted from “writing about zany explorations” to “writing as a means for teaching myself a topic, or solidifying my understanding of the topic.”

I started to notice, observing both myself and coworkers, that we developers let so many educational opportunities pass without recording the results. What a waste!

Not only is writing about what you learn good for your own understanding and your team’s understanding and for the internet’s understanding, it’s good marketing for you and the company you work for. Good marketing in the sense that when people see someone write a useful blog post, they think “that person is cool, and the company they work for must also be cool; I want to work with them or work for that company or buy from that company.”

So there’s this confluence of reasons that make blogging so obviously worth the time.

\

Preston Thorpe

Honestly, I don’t think I made a conscious decision to start blogging. I just remember being involved in open source projects and my story wasn’t really out there yet. And it felt odd that nobody knew my situation.

I didn’t want to just flat out randomly tell people that I’m incarcerated. So I decided that I’d write my story down for anyone to find if they came across my profile. I really did not expect for many people to actually read it, so it was pretty shocking to see it on the front page of HN a couple days after it was published.

I try my best to keep writing, although I don’t write as often as I should. Writing an in-depth technical blog post about a feature built or a problem solved allows me to fully absorb and understand it even better than just implementing it, so this is another reason I feel like I will continue to write. It serves as both personal motivation to more deeply understand something I am working on, as well as a way to share that knowledge with others.

\

Sam Rose

I started blogging about tech in my final year of university. The earliest posts on http://samwho.dev, the ones from 2011, were written at this time. I’d heard that having an online presence would help me get a job, so I started writing about the things I was learning.

I wrote sporadically for years, most of my posts only getting a trickle of traffic, but I did have a few modestly successful ones. https://samwho.dev/blog/the-birth-and-death-of-a-running-program/ did well and even ended up as part of the Georgia Tech CS2110 resources list. One of the lecturers, who has since retired, emailed me in 2013 asking if he could use it. I was concerned because the post had swearing in it, but he said “swearing is attention getting and helps the reader stay alert.”

The blog posts I’ve become known for, the ones with lots of visual and interactive bits, started in the first half of 2023. I’d long admired the work of https://ciechanow.ski/ and wanted to see if I could apply his style to programming. It’s working well so far!

As for why I continue, I’ve been gainfully employed for a long time now, so my initial motivations for writing are long gone. I think my blog does help when I have conversations with employers, but that’s not the goal anymore.

I have this dream of being a teacher. I’ve dabbled in many forms of teaching: teaching assistant in university for some of my lecturers, mentoring in commercial and personal capacities, moderating learning communities, volunteering at bootcamps and kids’ groups. What if I could just… teach for a living?

I’m trying to make use of the attention these blog posts are getting to see if I can make steps towards doing just that.

\

Tanel Poder

Over the years, I had accumulated a number of useful scripts and techniques for troubleshooting the common OS & database problems I had encountered. At first, I created the blog (on June 18, 2007) as a lookup table for my future self. I uploaded all my open source tools to my blog and wrote articles about the more interesting troubleshooting scenarios. When I visited a customer to solve a problem, we could just copy & paste relevant scripts and commands from my blog. That way, I didn’t have to show up with a USB stick and try to get it connected to the internal network first.

Why do I continue? There’s so much cool stuff and interesting problems to write about. When writing, you have to do additional research to make sure your understanding is good enough. That’s the fun part. Systems are getting more complex, so you need to find new ways to “stay systematic” when troubleshooting problems and not go with trial & error and guesswork. These kinds of topics are my favorite, how to measure the right things, from the right angle, at the right time – so that even new unforeseen problems would inevitably reveal their causes to you via numbers and metrics, without any magic needed.

What makes me really happy is when people contact me and say that they were able to solve completely different problems that I hadn’t even seen, on their own, with the aid of the tools and systematic troubleshooting advice I have in my blog.

\

Thorsten Ball

I published my first blog post on thorstenball.com in 2012. It’s this one about implementing autocompletion using Redis. I don’t know exactly why I started the blog, but, looking back, I think the main motivation was to share something I was proud of. It was a cool bit of code, it took me a bit to figure out, I learned a lot in the process, and I wanted to share the excitement.

At that time, I was also a junior software developer, having recently finished my first internship, trying to switch from studying philosophy into being an engineer and, I think, there was also a bit of “my blog can be a CV” aspect to it.

Back then, a friend had told me: you don’t need a degree to get a job as a software engineer, all you need to do is to show that you can do the job, because, trust me, he said, there’s a lot of people who have degrees but can’t do the job.

I figured that having a blog with which I can share what I learned, what I did, well, that’s a way to show that I can do the job. Now, I don’t think a lot of recruiters have read my blog, but I still believe there’s something to it: you’re sharing with the world what you do, what you learned, how you think — that’s a good thing in and of itself, and even if someone only takes a brief look at your blog before they interview you, I think that can help.

But, I also have to admit that I’ve been writing on the internet in one form or another, since the early 2000s, when I was a teenager. I had personal websites and blogs since I was 14 years old. I shared tutorials on web forums. There’s just something in me that makes me want to share stuff on the internet.

Nowadays I mostly write my newsletter, Register Spill, which I see as a different form of blogging, and for that newsletter, I have a few reasons:

  • I enjoy the writing. Well, okay, I enjoy having written. But, in general: I’m proud of writing something that’s good.

  • I’ve enjoyed tweeting a lot, but in the past few years the social media landscape has become so fractured that I decided to create a place of my own, a place where people can subscribe and follow me, where I can potentially take their emails and send them newsletters even if the platform decides to shut down.

  • Writing is thinking. I like sitting down and ordering my thinking in order to write something. The feelings of “I want to write something” and “I want to really think through this topic and share it” are similar for me.

    \

Book info

\

This Python “Auto-Painter” Creates a New Universe Every Time You Run It

2026-01-01 15:13:12

There’s a strange feeling that washes over you when you witness something you’ve created take on a life of its own. It’s not just pride; it’s a deep, almost philosophical resonance. That’s the feeling I’ve carried since my “Auto-Painter Robot Brain” completed its first masterpiece. What began as a simple coding exercise evolved into a profound exploration of art, logic, and the very nature of creation itself.

This isn’t just an “art generator.” It’s a small-scale model of a universe, born from a single moment, unfolding its entire complex existence over a million perfect, logical strokes.

The Genesis: Building the Robot’s Mind

My goal was to design a Python-based “robot brain” that could autonomously generate abstract digital artwork. It needed a canvas, a set of tools, and a way to make “creative” decisions.

  1. The Digital Canvas: Using Pygame, I set up an 800x600 pixel window. This was the empty void awaiting creation.
  2. The Robot’s Toolkit: I endowed my robot with:
  • A Color Palette: A carefully chosen selection of blues, yellows, oranges, and reds.
  • Diverse Shapes: It could draw circles, rectangles, and polygons ranging from 3 (triangles) to 13 sides.
  • The Power of Transparency: Each shape could be semi-transparent, allowing for rich, layered effects.
  • Gradients: A key feature, giving shapes a dynamic, flowing look as they transition between two randomly chosen colors.
  1. The “Big Bang” — A Unique Seed: This was crucial. Instead of simple pseudo-randomness, I tapped into my computer’s hardware entropy pool using os.urandom. This meant that the very first decision — the "seed" for all subsequent random choices — was a unique snapshot of my computer's internal activity at that precise moment. Every time the script runs, a new "universe" is born, guaranteed to be different.
  2. The Laws of Physics: The core of the robot’s “brain” consisted of simple, deterministic functions. It would randomly choose a color (or two for a gradient), a position on the canvas, a size, and a shape type. If it chose a polygon, it would randomly select its number of sides.

The Act of Creation: A Million Strokes

Once initialized, the robot began its work. The process was set to run for 1 million strokes. For over two hours, this autonomous artist diligently layered shapes, colors, and gradients onto the digital canvas.

\ \ Each decision, each placement, each color choice was a direct, logical consequence of that initial “Big Bang” seed. There was no human intervention, no second-guessing, just the relentless, perfect execution of its programmed laws.

The final artwork, a dense tapestry of overlapping forms and colors, is a visual record of this entire journey.

Press enter or click to view image in full size

\

The Unveiling: Art, Logic, and Perfect Provenance

Upon completion, the robot’s work wasn’t just a single image. It delivered two profound artifacts:

  1. The Masterpiece (.png): The final abstract image itself.
  2. The “Story” (.txt): A meticulously detailed log file. This file records every single one of the million strokes, detailing its number, shape type, exact position, size, whether it was a gradient, its specific colors, and if applicable, its number of sides.

This is where the true significance of the project resonated with me.

Why This Isn’t “Just an Auto-Painter”

  • A “Perfect” Universe in a Bottle: This project functions as a self-contained, deterministic universe. From that singular “Big Bang” seed, its entire existence (the 1 million strokes) was pre-ordained. What appears chaotic to the human eye is, from a logical standpoint, a flawless, inevitable unfolding of events. There were no mistakes, no second thoughts — just pure, perfect execution of its foundational laws.
  • The Translation of Art into Logic: I didn’t paint this image. I built a system that understood how to paint based on my rules. I translated my artistic intuition (what makes good composition, pleasing colors, interesting forms) into pure logic. The robot became a proxy for my creative process, automating the very act of artistic generation itself.
  • The Ultimate Provenance: Every piece of art has a story. This robot generated its own. The log file is the complete, verifiable “artist’s statement,” a diary of every single creative decision made. It doesn’t just show the finished product; it provides the entire history of its creation, proving its unique origin and validating every stroke’s “intentionality” within its own system.

This project redefined my understanding of art. It’s not just about the final image, but the elegance of the system that created it. It’s a testament to the beauty of logic, the power of algorithms, and the profound parallels between a coded process and the very universe we inhabit — a single starting point, unfolding into a complex, perfect, and unrepeatable reality.

\ \