2026-04-07 16:46:05
Most beginners use <ul> and <ol> interchangeably and wonder why their pages feel broken. There is one nesting mistake in particular that silently destroys screen reader accessibility — and almost nobody talks about it. This post covers the five biggest HTML list errors and how to fix them fast.
You learned HTML lists in about 20 minutes. Slap some <li> tags inside a <ul>, done. Job finished. Ship it.
Except users are bouncing. Your content feels cluttered. And a screen reader is announcing your navigation menu like it is reading a legal disclaimer at 1.5x speed.
Sound familiar? You are not alone. HTML lists for beginners look deceptively simple on the surface. Underneath, there are five specific mistakes that quietly strangle your user experience before anyone even reads your content.
Let us fix that.
Before diving into mistakes, here is what most tutorials skip: HTML lists are not a visual tool. They are a semantic structure tool. When you wrap content in <ul> or <ol>, you are telling the browser, the search engine, and the screen reader: these items belong together and have a relationship.
That matters enormously for SEO and accessibility.
<!-- This is just visual formatting -->
<p>• Eggs • Milk • Flour</p>
<!-- This is semantic structure -->
<ul>
<li>Eggs</li>
<li>Milk</li>
<li>Flour</li>
</ul>
Google can parse the second one. A screen reader can announce it properly. Your future self can maintain it without crying.
<ul> When Order Actually Matters
This is the most common HTML list mistake beginners make. Unordered lists (<ul>) are for items where sequence is irrelevant — pizza toppings, feature lists, ingredient collections.
Ordered lists (<ol>) are for steps, rankings, and sequences where position has meaning.
<!-- Wrong: Steps presented as unordered -->
<ul>
<li>Add the eggs</li>
<li>Preheat the oven</li>
<li>Mix the flour</li>
</ul>
<!-- Right: Sequential steps use ol -->
<ol>
<li>Preheat the oven to 180C</li>
<li>Mix the flour and eggs</li>
<li>Bake for 25 minutes</li>
</ol>
Why does this matter for UX? Because <ol> auto-numbers your items. Add a step in the middle and the numbers update automatically. No manual editing. No numbering errors. Screen readers also announce "Step 2 of 5" which guides users through processes clearly.
Nested lists are where beginners either shine or completely fall apart. The rule almost nobody mentions: nested lists must live inside an <li> element, not directly inside <ul> or <ol>.
<!-- Wrong: List nested directly inside ul -->
<ul>
<li>Monday</li>
<ul>
<li>9AM Meeting</li>
</ul>
</ul>
<!-- Right: Nested list lives inside the li -->
<ul>
<li>Monday
<ul>
<li>9AM Meeting</li>
<li>10AM Coffee (mandatory)</li>
</ul>
</li>
</ul>
The wrong version technically renders in most browsers. It also produces invalid HTML, confuses assistive technology, and causes unpredictable styling behavior across browsers. Valid structure is not optional when you care about real users.
Here is the one that catches experienced beginners off guard. When you set list-style: none in CSS to remove bullets for navigation menus, Safari and VoiceOver actually stop announcing the element as a list.
You killed the semantic meaning with one line of CSS.
The fix is a single attribute:
<ul role="list" style="list-style: none;">
<li>Home</li>
<li>About</li>
<li>Contact</li>
</ul>
Adding role="list" restores the accessibility announcement. Most tutorials on HTML lists for beginners never mention this. It is one of the most impactful fixes you can make in 10 seconds.
<br> Tags Instead of Proper List Structure
This one is a habit carried over from early HTML days. Beginners sometimes fake lists using line breaks:
<!-- Please do not do this -->
<p>
Feature 1<br>
Feature 2<br>
Feature 3
</p>
<!-- Do this instead -->
<ul>
<li>Feature 1</li>
<li>Feature 2</li>
<li>Feature 3</li>
</ul>
The <br> version gives you zero semantic value. Search engines cannot identify list items. Screen readers cannot count items or navigate between them. It is the HTML equivalent of writing an Excel spreadsheet in Notepad.
Nesting lists is powerful. Nesting lists four levels deep because your content is genuinely that complex is almost always a sign that your information architecture needs rethinking, not more nesting.
A good rule of thumb: if you are going past two levels of nesting, consider whether some of those sub-items deserve their own section with a heading instead.
<ol> for sequential steps, <ul> for unordered collections<li> elements, never directly inside <ul>
role="list" when removing bullets with CSS to preserve screen reader behavior<br> tagsThese five fixes will immediately improve your HTML structure, accessibility scores, and user experience. But there is one more technique covered in the full post — customizing list bullets beyond basic CSS — that completely transforms how your lists look and behave across devices.
Want the complete guide with interactive examples and the full bullet customization walkthrough? Read the full post at Drive Coding: https://drivecoding.com/html-lists-5-critical-mistakes-killing-your-ux/
Originally published at Drive Coding
2026-04-07 16:41:51
A lot of modern software architecture—microservices, event-driven systems, CQRS—is not born from deeply understanding the domain. It is what teams reach for when the existing application has become a mess: nobody really knows what’s happening where anymore, behavior is unpredictable, and making changes feels risky and expensive. Instead of asking “What does this concept actually mean and where does it truly belong?”, they ask “How do we split this?”
That is where a lot of modern architecture begins.
Not in necessity.
Not in insight.
But in the growing discomfort of trying to manage software that was never modeled well in the first place.
And because the resulting system still runs in production, the cost of that move often remains invisible for years.
That is one of the most expensive traps in software.
A lot of developers today are highly fluent in frameworks.
They know how to build controllers, services, repositories, DTOs, entities, integrations, and configuration.
From the outside, that often looks like competence.
But that kind of fluency can be deeply misleading.
Because building software out of familiar framework-shaped parts is not the same thing as designing software well.
The real questions are different:
What is the actual business concept here?
What belongs together?
What behavior is intrinsic to the domain?
What is a real boundary, and what is just an implementation detail?
What rules should be explicit in the model rather than implied by orchestration?
Real domain modeling is not about applying a catalog of patterns. It is the disciplined, often uncomfortable work of discovering what belongs together, what behavior is intrinsic, and expressing those concepts as clearly and cohesively as possible—whether that lives in modules, functions, or simple objects. The goal is conceptual integrity, not architectural ceremony.
Without those questions, software tends to take on a very predictable shape: fat service classes, anemic entities, persistence-first design, procedural workflows, business logic smeared across layers.
The code works. The endpoints return data. The database persists state.
But the system has not really been designed.
It has been assembled.
And that difference matters far more than most teams realize.
The cost of poor design does not usually show up immediately. At first, the system still feels manageable. A few controllers. A few services. A few repositories. Everything is still “clean.”
But over time, something starts to happen. Business rules accumulate. Exceptions pile up. New requirements interact with old assumptions. Concepts that looked simple turn out to be related in ways the software never captured.
And because there is no strong domain model holding those concepts together, the complexity has nowhere coherent to go. So it leaks—into service methods, orchestration flows, integration glue, persistence logic, special-case conditionals, “helper” abstractions, and coordination code.
At that point, the team starts feeling something very real:
Nobody understands the whole thing anymore.
And that is the crucial moment.
Because once a system becomes cognitively overwhelming, the team has two options:
Reduce the complexity by improving the model.
Reduce the scope of the confusion by splitting it apart.
A lot of teams choose Option B.
This is where architecture often stops being a design choice and starts becoming a coping mechanism.
When the internal model is weak, teams still need some way to create order. And distribution gives them one.
So they introduce microservices, event-driven architecture, CQRS, separate read models, ownership boundaries, queues, and asynchronous coordination.
Distribution, CQRS, and event-driven architecture can have legitimate uses in rare cases of extreme scale or unavoidable organizational boundaries. But in the vast majority of systems, they are not introduced because the domain demands them. They are introduced because the internal model is too weak to provide clarity. What looks like sophisticated architecture is often just confusion hiding behind cleaner service boundaries.
What they are really doing is this:
They are trying to create externally, through distribution, the boundaries they failed to create internally, through design.
And that can work. At least for a while.
A smaller service does feel easier to understand than a large monolith. A separate read model does reduce some friction. A queue does create some local decoupling.
But none of that means the software has become conceptually better. It often just means the confusion has been sliced into smaller containers.
That trade is where the real damage happens.
Because distribution absolutely can create local context. A team can say, “This service owns billing.” And that does help.
But it is a much weaker form of clarity than a real domain model. A service boundary can tell you where code lives. A good model can tell you what something is, what it means, what rules govern it, what its lifecycle is, and what relationships are essential.
Those are very different levels of understanding.
And when teams use distribution to manufacture context, they often gain short-term manageability at the cost of long-term agility. Because now the system starts paying the distribution tax: network failure, eventual consistency, contract drift, duplicated concepts, duplicated logic, coordination overhead, deployment complexity, operational burden, and fractured causality.
And perhaps most importantly: lost refactorability.
When the model is strong and cohesive, changing your mind usually means a local refactor—sometimes even a delightful collapse of concepts. When boundaries have been hardened into services, the same insight triggers contracts, versioning, migration scripts, and cross-team coordination. The cost of learning is no longer paid in thought, but in infrastructure and politics.
And in software, changing your mind is not a failure. It is the job.
This is where badly structured software reveals itself. Not when it is first deployed. Not when the first endpoints work. Not when the dashboards are green. But when the business itself becomes better understood.
Because that is what always happens. Sooner or later, the business learns: these two concepts are actually one thing, this workflow was modeled incorrectly, this rule has important exceptions, this distinction is more important than we thought, or this process should not exist at all.
That is normal. That is what software is supposed to accommodate.
A coherent domain model makes that kind of change survivable. A fragmented, distributed, weakly modeled system makes it expensive.
Note that “coherent domain model” here does not mean the tactical patterns that became associated with DDD—entities, repositories, aggregates, and the rest. Those often added their own accidental complexity. Real modeling is simpler and deeper: it is the ongoing work of refining ubiquitous language and discovering natural conceptual boundaries so that new business insight can be absorbed with minimal violence to the existing code.
Because now the insight has to travel through APIs, queues, read models, event contracts, deployment boundaries, ownership lines, duplicated rules, and partial consistency guarantees. What should have been a conceptual refactor becomes a cross-system negotiation.
And that is where the bill arrives. Not because the domain was inherently impossible. But because the architecture froze yesterday’s misunderstandings into today’s structure.
That is one of the worst things software can do.
The most dangerous part is that this kind of architecture often looks successful. The system runs. Users use it. The company makes money. So the architecture gets treated as validated.
But “it works” is one of the weakest standards in software. A system running in production proves only that it is viable enough to survive. It does not prove that it is cheap to change, conceptually sound, structurally coherent, or good at absorbing new understanding.
Most teams never get to experience how different software feels when:
Concepts have a single, obvious home instead of being smeared across services
Rules are explicit and enforceable rather than scattered in orchestration and glue code
New business understanding leads to a clean refactor instead of distributed coordination
The system invites insight instead of resisting change
Without that contrast, the pain of weak modeling hidden behind distribution gets normalized as “just how complex software is.”
Often, it is not. Often, it is just the cost of weak design hidden behind architecture.
Much of today’s distributed architecture is not the result of domain insight. It is compensation for the conceptual clarity that was never built into the model. By reaching for separation instead of deeper understanding, teams gain local manageability at the expense of long-term coherence and cheap evolution.
The problem is that the original lack of clarity doesn’t disappear — it just gets distributed. In the end, the same confusion that made the monolith unmaintainable will make the distributed system fail just as hard, only now it’s far more expensive and painful to fix.
This is why so much “sophisticated” architecture is, in truth, just sophisticated coping.
2026-04-07 16:35:55
A developer’s guide to labels, context changes, error handling, and prevention based on the DHS Section 508 Trusted Tester conformance process.
Labels are layered: Visual presence (5.A.), descriptive quality (5.B.), and programmatic association (5.C.)
Context changes require warning: Unexpected behavior disorients users, but warnings make changes acceptable.
Errors need both, a means of identification as well as remediation: Knowing something is wrong is only the first step, users need to know how to fix it as well.
Important submissions demand safeguards: Legal, financial, and data-modifying forms require either review mechanisms before submitting, an option to reverse submission or the page should flag errors and allow the user to correct them before final submission.
A cheat sheet for all Trusted Tester testing and evaluation steps is also in the Resources Drive folder.
Test IDs 5.A. - C cover form labels, splitting them into 3 smaller questions: Is the label visible? Does it make sense? Is it programmatically correct?
Note that the Trusted Tester process builds on WCAG 2.0. Due to this, I’ve included comparisons between WCAG 2.0 and 2.2 in each study group session. For all WCAG success criteria cited in relation to the topic 5 forms test IDs, neither the content nor level has changed between 2.0 and 2.2 and therefore I am only linking to 2.2 as well.
Every form field must have a visual label or instruction that remains visible, even when the field receives focus. So no, placeholder text that disappears on focus does not count as a label. But table headers can count!
Note that this test only checks presence, not accuracy. For 5.A. we are only looking, if a label is there, not if it makes sense.
Placeholder text is not a label. If it disappears on focus, it fails 5A.
Labels must sufficiently describe the purpose and data requirements of each field. Users should know what data format is expected (phone number format, date format, etc.), and required field indicators must be clear. For both, clarification in text is also acceptable. Button labels must describe the function clearly, and error messages alone cannot substitute for upfront label clarity.
5.C.: Programmatic Labels (WCAG 1.3.1 Info and Relationships, WCAG 4.1.2 Name, Role, Value)
This is where we go beyond what’s visible on the screen: Form elements must have programmatic associations that assistive technologies can read. To determine this, the Trusted Tester process uses the ANDI tool: ANDI output must include all relevant instructions and cues. For general web accessibility testing purposes, you can also use the developer tools of your browser to inspect the label. Accessible names don’t need to match visual labels word for word, but they must still be adequate.
Radio buttons and checkboxes must be programmatically associated with the questions they are supposed to answer. In contrary to that, dropdown options themselves are not part of the form field description.
In the study group, our fail example was a form with multiple phone number input boxes (spaced to mimic the US phone number format of 3-3-4 digits) all reading as “phone” without distinguishing the segments for area code (3 digits), prefix (3 digits) and line number (4 digits).
The next 2 Test IDs deal with what happens when users interact with form elements, and making sure they’re never surprised by unexpected behavior.
Changing form field values should not trigger unexpected changes of context.
To the diligent study group regular, this may sound familiar! We had a very similar test in the last session: Keyboard Access and Focus in , well, the focus part. The concept is the same, only the test ID name is different and limited to form elements instead of every keyboard accessible element on the page.
Users must be warned before a form automatically submits, new windows open, focus shifts, or the page redirects. “Unexpected” is the key word here: expected changes with warnings are acceptable.
The iconic fail example from the study group was a radio button selection for our birth year that immediately redirects to a Wikipedia page, without warning.
Users must be notified of any form-related changes on the same page.
Our fail example: A ticket availability message appearing without live regions, focus movement, or any dialog notification. How are we supposed to know the show’s already sold out then?
The last three test IDs ensure errors are not only detected but also effectively communicated and corrected.
Automatically detected errors must be identified and described in text, not just color or icons. The specific field in error must be identified, not just a vague “invalid input”, be precise which field is causing trouble.
Common failure: The error message consisting of only a red “X” icon or red outline without accompanying text description. Can we, as web design and development professionals, please move on already from sensory-dependent error handling? It has been plaguing UI accessibility for so long.
Note that this test does not require correction suggestions (that’s a spoiler for 5.G.)
We have the error. Now what? When errors are detected, guidance must be provided on how to correct them. Provide suggestions on how to remedy the error, unless they would jeopardize security or purpose.
How could it jeopardize security or purpose? Great question! While most errors are outside our happy path, in certain scenarios, they can be part of the game. For example, if you are playing a guessing game
Online tests and exams are also exceptions because automated error suggestions here would go against the whole concept of testing your knowledge!
Pass Examples: Password requirements that are shown upfront, or specifying acceptable ranges for input (e.g. “input a number between 1 to 10” for customer satisfaction surveys).
Fail Example: A form asked us to input our work experience and simply answered with “Error: 5 is not acceptable” without specifying what unit or format is expected.
Important transactions (we’re talking about legal, financial, and data-modifying context in particular) must allow review, confirmation, or reversal.
2026-04-07 16:35:45
We need to talk about the absolute junk flooding our feeds. If I see one more "text-in-text-out wrapper" masquerading as a technical breakthrough, I'm going to lose it.
The landscape has shifted entirely in 2026. We are no longer querying standalone endpoints. If your stack isn't utilizing Model Context Protocol (MCP) to fetch real-time datastore context, or leveraging Swarm Orchestration to run parallel execution threads, you're building legacy software. 
Most of you are wasting hours testing tools that operate in total isolation. You drop a PDF in, you get a summary out. That was cool in 2023. Today, we need integrated systems. The problem is finding these components. GitHub Trending is a mess of abandoned repos, and Twitter/X is mostly hype for closed ecosystems. How do you actually find the building blocks for modern architecture?
Look at what the top engineers are actually deploying right now:
This exact frustration is why I built SeekAITool.com. I was sick of sifting through directories filled with obsolete UI layers.
SeekAITool isn't a dump of "prompt interfaces." It's an infrastructure map designed specifically to filter for these exact architectural trends.
We index the actual technical capabilities, not the marketing fluff. It’s a tool stack filter for developers who actually build robust systems.
Here is what a modern connection looks like when hooking up a specialized MCP server (like one you'd find indexed on SeekAITool) straight into your local swarm runner:
python
import asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
# Example: Spinning up a specialized Postgres MCP server found via the "MCP-Ready" filter on SeekAITool
server_params = StdioServerParameters(
command="npx",
args=["-y", "@seekaitool/postgres-mcp", "postgresql://dev:pass@localhost/main_db"]
)
async def init_swarm_node():
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
# Instantly exposing database schema and context to the reasoning engine
resources = await session.list_resources()
print(f"[+] Swarm Node initialized. Attached context modules: {resources}")
# Ready for multi-agent vibe coding based on live DB state
asyncio.run(init_swarm_node())
2026-04-07 16:34:07
Accessibility isn't a feature you add at the end — it's a quality signal that correlates with better code overall. Well-structured, accessible apps tend to have better semantics, better keyboard behavior, and better performance. Here's the practical implementation.
Every accessibility problem that seems hard is often just a semantic HTML problem:
// Bad -- div soup, no semantics
<div className='header'>
<div className='nav'>
<div onClick={handleClick}>Home</div>
<div onClick={handleClick}>About</div>
</div>
</div>
// Good -- semantic HTML, keyboard accessible automatically
<header>
<nav aria-label='Main navigation'>
<ul>
<li><a href='/'>Home</a></li>
<li><a href='/about'>About</a></li>
</ul>
</nav>
</header>
Landmark elements (<header>, <nav>, <main>, <aside>, <footer>) give screen reader users a way to jump between sections.
ARIA attributes add semantic meaning when HTML alone isn't enough. But they don't add behavior — you must implement keyboard interaction yourself:
// Custom dropdown -- requires ARIA + keyboard handling
function Dropdown({ label, options, value, onChange }) {
const [isOpen, setIsOpen] = useState(false)
const listboxId = useId()
return (
<div>
<button
aria-haspopup='listbox'
aria-expanded={isOpen}
aria-controls={listboxId}
onClick={() => setIsOpen(!isOpen)}
onKeyDown={(e) => {
if (e.key === 'Escape') setIsOpen(false)
if (e.key === 'ArrowDown') { setIsOpen(true); /* focus first option */ }
}}
>
{value || label}
</button>
{isOpen && (
<ul id={listboxId} role='listbox' aria-label={label}>
{options.map(opt => (
<li
key={opt.value}
role='option'
aria-selected={value === opt.value}
onClick={() => { onChange(opt.value); setIsOpen(false) }}
tabIndex={0}
>
{opt.label}
</li>
))}
</ul>
)}
</div>
)
}
The rule: if a native HTML element does what you need (<select>, <button>, <a>), use it. ARIA is for custom widgets.
// Dialog -- trap focus inside when open
import { useEffect, useRef } from 'react'
function Dialog({ isOpen, onClose, children }) {
const dialogRef = useRef<HTMLDivElement>(null)
useEffect(() => {
if (!isOpen) return
// Focus the dialog on open
dialogRef.current?.focus()
// Trap focus
const focusable = dialogRef.current?.querySelectorAll(
'button, [href], input, select, textarea, [tabindex]:not([tabindex="-1"])'
)
const first = focusable?.[0] as HTMLElement
const last = focusable?.[focusable.length - 1] as HTMLElement
const trap = (e: KeyboardEvent) => {
if (e.key !== 'Tab') return
if (e.shiftKey ? document.activeElement === first : document.activeElement === last) {
e.preventDefault()
;(e.shiftKey ? last : first)?.focus()
}
}
document.addEventListener('keydown', trap)
return () => document.removeEventListener('keydown', trap)
}, [isOpen])
return (
<div
ref={dialogRef}
role='dialog'
aria-modal='true'
tabIndex={-1}
onKeyDown={(e) => e.key === 'Escape' && onClose()}
>
{children}
</div>
)
}
WCAG AA requires 4.5:1 contrast for normal text, 3:1 for large text:
// Check contrast in your Tailwind config
// Gray-500 on white: 3.95:1 -- FAILS AA for normal text
// Gray-700 on white: 8.59:1 -- passes AAA
// Don't rely on color alone to convey information
// Bad: red = error (invisible to colorblind users)
// Good: red + icon + error message text
<div className='flex items-center gap-2 text-red-600'>
<AlertCircle className='w-4 h-4' aria-hidden='true' />
<span role='alert'>Email is invalid</span>
</div>
npm install -D @axe-core/react jest-axe
import { axe, toHaveNoViolations } from 'jest-axe'
expect.extend(toHaveNoViolations)
it('has no accessibility violations', async () => {
const { container } = render(<LoginForm />)
const results = await axe(container)
expect(results).toHaveNoViolations()
})
Run axe in CI. Catch regressions before they reach users.
The AI SaaS Starter at whoffagents.com ships with semantic HTML structure, proper ARIA labels on all interactive elements, and axe-core integrated into the test suite. $99 one-time.
2026-04-07 16:33:46
Your GPU has 8 GB of VRAM. The model you want to run needs 14 GB. What now?
This is the most common wall people hit when running LLMs locally. Cloud APIs don't care about your hardware — local inference does. Understanding VRAM is the difference between smooth 40 tok/s responses and your system grinding to a halt.
I've spent months optimizing local AI setups and building tools around Ollama. Here's everything I've learned about making large models fit on consumer hardware.
When you load a model into your GPU, every single parameter needs to live in VRAM during inference. A 7B parameter model in full FP16 precision needs roughly:
7 billion × 2 bytes = ~14 GB VRAM
That's already more than most consumer GPUs. An RTX 4060 has 8 GB. An RTX 4070 has 12 GB. Even an RTX 4090 tops out at 24 GB.
So how do people run 70B models on a single GPU? Quantization.
Quantization reduces the precision of model weights. Instead of 16 bits per parameter, you use 4 or 8 bits. Here's the practical breakdown:
| Quant Level | Bits/Param | 7B Model Size | 13B Model Size | 70B Model Size |
|---|---|---|---|---|
| FP16 | 16 | ~14 GB | ~26 GB | ~140 GB |
| Q8_0 | 8 | ~7.5 GB | ~14 GB | ~70 GB |
| Q6_K | 6 | ~5.5 GB | ~10.5 GB | ~54 GB |
| Q5_K_M | 5 | ~4.8 GB | ~9 GB | ~48 GB |
| Q4_K_M | 4 | ~4.1 GB | ~7.5 GB | ~40 GB |
| Q3_K_M | 3 | ~3.3 GB | ~6 GB | ~32 GB |
| Q2_K | 2 | ~2.7 GB | ~5 GB | ~26 GB |
The sweet spot for most people: Q4_K_M. You lose minimal quality compared to FP16 while cutting memory usage by 75%.
The model weights aren't the only thing eating your VRAM. You also need memory for:
Real formula:
Total VRAM needed = Model weights + KV cache + CUDA overhead + OS reservation
Example for Llama 3.1 8B Q4_K_M, 8K context:
~4.1 GB + ~0.8 GB + ~0.4 GB + ~0.3 GB = ~5.6 GB
This is why a 4 GB quantized model doesn't actually run on a 4 GB GPU.
Ollama handles most of this automatically, but you can tune it:
# Check which models are loaded and their VRAM usage
ollama ps
# Set context size (lower = less VRAM)
ollama run llama3.1 --ctx-size 4096
# Force CPU-only inference (when GPU VRAM is full)
OLLAMA_GPU_LAYERS=0 ollama run llama3.1
# Partial GPU offloading — put some layers on GPU, rest on CPU
OLLAMA_GPU_LAYERS=20 ollama run mixtral
# Set how long models stay in VRAM (default: 5 min)
OLLAMA_KEEP_ALIVE=10m ollama run llama3.1
# Unload all models from VRAM immediately
curl http://localhost:11434/api/generate -d '{"model": "llama3.1", "keep_alive": 0}'
When a model doesn't fit entirely in VRAM, you split it between GPU and CPU. The key insight: the first and last layers matter most for speed.
# Check total layers in a model
ollama show llama3.1 --modelfile | grep -i layer
# For a 32-layer model on an 8 GB GPU with Q4_K_M:
# Start with 24 GPU layers, adjust based on actual usage
OLLAMA_GPU_LAYERS=24 ollama run llama3.1:8b-q4_K_M
Monitor with nvidia-smi while generating:
# Watch VRAM usage in real-time
watch -n 0.5 nvidia-smi
# Or just the memory line
nvidia-smi --query-gpu=memory.used,memory.total --format=csv -l 1
Rule of thumb: If VRAM usage is at 95%+ during generation, reduce GPU layers by 2-3. You want ~500 MB headroom for the KV cache to grow during long conversations.
Running multiple models simultaneously (say, a chat model + a code model) doubles your VRAM needs. Strategies:
1. Sequential loading with aggressive timeouts:
# Unload after 30 seconds of inactivity
OLLAMA_KEEP_ALIVE=30s
2. Mix model sizes intentionally:
Instead of two 7B models, pair a 7B with a 1.5B:
llama3.1:8b-q4_K_M (~4.1 GB)qwen2.5:1.5b (~1 GB)3. CPU offload your secondary model entirely:
# Run the smaller model on CPU while the main model uses GPU
OLLAMA_GPU_LAYERS=0 ollama run qwen2.5:1.5b
Here's a trick I built into Locally Uncensored — when doing A/B comparisons between models, you don't need both loaded simultaneously. The app sends the same prompt sequentially: load Model A, generate, unload, load Model B, generate, display side-by-side.
Sequential comparison is ~2x slower than parallel, but it means you can compare a 13B model against a 7B model on an 8 GB GPU. On a 24 GB card, you could compare two 70B quantized models that would otherwise need 48 GB together.
If you're doing this manually via API:
# Send to model A
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"prompt": "Explain quantum tunneling",
"keep_alive": 0
}'
# Model A unloads, then send to model B
curl http://localhost:11434/api/generate -d '{
"model": "gemma2:9b",
"prompt": "Explain quantum tunneling",
"keep_alive": 0
}'
The keep_alive: 0 is crucial — it tells Ollama to unload immediately after generation.
Based on real-world testing:
4-6 GB VRAM (GTX 1660, RTX 3050):
8 GB VRAM (RTX 4060, RTX 3070):
12 GB VRAM (RTX 4070, RTX 3060 12GB):
16 GB VRAM (RTX 4070 Ti, RTX 5060 Ti):
24 GB VRAM (RTX 4090, RTX 3090):
Save this as vram-watch.sh:
#!/bin/bash
# Monitor VRAM + Ollama loaded models
while true; do
clear
echo "=== GPU VRAM ==="
nvidia-smi --query-gpu=name,memory.used,memory.total,utilization.gpu \
--format=csv,noheader,nounits | \
awk -F', ' '{printf "%s: %s/%s MB (GPU: %s%%)\n", $1, $2, $3, $4}'
echo ""
echo "=== Loaded Models ==="
ollama ps 2>/dev/null || echo "Ollama not running"
echo ""
echo "Press Ctrl+C to exit"
sleep 2
done
keep_alive prevents models from squatting on your VRAM.nvidia-smi and ollama ps are your best friends.The local AI space is moving fast. Models are getting more efficient (Gemma 2 is wild for its size), quantization methods are improving, and tools like Ollama keep abstracting away the complexity. Understanding VRAM management means you'll always know how to squeeze maximum performance out of whatever hardware you have.
If you want a GUI that handles all of this automatically — model management, VRAM-aware loading, A/B comparisons — check out Locally Uncensored. It's MIT-licensed and built specifically for running local AI without the headaches.
What GPU are you running local models on? Drop your setup in the comments — I'm always curious what hardware people are working with.