Ollama on RTX 3060 12GB
RTX 3060 12GB is a strong local-LLM entry point when you treat VRAM as a shared budget instead of a fixed
fit-or-no-fit threshold. Most slow setups fail because context was set too high, not because the model
family was inherently wrong.
Practical sweet spot: 7B to 9B models are easy to keep fast, while 13B to 14B can work well with tighter
context and strict headroom.
The most common pattern on this card is: a setup feels great on day one, then slows down once chat history
grows, a second tab opens, or an API call runs in parallel. That is usually a context and concurrency issue,
not a model-quality issue.
12GB Memory Budget Model
In practical terms, you are balancing three levers at once. If you increase model size, context headroom
shrinks. If you increase context, concurrency tolerance shrinks. If you keep both high, offload risk jumps.
Model Picks That Usually Work Well
If you only change one setting while troubleshooting, change num_ctx. On 12GB it usually has a
larger real-world impact than moving from one 8B family to another.
Context Is the Main Performance Lever
Ollama defaults GPUs under 24 GiB to 4K context. On 12GB, that default is usually the right first step.
Increase only when your real workload needs it.
Why CPU Spill Feels Like a Cliff
If you must offload, offloading a small part of weights is generally less painful than forcing KV cache off GPU.
That is because KV cache is touched at every generated token. Once it leaves VRAM, token cadence can become
visibly stuttery, with bursts and pauses instead of smooth generation.
Concurrency Multiplies Context Allocation
A stable single-chat profile can become unstable instantly when you run two or four sessions in parallel.
Three Practical Profiles
- Fast daily driver: 8B model, Q4,
num_ctx=4096, single session. - Coding step-up: 14B code model, Q4,
num_ctx=4096, no background GPU-heavy apps. - Long-session mode: 8B or 9B model,
num_ctx=8192, low parallelism.
On Windows desktops, keep extra margin for VRAM overhead. If you are pushing limits, minimal Linux setups
typically provide more predictable usable headroom.
Practical Setup Rules
- Start with 8B to 9B at Q4 for daily speed.
- Use 14B only with moderate context and clear headroom.
- On Windows, leave extra VRAM margin for desktop overhead.
- If throughput collapses, reduce
num_ctx first.