Ollama on RTX 3060 12GB

RTX 3060 12GB is a strong local-LLM entry point when you treat VRAM as a shared budget instead of a fixed fit-or-no-fit threshold. Most slow setups fail because context was set too high, not because the model family was inherently wrong.

Practical sweet spot: 7B to 9B models are easy to keep fast, while 13B to 14B can work well with tighter context and strict headroom.

The most common pattern on this card is: a setup feels great on day one, then slows down once chat history grows, a second tab opens, or an API call runs in parallel. That is usually a context and concurrency issue, not a model-quality issue.

12GB Memory Budget Model

Memory bucket	Behavior	Operational implication
Model weights	Mostly fixed by model size and quantization	Choose 7B to 14B carefully and reserve headroom
KV cache	Scales with context length and is accessed every token	Largest practical performance lever on 12GB
System overhead	OS, drivers, desktop apps, and VRAM fragmentation	Usable VRAM is always lower than the card label

In practical terms, you are balancing three levers at once. If you increase model size, context headroom shrinks. If you increase context, concurrency tolerance shrinks. If you keep both high, offload risk jumps.

Model Picks That Usually Work Well

Use case	Model	Typical size range	Why it fits 12GB
General assistant	Llama 3.1	8B	Strong quality-per-VRAM for daily chat and drafting
General assistant	Gemma 2	9B	Efficient response quality on 12GB with moderate context
Coding	Qwen2.5 Coder	7B to 14B	Code-focused quality, with 14B often the practical ceiling
Multilingual writing	Qwen2.5	7B to 14B	Strong multilingual and long-form behavior if context is controlled
Reasoning	DeepSeek-R1	7B to 14B	Useful reasoning family when you can budget extra compute

If you only change one setting while troubleshooting, change num_ctx. On 12GB it usually has a larger real-world impact than moving from one 8B family to another.

Context Is the Main Performance Lever

Ollama defaults GPUs under 24 GiB to 4K context. On 12GB, that default is usually the right first step. Increase only when your real workload needs it.

Goal	Suggested `num_ctx`	Spill risk
Fast, stable interactive usage	4096	Low
Longer sessions	8192	Medium
Long docs or tool-heavy prompts	16384	High
Huge context experiments	32768+	Very high

Why CPU Spill Feels Like a Cliff

Execution mode	Observed behavior	User impact
Fully on GPU	Fast and predictable token throughput	Best interactive experience
Small weight spill	Noticeable slowdown with uneven latency	Sometimes usable, but less responsive
KV cache spill	Hot-path memory moves over PCIe per token	Often a severe performance cliff

If you must offload, offloading a small part of weights is generally less painful than forcing KV cache off GPU.

That is because KV cache is touched at every generated token. Once it leaves VRAM, token cadence can become visibly stuttery, with bursts and pauses instead of smooth generation.

Concurrency Multiplies Context Allocation

Context per request	Parallel requests	Effective allocation
4096	1	4096
4096	2	8192
4096	4	16384

A stable single-chat profile can become unstable instantly when you run two or four sessions in parallel.

Three Practical Profiles

Fast daily driver: 8B model, Q4, num_ctx=4096, single session.
Coding step-up: 14B code model, Q4, num_ctx=4096, no background GPU-heavy apps.
Long-session mode: 8B or 9B model, num_ctx=8192, low parallelism.

On Windows desktops, keep extra margin for VRAM overhead. If you are pushing limits, minimal Linux setups typically provide more predictable usable headroom.

Practical Setup Rules

Start with 8B to 9B at Q4 for daily speed.
Use 14B only with moderate context and clear headroom.
On Windows, leave extra VRAM margin for desktop overhead.
If throughput collapses, reduce num_ctx first.

References

Back to all guides