Ollama on RTX 3060 12GB

RTX 3060 12GB is a strong local-LLM entry point when you treat VRAM as a shared budget instead of a fixed fit-or-no-fit threshold. Most slow setups fail because context was set too high, not because the model family was inherently wrong.

Practical sweet spot: 7B to 9B models are easy to keep fast, while 13B to 14B can work well with tighter context and strict headroom.

The most common pattern on this card is: a setup feels great on day one, then slows down once chat history grows, a second tab opens, or an API call runs in parallel. That is usually a context and concurrency issue, not a model-quality issue.

12GB Memory Budget Model

Memory bucket Behavior Operational implication
Model weights Mostly fixed by model size and quantization Choose 7B to 14B carefully and reserve headroom
KV cache Scales with context length and is accessed every token Largest practical performance lever on 12GB
System overhead OS, drivers, desktop apps, and VRAM fragmentation Usable VRAM is always lower than the card label

In practical terms, you are balancing three levers at once. If you increase model size, context headroom shrinks. If you increase context, concurrency tolerance shrinks. If you keep both high, offload risk jumps.

Model Picks That Usually Work Well

Use case Model Typical size range Why it fits 12GB
General assistant Llama 3.1 8B Strong quality-per-VRAM for daily chat and drafting
General assistant Gemma 2 9B Efficient response quality on 12GB with moderate context
Coding Qwen2.5 Coder 7B to 14B Code-focused quality, with 14B often the practical ceiling
Multilingual writing Qwen2.5 7B to 14B Strong multilingual and long-form behavior if context is controlled
Reasoning DeepSeek-R1 7B to 14B Useful reasoning family when you can budget extra compute

If you only change one setting while troubleshooting, change num_ctx. On 12GB it usually has a larger real-world impact than moving from one 8B family to another.

Context Is the Main Performance Lever

Ollama defaults GPUs under 24 GiB to 4K context. On 12GB, that default is usually the right first step. Increase only when your real workload needs it.

Goal Suggested num_ctx Spill risk
Fast, stable interactive usage 4096 Low
Longer sessions 8192 Medium
Long docs or tool-heavy prompts 16384 High
Huge context experiments 32768+ Very high

Why CPU Spill Feels Like a Cliff

Execution mode Observed behavior User impact
Fully on GPU Fast and predictable token throughput Best interactive experience
Small weight spill Noticeable slowdown with uneven latency Sometimes usable, but less responsive
KV cache spill Hot-path memory moves over PCIe per token Often a severe performance cliff

If you must offload, offloading a small part of weights is generally less painful than forcing KV cache off GPU.

That is because KV cache is touched at every generated token. Once it leaves VRAM, token cadence can become visibly stuttery, with bursts and pauses instead of smooth generation.

Concurrency Multiplies Context Allocation

Context per request Parallel requests Effective allocation
4096 1 4096
4096 2 8192
4096 4 16384

A stable single-chat profile can become unstable instantly when you run two or four sessions in parallel.

Three Practical Profiles

On Windows desktops, keep extra margin for VRAM overhead. If you are pushing limits, minimal Linux setups typically provide more predictable usable headroom.

Practical Setup Rules

References

Back to all guides

Share This Page