Ollama on RTX 4060 8GB

RTX 4060 has strong compute, but 8GB VRAM is the limiting factor for local LLMs. The most reliable path is smaller models with deliberate context sizing, not large models with hidden CPU offload.

Practical sweet spot: 3B to 4B for maximum responsiveness, or 7B to 8B at Q4 with moderate context.

On 8GB, performance often feels binary: either fully on GPU and smooth, or partially spilled and suddenly slow. There is usually not much middle ground.

What Fits Comfortably on 8GB

Use case	Model	Typical size range	Why it works on 8GB
Ultra-light and fast	Phi-3 Mini	3B to 4B	Low VRAM pressure and room for longer context
General assistant	Llama 3.1	8B	Strong baseline quality at Q4 with moderate context
General chat/summaries	Gemma 2	2B to 9B	Smaller variants are particularly stable on 8GB
Coding	Qwen2.5 Coder	7B	Good coding output without pushing VRAM as hard as 14B
Multilingual writing	Qwen2.5	7B	Useful multilingual quality if context stays controlled

14B can sometimes run on 8GB with aggressive tradeoffs, but it is usually where latency becomes inconsistent.

If your goal is dependable throughput, smaller models at higher stability usually beat larger models with offload in real workflows.

Context Strategy for 8GB

On this tier, context is often the deciding factor between smooth GPU inference and cliff-like slowdown. Set num_ctx explicitly instead of depending on changing defaults.

Goal	Suggested `num_ctx`	Spill risk
Fast and consistent	2048 to 4096	Low
Longer sessions	4096 to 8192	Medium
Long documents	8192 to 16384	High
Extreme context tests	16384+	Very high

One practical trap: defaults can change between releases. Explicitly setting num_ctx keeps your behavior stable instead of inheriting shifting defaults.

Concurrency Is a Hidden Failure Mode

Effective context allocation scales with parallel requests. A setup that is stable in one chat can spill when you open multiple sessions.

Context per request	Parallel requests	Effective allocation
4096	1	4096
4096	2	8192
4096	4	16384

This is why a setup that worked yesterday can fail today when you add tabs, open another chat, or expose an API endpoint with parallel requests.

Three Profiles That Work in Practice

General chat profile: 7B to 8B model, Q4, num_ctx=4096, low parallelism.
Coding profile: Qwen2.5 Coder 7B or Phi-3 Mini, num_ctx=4096.
Long-document profile: smaller model (2B to 4B), num_ctx=8192 to 16384.

If you need longer context without spill, reducing model size is usually a better trade than forcing larger models into mixed CPU/GPU execution.

8GB Operational Rules

Use Q4 as the first quant target for 7B to 8B models.
Drop context before dropping model quality when latency spikes.
Keep overlays, browser tabs, and GPU-heavy apps closed while serving.
Treat 8GB as lower usable headroom in real desktop conditions.

References

Back to all guides