Ollama on RTX 4060 8GB

RTX 4060 has strong compute, but 8GB VRAM is the limiting factor for local LLMs. The most reliable path is smaller models with deliberate context sizing, not large models with hidden CPU offload.

Practical sweet spot: 3B to 4B for maximum responsiveness, or 7B to 8B at Q4 with moderate context.

On 8GB, performance often feels binary: either fully on GPU and smooth, or partially spilled and suddenly slow. There is usually not much middle ground.

What Fits Comfortably on 8GB

Use case Model Typical size range Why it works on 8GB
Ultra-light and fast Phi-3 Mini 3B to 4B Low VRAM pressure and room for longer context
General assistant Llama 3.1 8B Strong baseline quality at Q4 with moderate context
General chat/summaries Gemma 2 2B to 9B Smaller variants are particularly stable on 8GB
Coding Qwen2.5 Coder 7B Good coding output without pushing VRAM as hard as 14B
Multilingual writing Qwen2.5 7B Useful multilingual quality if context stays controlled

14B can sometimes run on 8GB with aggressive tradeoffs, but it is usually where latency becomes inconsistent.

If your goal is dependable throughput, smaller models at higher stability usually beat larger models with offload in real workflows.

Context Strategy for 8GB

On this tier, context is often the deciding factor between smooth GPU inference and cliff-like slowdown. Set num_ctx explicitly instead of depending on changing defaults.

Goal Suggested num_ctx Spill risk
Fast and consistent 2048 to 4096 Low
Longer sessions 4096 to 8192 Medium
Long documents 8192 to 16384 High
Extreme context tests 16384+ Very high

One practical trap: defaults can change between releases. Explicitly setting num_ctx keeps your behavior stable instead of inheriting shifting defaults.

Concurrency Is a Hidden Failure Mode

Effective context allocation scales with parallel requests. A setup that is stable in one chat can spill when you open multiple sessions.

Context per request Parallel requests Effective allocation
4096 1 4096
4096 2 8192
4096 4 16384

This is why a setup that worked yesterday can fail today when you add tabs, open another chat, or expose an API endpoint with parallel requests.

Three Profiles That Work in Practice

If you need longer context without spill, reducing model size is usually a better trade than forcing larger models into mixed CPU/GPU execution.

8GB Operational Rules

References

Back to all guides

Share This Page