Ollama on RTX 4060 8GB
RTX 4060 has strong compute, but 8GB VRAM is the limiting factor for local LLMs. The most reliable path is
smaller models with deliberate context sizing, not large models with hidden CPU offload.
Practical sweet spot: 3B to 4B for maximum responsiveness, or 7B to 8B at Q4 with moderate context.
On 8GB, performance often feels binary: either fully on GPU and smooth, or partially spilled and suddenly slow.
There is usually not much middle ground.
What Fits Comfortably on 8GB
14B can sometimes run on 8GB with aggressive tradeoffs, but it is usually where latency becomes inconsistent.
If your goal is dependable throughput, smaller models at higher stability usually beat larger models with
offload in real workflows.
Context Strategy for 8GB
On this tier, context is often the deciding factor between smooth GPU inference and cliff-like slowdown.
Set num_ctx explicitly instead of depending on changing defaults.
One practical trap: defaults can change between releases. Explicitly setting num_ctx keeps your
behavior stable instead of inheriting shifting defaults.
Concurrency Is a Hidden Failure Mode
Effective context allocation scales with parallel requests. A setup that is stable in one chat can spill when
you open multiple sessions.
This is why a setup that worked yesterday can fail today when you add tabs, open another chat, or expose an
API endpoint with parallel requests.
Three Profiles That Work in Practice
- General chat profile: 7B to 8B model, Q4,
num_ctx=4096, low parallelism. - Coding profile: Qwen2.5 Coder 7B or Phi-3 Mini,
num_ctx=4096. - Long-document profile: smaller model (2B to 4B),
num_ctx=8192 to 16384.
If you need longer context without spill, reducing model size is usually a better trade than forcing larger
models into mixed CPU/GPU execution.
8GB Operational Rules
- Use Q4 as the first quant target for 7B to 8B models.
- Drop context before dropping model quality when latency spikes.
- Keep overlays, browser tabs, and GPU-heavy apps closed while serving.
- Treat 8GB as lower usable headroom in real desktop conditions.