Ollama on RTX 5090 (32GB)

RTX 5090 changes local inference mainly through 32GB capacity: you can keep larger models and larger context on GPU more often than 24GB systems. Raw speed helps, but staying fully on GPU is still the primary predictor of user-perceived performance.

The core mindset on 5090 is still budget management: weights + KV cache + overhead. The card is fast enough that when it slows down, it is usually because fit was lost, not because the GPU is weak.

Context Defaults and Why They Matter

VRAM tier Default context
Under 24 GiB 4K
24 to 48 GiB 32K
48 GiB or more 256K

5090 sits in the 24 to 48 GiB tier, so default context is usually 32K. Treat that as a capability, not a fixed setting for every model.

For very large checkpoints, starting at 16K and stepping up is usually safer than beginning at 32K and trying to debug a sudden spill.

Model Picks for 32GB Workloads

Model Size class Best for Starting quant Starting context
Llama 3.3 70B class Large-model general assistant Q4 Start 16K, then test 32K
Qwen2.5 72B class Multilingual and long-form Q4 Start 16K, then test 32K
Mixtral 8x22B MoE 8x22B High-quality long-context workflows Q4 32K
Command R+ 104B class Instruction-heavy tool workflows Q3 8K to 16K
Qwen2.5 VL 72B vision-language Document and vision tasks Q4 8K to 16K

The 32GB advantage is not just model size. It is the ability to keep larger working context and multi-step agent-style prompts on GPU without immediate collapse.

Practical Profiles for RTX 5090

Profile Model band Context plan Primary objective
A: Maximum responsiveness 14B to 32B 16K to 32K Lowest latency with plenty of tool headroom
B: Single-GPU big model 70B to 72B 16K first, then 32K High quality while staying on one GPU
C: Long-context agents MoE or strong mid-size models 32K Long history and retrieval without offload

Profile B is where 5090 shines: single-GPU large-model runs that are impractical or brittle on smaller cards. Profile A is often better for product workflows where latency consistency matters more than headline model size.

Pros and Cons in Real Use

Pros Cons
32GB makes 70B/72B single-GPU runs far more practical Still not a 48GB class card for every giant model + huge context combo
32K default context tier supports long sessions and agent workflows Large default context can backfire on very heavy models
More headroom for multitasking and parallel chats High cost, power, and thermals compared to smaller cards

Where 5090 Setups Still Fail

32GB Operating Rules

References

Back to all guides

Share This Page