Ollama on RTX 5090 (32GB)

RTX 5090 changes local inference mainly through 32GB capacity: you can keep larger models and larger context on GPU more often than 24GB systems. Raw speed helps, but staying fully on GPU is still the primary predictor of user-perceived performance.

The core mindset on 5090 is still budget management: weights + KV cache + overhead. The card is fast enough that when it slows down, it is usually because fit was lost, not because the GPU is weak.

Context Defaults and Why They Matter

VRAM tier	Default context
Under 24 GiB	4K
24 to 48 GiB	32K
48 GiB or more	256K

5090 sits in the 24 to 48 GiB tier, so default context is usually 32K. Treat that as a capability, not a fixed setting for every model.

For very large checkpoints, starting at 16K and stepping up is usually safer than beginning at 32K and trying to debug a sudden spill.

Model Picks for 32GB Workloads

Model	Size class	Best for	Starting quant	Starting context
Llama 3.3	70B class	Large-model general assistant	Q4	Start 16K, then test 32K
Qwen2.5	72B class	Multilingual and long-form	Q4	Start 16K, then test 32K
Mixtral 8x22B	MoE 8x22B	High-quality long-context workflows	Q4	32K
Command R+	104B class	Instruction-heavy tool workflows	Q3	8K to 16K
Qwen2.5 VL	72B vision-language	Document and vision tasks	Q4	8K to 16K

The 32GB advantage is not just model size. It is the ability to keep larger working context and multi-step agent-style prompts on GPU without immediate collapse.

Practical Profiles for RTX 5090

Profile	Model band	Context plan	Primary objective
A: Maximum responsiveness	14B to 32B	16K to 32K	Lowest latency with plenty of tool headroom
B: Single-GPU big model	70B to 72B	16K first, then 32K	High quality while staying on one GPU
C: Long-context agents	MoE or strong mid-size models	32K	Long history and retrieval without offload

Profile B is where 5090 shines: single-GPU large-model runs that are impractical or brittle on smaller cards. Profile A is often better for product workflows where latency consistency matters more than headline model size.

Pros and Cons in Real Use

Pros	Cons
32GB makes 70B/72B single-GPU runs far more practical	Still not a 48GB class card for every giant model + huge context combo
32K default context tier supports long sessions and agent workflows	Large default context can backfire on very heavy models
More headroom for multitasking and parallel chats	High cost, power, and thermals compared to smaller cards

Where 5090 Setups Still Fail

Running 70B+ models with 32K context and high parallelism at the same time.
Assuming default context is always optimal for every model class.
Treating 32GB as equivalent to 48GB and overcommitting KV cache.
Benchmarking with idle conditions, then deploying with real concurrent traffic.

32GB Operating Rules

Use 8K to 16K for fast interactive workflows.
Move to 32K when longer working memory has clear task value.
When performance drops, lower context before changing model family.
In API mode, limit parallelism if latency suddenly worsens.

References

Back to all guides