Collin D. Johnson

LLM VRAM Calculator

May 2026 · interactive tool

A back-of-the-envelope calculator for figuring out whether a given language model will fit in your GPU's VRAM — for inference, LoRA, or QLoRA fine-tuning.

This tool assumes ~90% of total VRAM is usable, leaving headroom for CUDA context, the framework runtime, and any other GPU workloads (display, encoder, etc.). For multi-GPU setups, enter the VRAM of a single card if the model fits on one, or the combined VRAM if you're using tensor/pipeline parallelism.

KV cache estimated from a sub-linear approximation calibrated for modern GQA models.

GB
896
B
0.1B120B
bits
232
tok
512200K
Estimated total memory needed15.9 GB
Usable VRAM (90%, leaves headroom for CUDA + framework overhead)21.6 GB
Tight but works

Memory breakdown

Model weights14.0 GBKV cache (context)888 MBTraining overhead— (inference)Activations + buffer1.0 GB

How it works

Memory usage breaks down into four buckets:

Math reference

weights_gb     = params_billions × (bits / 8)

# Model preset (exact, from the selected model's attention architecture):
kv_cache_gb    = (2 × n_layers × n_kv_heads × head_dim × 2)
                 × context_length × (bits / 16) / 1e9

# Custom mode (sub-linear approximation, calibrated for modern GQA):
kv_cache_gb    = √params_billions × 40000
                 × context_length × (bits / 16) / 1e9

training_gb    = weights_gb × { 0 inference, 0.6 LoRA, 0.4 QLoRA, 3 full }
activations_gb = 1 if inference else 2
total_gb       = weights_gb + kv_cache_gb + training_gb + activations_gb
usable_gb      = total_vram_gb × 0.9

Limitations

These are estimates, not guarantees. Real memory usage depends on framework choice (vLLM, llama.cpp, PyTorch, TensorRT-LLM), batch size, attention variant (FlashAttention, paged attention), and whatever else is running on the GPU. Treat comfortable fit as accurate, tight but works as "test before you commit," and won't fit as a hard wall.

The KV cache assumes the cache is stored at the same precision as the weights — common when running 4-bit weights with 4-bit KV cache to fit long contexts in VRAM. Most production setups keep KV in fp16/bf16 regardless of weight precision; if you're running fp16 KV with 4-bit weights, multiply the reported KV cache by 4.

DeepSeek V3.1, R1, and V4 use Multi-Head Latent Attention (or its V4 sparse-index successor), which compresses KV substantially — they're listed with their effective per-token KV size. Qwen 3.6 (27B and 35B-A3B) interleaves Gated DeltaNet linear-attention layers with full-attention layers in a 3:1 ratio, so only 1-in-4 layers grow KV with context; the listed sizes reflect that.