LLM VRAM Calculator
May 2026 · interactive tool
A back-of-the-envelope calculator for figuring out whether a given language model will fit in your GPU's VRAM — for inference, LoRA, or QLoRA fine-tuning.
This tool assumes ~90% of total VRAM is usable, leaving headroom for CUDA context, the framework runtime, and any other GPU workloads (display, encoder, etc.). For multi-GPU setups, enter the VRAM of a single card if the model fits on one, or the combined VRAM if you're using tensor/pipeline parallelism.
KV cache estimated from a sub-linear approximation calibrated for modern GQA models.
Memory breakdown
How it works
Memory usage breaks down into four buckets:
- Model weights —
parameters × bytes_per_parameter. A 7B model in bfloat16 (2 bytes) is ~14 GB. For MoE models, all experts must be resident, so total params (not active) drives weight memory. - KV cache — Attention state cached during inference. Scales linearly with context length but only sub-linearly with model size, since it depends on
n_layers × n_kv_heads × head_dimrather than total parameters. Modern GQA architectures (Llama-3+, Qwen 2.5+, Mistral) cache far less per token than legacy MHA at the same parameter count. - Training overhead — Only present during fine-tuning. ~0.6× weights for LoRA, ~0.4× for QLoRA, ~3× for full fine-tuning (gradients + Adam optimizer states + activations).
- Activations + buffer — A flat 1-2 GB for working memory.
Math reference
weights_gb = params_billions × (bits / 8)
# Model preset (exact, from the selected model's attention architecture):
kv_cache_gb = (2 × n_layers × n_kv_heads × head_dim × 2)
× context_length × (bits / 16) / 1e9
# Custom mode (sub-linear approximation, calibrated for modern GQA):
kv_cache_gb = √params_billions × 40000
× context_length × (bits / 16) / 1e9
training_gb = weights_gb × { 0 inference, 0.6 LoRA, 0.4 QLoRA, 3 full }
activations_gb = 1 if inference else 2
total_gb = weights_gb + kv_cache_gb + training_gb + activations_gb
usable_gb = total_vram_gb × 0.9
Limitations
These are estimates, not guarantees. Real memory usage depends on framework choice (vLLM, llama.cpp, PyTorch, TensorRT-LLM), batch size, attention variant (FlashAttention, paged attention), and whatever else is running on the GPU. Treat comfortable fit as accurate, tight but works as "test before you commit," and won't fit as a hard wall.
The KV cache assumes the cache is stored at the same precision as the weights — common when running 4-bit weights with 4-bit KV cache to fit long contexts in VRAM. Most production setups keep KV in fp16/bf16 regardless of weight precision; if you're running fp16 KV with 4-bit weights, multiply the reported KV cache by 4.
DeepSeek V3.1, R1, and V4 use Multi-Head Latent Attention (or its V4 sparse-index successor), which compresses KV substantially — they're listed with their effective per-token KV size. Qwen 3.6 (27B and 35B-A3B) interleaves Gated DeltaNet linear-attention layers with full-attention layers in a 3:1 ratio, so only 1-in-4 layers grow KV with context; the listed sizes reflect that.