LLM VRAM Calculator

May 2026 · interactive tool

A back-of-the-envelope calculator for figuring out whether a given language model will fit in your GPU's VRAM — for inference, LoRA, or QLoRA fine-tuning.

This tool assumes ~90% of total VRAM is usable, leaving headroom for CUDA context, the framework runtime, and any other GPU workloads (display, encoder, etc.). For multi-GPU setups, enter the VRAM of a single card if the model fits on one, or the combined VRAM if you're using tensor/pipeline parallelism.

Model preset

KV cache estimated from a sub-linear approximation calibrated for modern GQA models.

GPU VRAM

896

Model size (parameters)

0.1B120B

Precision (bits)

bits

232

Use case

Context length (tokens)

tok

512200K

Estimated total memory needed15.9 GB

Usable VRAM (90%, leaves headroom for CUDA + framework overhead)21.6 GB

Tight but works

Memory breakdown

Model weights14.0 GBKV cache (context)888 MBTraining overhead— (inference)Activations + buffer1.0 GB

How it works

Memory usage breaks down into four buckets:

Model weights — parameters × bytes_per_parameter. A 7B model in bfloat16 (2 bytes) is ~14 GB. For MoE models, all experts must be resident, so total params (not active) drives weight memory.
KV cache — Attention state cached during inference. Scales linearly with context length but only sub-linearly with model size, since it depends on n_layers × n_kv_heads × head_dim rather than total parameters. Modern GQA architectures (Llama-3+, Qwen 2.5+, Mistral) cache far less per token than legacy MHA at the same parameter count.
Training overhead — Only present during fine-tuning. ~0.6× weights for LoRA, ~0.4× for QLoRA, ~3× for full fine-tuning (gradients + Adam optimizer states + activations).
Activations + buffer — A flat 1-2 GB for working memory.

Math reference

weights_gb     = params_billions × (bits / 8)

# Model preset (exact, from the selected model's attention architecture):
kv_cache_gb    = (2 × n_layers × n_kv_heads × head_dim × 2)
                 × context_length × (bits / 16) / 1e9

# Custom mode (sub-linear approximation, calibrated for modern GQA):
kv_cache_gb    = √params_billions × 40000
                 × context_length × (bits / 16) / 1e9

training_gb    = weights_gb × { 0 inference, 0.6 LoRA, 0.4 QLoRA, 3 full }
activations_gb = 1 if inference else 2
total_gb       = weights_gb + kv_cache_gb + training_gb + activations_gb
usable_gb      = total_vram_gb × 0.9

Limitations

These are estimates, not guarantees. Real memory usage depends on framework choice (vLLM, llama.cpp, PyTorch, TensorRT-LLM), batch size, attention variant (FlashAttention, paged attention), and whatever else is running on the GPU. Treat comfortable fit as accurate, tight but works as "test before you commit," and won't fit as a hard wall.

The KV cache assumes the cache is stored at the same precision as the weights — common when running 4-bit weights with 4-bit KV cache to fit long contexts in VRAM. Most production setups keep KV in fp16/bf16 regardless of weight precision; if you're running fp16 KV with 4-bit weights, multiply the reported KV cache by 4.

DeepSeek V3.1, R1, and V4 use Multi-Head Latent Attention (or its V4 sparse-index successor), which compresses KV substantially — they're listed with their effective per-token KV size. Qwen 3.6 (27B and 35B-A3B) interleaves Gated DeltaNet linear-attention layers with full-attention layers in a 3:1 ratio, so only 1-in-4 layers grow KV with context; the listed sizes reflect that.