Model Quantization — Cold Mountain Wiki

The Problem

LLM parameters (weights) dominate memory usage. A 80B parameter model at 32-bit float precision requires ~320GB of RAM. Frontier models rumored to have 1T+ parameters would need petabytes. Quantization compresses weights into smaller integer formats so capable models can run on consumer hardware.

Most LLM weights cluster near zero — very small values. This makes quantization effective: the precision loss from rounding near-zero values is small, and only a few outlier weights cause trouble.

How It Works

Basic principle: Map floating-point values into a smaller integer range using a scale factor. On inference, dequantize on-the-fly before computation.

Symmetric quantization: Scale around zero. Find largest absolute value, divide by max integer value to get scale factor. Simple, but wastes bits when data isn't centered.

Asymmetric quantization: Find actual data range (min and max), map it to the full integer range. Adds a zero-point offset. ~10% lower error than symmetric for the same bit depth.

Block-wise quantization: Critical in practice. Quantizing all weights at once fails badly because outlier values compress everything else into near-zero. Instead, quantize 32–256 weights at a time. Each block has its own scale (and zero-point for asymmetric). Overhead: stored alongside weights, tradeoff between block size and accuracy.

Quality Benchmarks (Qwen 3.5 9B)

Format	Perplexity	KL Divergence	Benchmark Performance
16-bit (baseline)	—	0	100%
8-bit symmetric	Near-identical	Very low	Slightly better (noise)
4-bit (various)	Small degradation	Moderate	~90%
2-bit	Significant	High	~0% (model breaks)

The quality cliff exists but isn't linear — 16-bit to 8-bit is nearly free, 16-bit to 4-bit retains ~90% quality. 2-bit is generally unusable for most models.

Evaluation Methods

Three ways to measure quantization impact:

Perplexity — Lower is better. Measures how confident the model is in its token predictions. Simple but only considers the probability of the correct token.
KL Divergence — Measures how much the full token probability distribution changes after quantization. More comprehensive than perplexity but harder to interpret intuitively (no natural scale, only comparable within the same model).
Benchmarks and vibe testing — Run standard benchmarks (GPQA, MMLU) and just ask questions. Less rigorous but contextualizes other scores. For critical use cases, build task-specific evaluations.

Practical Notes

8-bit is almost always worth it — Nearly free quality for 2x memory reduction and speed improvement
4-bit is a good default for running locally — Apps like llama.cpp offer Q4 variants for most popular models
Speed improves with quantization (usually) — Less data to move through GPU memory. 8-bit and 4-bit are meaningfully faster than 16-bit on most hardware
Outlier weights are the enemy — Block-wise quantization exists to contain them; tools like AWQ and GPTQ handle outliers more cleverly

Advanced Methods

Post-Training Quantization (PTQ) — Applied after training. What most users do. What this page covers.
Quantization-Aware Training (QAT) — Simulates quantization during pre-training so the model learns weights that quantize well. More expensive but higher quality output.
AWQ (Activation-aware Weight Quantization) — Preserves salient weights; better than naive PTQ at 4-bit
GPTQ — Layer-wise quantization using approximate second-order information; commonly used for 4-bit
Other efficiency methods: Parameter pruning, knowledge distillation

Tooling

llama.cpp — De facto standard for running quantized models locally. GGUF format. Provides perplexity, KL divergence, and speed benchmarking tools.
Ollama — User-friendly wrapper around llama.cpp
HuggingFace — Hub for quantized model variants

Sources

"Quantization from the ground up" — Sam Rose (Apr 2026) (link)