Model Quantization: Smaller, Faster, Almost as Good
Quantization shrinks a model by storing its numbers with less precision. Done well, it cuts memory and cost dramatically while barely touching quality. Here is the intuition and the tradeoffs.
A large language model is, at bottom, a vast collection of numbers — its weights. By default each of those numbers is stored at high precision, which is accurate but expensive: the model takes a great deal of memory, and moving all those numbers around is much of what makes inference slow and costly. Quantization is the deceptively simple idea that you do not need all that precision, and that storing the numbers more coarsely buys you enormous savings for surprisingly little loss.
The core intuition
Think of a weight as a measurement. You could record someone’s height as 1.823456 metres, but for almost every purpose 1.82 metres is just as useful and takes far less to write down. Quantization does the same to a model’s weights: it maps high-precision numbers onto a much smaller set of possible values. Go from 16 bits per weight down to 8, or even 4, and the model’s memory footprint falls by half or three-quarters. A model that demanded expensive high-memory hardware suddenly fits on something far cheaper.
Why a little imprecision rarely hurts
The surprising part is how little quality you lose. Neural networks are robust to small perturbations in their weights — they were trained with noise and redundancy, and no single weight is precious. Nudging each one slightly, as quantization does, mostly averages out across billions of them. For 8-bit quantization the quality drop is often imperceptible; at 4 bits it becomes measurable but, with the right techniques, remains small enough that the cost savings clearly win for most applications.
The tradeoff curve
Quantization is a dial, not a switch, and the dial has a knee:
- 8-bit is close to free: large memory savings, quality almost indistinguishable from the original. This is the easy default for most deployments.
- 4-bit roughly doubles the savings again and is where much of the excitement is. It needs more careful methods to preserve quality, but those methods are now mature and widely used.
- Below 4-bit, quality degrades faster and the techniques get exotic. This is the research frontier, worth it only when memory is the binding constraint.
The right setting is the most aggressive one that still passes your evaluation. That last clause matters: never quantize on faith.
Where the savings show up
The payoff is concrete. A quantized model fits on smaller, cheaper accelerators, so your hardware bill drops. It runs faster, because moving fewer bits through memory is a large part of inference time — and memory bandwidth, not raw compute, is the bottleneck for much LLM inference. And it makes on-device deployment possible: a 4-bit model can run on a phone or laptop where the full-precision version never could, unlocking the privacy and latency benefits of local inference.
Measure, do not assume
Quantization’s effect is task-dependent. A model that holds up beautifully on summarisation might lose precision exactly where your application needs it most — on the arithmetic, the rare edge case, the long-context reasoning. The discipline is simple: quantize, then run your real evaluation suite against the smaller model and compare. Treat the precision level as one more parameter to tune against measured quality, not a setting to choose from a blog post. When it passes, you have made your system cheaper and faster for essentially nothing — one of the rare free lunches in production AI.