vector quantization

Compressing high-dimensional vectors into shorter codes (product, scalar, or binary) so that ANN indexes fit in cache or memory; trades a small recall cost for a large memory reduction.

also known as quantization · pq · product-quantization · scalar-quantization

stack memory · application

Vector quantization compresses high-dimensional vectors into shorter codes so that ANN indexes occupy a fraction of the original memory. The motivation is hard: a 768-dim float32 vector is 3 KB, so 100 million vectors is 300 GB uncompressed — too big for any realistic DRAM budget and catastrophically slow when paged. Quantization pushes that down to single-digit GB with acceptable recall loss.

The main techniques:

Product Quantization (PQ): split each vector into M subvectors, run k-means on each subvector independently, and replace each subvector with the index of its nearest centroid. A 768-dim float32 vector (3 KB) becomes 96 bytes at M=96, nbits=8. Distance computation uses precomputed lookup tables — fast on modern CPUs.
Scalar Quantization (SQ): quantize each float32 dimension into int8 or int4 based on a per-dim range. Simpler and faster than PQ, typically ~4x memory reduction with small recall loss.
Binary quantization: binarize each dimension (sign bit); distance via Hamming (popcount). 32x memory reduction but significant recall loss — usually a first-pass filter followed by a re-rank on the true vectors.

Trade-offs:

Quantization is recall-reducing. The cost is workload-dependent; for typical RAG embeddings, PQ at M=96, nbits=8 loses 2–5% recall@10.
Decode cost matters: PQ LUTs, SQ scale+shift, binary popcount. Modern CPUs handle all three in AVX2/AVX-512, which is why quantization is usually a throughput win despite the extra compute.
Quantization pairs with structural indexes: IVF-PQ, HNSW-PQ. The quantization shrinks the payload; the index does the search.

The modern vector-DB memory equation is structural index + quantization; picking either alone is leaving an order of magnitude on the table.

related

sources