vector quantization
Compressing high-dimensional vectors into shorter codes (product, scalar, or binary) so that ANN indexes fit in cache or memory; trades a small recall cost for a large memory reduction.
Vector quantization compresses high-dimensional vectors into shorter codes so that ANN indexes occupy a fraction of the original memory. The motivation is hard: a 768-dim float32 vector is 3 KB, so 100 million vectors is 300 GB uncompressed — too big for any realistic DRAM budget and catastrophically slow when paged. Quantization pushes that down to single-digit GB with acceptable recall loss.
The main techniques:
- Product Quantization (PQ): split each vector into
Msubvectors, run k-means on each subvector independently, and replace each subvector with the index of its nearest centroid. A 768-dim float32 vector (3 KB) becomes 96 bytes atM=96, nbits=8. Distance computation uses precomputed lookup tables — fast on modern CPUs. - Scalar Quantization (SQ): quantize each float32 dimension into int8 or int4 based on a per-dim range. Simpler and faster than PQ, typically ~4x memory reduction with small recall loss.
- Binary quantization: binarize each dimension (sign bit); distance via Hamming (popcount). 32x memory reduction but significant recall loss — usually a first-pass filter followed by a re-rank on the true vectors.
Trade-offs:
- Quantization is recall-reducing. The cost is workload-dependent; for typical RAG embeddings, PQ at
M=96, nbits=8loses 2–5% recall@10. - Decode cost matters: PQ LUTs, SQ scale+shift, binary popcount. Modern CPUs handle all three in AVX2/AVX-512, which is why quantization is usually a throughput win despite the extra compute.
- Quantization pairs with structural indexes: IVF-PQ, HNSW-PQ. The quantization shrinks the payload; the index does the search.
The modern vector-DB memory equation is structural index + quantization; picking either alone is leaving an order of magnitude on the table.