HNSW-RaBitQ Index
HNSW with RaBitQ Quantization
An advanced graph-based index that combines the HNSW graph structure with RaBitQ quantization algorithm β delivering dramatically lower memory usage while maintaining state-of-the-art search quality.
Platform Requirement: HNSW-RaBitQ is currently supported only on x86_64 with AVX2 or higher instruction set support. It is not available on ARM architectures.
How It Works
HNSW-RaBitQ combines two techniques to achieve high recall with minimal memory:
- HNSW graph for navigation πͺ
- Same multi-layer graph structure as the standard HNSW index β sparse upper layers for fast long-range jumps, dense lower layers for fine-grained local search.
- RaBitQ for distance estimation π
- During training, the algorithm clusters vectors and learns centroids. Each vector is then quantized into a compact binary representation (as few as configurable 1β9 bits per dimension).
- During search, the query vector is transformed using the same rotation, and distances to candidate nodes are estimated using SIMD instructions on the quantized data β instead of full FP32 distance computations.
- The key insight: graph construction uses original FP32 vectors for maximum accuracy, while search uses quantized vectors for speed and low memory. This separation ensures graph quality is never compromised by quantization.
Why This Matters
In a standard HNSW index, every search step requires loading full FP32 vectors to compute exact distances. For 768-dimensional vectors, that's 3 KB per node β and a single query may visit thousands of nodes.
With RaBitQ quantization at 7 bits per dimension, each vector shrinks to roughly 672 bytes β a ~4.6x reduction. At the extreme end, 1-bit quantization compresses each vector to just 96 bytes β a ~32x reduction. Since the HNSW graph traversal is memory-bound, smaller vectors translate directly to faster search and lower memory footprint.
When to Use HNSW-RaBitQ?
- β Production systems needing fast search and high recall with controlled memory budgets
- β Massive high-dimensional datasets β billion-scale vectors with 1536+ dimensions that would consume terabytes of RAM in FP32 format
- β Workloads on x86_64 servers with AVX2/AVX-512 support
Best Practice: Use HNSW-RaBitQ when you want HNSW-quality search without the memory overhead.
The total_bits parameter controls the accuracyβmemory trade-off β recall figures reported in the paper show that 7 bits achieves ~99% recall, 5 bits ~95%, and 4 bits ~90% β all without raw vector reranking. Going as low as 1 bit maximizes compression at the cost of lower recall.
Advantages
- β¨ Dramatically lower memory β Quantized vectors are up to 32x smaller than FP32, reducing active index size
- β¨ Fast Distance Estimation β RaBitQ supports to estimate the similarity metrics with high efficiency based on bitwise operations
- β¨ Promising Recall Without Re-ranking β Graph construction uses original vectors, preserving graph quality and RaBitQ provides an asymptotically optimal error bound for reliable ordering and reranking.
Trade-offs
- β οΈ x86_64 only β Requires AVX2 or AVX-512; ARM is not supported
- β οΈ Training overhead β Requires a KMeans training step before index construction, adding to build time
- β οΈ Dimension constraints β Only supports vectors between 64 and 4095 dimensions
Key Parameters
Tuning Tip:
Start with the defaults (total_bits=7, num_clusters=16). Adjust ef first for recall/latency trade-offs at query time. Only reduce total_bits if you need to cut memory further and can tolerate slightly lower recall.
Index-Time Parameters
| Parameter | Description | Tuning Guidance |
|---|---|---|
metric_type | Similarity metric used to compare vectors | Choose based on how your embeddings were trained |
m | Max neighbors per node β The maximum number of bidirectional links created for each node during graph construction | β’ Higher m β β¨ better recall and graph connectivity β οΈ more memory usage and higher latency for both indexing and search |
ef_construction | Index-time candidate pool size β Determines how many neighboring candidates the algorithm considers when inserting a new vector into the graph | β’ Higher ef_construction β β¨ better graph quality and higher recall β οΈ longer index build time (does not affect query speed) |
total_bits | RaBitQ quantization bits per dimension β Controls the precision of the binary encoding | Controls the accuracyβmemory trade-off. Lower values save more memory but reduce accuracy |
num_clusters | Number of KMeans clusters β Used during the RaBitQ training phase to partition the vector space | β’ More clusters can capture finer distribution patterns β’ Higher values increase recall slightly |
sample_count | Training sample count β Number of vectors sampled for KMeans training (0 = use all vectors) | Default is 0. Set a smaller value (e.g. 5,000,000) to speed up training and reduce memory usage on very large datasets |
Query-Time Parameters
| Parameter | Description | Tuning Guidance |
|---|---|---|
ef | Query-time candidate pool size β Determines how many potential neighbors are explored at each step during graph traversal at query time | β’ Higher ef β β¨ higher recall β οΈ higher query latency |
radius | Distance (similarity) threshold for range-based filtering β only documents satisfying the threshold are returned | Example: β’ With inner product MetricType.IP, set radius=0.6 to keep only results with score > 0.6 β Use when: You want to filter out low-quality matches π« Skip when: You want all top-k results, regardless of quality |
is_linear | Forces a brute-force linear search instead of using the index | π Very slow for large datasets! β Only use for: Debugging, tiny collections, or verifying index accuracy |
is_using_refiner | Enables exact score refinement β recomputes exact FP32 distances for top candidates after quantized search | β
Turn on: When you need maximum precision β οΈ Adds latency due to full-precision re-scoring |
HNSW vs HNSW-RaBitQ
| Aspect | HNSW | HNSW-RaBitQ |
|---|---|---|
| Quantization method | Standard quantization via quantize_type β supports INT8, FP16, INT4 | Built-in RaBitQ binary quantization (1β9 bits per dimension) |
| Memory per vector (768-dim) | ~3 KB (FP32), ~1.5 KB (FP16), ~768 B (INT8), ~384 B (INT4) | ~864 B (9-bit) to ~96 B (1-bit) |
| Search distance computation | Exact FP32 (or quantized integer/fp16 arithmetic when quantized) | Fast distance estimation via RaBitQ |
| Recall (typical) | ~99%+ (FP32), slightly lower when quantized | Up to 99% (tunable via total_bits) |
| Graph construction quality | FP32 or quantized | FP32 |
| Refiner support | β Recomputes exact FP32 distances for quantized results | β Same β recomputes exact FP32 distances |
| Platform | All supported platforms | x86_64 only (AVX2 or higher required) |
| Best for | General-purpose production; use with INT8/FP16 for moderate memory savings | Maximum compression with RaBitQ-specific optimizations on x86_64 |
How to choose: If you need cross-platform compatibility, use the standard HNSW index with quantize_type set to INT8 or FP16 β this already provides significant memory savings. Choose HNSW-RaBitQ when you are on x86_64 and want RaBitQ's specialized binary quantization, which can offer better accuracy-compression trade-offs at comparable memory levels.