Zvec Logo

HNSW-RaBitQ Index

HNSW with RaBitQ Quantization

An advanced graph-based index that combines the HNSW graph structure with RaBitQ quantization algorithm β€” delivering dramatically lower memory usage while maintaining state-of-the-art search quality.

Platform Requirement: HNSW-RaBitQ is currently supported only on x86_64 with AVX2 or higher instruction set support. It is not available on ARM architectures.

How It Works

HNSW-RaBitQ combines two techniques to achieve high recall with minimal memory:

  • HNSW graph for navigation πŸͺœ
    • Same multi-layer graph structure as the standard HNSW index β€” sparse upper layers for fast long-range jumps, dense lower layers for fine-grained local search.
  • RaBitQ for distance estimation πŸ”
    1. During training, the algorithm clusters vectors and learns centroids. Each vector is then quantized into a compact binary representation (as few as configurable 1–9 bits per dimension).
    2. During search, the query vector is transformed using the same rotation, and distances to candidate nodes are estimated using SIMD instructions on the quantized data β€” instead of full FP32 distance computations.
    3. The key insight: graph construction uses original FP32 vectors for maximum accuracy, while search uses quantized vectors for speed and low memory. This separation ensures graph quality is never compromised by quantization.

Why This Matters

In a standard HNSW index, every search step requires loading full FP32 vectors to compute exact distances. For 768-dimensional vectors, that's 3 KB per node β€” and a single query may visit thousands of nodes.

With RaBitQ quantization at 7 bits per dimension, each vector shrinks to roughly 672 bytes β€” a ~4.6x reduction. At the extreme end, 1-bit quantization compresses each vector to just 96 bytes β€” a ~32x reduction. Since the HNSW graph traversal is memory-bound, smaller vectors translate directly to faster search and lower memory footprint.

When to Use HNSW-RaBitQ?

  • βœ… Production systems needing fast search and high recall with controlled memory budgets
  • βœ… Massive high-dimensional datasets β€” billion-scale vectors with 1536+ dimensions that would consume terabytes of RAM in FP32 format
  • βœ… Workloads on x86_64 servers with AVX2/AVX-512 support

Best Practice: Use HNSW-RaBitQ when you want HNSW-quality search without the memory overhead.

The total_bits parameter controls the accuracy–memory trade-off β€” recall figures reported in the paper show that 7 bits achieves ~99% recall, 5 bits ~95%, and 4 bits ~90% β€” all without raw vector reranking. Going as low as 1 bit maximizes compression at the cost of lower recall.

Advantages

  1. ✨ Dramatically lower memory β€” Quantized vectors are up to 32x smaller than FP32, reducing active index size
  2. ✨ Fast Distance Estimation β€” RaBitQ supports to estimate the similarity metrics with high efficiency based on bitwise operations
  3. ✨ Promising Recall Without Re-ranking β€” Graph construction uses original vectors, preserving graph quality and RaBitQ provides an asymptotically optimal error bound for reliable ordering and reranking.

Trade-offs

  1. ⚠️ x86_64 only β€” Requires AVX2 or AVX-512; ARM is not supported
  2. ⚠️ Training overhead β€” Requires a KMeans training step before index construction, adding to build time
  3. ⚠️ Dimension constraints β€” Only supports vectors between 64 and 4095 dimensions

Key Parameters

Tuning Tip: Start with the defaults (total_bits=7, num_clusters=16). Adjust ef first for recall/latency trade-offs at query time. Only reduce total_bits if you need to cut memory further and can tolerate slightly lower recall.

Index-Time Parameters

ParameterDescriptionTuning Guidance
metric_typeSimilarity metric used to compare vectorsChoose based on how your embeddings were trained
mMax neighbors per node β€” The maximum number of bidirectional links created for each node during graph constructionβ€’ Higher m β†’
✨ better recall and graph connectivity
⚠️ more memory usage and higher latency for both indexing and search
ef_constructionIndex-time candidate pool size β€” Determines how many neighboring candidates the algorithm considers when inserting a new vector into the graphβ€’ Higher ef_construction β†’
✨ better graph quality and higher recall
⚠️ longer index build time (does not affect query speed)
total_bitsRaBitQ quantization bits per dimension β€” Controls the precision of the binary encodingControls the accuracy–memory trade-off.
Lower values save more memory but reduce accuracy
num_clustersNumber of KMeans clusters β€” Used during the RaBitQ training phase to partition the vector spaceβ€’ More clusters can capture finer distribution patterns
β€’ Higher values increase recall slightly
sample_countTraining sample count β€” Number of vectors sampled for KMeans training (0 = use all vectors)Default is 0. Set a smaller value (e.g. 5,000,000) to speed up training and reduce memory usage on very large datasets

Query-Time Parameters

ParameterDescriptionTuning Guidance
efQuery-time candidate pool size β€” Determines how many potential neighbors are explored at each step during graph traversal at query timeβ€’ Higher ef β†’
✨ higher recall
⚠️ higher query latency
radiusDistance (similarity) threshold for range-based filtering β€” only documents satisfying the threshold are returnedExample:
β€’ With inner product MetricType.IP, set radius=0.6 to keep only results with score > 0.6
βœ… Use when: You want to filter out low-quality matches
🚫 Skip when: You want all top-k results, regardless of quality
is_linearForces a brute-force linear search instead of using the index🐌 Very slow for large datasets!
βœ… Only use for: Debugging, tiny collections, or verifying index accuracy
is_using_refinerEnables exact score refinement β€” recomputes exact FP32 distances for top candidates after quantized searchβœ… Turn on: When you need maximum precision
⚠️ Adds latency due to full-precision re-scoring

HNSW vs HNSW-RaBitQ

AspectHNSWHNSW-RaBitQ
Quantization methodStandard quantization via quantize_type β€” supports INT8, FP16, INT4Built-in RaBitQ binary quantization (1–9 bits per dimension)
Memory per vector (768-dim)~3 KB (FP32), ~1.5 KB (FP16), ~768 B (INT8), ~384 B (INT4)~864 B (9-bit) to ~96 B (1-bit)
Search distance computationExact FP32 (or quantized integer/fp16 arithmetic when quantized)Fast distance estimation via RaBitQ
Recall (typical)~99%+ (FP32), slightly lower when quantizedUp to 99% (tunable via total_bits)
Graph construction qualityFP32 or quantizedFP32
Refiner supportβœ… Recomputes exact FP32 distances for quantized resultsβœ… Same β€” recomputes exact FP32 distances
PlatformAll supported platformsx86_64 only (AVX2 or higher required)
Best forGeneral-purpose production; use with INT8/FP16 for moderate memory savingsMaximum compression with RaBitQ-specific optimizations on x86_64

How to choose: If you need cross-platform compatibility, use the standard HNSW index with quantize_type set to INT8 or FP16 β€” this already provides significant memory savings. Choose HNSW-RaBitQ when you are on x86_64 and want RaBitQ's specialized binary quantization, which can offer better accuracy-compression trade-offs at comparable memory levels.