Zvec Logo

Embedding

This page introduces Zvec's embedding function system for converting text into vector representations. It provides multiple out-of-the-box implementations and supports custom extensions to integrate your own models.

Current Support: Zvec currently supports text modality embeddings only. Support for other modalities (images, audio, etc.) may be added in future releases.

Note for users in mainland China: To download models from Hugging Face more reliably, configure the mirror endpoint before running Python:

export HF_ENDPOINT=https://hf-mirror.com

Dependencies: To run the examples in this document, install the following packages first:

pip install openai dashscope dashtext sentence-transformers

Overview

Zvec's embedding system provides ready-to-use embedding functions to convert text into vector representations for similarity search.

Embedding Function Types

TypeImplementationDescription
Local DenseDefaultLocalDenseEmbeddingUses Sentence Transformers with all-MiniLM-L6-v2 model (384 dimensions, ~80MB)
Local SparseDefaultLocalSparseEmbeddingUses SPLADE naver/splade-cocondenser-ensembledistil model (~100MB)
BM25BM25EmbeddingFunctionBM25 algorithm using DashText SDK (local computation, no API key needed)
Qwen DenseQwenDenseEmbeddingUses Qwen Dashscope API
Qwen SparseQwenSparseEmbeddingUses Qwen Dashscope API
OpenAI DenseOpenAIDenseEmbeddingUses OpenAI API

Dense Embedding

Dense embeddings capture semantic meaning in fixed-length continuous vectors.

1. DefaultLocalDenseEmbedding - Local Dense Embedding

Uses the Sentence Transformers library with the all-MiniLM-L6-v2 model to generate 384-dimensional dense vectors.

Model Details:

  • Model: all-MiniLM-L6-v2 (HuggingFace) or iic/nlp_gte_sentence-embedding_chinese-small (ModelScope for Chinese)
  • Dimensions: 384
  • Size: ~80MB
from zvec.extension import DefaultLocalDenseEmbedding

# Basic usage (international users)
embedding_func = DefaultLocalDenseEmbedding()
vector = embedding_func.embed("Hello, world!")
print(f"Dimensions: {len(vector)}")  # 384

# Chinese users: recommended to use ModelScope
embedding_func = DefaultLocalDenseEmbedding(model_source="modelscope")
vector = embedding_func.embed("你好,世界!")

# Batch processing
texts = ["Text 1", "Text 2", "Text 3"]
vectors = [embedding_func.embed(text) for text in texts]

# Semantic similarity computation
import numpy as np
v1 = embedding_func.embed("The cat sits on the mat")
v2 = embedding_func.embed("A cat is resting on the mat")
similarity = np.dot(v1, v2)  # Normalized vectors, dot product = cosine similarity
print(f"Similarity: {similarity:.4f}")

2. QwenDenseEmbedding - Dashscope API Dense Embedding

Uses Qwen's Dashscope embedding API.

Note: Requires Dashscope API key, and dimension must be specified explicitly.

from zvec.extension import QwenDenseEmbedding

# API key required
embedding_func = QwenDenseEmbedding(
    api_key="your-dashscope-api-key",
    model="text-embedding-v4",   # Optional, uses latest model by default
    dimension=256,               # Required: embedding dimension
)

vector = embedding_func.embed("Vector database")
print(f"Dimensions: {embedding_func.dimension}")  # 256

3. OpenAIDenseEmbedding - OpenAI API Dense Embedding

Uses OpenAI's embedding API.

from zvec.extension import OpenAIDenseEmbedding

embedding_func = OpenAIDenseEmbedding(
    api_key="your-openai-api-key",
    model="text-embedding-4",  # Optional, uses latest model by default
    dimension=256,            # Required: embedding dimension
)

vector = embedding_func.embed("Vector database")

Sparse Embedding

Sparse embeddings represent text with high-dimensional sparse vectors, ideal for lexical matching.

1. DefaultLocalSparseEmbedding - Local Sparse Embedding

Uses the SPLADE model to generate sparse vectors, suitable for lexical matching and hybrid retrieval.

Model Details:

  • Model: naver/splade-cocondenser-ensembledistil
  • Size: ~100MB
  • Output: Sparse dictionary format
from zvec.extension import DefaultLocalSparseEmbedding

# Query embedding (for search queries)
query_embedding = DefaultLocalSparseEmbedding(encoding_type="query")
query_vec = query_embedding.embed("machine learning algorithms")

# Document embedding (for document indexing)
doc_embedding = DefaultLocalSparseEmbedding(encoding_type="document")
doc_vec = doc_embedding.embed("Machine learning is a subfield of artificial intelligence")

# Sparse vector format: {dimension_index: weight}
print(f"Non-zero dimensions: {len(query_vec)}")
print(f"First 5 dimensions: {list(query_vec.items())[:5]}")

# Clear model cache
DefaultLocalSparseEmbedding.clear_cache()

2. BM25EmbeddingFunction - DashText SDK BM25 Sparse Embedding

Uses DashText's local BM25 encoder for lexical matching. No API key or network connectivity required.

Two Options:

  • Built-in encoder (recommended for general use): Pre-trained models for Chinese (language="zh") and English (language="en")
  • Custom encoder: Train on your own corpus for domain-specific terminology with BM25 parameters (b, k1)
from zvec.extension import BM25EmbeddingFunction

# Option 1: Using built-in encoder (no corpus needed)
# For Chinese query encoding
bm25_query_zh = BM25EmbeddingFunction(language="zh", encoding_type="query")
query_vec = bm25_query_zh.embed("深度学习神经网络")

# For Chinese document encoding
bm25_doc_zh = BM25EmbeddingFunction(language="zh", encoding_type="document")
doc_vec = bm25_doc_zh.embed("机器学习是人工智能的重要分支")

# For English query encoding
bm25_query_en = BM25EmbeddingFunction(language="en", encoding_type="query")
query_vec_en = bm25_query_en.embed("deep learning neural networks")

# Option 2: Using custom corpus for better domain accuracy
corpus = [
    "Machine learning is an important branch of artificial intelligence",
    "Deep learning uses neural networks",
    "Natural language processing handles text data"
]

bm25_custom = BM25EmbeddingFunction(
    corpus=corpus,
    encoding_type="query",
    b=0.75,   # Document length normalization
    k1=1.2    # Term frequency saturation
)

query_vec = bm25_custom.embed("deep learning neural networks")

3. QwenSparseEmbedding - Dashscope API Sparse Embedding

Requires Dashscope API key. Visit Dashscope Console to get your API key.

from zvec.extension import QwenSparseEmbedding

embedding_func = QwenSparseEmbedding(
    api_key="your-dashscope-api-key",
    dimension=256,  # dashscope api required input dimension
)
sparse_vec = embedding_func.embed("sparse vector")

Custom Implementation Guide

Learn how to create your own embedding functions.

Custom Embedding Functions

Zvec provides protocol base classes and framework-specific base classes for custom embeddings:

Protocol Base Classes:

  • DenseEmbeddingFunction[T]: Protocol for dense embeddings
  • SparseEmbeddingFunction[T]: Protocol for sparse embeddings

Framework-Specific Base Classes:

  • SentenceTransformerFunctionBase: Base class for Sentence Transformers models (in sentence_transformer_function.py)
  • QwenFunctionBase: Base class for Qwen Dashscope API (in qwen_function.py)

Example 1: Custom Dense Embedding from Scratch

from zvec.extension import DenseEmbeddingFunction
from zvec.common.constants import TEXT, DenseVectorType
from typing import Optional
import numpy as np


class MyCustomDenseEmbedding(DenseEmbeddingFunction[TEXT]):
    """Custom dense embedding function example"""
    
    def __init__(self, model_name: str = "custom-model", **kwargs):
        self._model_name = model_name
        self._dimension = 768  # Custom dimension
        self._extra_params = kwargs
        # Initialize your model
        self._model = self._load_model()
    
    @property
    def dimension(self) -> int:
        """Return embedding vector dimension"""
        return self._dimension
    
    @property
    def extra_params(self) -> dict:
        """Return extra parameters"""
        return self._extra_params
    
    def _load_model(self):
        """Load your custom model"""
        # Implement your model loading logic here
        # e.g., return YourModelClass.from_pretrained(self._model_name)
        pass
    
    def embed(self, input: str) -> DenseVectorType:
        """
        Generate dense embedding vector
        
        Args:
            input: Input text
            
        Returns:
            DenseVectorType: List of floats, length = self.dimension
        """
        # Input validation
        if not isinstance(input, str):
            raise TypeError(f"Expected str, got {type(input).__name__}")
        
        input = input.strip()
        if not input:
            raise ValueError("Input cannot be empty")
        
        # Generate embedding using your model
        # embedding = self._model.encode(input)
        # return embedding.tolist()
        
        # Example: return random vector
        return np.random.randn(self._dimension).tolist()
    
    def __call__(self, input: str) -> DenseVectorType:
        """Make the function callable"""
        return self.embed(input)


# Use custom embedding
custom_emb = MyCustomDenseEmbedding(model_name="my-model")
vector = custom_emb.embed("Test text")
print(f"Dimensions: {len(vector)}")

Example 2: Custom Sparse Embedding from Scratch

from zvec.extension import SparseEmbeddingFunction
from zvec.common.constants import TEXT, SparseVectorType
from typing import Dict


class MyCustomSparseEmbedding(SparseEmbeddingFunction[TEXT]):
    """Custom sparse embedding function example"""
    
    def __init__(self, vocab_size: int = 30000, **kwargs):
        self._vocab_size = vocab_size
        self._extra_params = kwargs
        self._tokenizer = self._load_tokenizer()
    
    @property
    def extra_params(self) -> dict:
        return self._extra_params
    
    def _load_tokenizer(self):
        """Load tokenizer"""
        # Implement your tokenizer loading logic
        pass
    
    def embed(self, input: str) -> SparseVectorType:
        """
        Generate sparse embedding vector
        
        Args:
            input: Input text
            
        Returns:
            SparseVectorType: Dictionary {dimension_index: weight}, contains only non-zero values
        """
        if not isinstance(input, str):
            raise TypeError(f"Expected str, got {type(input).__name__}")
        
        input = input.strip()
        if not input:
            raise ValueError("Input cannot be empty")
        
        # Implement your sparse embedding logic
        # tokens = self._tokenizer.tokenize(input)
        # sparse_vec = self._compute_sparse_representation(tokens)
        
        # Example: return simple term frequency vector
        sparse_vec = {
            100: 0.5,
            250: 1.2,
            500: 0.8
        }
        
        # Ensure sorted by index
        return dict(sorted(sparse_vec.items()))
    
    def __call__(self, input: str) -> SparseVectorType:
        return self.embed(input)


# Use custom sparse embedding
sparse_emb = MyCustomSparseEmbedding(vocab_size=50000)
sparse_vec = sparse_emb.embed("Test text")
print(f"Non-zero dimensions: {len(sparse_vec)}")

Example 3: Using SentenceTransformerFunctionBase

If you want to use a different Sentence Transformers model, you can inherit from SentenceTransformerFunctionBase:

from zvec.extension.sentence_transformer_function import SentenceTransformerFunctionBase
from zvec.extension import DenseEmbeddingFunction
from zvec.common.constants import TEXT, DenseVectorType
from typing import Literal, Optional


class CustomSentenceTransformerEmbedding(
    SentenceTransformerFunctionBase, 
    DenseEmbeddingFunction[TEXT]
):
    """Using custom Sentence Transformer model"""
    
    def __init__(
        self,
        model_name: str = "all-mpnet-base-v2",  # Use a different model
        model_source: Literal["huggingface", "modelscope"] = "huggingface",
        normalize_embeddings: bool = True,
        **kwargs
    ):
        # Initialize base class
        SentenceTransformerFunctionBase.__init__(
            self, 
            model_name=model_name,
            model_source=model_source,
        )
        
        self._normalize_embeddings = normalize_embeddings
        self._extra_params = kwargs
        
        # Load model and get dimension
        model = self._get_model()
        self._dimension = model.get_sentence_embedding_dimension()
    
    @property
    def dimension(self) -> int:
        return self._dimension
    
    @property
    def extra_params(self) -> dict:
        return self._extra_params
    
    def embed(self, input: str) -> DenseVectorType:
        if not isinstance(input, str):
            raise TypeError(f"Expected str, got {type(input).__name__}")
        
        input = input.strip()
        if not input:
            raise ValueError("Input cannot be empty")
        
        model = self._get_model()
        embedding = model.encode(
            input,
            convert_to_numpy=True,
            normalize_embeddings=self._normalize_embeddings
        )
        
        return embedding.tolist()
    
    def __call__(self, input: str) -> DenseVectorType:
        return self.embed(input)


# Use custom model
# Use larger MPNet model (768 dimensions)
custom_emb = CustomSentenceTransformerEmbedding(
    model_name="all-mpnet-base-v2"
)
vector = custom_emb.embed("High-quality text embedding")
print(f"Dimensions: {len(vector)}")  # 768

# Use multilingual model
multilingual_emb = CustomSentenceTransformerEmbedding(
    model_name="paraphrase-multilingual-MiniLM-L12-v2"
)

Example 4: Using QwenFunctionBase

If you want to implement custom embeddings using Qwen Dashscope API:

from zvec.extension.qwen_function import QwenFunctionBase
from zvec.extension import DenseEmbeddingFunction
from zvec.common.constants import TEXT, DenseVectorType
from typing import Optional


class CustomQwenEmbedding(QwenFunctionBase, DenseEmbeddingFunction[TEXT]):
    """Custom Qwen embedding implementation"""
    
    def __init__(
        self,
        api_key: str,
        model: str = "text-embedding-v3",
        **kwargs
    ):
        # Initialize base class with API key
        QwenFunctionBase.__init__(self, api_key=api_key)
        
        self._model = model
        self._extra_params = kwargs
        self._dimension = None  # Will be set after first call
    
    @property
    def dimension(self) -> int:
        if self._dimension is None:
            # Get dimension from first embedding call
            test_result = self.embed("test")
            self._dimension = len(test_result)
        return self._dimension
    
    @property
    def extra_params(self) -> dict:
        return self._extra_params
    
    def embed(self, input: str) -> DenseVectorType:
        if not isinstance(input, str):
            raise TypeError(f"Expected str, got {type(input).__name__}")
        
        input = input.strip()
        if not input:
            raise ValueError("Input cannot be empty")
        
        # Use the base class's embed_text method
        result = self._embed_text(
            text=input,
            model=self._model
        )
        
        return result
    
    def __call__(self, input: str) -> DenseVectorType:
        return self.embed(input)


# Use custom Qwen embedding
custom_qwen_emb = CustomQwenEmbedding(
    api_key="your-dashscope-api-key",
    model="text-embedding-v3"
)
vector = custom_qwen_emb.embed("Custom Qwen embedding")

Best Practices

Follow these patterns to build effective search pipelines.

1. Hybrid Search (Multi-Vector Retrieval)

Combine dense and sparse embeddings for best retrieval performance:

from zvec.extension import (
    DefaultLocalDenseEmbedding,
    DefaultLocalSparseEmbedding,
    RrfReRanker
)

# Create embedding functions
dense_emb = DefaultLocalDenseEmbedding()
sparse_emb = DefaultLocalSparseEmbedding(encoding_type="query")

# Query text
query = "What is a vector database"

# Generate both embeddings
dense_vec = dense_emb.embed(query)
sparse_vec = sparse_emb.embed(query)

# Fuse results using RRF
rrf_ranker = RrfReRanker(topn=3)

# Retrieve using both vectors separately (pseudo-code)
final_results = zvec.collection.query(
    vectors=[
        VectorQuery("dense", vector=dense_vec),
        VectorQuery("sparse", vector=dense_vec),
    ],
    topk=10,
    reranker=rrf_ranker,
)

2. Network Configuration for Chinese Users

For users in mainland China, configure network settings to download models reliably:

import os
from zvec.extension import DefaultLocalDenseEmbedding

# Option 1: Use ModelScope
embedding = DefaultLocalDenseEmbedding(model_source="modelscope")

# Option 2: Use Hugging Face mirror in Python
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
embedding = DefaultLocalDenseEmbedding(model_source="huggingface")

Important Notes

Key Considerations:

  1. Model Download: Models will be downloaded on first use. Ensure network connectivity.
  2. Memory Management: Local models consume memory. Call clear_cache() to release memory after use.
  3. API Rate Limiting: When using API-based functions (Qwen, OpenAI), be mindful of quotas and rate limits.
  4. Thread Safety: Embedding functions are thread-safe and can be used in multi-threaded environments.
  5. Text Only: Currently, Zvec only supports text modality embeddings. Support for other modalities may be added in future releases.

Explore the source code and implementation details: