Embedding
This page introduces Zvec's embedding function system for converting text into vector representations. It provides multiple out-of-the-box implementations and supports custom extensions to integrate your own models.
Current Support: Zvec currently supports text modality embeddings only. Support for other modalities (images, audio, etc.) may be added in future releases.
Note for users in mainland China: To download models from Hugging Face more reliably, configure the mirror endpoint before running Python:
export HF_ENDPOINT=https://hf-mirror.comDependencies: To run the examples in this document, install the following packages first:
pip install openai dashscope dashtext sentence-transformersOverview
Zvec's embedding system provides ready-to-use embedding functions to convert text into vector representations for similarity search.
Embedding Function Types
| Type | Implementation | Description |
|---|---|---|
| Local Dense | DefaultLocalDenseEmbedding | Uses Sentence Transformers with all-MiniLM-L6-v2 model (384 dimensions, ~80MB) |
| Local Sparse | DefaultLocalSparseEmbedding | Uses SPLADE naver/splade-cocondenser-ensembledistil model (~100MB) |
| BM25 | BM25EmbeddingFunction | BM25 algorithm using DashText SDK (local computation, no API key needed) |
| Qwen Dense | QwenDenseEmbedding | Uses Qwen Dashscope API |
| Qwen Sparse | QwenSparseEmbedding | Uses Qwen Dashscope API |
| OpenAI Dense | OpenAIDenseEmbedding | Uses OpenAI API |
Dense Embedding
Dense embeddings capture semantic meaning in fixed-length continuous vectors.
1. DefaultLocalDenseEmbedding - Local Dense Embedding
Uses the Sentence Transformers library with the all-MiniLM-L6-v2 model to generate 384-dimensional dense vectors.
Model Details:
- Model:
all-MiniLM-L6-v2(HuggingFace) oriic/nlp_gte_sentence-embedding_chinese-small(ModelScope for Chinese) - Dimensions: 384
- Size: ~80MB
from zvec.extension import DefaultLocalDenseEmbedding
# Basic usage (international users)
embedding_func = DefaultLocalDenseEmbedding()
vector = embedding_func.embed("Hello, world!")
print(f"Dimensions: {len(vector)}") # 384
# Chinese users: recommended to use ModelScope
embedding_func = DefaultLocalDenseEmbedding(model_source="modelscope")
vector = embedding_func.embed("你好,世界!")
# Batch processing
texts = ["Text 1", "Text 2", "Text 3"]
vectors = [embedding_func.embed(text) for text in texts]
# Semantic similarity computation
import numpy as np
v1 = embedding_func.embed("The cat sits on the mat")
v2 = embedding_func.embed("A cat is resting on the mat")
similarity = np.dot(v1, v2) # Normalized vectors, dot product = cosine similarity
print(f"Similarity: {similarity:.4f}")2. QwenDenseEmbedding - Dashscope API Dense Embedding
Uses Qwen's Dashscope embedding API.
Note: Requires Dashscope API key, and dimension must be specified explicitly.
from zvec.extension import QwenDenseEmbedding
# API key required
embedding_func = QwenDenseEmbedding(
api_key="your-dashscope-api-key",
model="text-embedding-v4", # Optional, uses latest model by default
dimension=256, # Required: embedding dimension
)
vector = embedding_func.embed("Vector database")
print(f"Dimensions: {embedding_func.dimension}") # 2563. OpenAIDenseEmbedding - OpenAI API Dense Embedding
Uses OpenAI's embedding API.
from zvec.extension import OpenAIDenseEmbedding
embedding_func = OpenAIDenseEmbedding(
api_key="your-openai-api-key",
model="text-embedding-4", # Optional, uses latest model by default
dimension=256, # Required: embedding dimension
)
vector = embedding_func.embed("Vector database")Sparse Embedding
Sparse embeddings represent text with high-dimensional sparse vectors, ideal for lexical matching.
1. DefaultLocalSparseEmbedding - Local Sparse Embedding
Uses the SPLADE model to generate sparse vectors, suitable for lexical matching and hybrid retrieval.
Model Details:
- Model:
naver/splade-cocondenser-ensembledistil - Size: ~100MB
- Output: Sparse dictionary format
from zvec.extension import DefaultLocalSparseEmbedding
# Query embedding (for search queries)
query_embedding = DefaultLocalSparseEmbedding(encoding_type="query")
query_vec = query_embedding.embed("machine learning algorithms")
# Document embedding (for document indexing)
doc_embedding = DefaultLocalSparseEmbedding(encoding_type="document")
doc_vec = doc_embedding.embed("Machine learning is a subfield of artificial intelligence")
# Sparse vector format: {dimension_index: weight}
print(f"Non-zero dimensions: {len(query_vec)}")
print(f"First 5 dimensions: {list(query_vec.items())[:5]}")
# Clear model cache
DefaultLocalSparseEmbedding.clear_cache()2. BM25EmbeddingFunction - DashText SDK BM25 Sparse Embedding
Uses DashText's local BM25 encoder for lexical matching. No API key or network connectivity required.
Two Options:
- Built-in encoder (recommended for general use): Pre-trained models for Chinese (
language="zh") and English (language="en") - Custom encoder: Train on your own corpus for domain-specific terminology with BM25 parameters (
b,k1)
from zvec.extension import BM25EmbeddingFunction
# Option 1: Using built-in encoder (no corpus needed)
# For Chinese query encoding
bm25_query_zh = BM25EmbeddingFunction(language="zh", encoding_type="query")
query_vec = bm25_query_zh.embed("深度学习神经网络")
# For Chinese document encoding
bm25_doc_zh = BM25EmbeddingFunction(language="zh", encoding_type="document")
doc_vec = bm25_doc_zh.embed("机器学习是人工智能的重要分支")
# For English query encoding
bm25_query_en = BM25EmbeddingFunction(language="en", encoding_type="query")
query_vec_en = bm25_query_en.embed("deep learning neural networks")
# Option 2: Using custom corpus for better domain accuracy
corpus = [
"Machine learning is an important branch of artificial intelligence",
"Deep learning uses neural networks",
"Natural language processing handles text data"
]
bm25_custom = BM25EmbeddingFunction(
corpus=corpus,
encoding_type="query",
b=0.75, # Document length normalization
k1=1.2 # Term frequency saturation
)
query_vec = bm25_custom.embed("deep learning neural networks")3. QwenSparseEmbedding - Dashscope API Sparse Embedding
Requires Dashscope API key. Visit Dashscope Console to get your API key.
from zvec.extension import QwenSparseEmbedding
embedding_func = QwenSparseEmbedding(
api_key="your-dashscope-api-key",
dimension=256, # dashscope api required input dimension
)
sparse_vec = embedding_func.embed("sparse vector")Custom Implementation Guide
Learn how to create your own embedding functions.
Custom Embedding Functions
Zvec provides protocol base classes and framework-specific base classes for custom embeddings:
Protocol Base Classes:
DenseEmbeddingFunction[T]: Protocol for dense embeddingsSparseEmbeddingFunction[T]: Protocol for sparse embeddings
Framework-Specific Base Classes:
SentenceTransformerFunctionBase: Base class for Sentence Transformers models (insentence_transformer_function.py)QwenFunctionBase: Base class for Qwen Dashscope API (inqwen_function.py)
Example 1: Custom Dense Embedding from Scratch
from zvec.extension import DenseEmbeddingFunction
from zvec.common.constants import TEXT, DenseVectorType
from typing import Optional
import numpy as np
class MyCustomDenseEmbedding(DenseEmbeddingFunction[TEXT]):
"""Custom dense embedding function example"""
def __init__(self, model_name: str = "custom-model", **kwargs):
self._model_name = model_name
self._dimension = 768 # Custom dimension
self._extra_params = kwargs
# Initialize your model
self._model = self._load_model()
@property
def dimension(self) -> int:
"""Return embedding vector dimension"""
return self._dimension
@property
def extra_params(self) -> dict:
"""Return extra parameters"""
return self._extra_params
def _load_model(self):
"""Load your custom model"""
# Implement your model loading logic here
# e.g., return YourModelClass.from_pretrained(self._model_name)
pass
def embed(self, input: str) -> DenseVectorType:
"""
Generate dense embedding vector
Args:
input: Input text
Returns:
DenseVectorType: List of floats, length = self.dimension
"""
# Input validation
if not isinstance(input, str):
raise TypeError(f"Expected str, got {type(input).__name__}")
input = input.strip()
if not input:
raise ValueError("Input cannot be empty")
# Generate embedding using your model
# embedding = self._model.encode(input)
# return embedding.tolist()
# Example: return random vector
return np.random.randn(self._dimension).tolist()
def __call__(self, input: str) -> DenseVectorType:
"""Make the function callable"""
return self.embed(input)
# Use custom embedding
custom_emb = MyCustomDenseEmbedding(model_name="my-model")
vector = custom_emb.embed("Test text")
print(f"Dimensions: {len(vector)}")Example 2: Custom Sparse Embedding from Scratch
from zvec.extension import SparseEmbeddingFunction
from zvec.common.constants import TEXT, SparseVectorType
from typing import Dict
class MyCustomSparseEmbedding(SparseEmbeddingFunction[TEXT]):
"""Custom sparse embedding function example"""
def __init__(self, vocab_size: int = 30000, **kwargs):
self._vocab_size = vocab_size
self._extra_params = kwargs
self._tokenizer = self._load_tokenizer()
@property
def extra_params(self) -> dict:
return self._extra_params
def _load_tokenizer(self):
"""Load tokenizer"""
# Implement your tokenizer loading logic
pass
def embed(self, input: str) -> SparseVectorType:
"""
Generate sparse embedding vector
Args:
input: Input text
Returns:
SparseVectorType: Dictionary {dimension_index: weight}, contains only non-zero values
"""
if not isinstance(input, str):
raise TypeError(f"Expected str, got {type(input).__name__}")
input = input.strip()
if not input:
raise ValueError("Input cannot be empty")
# Implement your sparse embedding logic
# tokens = self._tokenizer.tokenize(input)
# sparse_vec = self._compute_sparse_representation(tokens)
# Example: return simple term frequency vector
sparse_vec = {
100: 0.5,
250: 1.2,
500: 0.8
}
# Ensure sorted by index
return dict(sorted(sparse_vec.items()))
def __call__(self, input: str) -> SparseVectorType:
return self.embed(input)
# Use custom sparse embedding
sparse_emb = MyCustomSparseEmbedding(vocab_size=50000)
sparse_vec = sparse_emb.embed("Test text")
print(f"Non-zero dimensions: {len(sparse_vec)}")Example 3: Using SentenceTransformerFunctionBase
If you want to use a different Sentence Transformers model, you can inherit from SentenceTransformerFunctionBase:
from zvec.extension.sentence_transformer_function import SentenceTransformerFunctionBase
from zvec.extension import DenseEmbeddingFunction
from zvec.common.constants import TEXT, DenseVectorType
from typing import Literal, Optional
class CustomSentenceTransformerEmbedding(
SentenceTransformerFunctionBase,
DenseEmbeddingFunction[TEXT]
):
"""Using custom Sentence Transformer model"""
def __init__(
self,
model_name: str = "all-mpnet-base-v2", # Use a different model
model_source: Literal["huggingface", "modelscope"] = "huggingface",
normalize_embeddings: bool = True,
**kwargs
):
# Initialize base class
SentenceTransformerFunctionBase.__init__(
self,
model_name=model_name,
model_source=model_source,
)
self._normalize_embeddings = normalize_embeddings
self._extra_params = kwargs
# Load model and get dimension
model = self._get_model()
self._dimension = model.get_sentence_embedding_dimension()
@property
def dimension(self) -> int:
return self._dimension
@property
def extra_params(self) -> dict:
return self._extra_params
def embed(self, input: str) -> DenseVectorType:
if not isinstance(input, str):
raise TypeError(f"Expected str, got {type(input).__name__}")
input = input.strip()
if not input:
raise ValueError("Input cannot be empty")
model = self._get_model()
embedding = model.encode(
input,
convert_to_numpy=True,
normalize_embeddings=self._normalize_embeddings
)
return embedding.tolist()
def __call__(self, input: str) -> DenseVectorType:
return self.embed(input)
# Use custom model
# Use larger MPNet model (768 dimensions)
custom_emb = CustomSentenceTransformerEmbedding(
model_name="all-mpnet-base-v2"
)
vector = custom_emb.embed("High-quality text embedding")
print(f"Dimensions: {len(vector)}") # 768
# Use multilingual model
multilingual_emb = CustomSentenceTransformerEmbedding(
model_name="paraphrase-multilingual-MiniLM-L12-v2"
)Example 4: Using QwenFunctionBase
If you want to implement custom embeddings using Qwen Dashscope API:
from zvec.extension.qwen_function import QwenFunctionBase
from zvec.extension import DenseEmbeddingFunction
from zvec.common.constants import TEXT, DenseVectorType
from typing import Optional
class CustomQwenEmbedding(QwenFunctionBase, DenseEmbeddingFunction[TEXT]):
"""Custom Qwen embedding implementation"""
def __init__(
self,
api_key: str,
model: str = "text-embedding-v3",
**kwargs
):
# Initialize base class with API key
QwenFunctionBase.__init__(self, api_key=api_key)
self._model = model
self._extra_params = kwargs
self._dimension = None # Will be set after first call
@property
def dimension(self) -> int:
if self._dimension is None:
# Get dimension from first embedding call
test_result = self.embed("test")
self._dimension = len(test_result)
return self._dimension
@property
def extra_params(self) -> dict:
return self._extra_params
def embed(self, input: str) -> DenseVectorType:
if not isinstance(input, str):
raise TypeError(f"Expected str, got {type(input).__name__}")
input = input.strip()
if not input:
raise ValueError("Input cannot be empty")
# Use the base class's embed_text method
result = self._embed_text(
text=input,
model=self._model
)
return result
def __call__(self, input: str) -> DenseVectorType:
return self.embed(input)
# Use custom Qwen embedding
custom_qwen_emb = CustomQwenEmbedding(
api_key="your-dashscope-api-key",
model="text-embedding-v3"
)
vector = custom_qwen_emb.embed("Custom Qwen embedding")Best Practices
Follow these patterns to build effective search pipelines.
1. Hybrid Search (Multi-Vector Retrieval)
Combine dense and sparse embeddings for best retrieval performance:
from zvec.extension import (
DefaultLocalDenseEmbedding,
DefaultLocalSparseEmbedding,
RrfReRanker
)
# Create embedding functions
dense_emb = DefaultLocalDenseEmbedding()
sparse_emb = DefaultLocalSparseEmbedding(encoding_type="query")
# Query text
query = "What is a vector database"
# Generate both embeddings
dense_vec = dense_emb.embed(query)
sparse_vec = sparse_emb.embed(query)
# Fuse results using RRF
rrf_ranker = RrfReRanker(topn=3)
# Retrieve using both vectors separately (pseudo-code)
final_results = zvec.collection.query(
vectors=[
VectorQuery("dense", vector=dense_vec),
VectorQuery("sparse", vector=dense_vec),
],
topk=10,
reranker=rrf_ranker,
)2. Network Configuration for Chinese Users
For users in mainland China, configure network settings to download models reliably:
import os
from zvec.extension import DefaultLocalDenseEmbedding
# Option 1: Use ModelScope
embedding = DefaultLocalDenseEmbedding(model_source="modelscope")
# Option 2: Use Hugging Face mirror in Python
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
embedding = DefaultLocalDenseEmbedding(model_source="huggingface")Important Notes
Key Considerations:
- Model Download: Models will be downloaded on first use. Ensure network connectivity.
- Memory Management: Local models consume memory. Call
clear_cache()to release memory after use. - API Rate Limiting: When using API-based functions (Qwen, OpenAI), be mindful of quotas and rate limits.
- Thread Safety: Embedding functions are thread-safe and can be used in multi-threaded environments.
- Text Only: Currently, Zvec only supports text modality embeddings. Support for other modalities may be added in future releases.
Related Documentation
Explore the source code and implementation details: