Create a New Collection
To create a new, empty collection in Zvec, you need to define the following:
- Schema — the structural blueprint of your data, specifying scalar fields and vector embeddings.
- Collection options (optional) — runtime settings that control how the collection behaves when opened (e.g., read-only mode).
Once defined, you call create_and_open() to create the collection on disk and get a ready-to-use Collection handle.
If a collection already exists at the specified path, create_and_open() will raise an error to prevent accidental overwrites.
Step 1: Define the Schema
A collection schema CollectionSchema defines the structure that every document inserted into the collection must conform to.
The schema in Zvec is dynamic: you can add or remove scalar fields and vectors at any time without rebuilding the collection.
CollectionSchema has three parts:
name: An identifier for the collection.fields: A list of scalar fields.vectors: A list of vector fields.
1. Collection Name
A human-readable identifier for your collection. This name is used internally for reference and logging.
2. Scalar Fields
Scalar fields store non-vector (i.e., structured) data — such as strings, numbers, booleans, or arrays.
Each field is defined using FieldSchema with the following properties:
- 🔤
name: A unique string identifier for the field within the collection. - 🧬
data_type: The type of data stored — e.g.,STRING,INT64, or array types likeARRAY_STRING. - ⭕
nullable(optional): Whether the field is allowed to have no value (defaults toFalse). - 🚀
index_param(optional): Enables fast filtering by creating an inverted index viaInvertIndexParam.
Tip:
Only add an index to fields you plan to filter on. Unindexed fields save storage and write overhead.
If you do enable indexing, you can optionally activate performance-enhancing (but storage-costly) features:
enable_range_optimization=True→ faster range queries (e.g.,price > 100)enable_extended_wildcard=True→ complex string pattern matching (e.g.,name LIKE 'abc%def')
3. Vectors (Embeddings)
A vector is defined using VectorSchema with the following properties:
- 🔤
name: A unique string identifier for the vector within the collection. - 🧬
data_type: The numeric format of the vector.- Dense vectors:
VECTOR_FP32,VECTOR_FP16, etc. - Sparse vectors:
SPARSE_VECTOR_FP32,SPARSE_VECTOR_FP16.
- Dense vectors:
- 📐
dimension: Required for dense vectors — the number of dimensions. - 🚀
index_param: Configures the vector index type and similarity metric.
Configure the vector index via the index_param field using one of the following:
FlatIndexParam(...)for Flat indexHnswIndexParam(...)for HNSW indexIVFIndexParam(...)for IVF index
In index_param you can also specify:
metric_type:
COSINE,L2, orIP(inner product) — Ensure your metric matches how your embeddings were trained!quantize_type(optional):
Compress vectors to reduce index size and speed up search (with slight recall trade-off)
Full Schema Example
Step 2: Configure Collection Options
The CollectionOption lets you control runtime behavior when creating the collection:
read_only: Opens the collection in read-only mode. Attempts to write will raise an error.Note:
read_onlymust be set toFalsewhen callingcreate_and_open(), since creation requires writing files to disk.enable_mmap: Uses memory-mapped I/O for faster access (default toTrue). This trades slightly higher memory cache usage for improved performance.
import zvec
collection_option = zvec.CollectionOption(read_only=False, enable_mmap=True) Step 3: Create and Open the Collection
With your schema and options ready, call create_and_open() to create the collection at the desired path:
import zvec
collection = zvec.create_and_open(
path="/path/to/my/collection",
schema=collection_schema,
option=collection_option,
)The returned collection object is immediately ready for inserting documents, running queries, or managing data.
Real-World Example: 🛒 Product Search
This schema models a multi-modal product search system, combining visual, textual, and structured metadata for rich retrieval:
🗂️ Scalar Fields: For Filtering & Display
category(array of strings, indexed):
Enables queries likecategory CONTAIN_ANY ("electronics", "headphones")to find products that belong to either "electronics" or "headphones" (or both).price(integer, indexed with range optimization):
Supports fast range queries such asprice > 100.in_stock(boolean, indexed):
Enables instant filtering by availability (e.g., "only show items in stock").image_urlanddescriptionare stored but not indexed, since they're only used for display.
📐 Vector Embeddings: For Semantic Relevance
- Two dense vectors capture semantic meaning:
image_vec: 512-dimensional embeddings from product images (e.g., via a vision model).description_vec: 768-dimensional embeddings from product descriptions (e.g., from a language model), stored with quantization.
- One sparse vector
keywords_sparsefor keyword matching, enabling hybrid sparse-dense search.