Zvec Logo

Create a New Collection

To create a new, empty collection in Zvec, you need to define the following:

  • Schema — the structural blueprint of your data, specifying scalar fields and vector embeddings.
  • Collection options (optional) — runtime settings that control how the collection behaves when opened (e.g., read-only mode).

Once defined, you call create_and_open() to create the collection on disk and get a ready-to-use Collection handle.

If a collection already exists at the specified path, create_and_open() will raise an error to prevent accidental overwrites.


Step 1: Define the Schema

A collection schema CollectionSchema defines the structure that every document inserted into the collection must conform to.

The schema in Zvec is dynamic: you can add or remove scalar fields and vectors at any time without rebuilding the collection.

CollectionSchema has three parts:

  1. name: An identifier for the collection.
  2. fields: A list of scalar fields.
  3. vectors: A list of vector fields.

1. Collection Name

A human-readable identifier for your collection. This name is used internally for reference and logging.

2. Scalar Fields

Scalar fields store non-vector (i.e., structured) data — such as strings, numbers, booleans, or arrays.

Each field is defined using FieldSchema with the following properties:

  1. 🔤 name: A unique string identifier for the field within the collection.
  2. 🧬 data_type: The type of data stored — e.g., STRING, INT64, or array types like ARRAY_STRING.
  3. nullable (optional): Whether the field is allowed to have no value (defaults to False).
  4. 🚀 index_param (optional): Enables fast filtering by creating an inverted index via InvertIndexParam.

Tip:
Only add an index to fields you plan to filter on. Unindexed fields save storage and write overhead.

If you do enable indexing, you can optionally activate performance-enhancing (but storage-costly) features:

  • enable_range_optimization=True → faster range queries (e.g., price > 100)
  • enable_extended_wildcard=True → complex string pattern matching (e.g., name LIKE 'abc%def')

3. Vectors (Embeddings)

A vector is defined using VectorSchema with the following properties:

  1. 🔤 name: A unique string identifier for the vector within the collection.
  2. 🧬 data_type: The numeric format of the vector.
  3. 📐 dimension: Required for dense vectors — the number of dimensions.
  4. 🚀 index_param: Configures the vector index type and similarity metric.

Configure the vector index via the index_param field using one of the following:

In index_param you can also specify:

  • metric_type:
    COSINE, L2, or IP (inner product) — Ensure your metric matches how your embeddings were trained!
  • quantize_type (optional):
    Compress vectors to reduce index size and speed up search (with slight recall trade-off)

Full Schema Example


Step 2: Configure Collection Options

The CollectionOption lets you control runtime behavior when creating the collection:

  • read_only: Opens the collection in read-only mode. Attempts to write will raise an error.

    Note: read_only must be set to False when calling create_and_open(), since creation requires writing files to disk.

  • enable_mmap: Uses memory-mapped I/O for faster access (default to True). This trades slightly higher memory cache usage for improved performance.
Collection option
import zvec

collection_option = zvec.CollectionOption(read_only=False, enable_mmap=True)  

Step 3: Create and Open the Collection

With your schema and options ready, call create_and_open() to create the collection at the desired path:

Create and open a collection
import zvec

collection = zvec.create_and_open(  
    path="/path/to/my/collection",
    schema=collection_schema,
    option=collection_option,
)

The returned collection object is immediately ready for inserting documents, running queries, or managing data.


This schema models a multi-modal product search system, combining visual, textual, and structured metadata for rich retrieval:

🗂️ Scalar Fields: For Filtering & Display

  • category (array of strings, indexed):
    Enables queries like category CONTAIN_ANY ("electronics", "headphones") to find products that belong to either "electronics" or "headphones" (or both).
  • price (integer, indexed with range optimization):
    Supports fast range queries such as price > 100.
  • in_stock (boolean, indexed):
    Enables instant filtering by availability (e.g., "only show items in stock").
  • image_url and description are stored but not indexed, since they're only used for display.

📐 Vector Embeddings: For Semantic Relevance

  • Two dense vectors capture semantic meaning:
    • image_vec: 512-dimensional embeddings from product images (e.g., via a vision model).
    • description_vec: 768-dimensional embeddings from product descriptions (e.g., from a language model), stored with quantization.
  • One sparse vector keywords_sparse for keyword matching, enabling hybrid sparse-dense search.