Full-text search finds documents by matching text content, ranked by BM25 relevance scoring. It supports natural language queries, exact phrase matching, and boolean operators.

Prerequisites

This guide assumes you have opened a collection and have a collection object ready.

Defining an FTS Field

To enable full-text search on a field, add a FieldSchema with an FtsIndexParam as its index_param. The field must be of type STRING.

Define an FTS field

import zvec

fts_field = zvec.FieldSchema(  
    name="content",
    data_type=zvec.DataType.STRING,
    nullable=False,
    index_param=zvec.FtsIndexParam(
        tokenizer_name="standard",   # Tokenizer to use
        filters=["lowercase"],       # Token filters applied after tokenization
    ),
)

`FtsIndexParam` Parameters

Parameter	Type	Default	Description
`tokenizer_name`	`str`	`"standard"`	The tokenizer used to split text into tokens. Options: `"standard"`, `"whitespace"`, `"jieba"`. See Tokenizers.
`filters`	`list[str]`	`["lowercase"]`	Token filters applied in sequence after tokenization. See Token Filters.
`extra_params`	`str`	`""`	JSON string for tokenizer- and filter-specific configuration. See the configuration sections for each tokenizer and token filter.

Zvec supports FTS-only collections — you can create a collection with text fields and no vector fields at all.

Performing Full-Text Search

Zvec provides two query modes for full-text search, both using the Fts object within a Query:

Match String — natural language input, automatically tokenized
Query String — advanced expression syntax with boolean operators

Use match_string for natural language queries. The input is plain text — no special syntax or escaping is needed. It is tokenized using the same tokenizer configured on the field, and tokens are combined using the default operator (OR by default).

Match string query

from zvec.model.param.query import Fts, Query

result = collection.query(  
    queries=Query(
        field_name="content",
        fts=Fts(match_string="machine learning"),  
    ),
    topk=5,
)
print(result)

With the default OR operator, this returns documents containing either "machine" or "learning" (or both), ranked by BM25 relevance.

Query String

Use query_string for advanced queries with explicit boolean operators, required/excluded terms, and exact phrase matching.

Query string with operators

from zvec.model.param.query import Fts, Query

result = collection.query(  
    queries=Query(
        field_name="content",
        fts=Fts(query_string='+learning -neural "vector search"'),  
    ),
    topk=5,
)
print(result)

This query requires "learning", excludes "neural", and matches the exact phrase "vector search".

query_string and match_string are mutually exclusive — you must provide exactly one of them in each Fts object.

Query Syntax Reference

The following operators are supported in query_string expressions:

Syntax	Meaning	Example
`term`	Match a single term	`vector`
`"phrase"`	Match an exact phrase (word order and adjacency)	`"machine learning"`
`+term`	Term must appear in the document	`+vector`
`-term`	Term must not appear in the document	`-slow`
`a AND b`	Both terms must match	`vector AND search`
`a OR b`	Either term can match	`vector OR embedding`
`a NOT b`	Match `a` but exclude documents matching `b`	`learning NOT deep`
`(expr)`	Group sub-expressions	`(vector OR embedding) AND search`
`+(expr)`	Group must match	`+(vector OR embedding)`
`-(expr)`	Group must not match	`-(slow AND outdated)`

Operator precedence: AND / NOT bind tighter than OR. Adjacent terms without an explicit operator are combined using the default_operator setting (default: OR).

For complex queries mixing multiple operators, use () to make grouping explicit and avoid unexpected results.

Leading negation is not supported — both NOT term and standalone -term require at least one positive term. Use a NOT b or combine -term with positive terms (e.g., a -b).

Query Parameters

Parameter	Description
`topk`	The number of top-scoring documents to return.
`filter`	An optional SQL-like boolean expression to restrict results. See conditional filtering for details.
`output_fields`	An optional list of scalar field names to include in results. If omitted, all scalar fields are returned.

Default Operator

By default, adjacent bare terms in both match_string and query_string are combined with OR. To change this to AND, pass a FtsQueryParam via the param field:

Using AND as the default operator

import zvec
from zvec.model.param.query import Fts, Query

result = collection.query(
    queries=Query(
        field_name="content",
        fts=Fts(match_string="machine learning"),
        param=zvec.FtsQueryParam(default_operator="AND"),  
    ),
    topk=5,
)

With default_operator="AND", this returns only documents containing both "machine" and "learning".

Explicit operators (AND, OR, +, -) in a query_string are not affected by default_operator — it only controls how adjacent bare terms are combined.

Combining FTS with Scalar Filters

You can combine full-text search with scalar filters to narrow results:

FTS with scalar filter

from zvec.model.param.query import Fts, Query

result = collection.query(
    queries=Query(
        field_name="content",
        fts=Fts(match_string="machine learning"),
    ),
    filter="category = 'tech'",  
    topk=5,
)

FTS and vector search are mutually exclusive within one query route. A single Query / ZVecQuery should not set both fts and vector / id; use separate query routes with re-ranking, or run separate queries and merge results in your application.

Tokenizers

Zvec provides three built-in tokenizers. The tokenizer is configured per-field via FtsIndexParam.

Tokenizer	Name	Description
Standard	`"standard"`	Implements Unicode UAX #29 word boundaries and behaves similarly to Elasticsearch's standard tokenizer. Best for most English-like languages. (Default)
Whitespace	`"whitespace"`	Splits text on whitespace only (spaces, tabs, newlines). Preserves punctuation within tokens.
Jieba	`"jieba"`	Chinese word segmentation using cppjieba. Supports mixed Chinese/English text.

Standard Tokenizer

The default tokenizer. It implements Unicode UAX #29 word boundaries and behaves similarly to Elasticsearch's standard tokenizer. For CJK ideographs, standard emits single-character tokens; use jieba when you need Chinese word-level search.

Standard tokenizer

zvec.FtsIndexParam(tokenizer_name="standard", filters=["lowercase"])

Configuration (via extra_params JSON):

Key	Type	Default	Description
`max_token_length`	`int`	`255`	Maximum token length. Tokens exceeding this length are discarded.

Whitespace Tokenizer

Splits text only on whitespace characters. Useful when punctuation should be preserved in tokens.

Whitespace tokenizer

zvec.FtsIndexParam(tokenizer_name="whitespace", filters=["lowercase"])

Jieba Tokenizer

Chinese word segmentation tokenizer. Also handles mixed Chinese/English text.

The Python SDK bundles a default Jieba dictionary — the Jieba tokenizer works out of the box with no extra configuration. You only need to set jieba_dict_dir if you want to use a custom dictionary.

Jieba tokenizer

zvec.FtsIndexParam(
    tokenizer_name="jieba",
    filters=["lowercase"],
    extra_params='{"jieba_dict_dir": "/path/to/jieba/dict"}',
)

Configuration (via extra_params JSON):

Key	Type	Default	Description
`jieba_dict_dir`	`str`	—	Directory containing `jieba.dict.utf8` and `hmm_model.utf8`. Can also be set via `ZVEC_JIEBA_DICT_DIR` environment variable.
`user_dict_path`	`str`	—	Path to a custom user dictionary file.
`cut_mode`	`str`	`"search"`	Segmentation mode: `"search"` (fine-grained, recommended for search), `"mix"`, `"full"`, or `"hmm"`. See cppjieba documentation for details on each mode.

jieba_dict_dir resolution order (first non-empty value wins):

Per-field extra_params in FtsIndexParam
ZVEC_JIEBA_DICT_DIR environment variable
Global default set via zvec.init(jieba_dict_dir=...) or zvec.set_default_jieba_dict_dir()
Built-in dictionary bundled with the Python SDK (set automatically on import zvec)

Token Filters

Token filters are applied in sequence after tokenization. The same filter configuration is used for both indexing and querying.

Filter	Description
`"lowercase"`	Converts tokens to Unicode lowercase.
`"ascii_folding"`	Folds Unicode characters into ASCII equivalents.
`"stemmer"`	Normalizes word forms using a Snowball stemmer.

For English text, use lowercase and stemmer so case and word forms do not affect matching. For English-like text or text with diacritics, you can also add ascii_folding for accent-insensitive matching.

zvec.FtsIndexParam(
    tokenizer_name="standard",
    filters=["lowercase", "stemmer"],
    extra_params='{"stemmer_lang": "english"}',
)

Stemmer configuration (via extra_params JSON):

Key	Type	Default	Description
`stemmer_lang`	`str`	`"english"`	Snowball language or algorithm name. For example, set it to `"porter"` for behavior close to Elasticsearch's default English stemmer.

Constraints

FTS and vector search are mutually exclusive within one query route — a single Query / ZVecQuery should not set both fts and vector / id.
query_string and match_string are mutually exclusive in a single Fts object.
FTS fields do not support alter column.
Leading negation (NOT term or standalone -term) is not supported — at least one positive term is required.

Full-Text Search