Zvec Logo

Full-Text Search

Full-text search finds documents by matching text content, ranked by BM25 relevance scoring. It supports natural language queries, exact phrase matching, and boolean operators.


Prerequisites

This guide assumes you have opened a collection and have a collection object ready.


Defining an FTS Field

To enable full-text search on a field, add a FieldSchema with an FtsIndexParam as its index_param. The field must be of type STRING.

Define an FTS field
import zvec

fts_field = zvec.FieldSchema(  
    name="content",
    data_type=zvec.DataType.STRING,
    nullable=False,
    index_param=zvec.FtsIndexParam(
        tokenizer_name="standard",   # Tokenizer to use
        filters=["lowercase"],       # Token filters applied after tokenization
    ),
)

FtsIndexParam Parameters

ParameterTypeDefaultDescription
tokenizer_namestr"standard"The tokenizer used to split text into tokens. Options: "standard", "whitespace", "jieba". See Tokenizers.
filterslist[str]["lowercase"]Token filters applied in sequence after tokenization. Currently only "lowercase" is available.
extra_paramsstr""JSON string for tokenizer-specific configuration (e.g., Jieba dictionary path).

Zvec supports FTS-only collections — you can create a collection with text fields and no vector fields at all.


Zvec provides two query modes for full-text search, both using the Fts object within a Query:

  1. Match String — natural language input, automatically tokenized
  2. Query String — advanced expression syntax with boolean operators

Match String

Use match_string for natural language queries. The input is plain text — no special syntax or escaping is needed. It is tokenized using the same tokenizer configured on the field, and tokens are combined using the default operator (OR by default).

Match string query
from zvec.model.param.query import Fts, Query

result = collection.query(  
    queries=Query(
        field_name="content",
        fts=Fts(match_string="machine learning"),  
    ),
    topk=5,
)
print(result)

With the default OR operator, this returns documents containing either "machine" or "learning" (or both), ranked by BM25 relevance.

Query String

Use query_string for advanced queries with explicit boolean operators, required/excluded terms, and exact phrase matching.

Query string with operators
from zvec.model.param.query import Fts, Query

result = collection.query(  
    queries=Query(
        field_name="content",
        fts=Fts(query_string='+learning -neural "vector search"'),  
    ),
    topk=5,
)
print(result)

This query requires "learning", excludes "neural", and matches the exact phrase "vector search".

query_string and match_string are mutually exclusive — you must provide exactly one of them in each Fts object.


Query Syntax Reference

The following operators are supported in query_string expressions:

SyntaxMeaningExample
termMatch a single termvector
"phrase"Match an exact phrase (word order and adjacency)"machine learning"
+termTerm must appear in the document+vector
-termTerm must not appear in the document-slow
a AND bBoth terms must matchvector AND search
a OR bEither term can matchvector OR embedding
a NOT bMatch a but exclude documents matching blearning NOT deep
(expr)Group sub-expressions(vector OR embedding) AND search
+(expr)Group must match+(vector OR embedding)
-(expr)Group must not match-(slow AND outdated)

Operator precedence: AND / NOT bind tighter than OR. Adjacent terms without an explicit operator are combined using the default_operator setting (default: OR).

For complex queries mixing multiple operators, use () to make grouping explicit and avoid unexpected results.

Leading negation is not supported — both NOT term and standalone -term require at least one positive term. Use a NOT b or combine -term with positive terms (e.g., a -b).


Query Parameters

ParameterDescription
topkThe number of top-scoring documents to return.
filterAn optional SQL-like boolean expression to restrict results. See conditional filtering for details.
output_fieldsAn optional list of scalar field names to include in results. If omitted, all scalar fields are returned.

Default Operator

By default, adjacent bare terms in both match_string and query_string are combined with OR. To change this to AND, pass a FtsQueryParam via the param field:

Using AND as the default operator
import zvec
from zvec.model.param.query import Fts, Query

result = collection.query(
    queries=Query(
        field_name="content",
        fts=Fts(match_string="machine learning"),
        param=zvec.FtsQueryParam(default_operator="AND"),  
    ),
    topk=5,
)

With default_operator="AND", this returns only documents containing both "machine" and "learning".

Explicit operators (AND, OR, +, -) in a query_string are not affected by default_operator — it only controls how adjacent bare terms are combined.


Combining FTS with Scalar Filters

You can combine full-text search with scalar filters to narrow results:

FTS with scalar filter
from zvec.model.param.query import Fts, Query

result = collection.query(
    queries=Query(
        field_name="content",
        fts=Fts(match_string="machine learning"),
    ),
    filter="category = 'tech'",  
    topk=5,
)

FTS and vector search are mutually exclusive within one query route. A single Query / ZVecQuery should not set both fts and vector / id; use separate query routes with re-ranking, or run separate queries and merge results in your application.


Tokenizers

Zvec provides three built-in tokenizers. The tokenizer is configured per-field via FtsIndexParam.

TokenizerNameDescription
Standard"standard"Splits text on non-alphanumeric characters. Best for most English-like languages. (Default)
Whitespace"whitespace"Splits text on whitespace only (spaces, tabs, newlines). Preserves punctuation within tokens.
Jieba"jieba"Chinese word segmentation using cppjieba. Supports mixed Chinese/English text.

Standard Tokenizer

The default tokenizer. Splits text into tokens at non-alphanumeric character boundaries and discards delimiters.

Standard tokenizer
zvec.FtsIndexParam(tokenizer_name="standard", filters=["lowercase"])

Configuration (via extra_params JSON):

KeyTypeDefaultDescription
max_token_lengthint255Maximum token length. Tokens exceeding this length are discarded.

Whitespace Tokenizer

Splits text only on whitespace characters. Useful when punctuation should be preserved in tokens.

Whitespace tokenizer
zvec.FtsIndexParam(tokenizer_name="whitespace", filters=["lowercase"])

Jieba Tokenizer

Chinese word segmentation tokenizer. Also handles mixed Chinese/English text.

The Python SDK bundles a default Jieba dictionary — the Jieba tokenizer works out of the box with no extra configuration. You only need to set jieba_dict_dir if you want to use a custom dictionary.

Jieba tokenizer
zvec.FtsIndexParam(
    tokenizer_name="jieba",
    filters=["lowercase"],
    extra_params='{"jieba_dict_dir": "/path/to/jieba/dict"}',
)

Configuration (via extra_params JSON):

KeyTypeDefaultDescription
jieba_dict_dirstrDirectory containing jieba.dict.utf8 and hmm_model.utf8. Can also be set via ZVEC_JIEBA_DICT_DIR environment variable.
user_dict_pathstrPath to a custom user dictionary file.
cut_modestr"search"Segmentation mode: "search" (fine-grained, recommended for search), "mix", "full", or "hmm". See cppjieba documentation for details on each mode.

jieba_dict_dir resolution order (first non-empty value wins):

  1. Per-field extra_params in FtsIndexParam
  2. ZVEC_JIEBA_DICT_DIR environment variable
  3. Global default set via zvec.init(jieba_dict_dir=...) or zvec.set_default_jieba_dict_dir()
  4. Built-in dictionary bundled with the Python SDK (set automatically on import zvec)

Token Filters

Token filters are applied in sequence after tokenization. Currently, only one filter is available:

FilterDescription
"lowercase"Converts all tokens to ASCII lowercase.

Constraints

  • FTS and vector search are mutually exclusive within one query route — a single Query / ZVecQuery should not set both fts and vector / id.
  • query_string and match_string are mutually exclusive in a single Fts object.
  • FTS fields do not support alter column.
  • Leading negation (NOT term or standalone -term) is not supported — at least one positive term is required.

On this page