Full-Text Search
Full-text search finds documents by matching text content, ranked by BM25 relevance scoring. It supports natural language queries, exact phrase matching, and boolean operators.
Prerequisites
This guide assumes you have opened a collection and have a collection object ready.
Defining an FTS Field
To enable full-text search on a field, add a FieldSchema with an FtsIndexParam as its index_param. The field must be of type STRING.
import zvec
fts_field = zvec.FieldSchema(
name="content",
data_type=zvec.DataType.STRING,
nullable=False,
index_param=zvec.FtsIndexParam(
tokenizer_name="standard", # Tokenizer to use
filters=["lowercase"], # Token filters applied after tokenization
),
)FtsIndexParam Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
tokenizer_name | str | "standard" | The tokenizer used to split text into tokens. Options: "standard", "whitespace", "jieba". See Tokenizers. |
filters | list[str] | ["lowercase"] | Token filters applied in sequence after tokenization. Currently only "lowercase" is available. |
extra_params | str | "" | JSON string for tokenizer-specific configuration (e.g., Jieba dictionary path). |
Zvec supports FTS-only collections — you can create a collection with text fields and no vector fields at all.
Performing Full-Text Search
Zvec provides two query modes for full-text search, both using the Fts object within a Query:
- Match String — natural language input, automatically tokenized
- Query String — advanced expression syntax with boolean operators
Match String
Use match_string for natural language queries. The input is plain text — no special syntax or escaping is needed. It is tokenized using the same tokenizer configured on the field, and tokens are combined using the default operator (OR by default).
from zvec.model.param.query import Fts, Query
result = collection.query(
queries=Query(
field_name="content",
fts=Fts(match_string="machine learning"),
),
topk=5,
)
print(result)With the default OR operator, this returns documents containing either "machine" or "learning" (or both), ranked by BM25 relevance.
Query String
Use query_string for advanced queries with explicit boolean operators, required/excluded terms, and exact phrase matching.
from zvec.model.param.query import Fts, Query
result = collection.query(
queries=Query(
field_name="content",
fts=Fts(query_string='+learning -neural "vector search"'),
),
topk=5,
)
print(result)This query requires "learning", excludes "neural", and matches the exact phrase "vector search".
query_string and match_string are mutually exclusive — you must provide exactly one of them in each Fts object.
Query Syntax Reference
The following operators are supported in query_string expressions:
| Syntax | Meaning | Example |
|---|---|---|
term | Match a single term | vector |
"phrase" | Match an exact phrase (word order and adjacency) | "machine learning" |
+term | Term must appear in the document | +vector |
-term | Term must not appear in the document | -slow |
a AND b | Both terms must match | vector AND search |
a OR b | Either term can match | vector OR embedding |
a NOT b | Match a but exclude documents matching b | learning NOT deep |
(expr) | Group sub-expressions | (vector OR embedding) AND search |
+(expr) | Group must match | +(vector OR embedding) |
-(expr) | Group must not match | -(slow AND outdated) |
Operator precedence: AND / NOT bind tighter than OR. Adjacent terms without an explicit operator are combined using the default_operator setting (default: OR).
For complex queries mixing multiple operators, use () to make grouping explicit and avoid unexpected results.
Leading negation is not supported — both NOT term and standalone -term require at least one positive term. Use a NOT b or combine -term with positive terms (e.g., a -b).
Query Parameters
| Parameter | Description |
|---|---|
topk | The number of top-scoring documents to return. |
filter | An optional SQL-like boolean expression to restrict results. See conditional filtering for details. |
output_fields | An optional list of scalar field names to include in results. If omitted, all scalar fields are returned. |
Default Operator
By default, adjacent bare terms in both match_string and query_string are combined with OR. To change this to AND, pass a FtsQueryParam via the param field:
import zvec
from zvec.model.param.query import Fts, Query
result = collection.query(
queries=Query(
field_name="content",
fts=Fts(match_string="machine learning"),
param=zvec.FtsQueryParam(default_operator="AND"),
),
topk=5,
)With default_operator="AND", this returns only documents containing both "machine" and "learning".
Explicit operators (AND, OR, +, -) in a query_string are not affected by default_operator — it only controls how adjacent bare terms are combined.
Combining FTS with Scalar Filters
You can combine full-text search with scalar filters to narrow results:
from zvec.model.param.query import Fts, Query
result = collection.query(
queries=Query(
field_name="content",
fts=Fts(match_string="machine learning"),
),
filter="category = 'tech'",
topk=5,
)FTS and vector search are mutually exclusive within one query route. A single Query / ZVecQuery should not set both fts and vector / id; use separate query routes with re-ranking, or run separate queries and merge results in your application.
Tokenizers
Zvec provides three built-in tokenizers. The tokenizer is configured per-field via FtsIndexParam.
| Tokenizer | Name | Description |
|---|---|---|
| Standard | "standard" | Splits text on non-alphanumeric characters. Best for most English-like languages. (Default) |
| Whitespace | "whitespace" | Splits text on whitespace only (spaces, tabs, newlines). Preserves punctuation within tokens. |
| Jieba | "jieba" | Chinese word segmentation using cppjieba. Supports mixed Chinese/English text. |
Standard Tokenizer
The default tokenizer. Splits text into tokens at non-alphanumeric character boundaries and discards delimiters.
zvec.FtsIndexParam(tokenizer_name="standard", filters=["lowercase"])Configuration (via extra_params JSON):
| Key | Type | Default | Description |
|---|---|---|---|
max_token_length | int | 255 | Maximum token length. Tokens exceeding this length are discarded. |
Whitespace Tokenizer
Splits text only on whitespace characters. Useful when punctuation should be preserved in tokens.
zvec.FtsIndexParam(tokenizer_name="whitespace", filters=["lowercase"])Jieba Tokenizer
Chinese word segmentation tokenizer. Also handles mixed Chinese/English text.
The Python SDK bundles a default Jieba dictionary — the Jieba tokenizer works out of the box with no extra configuration. You only need to set jieba_dict_dir if you want to use a custom dictionary.
zvec.FtsIndexParam(
tokenizer_name="jieba",
filters=["lowercase"],
extra_params='{"jieba_dict_dir": "/path/to/jieba/dict"}',
)Configuration (via extra_params JSON):
| Key | Type | Default | Description |
|---|---|---|---|
jieba_dict_dir | str | — | Directory containing jieba.dict.utf8 and hmm_model.utf8. Can also be set via ZVEC_JIEBA_DICT_DIR environment variable. |
user_dict_path | str | — | Path to a custom user dictionary file. |
cut_mode | str | "search" | Segmentation mode: "search" (fine-grained, recommended for search), "mix", "full", or "hmm". See cppjieba documentation for details on each mode. |
jieba_dict_dir resolution order (first non-empty value wins):
- Per-field
extra_paramsinFtsIndexParam ZVEC_JIEBA_DICT_DIRenvironment variable- Global default set via
zvec.init(jieba_dict_dir=...)orzvec.set_default_jieba_dict_dir() - Built-in dictionary bundled with the Python SDK (set automatically on
import zvec)
Token Filters
Token filters are applied in sequence after tokenization. Currently, only one filter is available:
| Filter | Description |
|---|---|
"lowercase" | Converts all tokens to ASCII lowercase. |
Constraints
- FTS and vector search are mutually exclusive within one query route — a single
Query/ZVecQueryshould not set bothftsandvector/id. query_stringandmatch_stringare mutually exclusive in a singleFtsobject.- FTS fields do not support alter column.
- Leading negation (
NOT termor standalone-term) is not supported — at least one positive term is required.