turbopuffer

rag Mar 26, 2026 3 min read

turbopuffer is a third-party vector database that supports lexical, vector, and hybrid search. use this if you already have data in turbopuffer or want vector/hybrid retrieval beyond BM25. castform does not provide turbopuffer access; you’ll need your own api key.

when to use turbopuffer

  • you already have data indexed in turbopuffer
  • you need vector or hybrid search (e.g., for code, jargon-heavy content, or multilingual corpora where keyword matching alone falls short)
  • you want lexical search without embeddings (turbopuffer supports BM25 natively)

1. create your corpus

use TpufChunkSource to upload and index your documents:

from trainer.corpus.turbopuffer.source import TpufChunkSource

# lexical-only (no embeddings needed)
source = TpufChunkSource(
    api_key="tpuf_...",
    namespace="my-docs",
)
source.populate_from_folder("./docs/")

to enable vector and hybrid search, pass an embedding function:

from trainer.corpus.turbopuffer.source import TpufChunkSource
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

source = TpufChunkSource(
    api_key="tpuf_...",
    namespace="my-docs",
    embed_fn=model.encode,
)
source.populate_from_folder("./docs/")

you can also upload pre-chunked data:

source.populate_from_chunks(chunks)

search modes

moderequires embed_fndescription
lexicalnoBM25 keyword matching
vectoryesapproximate nearest neighbor with embeddings
hybridyesreciprocal rank fusion of lexical + vector

2. generate QA data

pass your TpufChunkSource to the QA generation pipeline as usual. it implements the standard ChunkSource interface:

from trainer.qa_generation.cgft_models import CgftPipelineConfig, PlatformConfig, CorpusConfig, TargetsConfig
from trainer.qa_generation.cgft_pipeline import CgftPipeline

cfg = CgftPipelineConfig(
    platform=PlatformConfig(api_key="sk_..."),
    corpus=CorpusConfig(corpus_name="my-docs", docs_path="./docs/"),
    targets=TargetsConfig(total_samples=200),
)
cfg.resolve_api_keys()

pipeline = CgftPipeline(cfg, corpus_source=source)
result = pipeline.run()

3. train with SearchEnv

create a TpufSearch client and pass it to SearchEnv:

from trainer.corpus.turbopuffer.search import TpufSearch
from trainer.envs.search_env import SearchEnv
from trainer.trainer.pipeline import train

search = TpufSearch(
    api_key="tpuf_...",
    namespace="my-docs",
    embed_fn=model.encode,  # same embed_fn used for corpus
)

experiment_id = train(
    env_class=SearchEnv,
    env_args={"search": search},
    train_dataset=result["train_dataset"],
    eval_dataset=result["eval_dataset"],
    prefix="tpuf-search",
    api_key="sk_...",
)

TpufSearch parameters

parameterdefaultdescription
api_keyrequiredturbopuffer API key
namespacerequiredturbopuffer namespace
region"aws-us-east-1"turbopuffer region
content_attrNonemetadata fields to concatenate as content
embed_fnNoneembedding function; required for vector/hybrid modes
vector_attr"vector"attribute name for vector storage
distance_metric"cosine_distance"distance metric for vector search

notes

  • hybrid search uses client-side reciprocal rank fusion (RRF) to merge lexical and vector results
  • TpufSearch is pickle-safe. it stores only connection parameters and reconstructs the SDK client lazily after unpickling
  • TpufChunkSource and TpufSearch are separate: the source handles corpus creation and QA generation, the search client handles training